# Version 1: Train GPT2

## Plan:

Train GPT2 on the text off of the **Godot 4.3** documentation for 5 epochs, then test the model.

### System Used

This was done on older desktop hardware. Windows 10, WSL Ubuntu 22.04, i7 processor, Radeon 3050 8Gb OC, 16Gb RAM.

### Warning:

**Things are changing regularly, and you will see warnings (if this still runs), mostly to do with deprecation. This means, eventually, one day this code will need changes/updates/rewrites.**

### Base Code Use

Used as a starter: [311_fine_tuning_GPT2.ipynb](https://github.com/bnsreenu/python_for_microscopists/blob/master/311_fine_tuning_GPT2.ipynb)

Note: ```--break-system-packages``` is required for my customised WSL2 instance.

## Step 1: Packages

In [1]:
!pip install -q transformers torch datasets python-docx --break-system-packages

In [2]:
!pip install -qU PyPDF2 --break-system-packages

In [3]:
import pandas as pd
import numpy as np
import re
from PyPDF2 import PdfReader
import os
import docx

## Step 2: Define file text extract

In [4]:
# Functions to read different file types
def read_pdf(file_path):
    with open(file_path, "rb") as file:
        pdf_reader = PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

def read_word(file_path):
    doc = docx.Document(file_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text

def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text

def read_documents_from_directory(directory):
    combined_text = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            combined_text += read_pdf(file_path)
        elif filename.endswith(".docx"):
            combined_text += read_word(file_path)
        elif filename.endswith(".txt"):
            combined_text += read_txt(file_path)
    return combined_text

## Step 3: Prepare text Godot 4.3 Text (sample version)

Got ```godot-docs``` through ```https://docs.godotengine.org/en/stable/index.html``` using the ```stable``` variety. As of today it's ```319mb```.

Extract contents into ```godot-docs-html-stable``` then run below to extract data from the html (dirty formatting for v1).

**Time Required:** ~4min

In [11]:
%%time
# Convert html files recursively into txt files without format
import os
from bs4 import BeautifulSoup

def convert_html_to_txt(html_file_path, txt_file_path):
    with open(html_file_path, 'r', encoding='utf-8') as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')
        text = soup.get_text()
        
    with open(txt_file_path, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

def process_folder(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for root, _, files in os.walk(input_folder):
        for file in files:
            if file.endswith('.html'):
                #print('processing:',file)
                html_file_path = os.path.join(root, file)
                relative_path = os.path.relpath(html_file_path, input_folder)
                txt_file_path = os.path.join(output_folder, os.path.splitext(relative_path)[0] + '.txt')
                splt_path = txt_file_path.split('/')
                #print(splt_path[-1])
                splt_path[-1] = splt_path[-2] + '-' + splt_path[-1]
                convert_html_to_txt(html_file_path, output_folder + '/' + splt_path[-1])
            else:
                print(' - skipped:',file)

input_folder = 'godot-docs-html-stable/tutorials'
output_folder = 'godot-docs-text'

process_folder(input_folder, output_folder)

CPU times: user 3min 41s, sys: 230 ms, total: 3min 41s
Wall time: 3min 46s


## Step 4: Trim excess whitespace

In [12]:
# Read documents from the directory
train_directory = output_folder
text_data = read_documents_from_directory(train_directory)
text_data = re.sub(r'\n+', '\n', text_data).strip()  # Remove excess newline characters

In [15]:
print(text_data[0:200],'...')

(DEV) 2D antialiasing — Godot Engine (4.3) documentation in English
About
Introduction
Before you start
About Godot Engine
Organization of the documentation
About this documentation
List of features
P ...


*Uncomment this with train.txt removed to write a new train.txt*

In [21]:
## Only needed once off if you have the data
#with open("./train.txt", "w") as f:
#    f.write(text_data)

## Step 5: Setup training functions

In [16]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

2024-10-06 07:14:01.420652: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-06 07:14:01.940385: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-06 07:14:02.088674: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-06 07:14:03.037107: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [17]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

In [18]:
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, 
        mlm=mlm,
    )
    return data_collator

In [19]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)
      
  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )
      
  trainer.train()
  trainer.save_model()

## Step 6: Specify Training Arguments

In [27]:
train_file_path = "./train.txt"
model_name = 'gpt2'
output_dir = './model'
overwrite_output_dir = False
per_device_train_batch_size = 8
num_train_epochs = 5.0
save_steps = 50000

#### Step 6.1: I didn't want to use WANDB, this is how you disable it:

In [25]:
# I didn't want to use wandb
os.environ['WANDB_DISABLED'] = 'true'

## Step 7: Run the training

*Note: on my desktop the 5 epochs was what I felt like trying out, it shared it would take around 5 hours at the start.*

Also used:

- ```git lfs track "*.safetensors"```
- ```git lfs track "*.gguf"```

This was for a later stage if I want to upload them to my repo.

In [29]:
# Train
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss
500,1.5
1000,0.7218
1500,0.4443
2000,0.3535
2500,0.3237
3000,0.312
3500,0.3147
4000,0.2763
4500,0.269
5000,0.2688


*Note: Kernel restart once here, then did Step 8. This was to free up GPU memory for the tests below.*

## Step 8: Test The Model

*Warning: This data isn't likely to give the style of answers we would love to have, but the plan is to test if the idea I'm working on will succeed, or not. If the answers below hint at it being possible,* **Version 2** *will follow this starting structure with my idea of a revamp.*

In [1]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

In [9]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

model1_path = "./model"
max_length = 512

model = load_model(model1_path)
tokenizer = load_tokenizer(model1_path)

def generate_text(sequence):
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

In [10]:
def Query(question1):
    generate_text(model1_path)

### 8.1: What’s new in Godot 4.3 compared to previous versions?

In [11]:
generate_text('What’s new in Godot 4.3 compared to previous versions?')

What’s new in Godot 4.3 compared to previous versions?
What is GDScript and why should I use it?
What were the motivations behind creating GDScript?
What 3D model formats does Godot support?
Will [insert closed SDK such as FMOD, GameWorks, etc.] be supported in Godot?
How can I extend Godot?
How do I install the Godot editor on my system (for desktop integration)?
Windows
macOS
Linux
Is the Godot editor a portable application?
Why does Godot prioritize Vulkan and OpenGL over Direct3D?
Why does Godot aim to keep its core feature set small?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user interface toolk

### 8.2: Can you explain the changes in the rendering pipeline introduced in Godot 4.3?

In [12]:
generate_text('Can you explain the changes in the rendering pipeline introduced in Godot 4.3?')

Can you explain the changes in the rendering pipeline introduced in Godot 4.3?
What is GLSL and why should I use it?
GLSL functions
Shader preprocessor
Why use a shader preprocessor?
Directives
Spatial shaders
Render modes
Built-ins
Global built-ins
Vertex built-ins
Fragment built-ins
Light built-ins
CanvasItem shaders
Render modes
Built-ins
Global built-ins
Vertex built-ins
Fragment built-ins
Light built-ins
SDF functions
Particle shaders
Render modes
Built-ins
Global built-ins
Start and Process built-ins
Start built-ins
Process built-ins
Process functions
Sky shaders
Render modes
Built-ins
Global built-ins
Sky built-ins
Fog shaders
Built-ins
Global built-ins
Fog built-ins
Your first shader
Your first 2D shader
Introduction
Setup
Your first CanvasItem shader
Your first fragment function
Your first vertex function
Conclusion
Your first 3D shader
Where to assign my material
Setting up
Shader magic
Noise heightmap
Uniforms
Uniforms
Interacting with light
Your second 3D shader
Your first 

### 8.3: How do you implement a custom shader in Godot 4.3?

In [13]:
generate_text('How do you implement a custom shader in Godot 4.3?')

How do you implement a custom shader in Godot 4.3?
What were the motivations behind creating GDScript?
What 3D model formats does Godot support?
Will [insert closed SDK such as FMOD, GameWorks, etc.] be supported in Godot?
How can I extend Godot?
How do I install the Godot editor on my system (for desktop integration)?
Windows
macOS
Linux
Is the Godot editor a portable application?
Why does Godot prioritize Vulkan and OpenGL over Direct3D?
Why does Godot aim to keep its core feature set small?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user interface toolkit does Godot use?
Why does Godot use the SCon

### 8.4: What are the key differences between GDScript and C# in Godot 4.3?

In [14]:
generate_text('What are the key differences between GDScript and C# in Godot 4.3?')

What are the key differences between GDScript and C# in Godot 4.3?
What is GDScript and why should I use it?
What were the motivations behind creating GDScript?
What 3D model formats does Godot support?
Will [insert closed SDK such as FMOD, GameWorks, etc.] be supported in Godot?
How can I extend Godot?
How do I install the Godot editor on my system (for desktop integration)?
Windows
macOS
Linux
Is the Godot editor a portable application?
Why does Godot prioritize Vulkan and OpenGL over Direct3D?
Why does Godot aim to keep its core feature set small?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user int

### 8.5: How would you optimize a game for performance in Godot 4.3?

In [15]:
generate_text('How would you optimize a game for performance in Godot 4.3?')

How would you optimize a game for performance in Godot 4.3?
How can I extend Godot?
How do I install the Godot editor on my system (for desktop integration)?
Windows
macOS
Linux
Is the Godot editor a portable application?
Why does Godot prioritize Vulkan and OpenGL over Direct3D?
Why does Godot aim to keep its core feature set small?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user interface toolkit does Godot use?
Why does Godot use the SCons build system?
Why does Godot not use STL (Standard Template Library)?
Why does Godot not use exceptions?
Why does Godot not use exceptions?
Does Godot use an ECS

### 8.6: Can you describe the process of creating a custom node in Godot 4.3?

In [16]:
generate_text('Can you describe the process of creating a custom node in Godot 4.3?')

Can you describe the process of creating a custom node in Godot 4.3?
How does it work and look?
Programming languages
What do I need to know to use Godot?
Learn to code with GDScript
Learn in your browser with the GDScript app
Overview of Godot's key concepts
Scenes
Nodes
The scene tree
Signals
Summary
First look at Godot's interface
The Project Manager
First look at Godot's editor
The four main screens
Integrated class reference
Learning new features
Making the most of this manual
Learning to think like a programmer
Learning with the community
Community tutorials
Godot's design philosophy
Object-oriented design and composition
All-inclusive package
Open source
Community-driven
The Godot editor is a Godot game
Separate 2D and 3D engines
Step by step
Nodes and Scenes
Nodes
Scenes
Creating your first scene
Changing a node's properties
Running the scene
Setting the main scene
Creating instances
In practice
Editing scenes and instances
Scene instances as a design language
Summary
Scripting

### 8.7: What are the new features in the animation system of Godot 4.3?

In [17]:
generate_text('What are the new features in the animation system of Godot 4.3?')

What are the new features in the animation system of Godot 4.3?
How can I extend Godot?
How do I install the Godot editor on my system (for desktop integration)?
Windows
macOS
Linux
Is the Godot editor a portable application?
Why does Godot prioritize Vulkan and OpenGL over Direct3D?
Why does Godot aim to keep its core feature set small?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user interface toolkit does Godot use?
Why does Godot use the SCons build system?
Why does Godot not use STL (Standard Template Library)?
Why does Godot not use exceptions?
Does Godot use an ECS (Entity Component System)?
Why

### 8.8: How do you handle input events in Godot 4.3?

In [18]:
generate_text('How do you handle input events in Godot 4.3?')

How do you handle input events in Godot 4.3?
How do I install the Godot editor on my system (for desktop integration)?
Windows
macOS
Linux
Is the Godot editor a portable application?
Why does Godot prioritize Vulkan and OpenGL over Direct3D?
Why does Godot aim to keep its core feature set small?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user interface toolkit does Godot use?
Why does Godot use the SCons build system?
Why does Godot not use STL (Standard Template Library)?
Why does Godot not use exceptions?
Does Godot use an ECS (Entity Component System)?
Why does Godot not force users to implement DO

### 8.9: What are the best practices for using the new navigation system in Godot 4.3?

In [19]:
generate_text('What are the best practices for using the new navigation system in Godot 4.3?')

What are the best practices for using the new navigation system in Godot 4.3?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user interface toolkit does Godot use?
Why does Godot use the SCons build system?
Why does Godot not use STL (Standard Template Library)?
Why does Godot not use exceptions?
Does Godot use an ECS (Entity Component System)?
Why does Godot not force users to implement DOD (Data-Oriented Design)?
How can I support Godot development or contribute?
Who is working on Godot? How can I contact you?
Complying with licenses
What are licenses?
What are licenses?
Requirements
Inclusion
Credits s

### 8.10: Can you walk me through setting up a multiplayer game in Godot 4.3?

In [20]:
generate_text('Can you walk me through setting up a multiplayer game in Godot 4.3?')

Can you walk me through setting up a multiplayer game in Godot 4.3?
How can I extend Godot?
How do I install the Godot editor on my system (for desktop integration)?
Windows
macOS
Linux
Is the Godot editor a portable application?
Why does Godot prioritize Vulkan and OpenGL over Direct3D?
Why does Godot aim to keep its core feature set small?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user interface toolkit does Godot use?
Why does Godot use the SCons build system?
Why does Godot not use STL (Standard Template Library)?
Why does Godot not use exceptions?
Does Godot use an ECS (Entity Component System)?

### 8.11: How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?

In [21]:
generate_text('How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?')

How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?
How do I install the Godot editor on my system (for desktop integration)?
Windows
macOS
Linux
Is the Godot editor a portable application?
Why does Godot prioritize Vulkan and OpenGL over Direct3D?
Why does Godot aim to keep its core feature set small?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user interface toolkit does Godot use?
Why does Godot use the SCons build system?
Why does Godot not use STL (Standard Template Library)?
Why does Godot not use exceptions?
Does Godot use an ECS (Entity Component Syst

### 8.12: How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?

In [22]:
generate_text('How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?')

How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?
How can I extend Godot?
How do I install the Godot editor on my system (for desktop integration)?
Windows
macOS
Linux
Is the Godot editor a portable application?
Why does Godot prioritize Vulkan and OpenGL over Direct3D?
Why does Godot aim to keep its core feature set small?
How should assets be created to handle multiple resolutions and aspect ratios?
When is the next release of Godot out?
Which Godot version should I use for a new project?
Should I upgrade my project to use new Godot versions?
I would like to contribute! How can I get started?
I have a great idea for Godot. How can I share it?
Is it possible to use Godot to create non-game applications?
Is it possible to use Godot as a library?
What user interface toolkit does Godot use?
Why does Godot use the SCons build system?
Wh

### 8.13: In Godot 4.3, using C#, can you create 2D snake from scratch?

In [23]:
generate_text('In Godot 4.3, using C#, can you create 2D snake from scratch?')

In Godot 4.3, using C#, can you create 2D snake from scratch?
How does it work?
Anatomy of an InputEvent
Actions
InputMap
Input examples
Introduction
Events versus polling
Input events
InputMap
Capturing actions
Keyboard events
Keyboard modifiers
Mouse events
Mouse buttons
Mouse motion
Touch events
Mouse and input coordinates
About
Hardware display coordinates
Viewport display coordinates
Customizing the mouse cursor
Using project settings
Using a script
Cursor list
Controllers, gamepads, and joysticks
Supporting universal input
Which Input singleton method should I use?
Vibration
Differences between keyboard/mouse and controller input
Dead zone
"Echo" events
Window focus
Power saving prevention
Troubleshooting
My controller isn't recognized by Godot.
My controller has incorrectly mapped buttons or axes.
My controller works on a given platform, but not on another platform.
Handling quit requests
Quitting
Handling the notification
On mobile devices
Sending your own quit notification
Mat

# Review (Personal):

- **Test 1:** 0/13 - 100% faliure on first run. I'm thinking I need to adjust ```train.txt``` to filter out useless rows. I think the contents wraps too close in each category. That might be why Training Loss was way too high. Cleaning the model folder, revamp of data to train on manually, then will run tests again