# Version 1: Train GPT2

## Plan:

Train GPT2 on the text off of the **Godot 4.3** documentation for 5 epochs, then test the model.

### System Used

This was done on older desktop hardware. Windows 10, WSL Ubuntu 22.04, i7 processor, Radeon 3050 8Gb OC, 16Gb RAM.

### Warning:

**Things are changing regularly, and you will see warnings (if this still runs), mostly to do with deprecation. This means, eventually, one day this code will need changes/updates/rewrites.**

### Base Code Use

Used as a starter: [311_fine_tuning_GPT2.ipynb](https://github.com/bnsreenu/python_for_microscopists/blob/master/311_fine_tuning_GPT2.ipynb)

Note: ```--break-system-packages``` is required for my customised WSL2 instance.

## Step 1: Packages

In [None]:
!pip install -q transformers torch datasets python-docx --break-system-packages

In [None]:
!pip install -qU PyPDF2 --break-system-packages

In [1]:
import pandas as pd
import numpy as np
import re
from PyPDF2 import PdfReader
import os
import docx

## Step 2: Define file text extract (Deprecated)

In [None]:
# Functions to read different file types
def read_pdf(file_path):
    with open(file_path, "rb") as file:
        pdf_reader = PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

def read_word(file_path):
    doc = docx.Document(file_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text

def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text

def read_documents_from_directory(directory):
    combined_text = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            combined_text += read_pdf(file_path)
        elif filename.endswith(".docx"):
            combined_text += read_word(file_path)
        elif filename.endswith(".txt"):
            combined_text += read_txt(file_path)
    return combined_text

## Step 3: Prepare text Godot 4.3 Text (sample version) (Deprecated)

Got ```godot-docs``` through ```https://docs.godotengine.org/en/stable/index.html``` using the ```stable``` variety. As of today it's ```319mb```.

Extract contents into ```godot-docs-html-stable``` then run below to extract data from the html (dirty formatting for v1).

**Time Required:** ~4min

In [None]:
%%time
# Convert html files recursively into txt files without format
import os
from bs4 import BeautifulSoup

def convert_html_to_txt(html_file_path, txt_file_path):
    with open(html_file_path, 'r', encoding='utf-8') as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')
        text = soup.get_text()
        
    with open(txt_file_path, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

def process_folder(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for root, _, files in os.walk(input_folder):
        for file in files:
            if file.endswith('.html'):
                #print('processing:',file)
                html_file_path = os.path.join(root, file)
                relative_path = os.path.relpath(html_file_path, input_folder)
                txt_file_path = os.path.join(output_folder, os.path.splitext(relative_path)[0] + '.txt')
                splt_path = txt_file_path.split('/')
                #print(splt_path[-1])
                splt_path[-1] = splt_path[-2] + '-' + splt_path[-1]
                convert_html_to_txt(html_file_path, output_folder + '/' + splt_path[-1])
            else:
                print(' - skipped:',file)

input_folder = 'godot-docs-html-stable/tutorials'
output_folder = 'godot-docs-text'

process_folder(input_folder, output_folder)

## Step 4: Trim excess whitespace (Deprecated)

In [None]:
# Read documents from the directory
train_directory = output_folder
text_data = read_documents_from_directory(train_directory)
text_data = re.sub(r'\n+', '\n', text_data).strip()  # Remove excess newline characters

In [None]:
print(text_data[0:200],'...')

*Uncomment this with train.txt removed to write a new train.txt*

In [None]:
## Only needed once off if you have the data
#with open("./train.txt", "w") as f:
#    f.write(text_data)

### 4.1: Minimised train.txt (Deprecated)

In [None]:
def filter_lines(input_file, output_file):
    with open(input_file, 'r') as file:
        lines = file.readlines()

    with open(output_file, 'w') as file:
        for line in lines:
            if len(line.strip()) > 30:
                file.write(line)

# Example usage
input_file = 'train.txt'
output_file = 'train-minimised.txt'
filter_lines(input_file, output_file)

## Step 5: Setup training functions

In [2]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

2024-10-06 18:52:04.679199: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-06 18:52:05.397933: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-06 18:52:05.563077: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-06 18:52:06.718448: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

In [4]:
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, 
        mlm=mlm,
    )
    return data_collator

In [5]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)
      
  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )
      
  trainer.train()
  trainer.save_model()

## Step 6: Specify Training Arguments

In [6]:
train_file_path = "./train-manual.txt"
model_name = 'gpt2'
output_dir = './model'
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 500.0
save_steps = 50000

#### Step 6.1: I didn't want to use WANDB, this is how you disable it:

In [7]:
# I didn't want to use wandb
os.environ['WANDB_DISABLED'] = 'true'

## Step 7: Run the training

*Note: on my desktop the 5 epochs was what I felt like trying out, ~it shared it would take around 5 hours at the start~ ~the minimised data shared it would take 45 minutes~ manual first test mode, without much in training, took 8 seconds.*

Minimised data is super messy, I need to format training data way better. This will likely give hulucinations that are similar to what I'm looking for.

Also used:

- ```git lfs track "*.safetensors"```
- ```git lfs track "*.gguf"```

This was for a later stage if I want to upload them to my repo.

In [8]:
# Train
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss
500,0.3467
1000,0.02
1500,0.0157
2000,0.0131
2500,0.0127
3000,0.0112
3500,0.0104


*Note: Kernel restart once here, then did Step 8. This was to free up GPU memory for the tests below.*

## Step 8: Test The Model

*Warning: This data isn't likely to give the style of answers we would love to have, but the plan is to test if the idea I'm working on will succeed, or not. If the answers below hint at it being possible,* **Version 2** *will follow this starting structure with my idea of a revamp.*

In [1]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

In [2]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

model1_path = "./model"
max_length = 512

model = load_model(model1_path)
tokenizer = load_tokenizer(model1_path)

def generate_text(sequence):
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

In [3]:
def Query(question1):
    generate_text(model1_path)

### 8.1: What’s new in Godot 4.3 compared to previous versions?

In [4]:
generate_text('What’s new in Godot 4.3 compared to previous versions?')

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


What’s new in Godot 4.3 compared to previous versions?Let's know in the comment box below!
**What is Godot 4.3?**\nGodot is a general-purpose 2D and 3D game engine designed to support all sorts of projects. You can use it to create games or applications you can then release on desktop or mobile, as well as on the web.\nYou can also create console games with it, although you either need strong programming skills or a developer to port the game for you.\n**Note:**The Godot team can't provide an open source console export due to the licensing terms imposed by the licensing community. Regardless of the reason, you can find workflows and materials on the various [Godot forums](https://godotengine.org/community/), especially the [Godot Discord](https://discord.gg/bdcfAYM4W9) where you can find information on all sorts of projects.\nSome examples of projects that make use of Godot Engine are:\n*Star Trek Online (Linux),\n*Godot's first game** (Android),\n*The demo game** (Windows),\n**The ope

### 8.2: Can you explain the changes in the rendering pipeline introduced in Godot 4.3?

In [5]:
generate_text('Can you explain the changes in the rendering pipeline introduced in Godot 4.3?')

Can you explain the changes in the rendering pipeline introduced in Godot 4.3?**\nThis is a quick and dirty answer. The engine no longer requires user input when editing a scene. You can use this as a starting point, and then move on to the next step.\n**Note:** The editor saves the main scene's path in a project.godot file in your project's directory. While you can edit this text file directly to change project settings, you can also use the "Project -> Project Settings" window to do so. For more information, see [Project Settings](https://docs.godotengine.org/en/stable/tutorials/editor/project_settings.html#doc-project-settings).\nThe Create Node dialog, like other node types in the editor, only allows you to create root nodes. The Create Node dialog, like other node types in the editor, only allows you to create root nodes.\nThe Create Node dialog, like other node types in the editor, only allows you to create root nodes.\nThe Create Node dialog, like other node types in the editor,

### 8.3: How do you implement a custom shader in Godot 4.3?

In [6]:
generate_text('How do you implement a custom shader in Godot 4.3?')

How do you implement a custom shader in Godot 4.3?\nLet's talk about the engine's most essential concepts.\n*What is Godot?\nGodot is a general-purpose 2D and 3D game engine designed to support all sorts of projects. You can use it to create games or applications you can then release on desktop or mobile, as well as on the web.\nYou can also create console games with it, although you either need strong programming skills or a developer to port the game for you.\n**Note:**The Godot team can't provide an open source console export due to the licensing terms of the various manufacturers. While you can work with manufacturers to source consoles with one license, you can also work with external programs. If you prefer, you can work with external programs.\n**Note:**The Godot team can't provide an open source console export due to the licensing terms of the various manufacturers. While you can work with manufacturers to source consoles with one license, you can also work with external progra

### 8.4: What are the key differences between GDScript and C# in Godot 4.3?

In [7]:
generate_text('What are the key differences between GDScript and C# in Godot 4.3?')

What are the key differences between GDScript and C# in Godot 4.3?\nThis series will introduce you to them and give you an overview of the engine's essential concepts.\n- [Introduction to Godot](https://docs.godotengine.org/en/stable/getting_started/introduction/introduction_to_godot.html)\n- [Learn to code with GDScript](https://docs.godotengine.org/en/stable/getting_started/introduction/learn_to_code_with_gdscript.html)\n- [Learn to code with GDScript](https://docs.godotengine.org/en/stable/getting_started/introduction/learn_to_code_with_gdscript.html)\n- [Learn to code with Visual Studio](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_to_code_with_vscode.html)\n- [Learn to code with the Qt editor](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_to_code_with_qt.html)\n- [Learn to code with the Run2Learn](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_to_code_with_qt.html)\n- [Learn to code with t

### 8.5: How would you optimize a game for performance in Godot 4.3?

In [8]:
generate_text('How would you optimize a game for performance in Godot 4.3?')

How would you optimize a game for performance in Godot 4.3?**First off, lets make a list of tips and suggestions.**\n-First off, make a backup of your game's data if you want to.\n-You can find help on the [Godot Discord](https://discord.godotengine.org/).\n-You can also find help on the [Godot Discord](https://discord.godotengine.org/).\n-You can also find information on the [Godot Discord](https://discord.godotengine.org/), especially the [Godot Discord](https://discord.godotengine.org/).\n-Finally, you can find information on the [Godot Discord](https://discord.godotengine.org/), especially the [Godot Discord](https://discord.godotengine.org/).\n-Finally, you can find information on the [Godot Discord](https://discord.godotengine.org/), especially the [Godot Discord](https://discord.godotengine.org/).\n-Finally, you can find information on the [Godot Discord](https://discord.godotengine.org/), especially the [Godot Discord](https://discord.godotengine.org/).\n-Finally, you can find 

### 8.6: Can you describe the process of creating a custom node in Godot 4.3?

In [9]:
generate_text('Can you describe the process of creating a custom node in Godot 4.3?')

Can you describe the process of creating a custom node in Godot 4.3?**First, you will need to [create a new project](https://docs.godotengine.org/en/stable/getting_started/introduction/index.html#doc-your-first-project).\nIn the following pages, you will get an overview of the engine's essential concepts. You will also get answers to questions such as "Is Godot for me?" or "What can I do with Godot?".\n- [Creating your first script](https://docs.godotengine.org/en/stable/getting_started/introduction/index.html#doc-your-first-script)
**Creating your first script](https://docs.godotengine.org/en/stable/getting_started/introduction/index.html#doc-your-first-script)
**[Creating your first script](https://docs.godotengine.org/en/stable/getting_started/introduction/index.html#doc-your-first-script), or [Listening to player input](https://docs.godotengine.org/en/stable/getting_started/introduction/index.html#doc-your-first-script), or [Listening to player input](https://docs.godotengine.org/e

### 8.7: What are the new features in the animation system of Godot 4.3?

In [10]:
generate_text('What are the new features in the animation system of Godot 4.3?')

What are the new features in the animation system of Godot 4.3?\nThis article is here to help you figure out whether Godot will benefit from bringing more features to the engine. We will introduce some broad features of the engine to give you a feel for what you can achieve with it and answer questions such as "what do I need to know to get started?".\n- [Introduction to Godot](https://docs.godotengine.org/en/stable/getting_started/introduction/introduction_to_godot.html)\n- [Learn to code with GDScript](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_to_code_with_gdscript.html)\n- [First look at Godot's interface](https://docs.godotengine.org/en/stable/getting_started/introduction/first_look_at_the_editor.html)\n- [Learning new features](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_new_features.html)\n- [Learning new features](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_new_features.html)\n- 

### 8.8: How do you handle input events in Godot 4.3?

In [11]:
generate_text('How do you handle input events in Godot 4.3?')

How do you handle input events in Godot 4.3?\nGodot relies on the [CharacterBody2D](https://docs.godotengine.org/en/stable/classes/class_characterbody2d.html#class-characterbody2d) node, which is a tree of nodes and that each node has a unique function. In an empty scene, each node has a single function, like a camera that follows a node's path. In an open scene, each node has a different function, like a camera that follows a player's path.\nThe CharacterBody2D node, for example, appears as a single node with its internals hidden. Its function is to draw text, which is what Godot will first load when you or a player runs into an open area.\nThe next step is to make the scene look like a game world. Create a new project and run it like a normal game project. You can find more information on that here: [Project Settings](https://docs.godotengine.org/en/stable/tutorials/editor/project_settings.html#doc-project-settings).\nIn an empty scene, the Scene dock on the left shows the currently 

### 8.9: What are the best practices for using the new navigation system in Godot 4.3?

In [12]:
generate_text('What are the best practices for using the new navigation system in Godot 4.3?')

What are the best practices for using the new navigation system in Godot 4.3?\nThis article is here to help you figure it out. We will introduce some broad features of the engine to give you a feel for what you can achieve with it and answer questions such as "what do I need to know to get started?".\n- [Introduction to Godot](https://docs.godotengine.org/en/stable/getting_started/introduction/introduction_to_godot.html)\n- [Learn to code with GDScript](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_to_code_with_gdscript.html)\n- [Learn to code with GDScript](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_to_code_with_gdscript.html)\n- [Using signals](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_to_code_with_signals.html)\n- [Listening to player input](https://docs.godotengine.org/en/stable/getting_started/introduction/learning_to_code_with_signals.html)\n- [Listening to player input](https://do

### 8.10: Can you walk me through setting up a multiplayer game in Godot 4.3?

In [13]:
generate_text('Can you walk me through setting up a multiplayer game in Godot 4.3?')

Can you walk me through setting up a multiplayer game in Godot 4.3?**\nOf course you can!\nGodot supports importing 3D scenes designed in [Blender](https://www.blender.org/) and provides plugins to code in [VSCode](https://github.com/godotengine/godot-vscode-plugin) and [Emacs](https://github.com/godotengine/emacs-plugin) and [Emacs](https://github.com/godotengine/emacs-plugin) for use with other programs.\n![../../introduction_editor.webp](https://docs.godotengine.org/en/stable/_images/introduction_editor.webp "introduction_editor.webp")\nYou can find many more examples in the [official showcase videos](https://www.youtube.com/playlist?list=PLZO9nW8ZM4W9A1nkwJZOwrfiLdQ8).\n![../../introduction_editor.webp](https://docs.godotengine.org/en/stable/_images/introduction_editor.webp "introduction_editor.webp")\nYou can find many more examples in the [official showcase videos](https://www.youtube.com/playlist?list=PLZO9nW8ZM4W9A1nkwJZOwrfiLdQ8).\n![../../introduction_editor.webp](https://doc

### 8.11: How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?

In [14]:
generate_text('How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?')

How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?
**Introduction**\nThis series will introduce you to Godot and give you an overview of the engine's essential concepts. You will learn more about nodes and scenes, code your first classes with GDScript, use signals to make nodes communicate with one another, and more.\nThe following lessons are here to prepare you for [Your first 2D game](https://docs.godotengine.org/en/stable/getting_started/introduction/first_2d_game.html#doc-your-first-2d-game).\nThe following lessons are here to prepare you for [Your first 3D game](https://docs.godotengine.org/en/stable/getting_started/introduction/first_3d_game.html#doc-your-first-3d-game).\nThe following lessons are here to prepare you for [Your first 2D game](https://docs.godotengine.org/en/stable/getting_started/introduction/first_2d_game.html#doc-your-first-2d-game).\nThe following lessons are here to prepare you for [Your first 3D game](https://docs

### 8.12: How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?

In [15]:
generate_text('How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?')

How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?\nMy Godot 4.3 World[/en/stable/community/tutorials/scripting/godot_4.3_world.html#doc-scripting-godot-4.3-world).\nYou can create as many instances of a scene as you'd like. You could have five or ten characters in your game, created from your Character scene.
**Creating your first scene**\nLet's create our first scene with a single node. To do so, you will need to [create a new project](https://docs.godotengine.org/en/stable/project/nodes_and_scenes_3d_scene_manager.html#doc-creating-first-scene).\nIn an empty scene, the Scene dock on the left shows the main scene, the Asset dock on the right shows the editor's interface, and the Debug dock on the left allows you to report bugs.\n![../../nodes_and_scenes_4.3_scene_manager.webp](https://docs.godotengine.org/en/stable/_images/nodes_and

### 8.13: In Godot 4.3, using C#, can you create 2D snake from scratch?

In [16]:
generate_text('In Godot 4.3, using C#, can you create 2D snake from scratch?')

In Godot 4.3, using C#, can you create 2D snake from scratch?\nLet's create our 2D snake with a single node. To do so, you will need to [create a new project](https://docs.godotengine.org/en/stable/getting_started/introduction/nodes_and_scenes.html#doc-nodes-and-scenes).\nIn the project's main window, you will see an empty scene.\n![../../nodes_and_scenes_01_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_empty_scene_emp

# Review (Personal):

- **Test 1:** 0/13 - 100% faliure on first run. I'm thinking I need to adjust ```train.txt``` to filter out useless rows. I think the contents wraps too close in each category. That might be why Training Loss was way too high. Cleaning the model folder, revamp of data to train on manually, then will run tests again.
- **Test 2:** 0/13 - 100% failure again. Rethinking it, going to test with just the one section of the documentation manually as a training set. ```./train-minimised.txt``` is still too dirty to possibly have any good results.
- **Test 3:** 2.5ish/13 - classing it as about 80% fail and just giving small points for some of the answer text. My training data was way too small in ```./train-manual.txt```