# Version 1: Train GPT2

## Plan:

Train GPT2 on the text off of the **Godot 4.3** documentation for 5 epochs, then test the model.

### System Used

This was done on older desktop hardware. Windows 10, WSL Ubuntu 22.04, i7 processor, Radeon 3050 8Gb OC, 16Gb RAM.

### Warning:

**Things are changing regularly, and you will see warnings (if this still runs), mostly to do with deprecation. This means, eventually, one day this code will need changes/updates/rewrites.**

### Base Code Use

Used as a starter: [311_fine_tuning_GPT2.ipynb](https://github.com/bnsreenu/python_for_microscopists/blob/master/311_fine_tuning_GPT2.ipynb)

Note: ```--break-system-packages``` is required for my customised WSL2 instance.

## Step 1: Packages

In [None]:
!pip install -q transformers torch datasets python-docx --break-system-packages

In [None]:
!pip install -qU PyPDF2 --break-system-packages

In [1]:
import pandas as pd
import numpy as np
import re
from PyPDF2 import PdfReader
import os
import docx

## Step 2: Define file text extract

In [None]:
# Functions to read different file types
def read_pdf(file_path):
    with open(file_path, "rb") as file:
        pdf_reader = PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

def read_word(file_path):
    doc = docx.Document(file_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text

def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text

def read_documents_from_directory(directory):
    combined_text = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            combined_text += read_pdf(file_path)
        elif filename.endswith(".docx"):
            combined_text += read_word(file_path)
        elif filename.endswith(".txt"):
            combined_text += read_txt(file_path)
    return combined_text

## Step 3: Prepare text Godot 4.3 Text (sample version)

Got ```godot-docs``` through ```https://docs.godotengine.org/en/stable/index.html``` using the ```stable``` variety. As of today it's ```319mb```.

Extract contents into ```godot-docs-html-stable``` then run below to extract data from the html (dirty formatting for v1).

**Time Required:** ~4min

In [None]:
%%time
# Convert html files recursively into txt files without format
import os
from bs4 import BeautifulSoup

def convert_html_to_txt(html_file_path, txt_file_path):
    with open(html_file_path, 'r', encoding='utf-8') as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')
        text = soup.get_text()
        
    with open(txt_file_path, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

def process_folder(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for root, _, files in os.walk(input_folder):
        for file in files:
            if file.endswith('.html'):
                #print('processing:',file)
                html_file_path = os.path.join(root, file)
                relative_path = os.path.relpath(html_file_path, input_folder)
                txt_file_path = os.path.join(output_folder, os.path.splitext(relative_path)[0] + '.txt')
                splt_path = txt_file_path.split('/')
                #print(splt_path[-1])
                splt_path[-1] = splt_path[-2] + '-' + splt_path[-1]
                convert_html_to_txt(html_file_path, output_folder + '/' + splt_path[-1])
            else:
                print(' - skipped:',file)

input_folder = 'godot-docs-html-stable/tutorials'
output_folder = 'godot-docs-text'

process_folder(input_folder, output_folder)

## Step 4: Trim excess whitespace

In [None]:
# Read documents from the directory
train_directory = output_folder
text_data = read_documents_from_directory(train_directory)
text_data = re.sub(r'\n+', '\n', text_data).strip()  # Remove excess newline characters

In [None]:
print(text_data[0:200],'...')

*Uncomment this with train.txt removed to write a new train.txt*

In [None]:
## Only needed once off if you have the data
#with open("./train.txt", "w") as f:
#    f.write(text_data)

### 4.1: Minimised train.txt

In [2]:
def filter_lines(input_file, output_file):
    with open(input_file, 'r') as file:
        lines = file.readlines()

    with open(output_file, 'w') as file:
        for line in lines:
            if len(line.strip()) > 30:
                file.write(line)

# Example usage
input_file = 'train.txt'
output_file = 'train-minimised.txt'
filter_lines(input_file, output_file)

## Step 5: Setup training functions

In [2]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

2024-10-06 18:16:17.659015: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-06 18:16:17.674707: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-06 18:16:17.678955: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-06 18:16:17.690343: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

In [4]:
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, 
        mlm=mlm,
    )
    return data_collator

In [5]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)
      
  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )
      
  trainer.train()
  trainer.save_model()

## Step 6: Specify Training Arguments

In [6]:
train_file_path = "./train-manual.txt"
model_name = 'gpt2'
output_dir = './model'
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 5.0
save_steps = 50000

#### Step 6.1: I didn't want to use WANDB, this is how you disable it:

In [7]:
# I didn't want to use wandb
os.environ['WANDB_DISABLED'] = 'true'

## Step 7: Run the training

*Note: on my desktop the 5 epochs was what I felt like trying out, ~it shared it would take around 5 hours at the start~ ~the minimised data shared it would take 45 minutes~ manual first test mode, without much in training, took 8 seconds.*

Minimised data is super messy, I need to format training data way better. This will likely give hulucinations that are similar to what I'm looking for.

Also used:

- ```git lfs track "*.safetensors"```
- ```git lfs track "*.gguf"```

This was for a later stage if I want to upload them to my repo.

In [8]:
# Train
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss


*Note: Kernel restart once here, then did Step 8. This was to free up GPU memory for the tests below.*

## Step 8: Test The Model

*Warning: This data isn't likely to give the style of answers we would love to have, but the plan is to test if the idea I'm working on will succeed, or not. If the answers below hint at it being possible,* **Version 2** *will follow this starting structure with my idea of a revamp.*

In [9]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

In [10]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

model1_path = "./model"
max_length = 512

model = load_model(model1_path)
tokenizer = load_tokenizer(model1_path)

def generate_text(sequence):
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

In [11]:
def Query(question1):
    generate_text(model1_path)

### 8.1: What’s new in Godot 4.3 compared to previous versions?

In [12]:
generate_text('What’s new in Godot 4.3 compared to previous versions?')

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


What’s new in Godot 4.3 compared to previous versions? You can get an answer with our main post. The Godot Engine tutorial for the new tutorial can be found here
Godot 4.3 also includes an experimental new engine called "Godot Engine". You can use it in your projects, or as a feature in a feature-packed game, to create, test and debug your features. As this project grows, so will your code. We encourage you to contribute your contributions to the Godot Engine project. You can do so by contributing our code and a new feature, or simply by contributing code to Godot Engine with the command line tool 'git clone git@github.com/godotengine/godotengine-3.git' . We also encourage you to contribute feature-packed games that are small, feature-inclusive, easy to develop, fun, and can run without an interpreter, as well as feature-packed languages like C and C++. In particular, we offer a tutorial to help developers learn how to write code with Godot's powerful API. Finally, if you find that you

### 8.2: Can you explain the changes in the rendering pipeline introduced in Godot 4.3?

In [13]:
generate_text('Can you explain the changes in the rendering pipeline introduced in Godot 4.3?')

Can you explain the changes in the rendering pipeline introduced in Godot 4.3?

As you can see, the renderer is now a single pipeline with a single render method and no single object-oriented-style window. While we can change the rendering pipeline with any feature, the new, new-look editor is so powerful and easy to create. And unlike our "Render with a single click" or "Render with HTML5" models, we can draw objects in a lot less time, and we can quickly take them to new places in an editor.

What's the big difference between Godot and other engine-based renderers?

Godot 3 supports a broad set of renderers. You can view a list of our engine-specific renderers, or view the corresponding file in your project root. This documentation provides information on the specific features you can create, including the OpenGL-based renderer, the C#-based renderer, and more.

What should you expect when you use Godot 3?

Godot 3 is a powerful new editor and is very well integrated with your existi

### 8.3: How do you implement a custom shader in Godot 4.3?

In [14]:
generate_text('How do you implement a custom shader in Godot 4.3?')

How do you implement a custom shader in Godot 4.3? First, check out this video series by Nathan van Sonder. It will teach you how to make the best use of your Godot engine.
This one-shot introduction to Godot is the main point of the video series. It contains information about a few common concepts, and shows you how to take advantage of them to your game development goals.
Godot has many of the characteristics that you need to build a high-level engine with. Most of them are easy to learn. They make your game better and more fun. They introduce an open source community, provide a quick tutorial, and encourage you to contribute. But, when you add features, you also get a lot of free time. You can learn more about Godot here: https://godotengine.org/en/docs/stable_and_improving_godot/index.html#doc-godot-engine).
Godot is a popular tool for game engines. As an example, the GIMP engine offers thousands of features. You can write your own editor, add your own engine, build new games, and 

### 8.4: What are the key differences between GDScript and C# in Godot 4.3?

In [15]:
generate_text('What are the key differences between GDScript and C# in Godot 4.3?')

What are the key differences between GDScript and C# in Godot 4.3?

In Godot 4.3, you can have two files called editor.editor.xcodeplex and editor.xcodeplex. For each editor, you can select the file you want to view. For each editor, select a key, and then type the new key. For most examples, you can use the new key to edit the editor.

The editor.xcodeplex file has four sections: the first part of the source code, the main source control file (FOU), and the editor.xcodeplex file contains all the variables you'll need to create the editor. You can find the file's documentation on the Godot 4.3 documentation website, or in a folder on your desktop. The Godot 4.3 source files contain a lot of information, so be sure to read the full documentation for the Godot documentation site before you do so.

As with C#, you can have multiple editor. For instance, you can add as many as 20 new editor.xcodeplex files to a project, a file named editor.unitycloud.org, which contains the files needed to

### 8.5: How would you optimize a game for performance in Godot 4.3?

In [16]:
generate_text('How would you optimize a game for performance in Godot 4.3?')

How would you optimize a game for performance in Godot 4.3?
**This is an open-source article **This is an open-source article is free and open source, so you can redistribute it and/or modify it under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.**\nWe use the work from this article to help develop Godot 4.3, and to help improve the community as a whole. We also use the information provided here to help answer questions you may have, and other developers have asked before!**\n-[docpath_of_project](https://docs.godotengine.org/en/stable/_images/assets/documents/noun_and_descriptive_descriptive.html#doc-noun-and-descriptive-descriptive-descriptive-descriptive)**\n**This project is a volunteer project, and will be hosted exclusively at the Godot Engine. All contributors are welcome, and all contributions must comply with the Godot Code of Conduct. (This proj

### 8.6: Can you describe the process of creating a custom node in Godot 4.3?

In [17]:
generate_text('Can you describe the process of creating a custom node in Godot 4.3?')

Can you describe the process of creating a custom node in Godot 4.3?

The process of creating a custom node starts at your local development or build branch. After that, you'll find several tutorials and tutorials for it. For a complete article on using Godot in a project, we recommend this page: http://godot-doc.godotengine.org/en/getting_started/getting_started/.

[](https://docs.godotengine.org/en/getting_started/getting_started/getting_started/nodes_and_scenes_and_scenes-introduction.html#doc-nodes_and_scenes-introduction)

Creating the Scene

Creating an event node is an important part of a project, but it also provides a lot of resources to the editor. To create an event, you need to provide it with two functions: the event call and its state. In general, you can think of it as "the DOM" for a project. In this tutorial, we'll see one scene, a button and a button handler.

Let's start by showing you how to run the scene and its state using the command line:

nodes.scene("/resource

### 8.7: What are the new features in the animation system of Godot 4.3?

In [18]:
generate_text('What are the new features in the animation system of Godot 4.3?')

What are the new features in the animation system of Godot 4.3? You can make all sorts of complex objects and animations within one step using Godot 4.3. Godot4.3 is no different, it provides a simple way to create complex features and animations. "One of the biggest advantages of Godot is its rich API. It gives you the ability to take any tool and create powerful features. In this article we will introduce the API to you and help you design your first game with Godot.", by G. David McRae-Gershon, 3D and 2D Senior Editor, Godot 4.3.1
The first feature of Godot 4.3.1 is its ability to draw a grid in a 2D scene. Godot's built-in 3D editor allows you to drag and drop objects and scenes as you like. This will allow you to draw your game's grid directly to a tool like Unity! "Godot's powerful engine allows you to program a lot more. It includes features like drawing 3D scenes, 3D editor, graphics library, and more. We are using this engine because Godot allows us to make games that have a p

### 8.8: How do you handle input events in Godot 4.3?

In [19]:
generate_text('How do you handle input events in Godot 4.3?')

How do you handle input events in Godot 4.3?

Input events are events from the Godot engine. You can interact with them using the "input() function" in your editor. Godot also provides a mechanism to interact directly with other engine commands. In particular, you can use Godot's "Script editor" to do most of the work in a single click! If Godot has a command-line interface, you can use that to interact directly with the engine! In any program, that's a good thing!

In the Godot engine, you use Godot's "Script editor" to write many of Godot's functions. The "Script Editor" also allows you to create a menu that displays each script file. "Script Editor" can also have a special "Script Editor Window" where you can use any editor or window to run Godot's script.

How do you write Godot code in C#?

Godot uses C# for its scripting language and is based on the popular C# engine. It uses it as its engine engine and features a number of properties such as type and properties. With Godot's scr

### 8.9: What are the best practices for using the new navigation system in Godot 4.3?

In [20]:
generate_text('What are the best practices for using the new navigation system in Godot 4.3?')

What are the best practices for using the new navigation system in Godot 4.3?

The core design philosophy of Godot has always been very similar to that of previous games. In a previous Godot game, you would start with the basic object-oriented programming language, and you would develop a lot of components based on a single type.

In Godot 3, you do not. Instead, you have to rely on the new engine engine to help you develop the concepts. This engine makes the interface more intuitive, more intuitive in a game that can handle thousands of complex concepts. In Godot 3, you can now write your own custom editor.

In addition, in Godot 2, Godot 2 will support the much-improved editor system and engine called GodotEdit. It allows you to customize much more your editor's parts without having to use Godot's third-party programming language. With GodotEdit, you can add the ability to create custom text editor with just a single click! In addition, GodotEdit allows you to create your own custom 

### 8.10: Can you walk me through setting up a multiplayer game in Godot 4.3?

In [21]:
generate_text('Can you walk me through setting up a multiplayer game in Godot 4.3?')

Can you walk me through setting up a multiplayer game in Godot 4.3? [How's this new engine looking for me?] [What's the game about?](https://docs.godotengine.org/en/stable/_images/d_images/d_game_3.png) [A tutorial to learn to use Unity in Unity.](https://docs.godotengine.org/en/stable/_images/d_scene_0.png) [What are your best friends in Godot 4?](https://docs.godotengine.org/en/stable/_images/d_scene_1.png) [What's the best part about a game in Godot 4.0?] [How can I learn a new engine?] [How much can we make in Godot 4?](https://docs.godotengine.org/en/stable/_images/d_scene_2.png) [What is my best friend?](https://docs.godotengine.org/en/stable/_images/d_scene_3.png) [What's our main character?](https://docs.godotengine.org/en/stable/_images/d_scene_4.png) [Can I make my own character?](https://docs.godotengine.org/en/stable/_images/d_scene_5.png) [How can I create my own characters?](https://docs.godotengine.org/en/stable/_images/d_scene_6.png) [Are you using any other engines?](h

### 8.11: How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?

In [22]:
generate_text('How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?')

How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own? Well, let's get started!\nIt's easy to program the code to programatically generate new chunks.\nIn a folder called node_3D.psx, just open the editor. To add any new elements to the scene, select [Create a new scene](https://docs.godotengine.org/en/stable/_images/_creating_new_scene.png).\nYou can then click on the new image to expand it to fit your game's editor.\nBy default, the editor will open and show you a list of available elements. You can also save it as a file, which can be used to save your game to the cloud, or as a reference to a project.\nIn Godot 4.3, you can open the editor with an arrow.\nBy default, Godot will use the built-in "Node.js" compiler, which is also the name of the game engine.\nIn the "Scene Editor", select [Start](https://docs.godotengine.org/en/stable/_images/_scripting-node-nodes/start).\nIn the editor window, type [Start](https://docs.godotengine.org/en/stab

### 8.12: How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?

In [23]:
generate_text('How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?')

How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?
**Note** [**1.1**] **Godot engine will not display a node as a node. You must first create a scene with the game engine and make sure that you have a scene's title, engine code, or a folder with all the node's data, and not only have the name "tree_node" set to the node's name.** **1.2** **You can make two nodes, using Godot's "Node Maker" command.** [**1.3**] **You can also create scenes using the "Create Scene" command.** [**2.1**] **The "Create Scene" command enables you to add multiple scenes at once. Create a scene using the command "nodes_scene_create" at the root of your scene window.** [**2.2**] **If you need a more advanced view of the scene's world, you can drag and drop the scenes you want to create.** [**3.1**] **To create a new scene, create a new node, right click, and s

### 8.13: In Godot 4.3, using C#, can you create 2D snake from scratch?

In [24]:
generate_text('In Godot 4.3, using C#, can you create 2D snake from scratch?')

In Godot 4.3, using C#, can you create 2D snake from scratch? For starters, you need to know how to make a C# game engine: it is a tool with thousands of examples. You should know the fundamentals of C#, Java, C#, and Visual Basic so that you can get started. Another good place to start is the C# course at https://cshal.godotengine.org/. If you are interested in learning about C#, feel free to download it. You should also look at the tutorials linked on the right to go through them.


You can find all of the tutorials on the Godot website, here . For more information about C#, see the Godot FAQ's: Godot Wiki | Godot Articles | Godot Help Forum or the Godot Help section.

**Community** Godot is open source, the community is always open to help or contribute, so feel free to report bugs, feature requests, feature requests and support people who need a look at Godot's code. ​Community guidelines are available at http://godot.org/.​ [Godot Wiki](https://godot-wiki.godotengine.org/en/stable

# Review (Personal):

- **Test 1:** 0/13 - 100% faliure on first run. I'm thinking I need to adjust ```train.txt``` to filter out useless rows. I think the contents wraps too close in each category. That might be why Training Loss was way too high. Cleaning the model folder, revamp of data to train on manually, then will run tests again.
- **Test 2:** 0/13 - 100% failure again. Rethinking it, going to test with just the one section of the documentation manually as a training set. ```./train-minimised.txt``` is still too dirty to possibly have any good results.
- **Test 3:** 2.5ish/13 - classing it as about 80% fail and just giving small points for some of the answer text. My training data was way too small in ```./train-manual.txt```