# Version 2: Train Godot 4.3 - GPT2

## Plan:

Train GPT2 on the text off of the **Godot 4.3** documentation for 25 epochs, then test the model.

### System Used

This was done on older desktop hardware. Windows 10, WSL Ubuntu 22.04, i7 processor, Radeon 3050 8Gb OC, 16Gb RAM.

### Warning:

**Things are changing regularly, so if you try this yourself you will see warnings (if it still runs), mostly to do with deprecation these days.**

### Base Code Use

Used as a starter: [311_fine_tuning_GPT2.ipynb](https://github.com/bnsreenu/python_for_microscopists/blob/master/311_fine_tuning_GPT2.ipynb)

Note: ```--break-system-packages``` is required for my customised WSL2 instance.

## Step 1: Packages

In [None]:
!pip install -q transformers torch datasets python-docx --break-system-packages

In [None]:
!pip install -qU PyPDF2 --break-system-packages

In [1]:
import pandas as pd
import numpy as np
import re
from PyPDF2 import PdfReader
import os
import docx

# Step 2: Get Training Data

Got ```godot-docs``` through ```https://docs.godotengine.org/en/stable/index.html``` using the ```stable``` variety. As of today it's ```319mb```.

Extract contents into ```godot-docs-html-stable``` then run below to extract mostly cleaner text.

**Time Required:** couple seconds

In [None]:
%%time
import os

def get_all_lines(directory):
    all_lines = []
    for root, _, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            with open(file_path, 'r', encoding='utf-8') as f:
                all_lines.extend(f.readlines())
    return all_lines

directory = 'godot-docs-html-stable\\_sources'
lines = get_all_lines(directory)
print(f"Total lines collected: {len(lines)}")

lines = [line.strip() for line in lines if len(line.strip()) > 1]
print(f"Total non-empty lines collected: {len(lines)}")

lines = [line for line in lines if not line.startswith(':')]
print(f"Total lines after removing lines starting with ':': {len(lines)}")

lines = [line for line in lines if not line.startswith('==')]
print(f"Total lines after removing lines starting with '==': {len(lines)}")

lines = [line for line in lines if not line.startswith('--')]
print(f"Total lines after removing lines starting with '--': {len(lines)}")

lines = [line for line in lines if len(line) > 25]
print(f"Total lines after removing lines with less than 25 characters: {len(lines)}")

lines = [line for line in lines if len(line.split()) > 3]
print(f"Total lines after removing lines with less than 4 words: {len(lines)}")

lines = [line for line in lines if ' <' not in line or line.index(' <') > 5]
print(f"Total lines after filtering lines that don't contain ' <' or index of ' <' is greater than 5: {len(lines)}")

lines = [line.replace('.. ', '') for line in lines]
print(f"Total lines after replacing '.. ' with '': {len(lines)}")

lines = [line.replace('__', '') for line in lines]
print(f"Total lines after replacing '__' with '': {len(lines)}")

lines = [line.replace('::', ':') for line in lines]
print(f"Total lines after replacing '::' with ':': {len(lines)}")

lines = [line.replace('`_', '`') for line in lines]
print(f"Total lines after replacing '`_' with '`': {len(lines)}")

#print(lines[60:100])
print(len(lines))

with open('train-smart.txt', 'w', encoding='utf-8') as f:
    f.write('All information provided is copyright to the parties associated and listed in the text. This GPT model abides by all those copyrights and falls under them as well.\n')
    f.write('This training data is sourced from the Godot Engine documentation, which is licensed under the MIT license. The documentation is available at https://docs.godotengine.org/en/stable/index.html\n')
    f.write('Important Notice at the end of every answer: `This information is provided as is and is not guaranteed to be correct. Please refer to the official documentation for the most accurate information. The idea started as an experiment, and will hopefully be usefull to you for thoughts, ideas, and finding what you might be able to use. (edg3)`\n')
    f.write('\n'.join(lines))

## Step 3: Setup training functions

In [2]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

2024-10-07 05:28:22.132553: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-07 05:28:22.148757: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-07 05:28:22.153861: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-07 05:28:22.166403: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

In [4]:
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, 
        mlm=mlm,
    )
    return data_collator

In [5]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)
      
  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )
      
  trainer.train()
  trainer.save_model()

## Step 4: Specify Training Arguments

In [6]:
train_file_path = "./train-smart.txt"
model_name = 'gpt2'
output_dir = './model'
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 25.0 # Should take around 9h15m
save_steps = 100000

#### Step 4.1: I didn't want to use WANDB, this is how you disable it:

In [7]:
# I didn't want to use wandb
os.environ['WANDB_DISABLED'] = 'true'

## Step 5: Run the training


Also used:

- ```git lfs track "*.safetensors"```
- ```git lfs track "*.gguf"```

This was for a later stage if I want to upload them to my repo.

In [8]:
%%time
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss
500,1.2424
1000,1.0903
1500,1.0098
2000,1.0074
2500,0.9788
3000,0.9808
3500,0.937
4000,0.9274
4500,0.922
5000,0.9237


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceed

It stopped around 62000; so decided to see how far I could get between 4:30pm and 6:30pm to post. I'm hoping if I clean outputs where they are I can train more (model still in memory it appears, shutting down might stop continuation, perhaps).

## Step 6: Test The Model

In [1]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

In [2]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model

def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

model1_path = "./model"
max_length = 512

model = load_model(model1_path)
tokenizer = load_tokenizer(model1_path)

def generate_text(sequence):
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

In [3]:
def Query(question1):
    generate_text(model1_path)

### 6.1: What’s new in Godot 4.3 compared to previous versions?

In [4]:
generate_text('What’s new in Godot 4.3 compared to previous versions?')

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


What’s new in Godot 4.3 compared to previous versions?

It is still in the current version. I've added the "New" tab.

Does it offer any compatibility between PS4 and the current version?

It's not compatible. The new version for MacOS will provide the same game but there is no way to play it on any other computer, I don't want to play on PlayStation 4!

Are you planning on having this feature before this update so that you don't miss out?

Yes! Please consider upgrading as you are not necessarily experiencing the same issues and don't have the same content as before.

What does the difference between Godot 4.3 and previous versions look like?

The older version has a different design.

Do you have any feedback on how Godot 4.3 will be different from your previous titles?

Thanks for your feedback.

How about other issues/issues related to the game and other games in Godot?

Please check out the Godot 4.3 Wiki for more information.

How much should one pay for the Godot 4.3 version?

T

### 6.2: Can you explain the changes in the rendering pipeline introduced in Godot 4.3?

In [5]:
generate_text('Can you explain the changes in the rendering pipeline introduced in Godot 4.3?')

Can you explain the changes in the rendering pipeline introduced in Godot 4.3? And how much more dynamic is the current rendering pipeline compared to Godot 4.0?

Dawn of Creation: The only changes to the current rendering pipeline that you can actually expect to see as of Godot 4.4 are the changes in the scripting language in the next version of the game. It's the only language I have yet to see, but for now it's very similar to Godot and has been in development for more than a year. I have no idea how they might have seen it. I think they're more of a bug and a feature than anything else.

It doesn't look very good in your editor though, do you think.

Dawn of Creation: We started using a slightly different scripting language, Godot, which is very much in a more technical perspective.

I didn't see it for a while, so I stopped playing with the godot thing. Then I started playing it really fast, and it just stopped me a couple of times. I started playing more slowly but I still felt t

### 6.3: How do you implement a custom shader in Godot 4.3?

In [6]:
generate_text('How do you implement a custom shader in Godot 4.3?')

How do you implement a custom shader in Godot 4.3?

First of all, I wanted to start by defining the implementation for the custom shader which is a single-file file. With this in mind I've taken a very simple and intuitive look at the shader definition, added a few additional parameters to the default shader (e.g., a line of code), made sure that the code was as clean as possible and so on.

In the examples below you'll see a simple example of what this would look like:

You can see that we've created a custom shader, as well as adding a couple of functions that will be responsible for executing the shader. All of these will be called if there is one. The function that executes the shader is called asynchronously.

The example below does all this for us. The code will return an object (in this case an object that looks like this):

public static void main(String[] args) { // ... Game.setInitialState (true); }

This code starts with the normal initialization code which returns an object

### 6.4: What are the key differences between GDScript and C# in Godot 4.3?

In [7]:
generate_text('What are the key differences between GDScript and C# in Godot 4.3?')

What are the key differences between GDScript and C# in Godot 4.3?

The first thing is that GDScript uses C# instead of C#. As you can see, I am not saying that you should always use a different C# compiler to the one that you use for GDScript. I am simply saying that with the following C# syntax (you should use C# 3.3.1-3.3.10.11, which is still in its early stages):

C# 3.3.1-3.3.10.11 " void createCursor(void **n) { int num = n; if (num == 2) { // no new cursor created if (n < 0) { // a new one created. return; } // else // a new one created. num++; } } // // }

Another thing that could be confusing is that we often want to use a number in our C# code, so it is often better to use a different C# syntax to it. One example where you may want to consider using C# syntax is in the following code:

void * newCursor() { for (int j = 0; j < num; j++) { NewCursor(); NewCursor(); return 0; } }

This makes sense but is confusing to many people who just know how C# works. With that in mind, I 

### 6.5: How would you optimize a game for performance in Godot 4.3?

In [8]:
generate_text('How would you optimize a game for performance in Godot 4.3?')

How would you optimize a game for performance in Godot 4.3?

I think our game is very easy to optimize and it is very simple to implement. I am sure the next step is to implement a complete system with more than a single layer of optimization and testing.

How much do you believe that Godot 4.3 makes the game more interesting?

I would say that we achieved a level of creativity which will become one of the central themes of the upcoming game. If we can have a level of sophistication with Godot, then we can easily make an entire game about Godot 4.

What do you think of the new game?

That is an interesting question, the game is a lot smarter than many people thought. I think our world is getting smarter, so we should see more of this as more and more games with more of the same design and complexity.

What are some of the other improvements to the game?

I believe there is a more complicated puzzle in this game than in other games with different mechanics, but it is a lot harder. This 

### 6.6: Can you describe the process of creating a custom node in Godot 4.3?

In [9]:
generate_text('Can you describe the process of creating a custom node in Godot 4.3?')

Can you describe the process of creating a custom node in Godot 4.3?

A: Before Godot was released, there was no way to create any custom node on the website, so I spent hours developing a custom module, and then Godot got really good at that – I did the custom node on its own.

We created a custom module for Godot 4 and it was very useful: it provided a full-featured application, which I didn't need, but you had to use a node server, which allowed you to create your own node without any external dependencies (like web hosting). It was easy enough to do, but there were a lot of side-effects.

However Godot 4.1 has a ton of new features, so we added another new feature – a default module to download modules in the app's database, which you can use to browse the content.

And after making a change that added a default module to download modules, the database was loaded from the server and started downloading them at runtime. If your local database is not up to date, or you don't have it,

### 6.7: What are the new features in the animation system of Godot 4.3?

In [10]:
generate_text('What are the new features in the animation system of Godot 4.3?')

What are the new features in the animation system of Godot 4.3?

As soon as you check Godot 4.3 out there is no need for any additional tutorial. If you follow the guide in the video, it will work in a few days. And the tutorial can be downloaded here.

In addition to the graphical feature set we've done, we have also added many new functions, such as:

Ability to share with others

Ability to create other animated characters

Ability to import and export as many animated characters as you can get

The Animation system is an important one because it means that everything that we do, like animation, can be added to a game. It is easy to add and remove as many functions as you want, from the controls we put on to the movement of the characters.

This video demonstrates how we have taken a lot of concepts of the engine from Godot and put it into Godot 4.3.

If you want to see more information about how Godot 4.3 works, read the Godot 4.3 video.

A more detailed description of the game is 

### 6.8: How do you handle input events in Godot 4.3?

In [11]:
generate_text('How do you handle input events in Godot 4.3?')

How do you handle input events in Godot 4.3?

You might have to have some kind of API for the output of these events; so let me provide an example here.

Let's say you're getting all the input events from a page:

// This is the main event here // let events = { type: "input", name: "input" }; let text = event.target: event.text; var name = event.target[2]; console.log(text);

When you run this in a browser, the response will look like this:

Input event = { type: "input" }

But instead of writing a single string for input events, you can write multiple events each with their own type and name.

<input type='text' name='input' data-name= "input" type='text' name='text'/>

Note that the type of these events is not affected by the HTML. But in fact, if you want to write a single event name like this (input for name input event event) or type name (input for text event event) or type name (input for name text event event) but just write a single string for input event events (input for na

### 6.9: What are the best practices for using the new navigation system in Godot 4.3?

In [12]:
generate_text('What are the best practices for using the new navigation system in Godot 4.3?')

What are the best practices for using the new navigation system in Godot 4.3?

First I want to start off by explaining what it's really like to have a real-time view of all these things. The first thing you see in Godot 4.3 is the nav bar, which will be shown next to your location on the screen. You'll also be able to see which of your icons were hit by this game. (The icon on the center of the bar is a light green bar, which has a green and blue outline) You'll then be able to scroll down to that icon to learn the game.

As you scroll down you'll be able to change things, such as whether or not the icons will remain centered on the left side of the screen (in this case this is true for icons with the same icon), if certain icons are being hit more recently, and how frequently they have been hit (like with the health bars above). If you have a specific icon hitting recently, and the icon is getting hit frequently, you'll need to use your scroll wheel to sort through all this info befor

### 6.10: Can you walk me through setting up a multiplayer game in Godot 4.3?

In [13]:
generate_text('Can you walk me through setting up a multiplayer game in Godot 4.3?')

Can you walk me through setting up a multiplayer game in Godot 4.3? If the answer is yes, you're on the right path for one. It's a long story but I'm sure we can work on it someday. What's the main difference between Godot and BK2?

What makes godot different than BK2? What are the differences between them that matter?

For me, the biggest difference is the way in which I approach the game and that's really something that I wanted to try out and try out in Godot 4.3. I'm not going to spoil anything for you, but I'm going to give a couple of specific examples of how you could use the same settings and gameplay elements to build a multiplayer game and it definitely works for me.

BK2 is a multiplayer game with online co-op. How do you decide which modes are the best for online co-op gameplay? How do you decide which modes are the best for multiplayer?

First and foremost I wanted to make this game as good for both game play and community support as I can make it and that's what we've don

### 6.11: How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?

In [14]:
generate_text('How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?')

How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?

You can use the following methods for creating a procedurally generated chunk for godot 4.3, which you should use to do these same things with this script:

Node creation with a few lines of code.

Node editing with "x", to change the image used for the text node.

Script execution.

The following is a short list of the script steps I followed in the script. If you encounter any bugs or if you have any questions, please email me at joshua@guitarysoftware.com or write to me at joshua@guitarysoftware.com.

Note: The following script code assumes that you have installed "C# 4.3.4" and "Godot 4.3.0".

You can use a console command to create a chunk that will never be included in Godot 4.3. The first command that you invoke will create a chunk that will never be included in godot 4.3. To create a chunk in a command-line tool that I'm using, you can run:

(myScript) $game = $game->newBlock(); $ga

### 6.12: How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?

In [15]:
generate_text('How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?')

How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?

There are a lot of different ways you can do it.

There is a bunch of tutorials on using the Unity library in the game, so I won't go into detail, but you have to use a simple method.

Here is how we would use it:

1) Load the Unity library and use this (note that this should be done using "loadAllIn")

2) Use this (this is the simplest way I have ever seen the game and I can't imagine it requires any more)

3) You can use "loadAllIn" as a URL

4) You can use the "getAllInURL() method or simply create a new one for each node" command

In the above example, you have been able to use these two methods, "createAndOpenAndOpenBy" and "initAndOpenBy" so that the world could be created automatically. As a general rule of thumb, there are already some "simple" methods for creating/uploading 

### 6.13: In Godot 4.3, using C#, can you create 2D snake from scratch?

In [16]:
generate_text('In Godot 4.3, using C#, can you create 2D snake from scratch?')

In Godot 4.3, using C#, can you create 2D snake from scratch?

The game was only released in 2005 to a limited number of backers and it still hasn't had the chance to receive a second copy. There has been numerous rumors around the world of an Rovio remake and a remake.

However, if you're new to Rovio then be sure to check out this short video of the game, which is about the story of the Rovio 2D Snake.

What do you think of the latest updates to Rovio? Tell us in the comments below!

[via GameFAQs, Kotaku]


# Review (Personal):

*Note: I need to sort out tokens; only thought of it after starting training v1*

- The training only got to steps around 62k / 100k and at 0.460700 training loss I'm undecided on if this was a smart plan with GPT2, it's highly hallucinating. Will experiment further and see where it gets to.

I believe this first training step, which crashed at around 62% complete, is along the correct lines. Now to work out how to train it further.

# Step 7: Upload to HuggingFace

In [35]:
!pip install -q huggingface_hub --break-system-packages

In [39]:
from huggingface_hub import HfApi
from huggingface_hub import create_repo
import os

# Define your model path and repository name
model_path = "model/"
repo_name = "edg3/godot-4.3-gpt2"
my_token = '<your-token-here>' 

if True: # first time true, else repo exists make false
    create_repo(repo_name, token=my_token)

# List all directories inside the model path
folders = [f for f in os.listdir(model_path) if os.path.isdir(os.path.join(model_path, f))]
skip_check = False

# Print the list of folders
if folders and not skip_check:
    print("Folders inside 'model/':")
    for folder in folders:
        print(folder)
else:
    # Initialize the API
    api = HfApi()

    # Upload the model
    api.upload_folder(
        folder_path=model_path,
        repo_id=repo_name,
        commit_message="Initial model upload",
        token=my_token
    )

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]