# Version 2: Train Godot 4.3 - GPT2

## Plan:

Train [edg3/godot-4.3-gpt2](https://huggingface.co/edg3/godot-4.3-gpt2) on the text off of the **Godot 4.3** documentation further.

### System Used

This was done on older desktop hardware. Windows 10, WSL Ubuntu 22.04, i7 processor, Radeon 3050 8Gb OC, 16Gb RAM.

### Warning:

**Things are changing regularly, so if you try this yourself you will see warnings (if it still runs), mostly to do with deprecation these days.**

### Base Code Use

Used as a starter: [311_fine_tuning_GPT2.ipynb](https://github.com/bnsreenu/python_for_microscopists/blob/master/311_fine_tuning_GPT2.ipynb)

Used (adjusted) for further training: (this) built off of [02.train-godot-4.3-gpt2.ipynb](https://github.com/edg3/GPT-systems/blob/main/04.train-gpt-2/02.train-godot-4.3-gpt2.ipynb)

Note: ```--break-system-packages``` is required for my customised WSL2 instance.

## Step 1: Packages

In [None]:
#!pip install -q transformers torch datasets python-docx --break-system-packages

In [None]:
#!pip install -qU PyPDF2 --break-system-packages

In [1]:
import pandas as pd
import numpy as np
import re
from PyPDF2 import PdfReader
import os
import docx

## Step 2: Setup training functions

In [2]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

2024-10-08 05:14:23.587088: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-08 05:14:24.215690: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-08 05:14:24.519163: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-08 05:14:25.808872: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

In [4]:
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, 
        mlm=mlm,
    )
    return data_collator

In [5]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)
      
  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )
      
  trainer.train()
  trainer.save_model()

## Step 3: Specify Training Arguments

In [6]:
train_file_path = "./train-smart.txt"
model_name = 'edg3/godot-4.3-gpt2'
output_dir = './model'
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 25.0 # Should take around 4h30m in theory
save_steps = 50000

#### Step 3.1: I didn't want to use WANDB, this is how you disable it:

In [7]:
# I didn't want to use wandb
os.environ['WANDB_DISABLED'] = 'true'

## Step 4: Run the training

In [None]:
%%time
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

tokenizer_config.json:   0%|          | 0.00/515 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/912 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss


## Step 5: Test The Model

In [None]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

In [None]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model

def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

model1_path = "./model"
max_length = 512

model = load_model(model1_path)
tokenizer = load_tokenizer(model1_path)

def generate_text(sequence):
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

In [None]:
def Query(question1):
    generate_text(model1_path)

### 6.1: What’s new in Godot 4.3 compared to previous versions?

In [None]:
generate_text('What’s new in Godot 4.3 compared to previous versions?')

### 6.2: Can you explain the changes in the rendering pipeline introduced in Godot 4.3?

In [None]:
generate_text('Can you explain the changes in the rendering pipeline introduced in Godot 4.3?')

### 6.3: How do you implement a custom shader in Godot 4.3?

In [None]:
generate_text('How do you implement a custom shader in Godot 4.3?')

### 6.4: What are the key differences between GDScript and C# in Godot 4.3?

In [None]:
generate_text('What are the key differences between GDScript and C# in Godot 4.3?')

### 6.5: How would you optimize a game for performance in Godot 4.3?

In [None]:
generate_text('How would you optimize a game for performance in Godot 4.3?')

### 6.6: Can you describe the process of creating a custom node in Godot 4.3?

In [None]:
generate_text('Can you describe the process of creating a custom node in Godot 4.3?')

### 6.7: What are the new features in the animation system of Godot 4.3?

In [None]:
generate_text('What are the new features in the animation system of Godot 4.3?')

### 6.8: How do you handle input events in Godot 4.3?

In [None]:
generate_text('How do you handle input events in Godot 4.3?')

### 6.9: What are the best practices for using the new navigation system in Godot 4.3?

In [None]:
generate_text('What are the best practices for using the new navigation system in Godot 4.3?')

### 6.10: Can you walk me through setting up a multiplayer game in Godot 4.3?

In [None]:
generate_text('Can you walk me through setting up a multiplayer game in Godot 4.3?')

### 6.11: How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?

In [None]:
generate_text('How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?')

### 6.12: How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?

In [None]:
generate_text('How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?')

### 6.13: In Godot 4.3, using C#, can you create 2D snake from scratch?

In [None]:
generate_text('In Godot 4.3, using C#, can you create 2D snake from scratch?')

# Review (Personal):

*Note: I need to sort out tokens; only thought of it after starting training v1*

- The training only got to steps around 62k / 100k and at 0.460700 training loss I'm undecided on if this was a smart plan with GPT2, it's highly hallucinating. Will experiment further and see where it gets to.

I believe this first training step, which crashed at around 62% complete, is along the correct lines. Now to work out how to train it further.

# Step 7: Upload to HuggingFace

In [None]:
#!pip install -q huggingface_hub --break-system-packages

In [None]:
from huggingface_hub import HfApi
from huggingface_hub import create_repo
import os

# Define your model path and repository name
model_path = "model/"
repo_name = "edg3/godot-4.3-gpt2"
my_token = '<your-token-here>' 

if False: # first time true, else repo exists make false
    create_repo(repo_name, token=my_token)

# List all directories inside the model path
folders = [f for f in os.listdir(model_path) if os.path.isdir(os.path.join(model_path, f))]
skip_check = False

# Print the list of folders
if folders and not skip_check:
    print("Folders inside 'model/':")
    for folder in folders:
        print(folder)
else:
    # Initialize the API
    api = HfApi()

    # Upload the model
    api.upload_folder(
        folder_path=model_path,
        repo_id=repo_name,
        commit_message="Second training",
        token=my_token
    )