# Version 1: Train GPT2

## Plan:

Train GPT2 on the text off of the **Godot 4.3** documentation for 5 epochs, then test the model.

### System Used

This was done on older desktop hardware. Windows 10, WSL Ubuntu 22.04, i7 processor, Radeon 3050 8Gb OC, 16Gb RAM.

### Warning:

**Things are changing regularly, and you will see warnings (if this still runs), mostly to do with deprecation. This means, eventually, one day this code will need changes/updates/rewrites.**

### Base Code Use

Used as a starter: [311_fine_tuning_GPT2.ipynb](https://github.com/bnsreenu/python_for_microscopists/blob/master/311_fine_tuning_GPT2.ipynb)

Note: ```--break-system-packages``` is required for my customised WSL2 instance.

## Step 1: Packages

In [None]:
!pip install -q transformers torch datasets python-docx --break-system-packages

In [None]:
!pip install -qU PyPDF2 --break-system-packages

In [None]:
import pandas as pd
import numpy as np
import re
from PyPDF2 import PdfReader
import os
import docx

## Step 2: Define file text extract

In [None]:
# Functions to read different file types
def read_pdf(file_path):
    with open(file_path, "rb") as file:
        pdf_reader = PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

def read_word(file_path):
    doc = docx.Document(file_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text

def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text

def read_documents_from_directory(directory):
    combined_text = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            combined_text += read_pdf(file_path)
        elif filename.endswith(".docx"):
            combined_text += read_word(file_path)
        elif filename.endswith(".txt"):
            combined_text += read_txt(file_path)
    return combined_text

## Step 3: Prepare text Godot 4.3 Text (sample version)

Got ```godot-docs``` through ```https://docs.godotengine.org/en/stable/index.html``` using the ```stable``` variety. As of today it's ```319mb```.

Extract contents into ```godot-docs-html-stable``` then run below to extract data from the html (dirty formatting for v1).

**Time Required:** ~4min

In [None]:
%%time
# Convert html files recursively into txt files without format
import os
from bs4 import BeautifulSoup

def convert_html_to_txt(html_file_path, txt_file_path):
    with open(html_file_path, 'r', encoding='utf-8') as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')
        text = soup.get_text()
        
    with open(txt_file_path, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

def process_folder(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for root, _, files in os.walk(input_folder):
        for file in files:
            if file.endswith('.html'):
                #print('processing:',file)
                html_file_path = os.path.join(root, file)
                relative_path = os.path.relpath(html_file_path, input_folder)
                txt_file_path = os.path.join(output_folder, os.path.splitext(relative_path)[0] + '.txt')
                splt_path = txt_file_path.split('/')
                #print(splt_path[-1])
                splt_path[-1] = splt_path[-2] + '-' + splt_path[-1]
                convert_html_to_txt(html_file_path, output_folder + '/' + splt_path[-1])
            else:
                print(' - skipped:',file)

input_folder = 'godot-docs-html-stable/tutorials'
output_folder = 'godot-docs-text'

process_folder(input_folder, output_folder)

## Step 4: Trim excess whitespace

In [None]:
# Read documents from the directory
train_directory = output_folder
text_data = read_documents_from_directory(train_directory)
text_data = re.sub(r'\n+', '\n', text_data).strip()  # Remove excess newline characters

In [None]:
print(text_data[0:200],'...')

*Uncomment this with train.txt removed to write a new train.txt*

In [None]:
## Only needed once off if you have the data
#with open("./train.txt", "w") as f:
#    f.write(text_data)

### 4.1: Minimised train.txt

In [None]:
def filter_lines(input_file, output_file):
    with open(input_file, 'r') as file:
        lines = file.readlines()

    with open(output_file, 'w') as file:
        for line in lines:
            if len(line.strip()) > 30:
                file.write(line)

# Example usage
input_file = 'train.txt'
output_file = 'train-minimised.txt'
filter_lines(input_file, output_file)

## Step 5: Setup training functions

In [None]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

In [None]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

In [None]:
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, 
        mlm=mlm,
    )
    return data_collator

In [None]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)
      
  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )
      
  trainer.train()
  trainer.save_model()

## Step 6: Specify Training Arguments

In [None]:
train_file_path = "./train-manual.txt"
model_name = 'gpt2'
output_dir = './model'
overwrite_output_dir = True
per_device_train_batch_size = 8
num_train_epochs = 5.0
save_steps = 50000

#### Step 6.1: I didn't want to use WANDB, this is how you disable it:

In [None]:
# I didn't want to use wandb
os.environ['WANDB_DISABLED'] = 'true'

## Step 7: Run the training

*Note: on my desktop the 5 epochs was what I felt like trying out, ~it shared it would take around 5 hours at the start~ ~the minimised data shared it would take 45 minutes~ manual first test mode, without much in training, took 8 seconds.*

Minimised data is super messy, I need to format training data way better. This will likely give hulucinations that are similar to what I'm looking for.

Also used:

- ```git lfs track "*.safetensors"```
- ```git lfs track "*.gguf"```

This was for a later stage if I want to upload them to my repo.

In [None]:
# Train
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

*Note: Kernel restart once here, then did Step 8. This was to free up GPU memory for the tests below.*

## Step 8: Test The Model

*Warning: This data isn't likely to give the style of answers we would love to have, but the plan is to test if the idea I'm working on will succeed, or not. If the answers below hint at it being possible,* **Version 2** *will follow this starting structure with my idea of a revamp.*

In [None]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

In [None]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

model1_path = "./model"
max_length = 512

model = load_model(model1_path)
tokenizer = load_tokenizer(model1_path)

def generate_text(sequence):
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

In [None]:
def Query(question1):
    generate_text(model1_path)

### 8.1: What’s new in Godot 4.3 compared to previous versions?

In [None]:
generate_text('What’s new in Godot 4.3 compared to previous versions?')

### 8.2: Can you explain the changes in the rendering pipeline introduced in Godot 4.3?

In [None]:
generate_text('Can you explain the changes in the rendering pipeline introduced in Godot 4.3?')

### 8.3: How do you implement a custom shader in Godot 4.3?

In [None]:
generate_text('How do you implement a custom shader in Godot 4.3?')

### 8.4: What are the key differences between GDScript and C# in Godot 4.3?

In [None]:
generate_text('What are the key differences between GDScript and C# in Godot 4.3?')

### 8.5: How would you optimize a game for performance in Godot 4.3?

In [None]:
generate_text('How would you optimize a game for performance in Godot 4.3?')

### 8.6: Can you describe the process of creating a custom node in Godot 4.3?

In [None]:
generate_text('Can you describe the process of creating a custom node in Godot 4.3?')

### 8.7: What are the new features in the animation system of Godot 4.3?

In [None]:
generate_text('What are the new features in the animation system of Godot 4.3?')

### 8.8: How do you handle input events in Godot 4.3?

In [None]:
generate_text('How do you handle input events in Godot 4.3?')

### 8.9: What are the best practices for using the new navigation system in Godot 4.3?

In [None]:
generate_text('What are the best practices for using the new navigation system in Godot 4.3?')

### 8.10: Can you walk me through setting up a multiplayer game in Godot 4.3?

In [None]:
generate_text('Can you walk me through setting up a multiplayer game in Godot 4.3?')

### 8.11: How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?

In [None]:
generate_text('How can I use C# to create a procedurally generated 3D chunk in a node in Godot 4.3 on its own?')

### 8.12: How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?

In [None]:
generate_text('How can I make nodes used as chunks, similar to minecraft, to store data in Godot 4.3 so that I can create a synchronous multiplayer world that can be hosted by 1 player, and have a second player connect?')

### 8.13: In Godot 4.3, using C#, can you create 2D snake from scratch?

In [None]:
generate_text('In Godot 4.3, using C#, can you create 2D snake from scratch?')

# Review (Personal):

- **Test 1:** 0/13 - 100% faliure on first run. I'm thinking I need to adjust ```train.txt``` to filter out useless rows. I think the contents wraps too close in each category. That might be why Training Loss was way too high. Cleaning the model folder, revamp of data to train on manually, then will run tests again.
- **Test 2:** 0/13 - 100% failure again. Rethinking it, going to test with just the one section of the documentation manually as a training set. ```./train-minimised.txt``` is still too dirty to possibly have any good results.
- **Test 3:** 2.5ish/13 - classing it as about 80% fail and just giving small points for some of the answer text. My training data was way too small in ```./train-manual.txt```