<a href="https://colab.research.google.com/github/cesnyder01/llmfinalproject/blob/main/training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Baseline Output** (untrained model)


In [1]:
!pip install transformers
!pip install datasets

# method to clean up output
import re

def trim_to_n_sentences(text, n=3):
    # Use regex to find sentence-ending punctuation
    sentence_endings = re.finditer(r'([.!?])\s+', text)

    count = 0
    end_index = len(text)  # fallback in case there are fewer than n sentences

    for match in sentence_endings:
        count += 1
        if count == n:
            end_index = match.end()
            break

    return text[:end_index].strip()


#

from transformers import GPT2LMHeadModel, GPT2Tokenizer

modelB = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # avoid padding issues
modelB.eval()  # set model to evaluation mode

# input prompts / data

prompts = [
    "There once was a boy who went on an adventure",
    "Once upon a time in a haunted forest,",
    "A long time ago, in a kingdom far, far away, there lived a princess",
    "Our story starts with three little kittens",
    "It was a dark and stormy night"
]

# actual generation

import torch

def generate_baseline(prompt, max_new_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        output = modelB.generate(
            inputs["input_ids"],
            max_new_tokens=max_new_tokens,
            do_sample=True,         # Random sampling makes output more natural
            top_k=50,               # Limits to top 50 likely tokens
            top_p=0.95,             # Nucleus sampling
            temperature=0.9,        # Controls randomness
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

# printing output

for prompt in prompts:
    generated = generate_baseline(prompt)
    trimmed_output = trim_to_n_sentences(generated, n=5)  # get first 5 sentences
    print(f"\nPrompt:\n{prompt}\n\nCompletion:\n{trimmed_output}\n\n{'-'*40}")




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



Prompt:
There once was a boy who went on an adventure

Completion:
There once was a boy who went on an adventure and came back for another. He got a glimpse of himself from a place he never thought he would see again: a place where he had not felt himself since childhood. He was, in his own way, a kind of self-made man; he had seen others before him and seen a way and a way again; and yet he was a stranger, a foreigner, a little stranger, who could not go on in the same way. He was alone. He was without hope.

----------------------------------------

Prompt:
Once upon a time in a haunted forest,

Completion:
Once upon a time in a haunted forest, as you explore a cave deep within the forest, you will see a ghostly figure of your hero.

The legend of Dr. Doom has been an ongoing source of intrigue for countless years. In recent years, the world has witnessed one of these things, and is beginning to wonder if his true nature can be revealed.

This is the first adventure in what has been

##**Training The Model**
Code generated by ChatGPT using the Hugging Face transformers library

In [None]:
!pip install hf_xet

from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset
from datasets import load_dataset

# 1. Load GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 needs this manually
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

# 2. Example dataset

big_dataset = load_dataset("ajibawa-2023/Children-Stories-Collection", split="train")
dataset = big_dataset.select(range(500))

# 3. Tokenization
def tokenize(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=64)

tokenized_dataset = dataset.map(tokenize)

# 4. Data collator (helps batch samples for language modeling)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# 5. Training setup
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=1,  # Small for laptops
    save_steps=1000,
    save_total_limit=1,
    prediction_loss_only=True,
    logging_steps=1000,
    fp16=False,  # Set to True if your GPU supports it
)

# 6. Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

# 7. Train!
trainer.train()






  trainer = Trainer(
[34m[1mwandb[0m: Currently logged in as: [33mcesnyder01[0m ([33mcesnyder01-william-mary[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


##**Rerunning The Baseline Evaluation** (trained model)

In [None]:
model.eval()

# generation

def generate_baseline(prompt, max_new_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        output = model.generate(
            inputs["input_ids"],
            max_new_tokens=max_new_tokens,
            do_sample=True,         # Random sampling makes output more natural
            top_k=50,               # Limits to top 50 likely tokens
            top_p=0.95,             # Nucleus sampling
            temperature=0.9,        # Controls randomness
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

# printing output

for prompt in prompts:
    generated = generate_baseline(prompt)
    trimmed_output = trim_to_n_sentences(generated, n=5)  # get first 5 sentences
    print(f"Prompt:\n{prompt}\n\nCompletion:\n{trimmed_output}\n{'-'*40}")