<a href="https://colab.research.google.com/github/bnelson05/Generative_Model/blob/main/GenerativeModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part A: Data Loading and Splitting



# Load the tiny_shakespeare Dataset
Use the Hugging Face datasets library’s load_dataset function with "tiny_shakespeare" as the argument.

Inspect the result to confirm you have splits named “train,” “validation,” and “test.”

Notice that each of these splits contains only 1 example (a single long string).

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.3.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.0-py3-none-any.whl (484 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading

In [2]:
from datasets import load_dataset

tiny_shakespeare_ds = load_dataset("tiny_shakespeare")

for split in tiny_shakespeare_ds:
  print(f"Split type: {split}")
  example = tiny_shakespeare_ds[split][0]['text']
  print(f"Example (first 100 chars): {example[:100]}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.10k [00:00<?, ?B/s]

tiny_shakespeare.py:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

The repository for tiny_shakespeare contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/tiny_shakespeare.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

Split type: train
Example (first 100 chars): First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You
Split type: validation
Example (first 100 chars): ?

GREMIO:
Good morrow, neighbour Baptista.

BAPTISTA:
Good morrow, neighbour Gremio.
God save you, 
Split type: test
Example (first 100 chars): rance ta'en
As shall with either part's agreement stand?

BAPTISTA:
Not in my house, Lucentio; for, 


# Examine the Data
Retrieve the string from the "train" split. (For example, you’ll see a dictionary with a key like "text"—that’s your single item.)

Print out a small snippet (e.g., the first few hundred characters) to see how it looks. Notice it’s multiple lines of Shakespeare text, separated by \n.

In [3]:
train_string_segment = tiny_shakespeare_ds["train"][0]["text"]
print(train_string_segment[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


# Convert the Single Example into Multiple Lines
You’ll need to split the long string using the newline character ("\n").
Remove any lines that are completely empty or just whitespace.

Finally, you’ll have a list of lines—each line is a small piece of Shakespeare text.

In [4]:
split_by_lines = train_string_segment.split("\n")
# Use .strip() function for clearing whitespace: https://www.w3schools.com/python/ref_string_strip.asp
final_split_lines = [line for line in split_by_lines if line.strip()]


# def line_chunking(lines, size):
#   line_groups = []
#   for section in range(0, len(lines), size):
#     line_groups.append(" ".join(lines[section:section + size]))
#   return line_groups

# final_grouped_lines = line_chunking(final_split_lines, 5)

print("Original dataset length: ", len(train_string_segment))
print("Split on newline dataset length: ", len(final_split_lines))
# print("Grouped by chunks of 5 dataset length: ", len(final_grouped_lines))

print(final_split_lines[:5])
# print(final_grouped_lines[:5])

Original dataset length:  1003854
Split on newline dataset length:  29242
['First Citizen:', 'Before we proceed any further, hear me speak.', 'All:', 'Speak, speak.', 'First Citizen:']


# Create a Dataset of Lines
Transform that list of lines into a Hugging Face Dataset object.

This will give you a dataset with many rows (one row per line), rather than a single row with a giant string.

In [5]:
from datasets import Dataset

# Hugging Face Create a Dataset: https://huggingface.co/docs/datasets/en/create_dataset
dataset_dict = {"text": final_split_lines}
lines_dataset = Dataset.from_dict(dataset_dict)
print(lines_dataset)

Dataset({
    features: ['text'],
    num_rows: 29242
})


# Split That Dataset into Train & Validation
Use the .train_test_split method (from the datasets library) on your newly created dataset.

Choose a test size (like 0.1, or 10%). The result is a DatasetDict with a “train” split and a “test” split.

Name them train_data and val_data (since we’re treating the test split as validation).

Print out the sizes to confirm you have a healthy number of lines in each.

In [6]:
lines_dataset_split = lines_dataset.train_test_split(test_size = 0.1)

train_data = lines_dataset_split["train"]
val_data = lines_dataset_split["test"]

print(f"Length of train split: {len(train_data)}")
print(f"Length of test split: {len(val_data)}")

Length of train split: 26317
Length of test split: 2925


# Part B: Tokenization and Processing

# Load the Model & Tokenizer

**What**: We’ll use the model distilgpt2.

**Why**: A pretrained tokenizer ensures we map text to the correct input IDs for our model (basically it maps text to numbers which computers can understand).

In [7]:
from transformers import AutoTokenizer

# DistilGPT2 is a pre-trained language model
model_name = "distilgpt2"
# Tokenizer is a tool that converts text to numbers that the model can understand
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# Pad token tells the model where a sequence ends and where padding starts
# GPT-2 doesn't have a pad token by default, so:
tokenizer.pad_token = tokenizer.eos_token

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Write a tokenize_function

**What**: A function that takes a batch of text lines, and returns their tokenized form. ML models take inputs in so-called "batches" - meaning that passing them one input at a time is wasteful, so usually multiple inputs are passed at once. When you see something like batch_size = 256, it means that the model takes in 256 inputs at the same time. Ideally of course we'd like to pass the entire dataset in a single batch but GPUs don't have enough memory to store the entire dataset so we need to pick a large enough batch size for efficiency but small enough to fit in your GPU. (if your batch size is too large for your GPU, you might get a GPU VRAM fault , think of it something like segmentation fault but for GPUs)

**Why**: Hugging Face’s .map() calls this tokenizer on each batch behind the scenes. Essentially, for a batch of sentences, the tokenizer maps them to numbers.

**Key Points**:
We do truncation=True and max_length=128 or 256 for memory efficiency.
Remove any lingering empty lines if needed.

In [8]:
# Function that takes a batch of text and converts it using the tokenizer
# Processing sentences in batches is more efficient
def tokenize_function(examples):
  return tokenizer(
    examples["text"],
    truncation=True,
    max_length=128
)

# Apply .map() to Create train_dataset & val_dataset

**What**: Convert your raw text lines into model-ready tokens.

**Why**: This is the final step before training. We remove the original “text” column, leaving only tokenized forms.

In [9]:
# .map() applies the tokenization function to the data
train_dataset = train_data.map(tokenize_function, batched=True, remove_columns=["text"])
val_dataset = val_data.map(tokenize_function, batched=True, remove_columns=["text"])

print(train_dataset[0])
print(len(train_dataset))
print(len(val_dataset))

Map:   0%|          | 0/26317 [00:00<?, ? examples/s]

Map:   0%|          | 0/2925 [00:00<?, ? examples/s]

{'input_ids': [18546, 4241, 338, 1474, 26, 314, 2314, 2740, 13, 921, 11, 8433, 4015], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
26317
2925


# Part C

# Training Arguments
Create a TrainingArguments object that specifies:

Where to save the model output (like output_dir="./distilgpt2-finetuned-shakespeare").

Number of epochs (e.g., 1–3 for quick tests; more if you want deeper fine-tuning).

Batch size (often small, like 2, if you’re on limited GPU memory).

Logging and evaluation frequency (for instance, log every 50 steps, evaluate every 100 steps).

In [10]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir="./distilgpt2-finetuned-shakespeare",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    logging_steps = 50,
    eval_steps = 100,
    report_to="none",
    eval_strategy = "steps",
    save_steps = 13000
    )

# Trainer Setup
Use the Hugging Face Trainer class and pass in:

model: a GPT-2–style model (e.g., distilgpt2) loaded from AutoModelForCausalLM.

training_args: the arguments from above.

train_dataset and eval_dataset: the tokenized datasets from Part B.

data_collator: a collator that pads data for causal language modeling (if needed).

In [11]:
from transformers import AutoModelForCausalLM, DataCollatorForLanguageModeling
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
# Applies padding
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    # Treats it as casual language modeling
    mlm=False
)
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    data_collator = data_collator
)

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# .train() Method
Call .train() on your Trainer object. You should see logs showing training loss and eval loss as the steps progress.

Keep an eye on:

Train Loss – does it steadily decrease?

Eval Loss – does it decrease as well, or start to level off/oscillate?

In [12]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss
100,4.7434,5.230287
200,4.6582,5.102442
300,4.3147,5.020924
400,4.4256,4.948077
500,4.5762,4.905717
600,4.2553,4.865808
700,4.3784,4.809505
800,4.2314,4.772165
900,4.145,4.763935
1000,4.2325,4.7189


TrainOutput(global_step=39477, training_loss=3.409353329906632, metrics={'train_runtime': 3279.0419, 'train_samples_per_second': 24.077, 'train_steps_per_second': 12.039, 'total_flos': 221950665474048.0, 'train_loss': 3.409353329906632, 'epoch': 3.0})

In [None]:
# model.to('cpu')
# model.save_pretrained('./distilgpt2-trained')
# tokenizer.save_pretrained('./distilgpt2-trained')

In [None]:
# model = AutoModelForCausalLM.from_pretrained('./distilgpt2-trained')
# tokenizer = AutoTokenizer.from_pretrained('./distilgpt2-trained')

# Generate a Test Sample
After training, set the model to eval mode and choose a prompt (like "Thus speaks").

Use the model’s .generate(...) or a pipeline("text-generation", ...) to produce text.

Compare this generated text to the un-fine-tuned model’s output—do you see more Shakespearean style?

In [13]:
from transformers import pipeline
model.eval()
text_generator_finetuned = pipeline("text-generation", model=model, tokenizer="distilgpt2")
generated_text_finetuned = text_generator_finetuned("Thus speaks", max_length=100, num_return_sequences=1)

text_generator_original = pipeline("text-generation", model="distilgpt2", tokenizer="distilgpt2")
generated_text_original = text_generator_original("Thus speaks", max_length=100, num_return_sequences=1)


print("Generated text using the fine-tuned model:")
print(generated_text_finetuned[0]['generated_text'])
print("")
print("Generated text using the original model:")
print(generated_text_original[0]['generated_text'])

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text using the fine-tuned model:
Thus speaks my mind more. Thou, hast made it seem, to me, in my sight, to me, by me,--as I, my father;'--and, I hope, is thy love--I love, so. O, the rest, my joy. O, think, and, gentle Mercutio. O, my heart! I'll not, go. Fare you well; but the rest of mine, gentle lord, hear me speak. Fare you well.

Generated text using the original model:
Thus speaks to a recent study in Japan. The researchers analyzed over 300,000 samples from a series of 11,000 samples sent out by the Fukushima (and Fukushima) tsunami. The sampling site is a three-stage water treatment system that is well functioning and fully under construction.




But more importantly, the researchers analyzed the data, and found that in just about every part of the city, it has long been a bottleneck on the road to a new facility. Once the first
