<a href="https://colab.research.google.com/github/bnelson05/Generative_Model/blob/main/GenerativeModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part A: Data Loading and Splitting



# Load the tiny_shakespeare Dataset
Use the Hugging Face datasets library’s load_dataset function with "tiny_shakespeare" as the argument.

Inspect the result to confirm you have splits named “train,” “validation,” and “test.”

Notice that each of these splits contains only 1 example (a single long string).

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [3]:
from datasets import load_dataset

tiny_shakespeare_ds = load_dataset("tiny_shakespeare")

for split in tiny_shakespeare_ds:
  print(f"Split type: {split}")
  example = tiny_shakespeare_ds[split][0]['text']
  print(f"Example (first 100 chars): {example[:100]}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.10k [00:00<?, ?B/s]

tiny_shakespeare.py:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

The repository for tiny_shakespeare contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/tiny_shakespeare.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

Split type: train
Example (first 100 chars): First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You
Split type: validation
Example (first 100 chars): ?

GREMIO:
Good morrow, neighbour Baptista.

BAPTISTA:
Good morrow, neighbour Gremio.
God save you, 
Split type: test
Example (first 100 chars): rance ta'en
As shall with either part's agreement stand?

BAPTISTA:
Not in my house, Lucentio; for, 


# Examine the Data
Retrieve the string from the "train" split. (For example, you’ll see a dictionary with a key like "text"—that’s your single item.)

Print out a small snippet (e.g., the first few hundred characters) to see how it looks. Notice it’s multiple lines of Shakespeare text, separated by \n.

In [4]:
train_string_segment = tiny_shakespeare_ds["train"][0]["text"]
print(train_string_segment[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


# Convert the Single Example into Multiple Lines
You’ll need to split the long string using the newline character ("\n").
Remove any lines that are completely empty or just whitespace.

Finally, you’ll have a list of lines—each line is a small piece of Shakespeare text.

In [5]:
split_by_lines = train_string_segment.split("\n")
# Use .strip() function for clearing whitespaceL: https://www.w3schools.com/python/ref_string_strip.asp
final_lines = [line for line in split_by_lines if line.strip()]
print(final_lines[:20])

['First Citizen:', 'Before we proceed any further, hear me speak.', 'All:', 'Speak, speak.', 'First Citizen:', 'You are all resolved rather to die than to famish?', 'All:', 'Resolved. resolved.', 'First Citizen:', 'First, you know Caius Marcius is chief enemy to the people.', 'All:', "We know't, we know't.", 'First Citizen:', "Let us kill him, and we'll have corn at our own price.", "Is't a verdict?", 'All:', "No more talking on't; let it be done: away, away!", 'Second Citizen:', 'One word, good citizens.', 'First Citizen:']


# Create a Dataset of Lines
Transform that list of lines into a Hugging Face Dataset object.

This will give you a dataset with many rows (one row per line), rather than a single row with a giant string.

In [6]:
from datasets import Dataset

# Hugging Face Create a Dataset: https://huggingface.co/docs/datasets/en/create_dataset
dataset_dict = {"text": final_lines}
lines_dataset = Dataset.from_dict(dataset_dict)
print(lines_dataset)

Dataset({
    features: ['text'],
    num_rows: 29242
})


# Split That Dataset into Train & Validation
Use the .train_test_split method (from the datasets library) on your newly created dataset.

Choose a test size (like 0.1, or 10%). The result is a DatasetDict with a “train” split and a “test” split.

Name them train_data and val_data (since we’re treating the test split as validation).

Print out the sizes to confirm you have a healthy number of lines in each.

In [7]:
lines_dataset_split = lines_dataset.train_test_split(test_size = 0.1)

train_data = lines_dataset_split["train"]
val_data = lines_dataset_split["test"]

print(f"Length of train split: {len(train_data)}")
print(f"Length of test split: {len(val_data)}")

Length of train split: 26317
Length of test split: 2925


# Part B: Tokenization and Processing

# Load the Model & Tokenizer

**What**: We’ll use the model distilgpt2.

**Why**: A pretrained tokenizer ensures we map text to the correct input IDs for our model (basically it maps text to numbers which computers can understand).

In [8]:
from transformers import AutoTokenizer

# DistilGPT2 is a pre-trained language model
model_name = "distilgpt2"
# Tokenizer is a tool that converts text to numbers that the model can understand
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# Pad token tells the model where a sequence ends and where padding starts
# GPT-2 doesn't have a pad token by default, so:
tokenizer.pad_token = tokenizer.eos_token

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Write a tokenize_function

**What**: A function that takes a batch of text lines, and returns their tokenized form. ML models take inputs in so-called "batches" - meaning that passing them one input at a time is wasteful, so usually multiple inputs are passed at once. When you see something like batch_size = 256, it means that the model takes in 256 inputs at the same time. Ideally of course we'd like to pass the entire dataset in a single batch but GPUs don't have enough memory to store the entire dataset so we need to pick a large enough batch size for efficiency but small enough to fit in your GPU. (if your batch size is too large for your GPU, you might get a GPU VRAM fault , think of it something like segmentation fault but for GPUs)

**Why**: Hugging Face’s .map() calls this tokenizer on each batch behind the scenes. Essentially, for a batch of sentences, the tokenizer maps them to numbers.

**Key Points**:
We do truncation=True and max_length=128 or 256 for memory efficiency.
Remove any lingering empty lines if needed.

In [9]:
# Function that takes a batch of text and converts it using the tokenizer
# Processing sentences in batches is more efficient
def tokenize_function(examples):
  return tokenizer(
    examples["text"],
    truncation=True,
    max_length=128
)

# Apply .map() to Create train_dataset & val_dataset

**What**: Convert your raw text lines into model-ready tokens.

**Why**: This is the final step before training. We remove the original “text” column, leaving only tokenized forms.

In [11]:
# .map() applies the tokenization function to the data
train_dataset = train_data.map(tokenize_function, batched=True, remove_columns=["text"])
val_dataset = val_data.map(tokenize_function, batched=True, remove_columns=["text"])

print(train_dataset[0])

Map:   0%|          | 0/26317 [00:00<?, ? examples/s]

Map:   0%|          | 0/2925 [00:00<?, ? examples/s]

{'input_ids': [3152, 8237, 290, 13017, 11, 356, 423, 802, 1754, 15625], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
