I’d like you to complete two parts:

1. Part A: Data Loading & Splitting
2. Part B: Tokenization & Preprocessing

The goal is for you to get comfortable with Hugging Face’s datasets library for data manipulation and the tokenizer for preparing text for our language model. Unlike our previous assignment, we're going to now work on generative models instead of merely a classifier (think ChatGPT which produces new text instead of just saying yes/no like sentiment analysis). Below are the detailed instructions for each part. Take your time to read through why we do these steps, not just how to code them. By the end, you’ll have a nicely prepared dataset, ready for full training in a future assignment. This is just the setup, I'll send you more instructions in a couple of days on how to actually get to the finetuning part once I see how you guys are doing. Feel free to reach out if anything is unclear. Please push your code at regular intervals so that I can keep tabs. If you're stuck somewhere, try to reach out ASAP.

These parts don't require a GPU to be used so, on Colab don't select a GPU for now. This will save your credits when we actually need them.



# Part A: Data Loading & Splitting

## 1. Load the tiny_shakespeare Dataset
Use the Hugging Face datasets library’s load_dataset function with "tiny_shakespeare" as the argument.
Inspect the result to confirm you have splits named “train,” “validation,” and “test.” (don't worry about what these mean for now, we'll discuss them when we meet next time)
Notice that each of these splits contains only 1 example (a single long string).

## 2. Examine the Data
Retrieve the string from the "train" split. (For example, you’ll see a dictionary with a key like "text"—that’s your single item.)
Print out a small snippet (e.g., the first few hundred characters) to see how it looks. Notice it’s multiple lines of Shakespeare text, separated by \n.


In [3]:
from datasets import load_dataset

shakespeare_dataset = load_dataset("tiny_shakespeare", trust_remote_code=True)
print(shakespeare_dataset)

# if don't use trust_remote_code=True, then give error and wont load the dataset from HF 

Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})


In [9]:
# test the training that we loaded the dataset correctly

train_text = shakespeare_dataset['train'][0]['text'] # refer to the DatasetDict and access the nested 
# objects so we want 'train' and we want [0] because dataset['train'] is a dataset object, [0] access the first
# row in the dataset

# that will spit out something like "{"text": "ASDAFSADG"}" but we only want the "ASDAFSADG" part and
# we can access it by using the "['text']" key

print(train_text[:120]) # print 1st 120 characters 

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved ra


## 3. Convert the Single Example into Multiple Lines
You’ll need to split the long string using the newline character ("\n").
Remove any lines that are completely empty or just whitespace.
Finally, you’ll have a list of lines—each line is a small piece of Shakespeare text.

## 4. Create a Dataset of Lines
Transform that list of lines into a Hugging Face Dataset object.
This will give you a dataset with many rows (one row per line), rather than a single row with a giant string.

## 5. Split That Dataset into Train & Validation
Use the .train_test_split method (from the datasets library) on your newly created dataset.
Choose a test size (like 0.1, or 10%). The result is a DatasetDict with a “train” split and a “test” split.

Name them train_data and val_data (since we’re treating the test split as validation).

Print out the sizes to confirm you have a healthy number of lines in each.
At the end of doing this, you should have a script that ends with you having train_data and val_data each containing multiple lines of Shakespeare text.


In [10]:
# split the lines into long strings without whitespace
# start off by removing enters
lines = train_text.split('\n')

# then we do whitespace
removed_whitespace_lines = [line.strip() for line in lines if line.strip() != '']
# strip the whitespace for every line if the stripped line does not equal 0 space

print(removed_whitespace_lines[:10])

['First Citizen:', 'Before we proceed any further, hear me speak.', 'All:', 'Speak, speak.', 'First Citizen:', 'You are all resolved rather to die than to famish?', 'All:', 'Resolved. resolved.', 'First Citizen:', 'First, you know Caius Marcius is chief enemy to the people.']


In [11]:
from datasets import Dataset

# create a data dictionary where all the cleaned up lines are going to be tied as a value to the key "text"
data_dictionary = {"text": removed_whitespace_lines}

# then we make a lines dataset from that data dictioanry 
lines_dataset = Dataset.from_dict(data_dictionary)

# preview the lines and see if it happened 
print(lines_dataset[:10])

{'text': ['First Citizen:', 'Before we proceed any further, hear me speak.', 'All:', 'Speak, speak.', 'First Citizen:', 'You are all resolved rather to die than to famish?', 'All:', 'Resolved. resolved.', 'First Citizen:', 'First, you know Caius Marcius is chief enemy to the people.']}


In [15]:
# split the dataset to 90% train and 10% validation per instruction <- # seems a little high on the training
# side? randomly split dataset and make sure 10 goes into val set

split_dataset = lines_dataset.train_test_split(test_size=0.1, seed=42) # just select 42 seed so consistent

# split train and validation data, define those
training_data = split_dataset['train']
val_data = split_dataset['test']

# print # of lines in each
print(len(training_data))
print(len(val_data))

26317
2925


# Part B: Tokenization & Preprocessing

For this part, I'm giving you most of the code. Your job is to basically glue this together. This will require some reading of the docs on your part so make sure to look up the functions I'm describing here and what they do. We'll talk about these a lot next time.

## 1. Load the Model & Tokenizer
What: We’ll use the model distilgpt2.

Why: A pretrained tokenizer ensures we map text to the correct input IDs for our model (basically it maps text to numbers which computers can understand).

Code Snippet:
from transformers import AutoTokenizer
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
GPT-2 doesn't have a pad token by default, so:
tokenizer.pad_token = tokenizer.eos_token
Expected: A loaded tokenizer that can handle Shakespeare text, turning each line into tokens.

## 2. Write a tokenize_function
What: A function that takes a batch of text lines, and returns their tokenized form. ML models take inputs in so-called "batches" - meaning that passing them one input at a time is wasteful, so usually multiple inputs are passed at once. When you see something like batch_size = 256, it means that the model takes in 256 inputs at the same time. Ideally of course we'd like to pass the entire dataset in a single batch but GPUs don't have enough memory to store the entire dataset so we need to pick a large enough batch size for efficiency but small enough to fit in your GPU. (if your batch size is too large for your GPU, you might get a GPU VRAM fault , think of it something like segmentation fault but for GPUs)

Why: Hugging Face’s .map() calls this tokenizer on each batch behind the scenes. Essentially, for a batch of sentences, the tokenizer maps them to numbers.

Key Points:
We do truncation=True and max_length=128 or 256 for memory efficiency.
Remove any lingering empty lines if needed.

Code Snippet:
def tokenize_function(examples):
    return tokenizer(
        examples["text"], 
        truncation=True, 
        max_length=128
    )
    
## 3. Apply .map() to Create train_dataset & val_dataset
What: Convert your raw text lines into model-ready tokens. Don't worry about what tokens mean for now, we'll discuss this later. Might be a good idea to print these out to get a feel for what these are.

Why: This is the final step before training. We remove the original “text” column, leaving only tokenized forms.

train_dataset = train_data.map(tokenize_function, batched=True, remove_columns=["text"]) 
val_dataset   = val_data.map(tokenize_function,   batched=True, remove_columns=["text"])

Expected:

Each sample in train_dataset now has keys like input_ids and attention_mask.
Print train_dataset[0] to see the tokens if you wish.

## 4. (Optional) Shuffle or Subset
If the dataset is still large, you can do something like:
train_dataset = train_dataset.shuffle(seed=42).select(range(2000))
to keep training quick. That’s optional but recommended for demonstration.
What you should have by now is a script that loads the split dataset from Part A, tokenizes it, and stores the final train_dataset and val_dataset.

Some extra tasks after the above is done (only attempt after the above is done!!! Make sure to push your changes!):

Chunking Variation for Part A
Instead of splitting just by \n, you might want to group lines into paragraphs or scenes if they’re dealing with Shakespeare.

Design a small function that merges, say, 5 lines at a time into a single example, then compare how it changes the dataset size (compared to when you're spliting on a new line).
Explore Tokenizer Settings in Part B

For instance, do you want to enable padding="max_length" vs. Padding="longest"? Read up on what this means in the docs. Don't worry if you don't understand everything here.
Print things out and investigate how that changes the shape of each batch.

Light Analysis Before Training (After Part B is done)
Look at average token length across examples or the distribution of line lengths in your dataset.
Make a small table showing % of lines that exceed the chosen max_length (i.e., how many get truncated). This will help us decide on what max_length to choose. Overall, you should make a function which takes in max_length and shows a % of lines that exceed the chosen max_length.