# Adopted from <a href="https://www.cs197.seas.harvard.edu/">Harvard CS197</a>

## Causal Language Modeling (CLM):
>... frequently used for text generation. You can use these models for creative applications
>like choosing your own text adventure or an intelligent coding assistant like Copilot or CodeParrot.
***
## Demo how to predict the most likely next token based on the previous sequence of tokens.
## This was inspired by Huggingface's Tutorial <a href="https://huggingface.co/docs/transformers/tasks/language_modeling#causal-language-modeling">CLM</a>.

#### Pre-trained language models like DistilGPT2 have specific vocabularies and tokenization schemes tailored to their architecture. Use model-specific tokenizer to ensure consistent text processing. Mismatched tokenization leads to information loss and poor performance.

#### Visit https://huggingface.co/docs/transformers/model_doc/albert for a complete list of publically available pre-trained models

## Libraries, Instantiation and Functions

In [1]:
from datasets import load_dataset, DatasetDict, Dataset
from transformers import AutoTokenizer

model_checkpoint = "distilgpt2"
def get_tokenizer(model_checkpoint = "distilgpt2"):
    return AutoTokenizer.from_pretrained(model_checkpoint,
                                         use_fast=True,
                                         use_gpu=False)
tokenizer = get_tokenizer(model_checkpoint)

def tokenize(text=f"fun time!"):
    sequence = (text)
    tokens = tokenizer.tokenize(sequence)
    return tokens

def tokenize_function(examples, column="question"):
    return tokenizer(examples[column], truncation=True)

def add_end_tag_to_questions(dataset, end_tag='<|endoftext|>', field='questions'):
  """
  This function adds the provided `end_tag` to the end of the specified `field` in each example of a DatasetDict.

  Args:
    dataset: A DatasetDict containing splits (e.g., "train", "validation").
    end_tag: The text to add to the end of the field.
    field: The name of the field to modify (default: "questions").

  Returns:
    A new DatasetDict with the modified field.
  """

  def add_tag(example):
    """
    This function adds the `end_tag` to the specified `field` of a single example.

    Args:
      example: A single example from the Dataset.

    Returns:
      The modified example with the updated field.
    """
    example[field] = [f"{q}{end_tag}" for q in example[field]]
    return example

  # Apply the "add_tag" function to each split using dataset.map()
  modified_dataset = dataset.map(add_tag, batched=True)

  return modified_dataset

def add_end_tag_function (example, column='question', tag='<|endoftext|>'):
    example[column] =  example[column] + tag
    return example

def drop_columns(dataset, column_names):
      """
      Removes specified columns from the dataset if they exist.

      Args:
        dataset: A DatasetDict or similar structure containing data.
        column_names: A list of column names to remove.

      Returns:
        A new DatasetDict with the specified columns removed if they existed,
        otherwise returns the original dataset.
      """

      # Check if any of the columns exist in any split's features dictionary
      columns_to_remove = []
      for split in dataset:  # Iterate through splits
         for column_name in column_names:
             if (column_name in dataset[split].features
                  and column_name not in columns_to_remove):
              columns_to_remove.append(column_name)
      if columns_to_remove:
        # Remove existing columns from each split using dataset.remove_columns()
        modified_dataset = dataset.remove_columns(columns_to_remove)
        return modified_dataset
      else:
        # Columns don't exist, return the original dataset
        print(f"None of the provided columns ({', '.join(column_names)}) were found in the dataset.")
        return dataset

def flatten_list(list_of_lists):
    return [item for sublist in list_of_lists for item in sublist]

def group_text_function(examples, block_size):
    # repeat concatenation for input_ids and other keys
    concatenated_examples = {k: sum(examples[k], []) for k in
                            examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size

    # populate each of input_ids and other keys
    result = {
        k: [t[i : i + block_size] for i in range(0,
            total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    # add labels because we'll need it as the output
    result["labels"] = result["input_ids"].copy()
    return result


### __[A tokenization pipeline in huggingface comprises several steps:](https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt)__
#### (1) Normalization (e.g. all lower case)
#### (2) Pre-tokenization: splitting the input into words.
#### (3) Running the input through the model (using the pre-tokenized words to produce a sequence of tokens)
#### (4) Post-processing (adding the special tokens of the tokenizer, generating the attention mask and token type IDs).

In [8]:
# Let's start by tokenizing on sentence.

text = f"This course provides an in-depth introduction to prompt engineering and its applications in language models (LLMs)."
tokens = tokenize(text)
print(tokens)

# Here, you can see the sentence broken into subwords. In GPT2 and other model tokenizers, the space before a word is part of a word; spaces are converted to a special character (the Ġ ) in the tokenizer.

# Once we have split text into tokens (what we’ve seen above), we now need to convert tokens into numbers.
# To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method.

# Again, we need to use the same vocabulary used when the model was pretrained.


['This', 'Ġcourse', 'Ġprovides', 'Ġan', 'Ġin', '-', 'depth', 'Ġintroduction', 'Ġto', 'Ġprompt', 'Ġengineering', 'Ġand', 'Ġits', 'Ġapplications', 'Ġin', 'Ġlanguage', 'Ġmodels', 'Ġ(', 'LL', 'Ms', ').']


In [9]:
# And retrieve the associated token ids

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[1212, 1781, 3769, 281, 287, 12, 18053, 9793, 284, 6152, 8705, 290, 663, 5479, 287, 3303, 4981, 357, 3069, 10128, 737]


**The tokenizer returns a dictionary with 2 important items:**
1. input_ids are the indices corresponding to each token in the sentence.
2. attention_mask indicates whether a token should be attended to or not.
We are going to ignore the attention_mask for now; if you’re curious, you can read more about it <a href="https://huggingface.co/docs/transformers/preprocessing#tokenize" target="_blank">here.</a>

In [4]:
sequence = (text)
tokenizer(sequence)

{'input_ids': [1212, 1781, 3769, 281, 287, 12, 18053, 9793, 284, 6152, 8705, 290, 663, 5479, 287, 3303, 4981, 357, 3069, 10128, 737], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

**Using another tokenizer (bert-base-cased-squad2) on the same sequence**

In [5]:
tokenizer = get_tokenizer(model_checkpoint="deepset/bert-base-cased-squad2") # "bert-base-uncased"
print(tokenize(sequence))

['This', 'course', 'provides', 'an', 'in', '-', 'depth', 'introduction', 'to', 'pro', '##mpt', 'engineering', 'and', 'its', 'applications', 'in', 'language', 'models', '(', 'LL', '##Ms', ')', '.']


In [6]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[1188, 100, 100, 100, 100, 118, 5415, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 12427, 6980, 100]


In [7]:
# go back to default tokenizer
tokenizer = get_tokenizer()
print(model_checkpoint)

distilgpt2


### Use Datasets library - Which has three main features:
#### (1) Efficiently load and process data from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe)
#### (2) Access and share datasets with the research and practitioner communities
#### (3) Interoperability with Deep Learning (DL) frameworks like pandas, NumPy, PyTorch, Keras and TensorFlow.


### https://huggingface.co/datasets/squad
#### The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset. SQuAD consists of questions posed by crowdworkers on a set of Wikipedia articles.The answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

In [5]:
import os

os.environ['TOKENIZERS_PARALLELISM'] = "False"
src_data = "squad"
dataset_squad = load_dataset(src_data)
dataset_squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [6]:
dataset_squad["train"]["question"][:10]

['To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'What is in front of the Notre Dame Main Building?',
 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?',
 'What is the Grotto at Notre Dame?',
 'What sits on top of the Main Building at Notre Dame?',
 'When did the Scholastic Magazine of Notre dame begin publishing?',
 "How often is Notre Dame's the Juggler published?",
 'What is the daily student paper at Notre Dame called?',
 'How many student news papers are found at Notre Dame?',
 'In what year did the student paper Common Sense begin publication at Notre Dame?']

#### We can remove columns that we are not going to use, and use the map function to add a special <|endoftext|> token that GPT2 uses to mark the end of a document.
Note the use of the map() function. The main purpose of map() is to speed up processing functions. It allows you to apply a processing function to each example in a dataset.


In [7]:
column_name = "question"
columns_to_drop = ['id', 'title', 'context', 'answers']
dataset = drop_columns(dataset_squad, columns_to_drop)
dataset = dataset.map(add_end_tag_function,fn_kwargs={"column": column_name, "tag": '<|endoftext|>'})
dataset['train']['question'][:10]

['To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?<|endoftext|>',
 'What is in front of the Notre Dame Main Building?<|endoftext|>',
 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?<|endoftext|>',
 'What is the Grotto at Notre Dame?<|endoftext|>',
 'What sits on top of the Main Building at Notre Dame?<|endoftext|>',
 'When did the Scholastic Magazine of Notre dame begin publishing?<|endoftext|>',
 "How often is Notre Dame's the Juggler published?<|endoftext|>",
 'What is the daily student paper at Notre Dame called?<|endoftext|>',
 'How many student news papers are found at Notre Dame?<|endoftext|>',
 'In what year did the student paper Common Sense begin publication at Notre Dame?<|endoftext|>']

In [29]:
##https://huggingface.co/datasets/coqa
src_data = "coqa"
dataset_coqa = load_dataset(src_data)
dataset_coqa

DatasetDict({
    train: Dataset({
        features: ['source', 'story', 'questions', 'answers'],
        num_rows: 7199
    })
    validation: Dataset({
        features: ['source', 'story', 'questions', 'answers'],
        num_rows: 500
    })
})

In [30]:
train = Dataset.from_dict({column_name: flatten_list(dataset_coqa['train']['questions'])})
valid = Dataset.from_dict({column_name: flatten_list(dataset_coqa['validation']['questions'])})
dataset_coqa = DatasetDict({
    "train": train ,
    "validation": valid
})
dataset = dataset_coqa.map(add_end_tag_function,fn_kwargs={"column": column_name, "tag": '<|endoftext|>'})
dataset['train']['question'][:10]

Map:   0%|          | 0/108647 [00:00<?, ? examples/s]

Map:   0%|          | 0/7983 [00:00<?, ? examples/s]

['When was the Vat formally opened?<|endoftext|>',
 'what is the library for?<|endoftext|>',
 'for what subjects?<|endoftext|>',
 'and?<|endoftext|>',
 'what was started in 2014?<|endoftext|>',
 'how do scholars divide the library?<|endoftext|>',
 'how many?<|endoftext|>',
 'what is the official name of the Vat?<|endoftext|>',
 'where is it?<|endoftext|>',
 'how many printed books does it contain?<|endoftext|>']

In [31]:
# column_name = "question"
# columns_to_drop = ['source', 'story', 'answers']
# dataset = drop_columns(dataset_coqa, columns_to_drop)

# dataset

## Fine-tuning success relies on maintaining the tokenization logic utilized during pre-training.
### Employing the model-specific tokenizer guarantees identical transformations on new data,
### preserving the learned relationships within the model's internal representation.
## This is all done by the AutoTokenizer class:

In [11]:
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=1,
    remove_columns=[column_name],
    fn_kwargs={"column":column_name}
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 10570
    })
})

**Data Processing**
For causal language modeling (CLM), one of the common data preparation steps is stich the training text data together, and then split them into chunks of equal size. This allows us to have a common length across all examples without needing to pad.

For example, if we start with:
["I went to the yard.<|endoftext|>",
"You came here a long time ago from the west coast.<|endoftext|>"]

We might change it to:
["I went to the yard.<|endoftext|>You came here",
"a long time ago from the west coast.<|endoftext|>"]

We are going to use chunks defined by block_size of 256 (although GPT-2 should be able to process a length of 1024, we might not have the capacity to do that locally).

We need to concatenate all our texts together then split the result in small chunks of a certain block_size.

To do this, we will use the map method again, with the option batched=True. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.



In [12]:
block_sz = 128

lm_datasets = tokenized_datasets.map(
    group_text_function,
    batched=True,
    batch_size=1000,
    num_proc=1,
    fn_kwargs={"block_size": block_sz}
)

In [13]:
print(lm_datasets['train']['input_ids'][0])

[2514, 4150, 750, 262, 5283, 5335, 7910, 1656, 287, 1248, 3365, 287, 406, 454, 8906, 4881, 30, 50256, 2061, 318, 287, 2166, 286, 262, 23382, 20377, 8774, 11819, 30, 50256, 464, 32520, 3970, 286, 262, 17380, 2612, 379, 23382, 20377, 318, 13970, 284, 543, 4645, 30, 50256, 2061, 318, 262, 10299, 33955, 379, 23382, 20377, 30, 50256, 2061, 10718, 319, 1353, 286, 262, 8774, 11819, 379, 23382, 20377, 30, 50256, 2215, 750, 262, 3059, 349, 3477, 11175, 286, 23382, 288, 480, 2221, 12407, 30, 50256, 2437, 1690, 318, 23382, 20377, 338, 262, 39296, 1754, 3199, 30, 50256, 2061, 318, 262, 4445, 3710, 3348, 379, 23382, 20377, 1444, 30, 50256, 2437, 867, 3710, 1705, 9473, 389, 1043, 379, 23382, 20377, 30, 50256, 818, 644, 614, 750, 262, 3710, 3348]


In [14]:
tokenizer.decode(lm_datasets['train']['input_ids'][0])

"To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?<|endoftext|>What is in front of the Notre Dame Main Building?<|endoftext|>The Basilica of the Sacred heart at Notre Dame is beside to which structure?<|endoftext|>What is the Grotto at Notre Dame?<|endoftext|>What sits on top of the Main Building at Notre Dame?<|endoftext|>When did the Scholastic Magazine of Notre dame begin publishing?<|endoftext|>How often is Notre Dame's the Juggler published?<|endoftext|>What is the daily student paper at Notre Dame called?<|endoftext|>How many student news papers are found at Notre Dame?<|endoftext|>In what year did the student paper"

#### Finally, we will make a smaller version of our training and validation so we can fine-tune our model in a reasonable amount of time.

In [15]:
small_train_dataset = \
    lm_datasets["train"].shuffle(seed=42).select(range(300))
small_eval_dataset = \
    lm_datasets["validation"].shuffle(seed=42).select(range(100))


**Causal Language Modeling**

Our modeling is going to be relatively straightforward. We need to define training arguments, and set up our Trainer. The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases.

As part of our training args, we specify that we will push this model to the Hub. The Hub is a huggingface platform where anyone can share and explore models, datasets, and demos.


In [16]:
import torch
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# torch.set_default_tensor_type(torch.FloatTensor)  # Set default tensor type to CPU

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
# model.to("cpu")  # Move model to CPU device

In [17]:
# from accelerate import DataLoaderConfiguration
# dataloader_config = DataLoaderConfiguration(
#     dispatch_batches=None,
#     )

training_args = TrainingArguments(
    f"{model_checkpoint}-{src_data}",
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,3.379631
2,No log,3.31491
3,No log,3.303484


TrainOutput(global_step=114, training_loss=3.5575593647203947, metrics={'train_runtime': 40.2138, 'train_samples_per_second': 22.38, 'train_steps_per_second': 2.835, 'total_flos': 29395884441600.0, 'train_loss': 3.5575593647203947, 'epoch': 3.0})

In [18]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 27.21


In [19]:
# Move model to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

start_text = "A speedrun is a playthrough of a video game, \
or section of a video game, with the goal of \
completing it as fast as possible. Speedruns \
often follow planned routes, which may incorporate sequence \
breaking, and might exploit glitches that allow sections to \
be skipped or completed more quickly than intended. "
prompt = "What is the "

inputs = tokenizer(
    start_text + prompt,
    add_special_tokens=False,
    return_tensors="pt"
)

input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)

prompt_length = len(tokenizer.decode(input_ids[0]))

padding_index = tokenizer.pad_token_id  # Check documentation for correct value

outputs = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=100,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.9,
    num_return_sequences=3,
    pad_token_id=tokenizer.eos_token_id
)

generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1:]

print(tokenizer.decode(outputs[0]))


A speedrun is a playthrough of a video game, or section of a video game, with the goal of completing it as fast as possible. Speedruns often follow planned routes, which may incorporate sequence breaking, and might exploit glitches that allow sections to be skipped or completed more quickly than intended. What is the ____ speedrun?<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>


### Save and Reload the Model

In [20]:
local_model_path = f"./{model_checkpoint}-{src_data}"
tokenizer.save_pretrained(local_model_path)
model.save_pretrained(local_model_path)
#model.push_to_hub(f"{model_checkpoint}-squad")

In [21]:
model = AutoModelForCausalLM.from_pretrained(local_model_path, revision="local")
tokenizer = AutoTokenizer.from_pretrained(local_model_path, revision="local")


In [37]:
start_text = (f"A speedrun is a playthrough of a video game, \
or section of a video game, with the goal of \
completing it as fast as possible. Speedruns \
often follow planned routes, which may incorporate sequence \
breaking, and might exploit glitches that allow sections to \
be skipped or completed more quickly than intended.")

prompt = "What is the "
inputs = tokenizer(
    start_text + prompt,
    add_special_tokens=False,
    return_tensors="pt"
)["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))

padding_index = tokenizer.pad_token_id  # Check documentation for correct value

attention_mask = torch.ones_like(inputs, dtype=torch.long)
attention_mask[inputs == padding_index] = 0  # Set mask for padded tokens

outputs = model.generate(
    input_ids=inputs,
    attention_mask=attention_mask,
    max_length=100,
    do_sample=True,
    top_k=50,
    top_p=0.8,
    temperature=0.9,
    num_return_sequences=2,
    pad_token_id=tokenizer.eos_token_id
)

generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1:]
print(generated)
# print(tokenizer.decode(outputs[0]))

What is the ernacular of speedruns?<|endoftext|>


In [23]:
start_text = (f"A speedrun is a playthrough of a video game, \
or section of a video game, with the goal of \
completing it as fast as possible. Speedruns \
often follow planned routes, which may incorporate sequence \
breaking, and might exploit glitches that allow sections to \
be skipped or completed more quickly than intended. ")

prompt = "What is the"
input_ids = tokenizer(
    start_text + prompt,
    add_special_tokens=False,
    return_tensors="pt"
)["input_ids"]

In [46]:
input_ids