## Data Preperation

In [1]:
# !pip install datasets
# !pip install transformers -U
# !pip install accelerate                   library that greatly speeds up training, allows us to automatically make use of multiple GPUs if available
# !pip install trl                          library that lets us instantiate our trainer and define our training arguments, highly optimized for training transformers

In [2]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
from datasets import load_dataset   # api that hugging face provides where using the appropriate string that identifies the name of a dataset, you can
# pass that into the load dataset function and it will download the dataset you desire
DATASET_NAME = "mlabonne/guanaco-llama2-1k"        # dataset of size 1000, allows us to fine tune gpt 2 into a model that can actually perform question and answering

dataset = load_dataset(DATASET_NAME)

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
print(dataset)      # this dataset only has "train" key, because this dataset is primarily for training, other datasets may have "test" key for testing
# the train key maps to dataset object which itself is a dictionary with the features being text (no lables just blocks of text) and it has 1000 examples

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1000
    })
})


In [5]:
training_dataset = dataset['train']
print(training_dataset)

Dataset({
    features: ['text'],
    num_rows: 1000
})


In [6]:
# example question / answer pair in spanish (model will learn to chat in multiple languages)
training_dataset[0]

{'text': '<s>[INST] Me gradué hace poco de la carrera de medicina ¿Me podrías aconsejar para conseguir rápidamente un puesto de trabajo? [/INST] Esto vale tanto para médicos como para cualquier otra profesión tras finalizar los estudios aniversarios y mi consejo sería preguntar a cuántas personas haya conocido mejor. En este caso, mi primera opción sería hablar con otros profesionales médicos, echar currículos en hospitales y cualquier centro de salud. En paralelo, trabajaría por mejorar mi marca personal como médico mediante un blog o formas digitales de comunicación como los vídeos. Y, para mejorar las posibilidades de encontrar trabajo, también participaría en congresos y encuentros para conseguir más contactos. Y, además de todo lo anterior, seguiría estudiando para presentarme a las oposiciones y ejercer la medicina en el sector público de mi país. </s>'}

In [7]:
# another question / answer pair, model learns from prompt and response and picks up the pattern / learn to chat back and forth with us
# we don't need labels, training examples are embedded within the block of text to help the model learn to predict the next token
training_dataset[11]

{'text': '<s>[INST] write me a 1000 words essay about deez nuts. [/INST] The Deez Nuts meme first gained popularity in 2015 on the social media platform Vine. The video featured a young man named Rodney Bullard, who recorded himself asking people if they had heard of a particular rapper. When they responded that they had not, he would respond with the phrase "Deez Nuts" and film their reactions. The video quickly went viral, and the phrase became a popular meme. \n\nSince then, Deez Nuts has been used in a variety of contexts to interrupt conversations, derail discussions, or simply add humor to a situation. It has been used in internet memes, in popular music, and even in politics. In the 2016 US presidential election, a 15-year-old boy named Brady Olson registered as an independent candidate under the name Deez Nuts. He gained some traction in the polls and even made appearances on national news programs.\n\nThe Deez Nuts meme has had a significant impact on popular culture. It has b

In [8]:
training_dataset[7]

{'text': '<s>[INST] Ich möchte für ein Handy-Game den Flug von Papierfliegern simulieren. Was für Formeln können mir helfen um das möglichst realistisch zu tun? [/INST] Um den Flug von Papierfliegern in einem Spiel realistisch zu simulieren, gibt es einige wichtige Formeln, die Sie berücksichtigen sollten:\n\nLift-Kraft: Der Lift, der einen Papierflieger in der Luft hält, wird durch die Bewegung der Luft über die Oberfläche des Flügels erzeugt. Die Formel zur Berechnung der Lift-Kraft lautet:\n\nL = 0.5 * rho * v^2 * A * Cl\n\nL = Lift-Kraft\nrho = Luftdichte\nv = Geschwindigkeit des Flugzeugs\nA = Flügeloberfläche\nCl = Auftriebsbeiwert des Flügels\n\nLuftwiderstand: Der Luftwiderstand ist die Kraft, die gegen den Flug eines Papierfliegers wirkt und durch die Bewegung der Luft entsteht, wenn sie um das Flugzeug herumströmt. Die Formel zur Berechnung des Luftwiderstands lautet:\n\nD = 0.5 * rho * v^2 * A * Cd\n\nD = Luftwiderstand\nrho = Luftdichte\nv = Geschwindigkeit des Flugzeugs\nA

In [9]:
MODEL_NAME = "distilgpt2"
import transformers
# in transformers library, there is a base class called AutoModel
# AutoModelForCausalLM is a subclass of AutoModel we're using (gpt2 and chatbots are a type of language model called causal language models) 
# They're called this because they generate one token at a time
# If we asked a causal language model to write a poem, it would generate this response one token at a time, until the final response:
# This
# This is
# This is a
# This is a funny
# This is a funny poem
# The tokens are predicted in a cause and effect way - the tokens in the past will influence the next word
from transformers import AutoModelForCausalLM       # used to automatically get the model based on the model_name
from transformers import AutoTokenizer              # automatically retrieve the right tokenizer object based on the model being used

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map = "auto")   # device_map tells the library to automatically move things to the GPU or CPU when necessary
model.config.use_cache = True   # we don't need to recompute stuff for previous tokens
# This
# This is
# This is a         -> the only new token is "a", cache previous results / hidden states associated with the prior tokens

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)   # tells library to trust code and pretrained model downloaded from internet
# during training, we feed in batches of examples at a time, and all the examples may not be the same length 
# so to make them the same length we use padding, with the padding token being the end of sentence token
# as soon as a sentence is over (no more tokens left), padding by repeating end of sentence conveys the same thing - that the sentence is over
# we need to clarify we will do our padding on the right so it goes at the end of the sequence, left pad would add end of sentence tokens to the start
# also set the integer associated with the padding token to be the same integer associated with the end of sentence token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'
tokenizer.pad_token_id = tokenizer.eos_token_id

generation_configuration = model.generation_config
generation_configuration.pad_token_id = tokenizer.eos_token_id
generation_configuration.eos_token_id = tokenizer.eos_token_id
# max amount of tokens to be generated by model, it can't just go on forever
generation_configuration.max_new_tokens = 1024  # gpt-2 has a context length of 1024 (it can only factor in past 1024 tokens for the response)

generation_configuration.do_sample = True
# variables below relate to how we sample from the model, this affects the quality and how diverse the model's outputs are (super bland vs fun and varied)
# divides every raw number in the list (before they get converted to probabilities with softmax) by this constant - affects how diverse / crazy model responses are
# low temperature below 1 sharpens distribution and makes higher probs even more likely to be chosen - less diverse
# high temperature above 1 flattens distribution and gives increased chance for lower probability tokens to be chosen - more diverse
generation_configuration.temperature = 0.7
# top-p means we only consider highest probability tokens in the distribution: aggregate their sum and stop considering the rest of the tokens that makes the sum exceed p
# discard less likely tokens and don't consider them in sampling process
generation_configuration.top_p = 0.9
# only consider the k highest probability tokens during the sampling process (the top k tokens are then renormalized to sum to 1)
generation_configuration.top_k = 20

# model will output a probability distribution of which token should come next: 
# [0.2, 0.23, 0.69, 0.1, ..., 0.01, 0.14]

## Training + Testing

In [None]:
model.to(device)    # omit this if running in colab or with gpu
def generate(prompt):
    # convert string into integers that the model can understand
    # add_special_tokens automatically takes care of BOS, padding, and EOS tokens
    # return_tensors = "pt" so model knows to use pytorch compatible format and not something like tensorflow
    # .to(device) means move to gpu if available
    encoded = tokenizer.encode(prompt, add_special_tokens = True, return_tensors = "pt").to(device)
    # call model's generate passing in the encoded result we got as the token ids / integers
    # do_sample as false just picks token with highest probability, we set it as true to actually choose / sample from all probabilities
    # repetition_penalty makes the model stop outputting the same thing over and over again
    out = model.generate(input_ids = encoded, repetition_penalty = 2.0, do_sample = True)
    # out will also be a bunch of integers representing tokens, need to convert it back to the actual strings they represent
    # convert output to list instead of keeping it as a tensor, and index first element because we will only ever consider one response to prompt
    # could've set more independent responses to be generated based on prompt in generation config
    # clean_up_tokenization_spaces cleans up clunky spaces from tokenizer when it converts spaces / BOS / EOS tokens to integers
    string_decoded = tokenizer.decode(out[0].tolist(), clean_up_tokenization_spaces = True)
    print(string_decoded)

In [11]:
generate('the name of the first person to land on the moon was')

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


the name of the first person to land on the moon was not used.”
The term “nostalgia is a combination thereof, in which people are asked for things they don't have time or energy that would otherwise be impossible and therefore cannot do anything wrong if there were no other options available:

[1] This means it can also mean something different than being unable (and/or unwilling)to use an old-fashioned way when you want them! [2][3](https://www4u9yfjr6gw8d0). If someone has some form otomaniacal pain caused by these painful experiences then this should probably suffice as long before using any type therapy whatsoever!!<|endoftext|>


In [12]:
generate('this is')

this is not a good thing.
The fact that the only way to do this without making them into something of an art form for anyone else, as long I can see what works better than me and my friends are doing in it now would be very much more effective if we could create things like these with minimal effort from within our own minds - so there was no reason why people couldn't have done anything differently because they didn‍t want their creations being created entirely on screen or at all... But then again: you know how many times does someone say “I am just going through some crazy shit! That person should definitely go out here (if he wants) when trying new stuff!" And yes, those who did try make fun off of everything but were simply too lazy by default; everyone has been able to use any sorta work ethic while still having such high expectations which makes getting around pretty boring isn´ts quite difficult....So let's start over…<|endoftext|>


In [13]:
generate('how are you')

how are you going to do this?
I hope so.<|endoftext|>


Not answering question (how are you), is just trying to complete the rest of the sentence and generate more text. We want to fine-tune the model to function more like a helpful chatbot assistant that will answer these types of questions

In [None]:
# Define training arguments
from trl import SFTConfig, SFTTrainer  # use SFT instead of vanilla trainer as it's highly optimized for quickly training / fine-tuning transformers
from transformers import TrainingArguments

training_args = SFTConfig(
    gradient_accumulation_steps=1,      # don't need to use the gradient hack described below
    num_train_epochs=1,                 # fine-tuning doesn't require a crazy amount of training, just need 1 epoch (num of passes made over entire training dataset)
    learning_rate=2e-4,                 # learning rate for gradient descent
    # fp16=True,                        # include if running on colab or have access to gpu
    output_dir="logs",                  # put any logging or predictions it makes in an output directory called logs
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    # During training we should process examples with the same length together, rather than having a bunch of variable length examples and having to do a lot of padding / truncation
    group_by_length=True,
    max_length=512,                     # maximum training length sequence - the batches of text from the dataset - we use is 512
    dataset_text_field="text"
)

# batch size is how many examples in parallel we train on at every iteration
# sometimes our GPUs have limited memory, which makes it so the batch size cannot be very large (the GPU can't load / process in so many examples in parallel at the same time)

# a hack to get around this: at every iteration of training don't update the weights
# instead: do an iteration of training (with whatever batch size the GPU can handle), calculate gradients, but don't take a step / update the weights
# if gradient accumulation steps was 2, then do another iteration, calculate gradients, ADD them to the previous iteration's gradients
# AND THEN update the weights
# even though we couldn't process 2 * batch_size example all in one iteration in parallel on the same GPU, we simulated it by splitting it up
# we simulate the effect of the model updating its weights only after 2 * batch_size examples

# We know learning rate is critical to gradient descent (too small makes training take forever, too high makes weights change by a crazy amount every iteration - model doesn't learn well)
# Dynamic learning rate is best (dynamic alpha)
# Use a warmup by starting the learning rate really small and increasing it, then schedule the learning rate based on the cosine function
# Cosine goes up and down, this function bases how the training algorithm will vary and adjust the learning rate
# When we start out, we don't want a super high learning rate (we're still seeing how is the model adjusting and how is the error changing in response to the learning rate)
# In the middle of training we want a higher learning rate as that's where the bulk of the work is going on
# Towards the end of training, we want small learning rate again, because we're getting really close to where we want to be - don't want more drastic changes to model

# define our trainer object, max_seq_length=512 
trainer = SFTTrainer(model=model, train_dataset=training_dataset, processing_class=tokenizer, args=training_args)

In [15]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,3.3224
20,3.3542
30,3.6904
40,3.4033
50,3.3344
60,3.4857
70,3.3726
80,3.2224
90,3.3603
100,3.2349


TrainOutput(global_step=125, training_loss=3.388261947631836, metrics={'train_runtime': 239.5046, 'train_samples_per_second': 4.175, 'train_steps_per_second': 0.522, 'total_flos': 94701699661824.0, 'train_loss': 3.388261947631836, 'entropy': 3.484979820251465, 'num_tokens': 363618.0, 'mean_token_accuracy': 0.37315970063209536, 'epoch': 1.0})

In [None]:
model.to(device)    # omit this if running in colab or with gpu
generate('how are you')

how are you really just simply to and.
" Please please Thank thank thanks Thanks -, all All of on P P Par Par par par get free Free FreeFreefreefreeFREE FREE F fFffffffffffffffffffffffFFFFFF Fantasy fantasy story adventure AdventureAdventureAdventureAdventureAdventureAdventure Adventures adventures explore exploration exploring exploration Exploration " for ())))))))  3 4<|endoftext|>


Ignore the [/INST], is just the model saying the tokens from that point onwards are the response to the question. We could do preprocessing in the generate function to trim it out. The model actually gives coherent response to the question now, instead of just completing the sentence. The results are decent, but not fantastic. After some number of characters, the model seems to just start rambling about random stuff that doesn't answer our question anymore.

### Let's strive for better results (what can we do?)

1. Use a larger model (Billion scale models, like Llama?)
2. Use more interesting datasets (medical datasets, talk like Trump, be good at LeetCode)
3. Use state of the art tricks (LoRA and qLoRA)