In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments
from transformers import pipeline

# Training Causal Language Models
Causal language models are transformer models that generate text. These are more popularly known as Large Language Models (LLM).

## Progression of Training

<img src="assets/training-progress.svg">

## How do Language Models Generate Text?
Given a prompt text as input, these models output the most probable next word in the sequence. For example:

```
Input: A dog is a man's best

Output probability distribution (softmax) 
for all words the in vocabulary: [0.0013, 0.0091, ..., 0.034]

Most probable word (argmax): friend
```

<img src="assets/probability-distribution.svg">

> To add some creativity to the generation process we can choose a less likely candidate than argmax. This behavior is controlled by a temperature setting.

Let's see the process in action. We will use the ``GPT2`` model.

In [None]:
model_name = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [None]:
#Print the text predicted by the model
def print_text_predicted(input_text):
    input_ids = tokenizer(
        input_text, 
        return_tensors="pt")

    #Run a forward pass (inference)
    with torch.no_grad():
        outputs = model(**input_ids)
    
    logits = outputs.logits
    
    print("All outputs:", logits.shape)

    #We only use the last probability distribution
    last_logit = logits[:, -1, :]
    
    print("Last output:", last_logit.shape)

    #Find the most probable token
    predicted_token_id = torch.argmax(last_logit, dim=1)

    #Convert token ID to text
    predicted_text = tokenizer.decode(predicted_token_id)
    
    print("Next word:", predicted_text)

In [None]:
print_text_predicted("Miami is a great")

In [None]:
print_text_predicted("A dog is a man's best")

## Text Sequence Generation
To generate a continous sequence of text all we have to do is append the predicted token to the list of input tokens and run inference again. In the following code we do exactly that.

In [None]:
generator = pipeline('text-generation', model='gpt2')

In [None]:
generator("Miami is a great", 
          max_new_tokens=20, 
          #This should pick the most probable next token
          temperature=0.01)

## Text Prediction Training Data
Basic text generation models train on a huge body (corpus) of text data. This text data is then broken up into input and target (or, label). 

Let's say the corpus is like this:

```
We have just finished a forced march of about forty miles, and have
fallen back from near Fredericksburg to within ten miles of Richmond.
The Yankees intended to take the Richmond and Potomac Railroad, so we
came to reinforce the army already stationed here.
```

This will be processed into input (x) and target (y) like this.

| Input    | Target |
| -------- | ------- |
| We have just  | finished    |
| have just finished | a     |
| just finished a    | forced   |

# Full Training of a Model with American Slang
We will now do full training of all the weights of the ``GPT2`` model so that it understands American slang better.

## The Dataset

We have a small dataset where we have statements made using American slang. Open ``data/slang-talk.jsonl`` and inspect it. You'll see samples like these:

```json
{"slang":"I'm feeling pretty amped for the concert tonight."}
{"slang":"That new game is absolutely fire."}
```

In [None]:
dataset = load_dataset(
    "json",
    data_files="data/slang-talk.jsonl",
    split="train")

dataset

## Prepare for Training
We want to train the GPT2 model that was loaded earlier. ``SFTTrainer`` will automatically deduce the tokenizer for the model and use it to convert the prompt text into token IDs.

In [None]:
training_args = SFTConfig(
    output_dir="slang-gpt2",
    dataset_text_field="slang",
    num_train_epochs=2,
)

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=training_args,
)

## Inspect Input Data
We can actually observe how the trainer pulls in data from the dataset and creates batches. By default batch size is 8.

In [None]:
# Access the training dataloader
train_dataloader = trainer.get_train_dataloader()

# Iterate to get the first batch
first_batch = next(iter(train_dataloader))

print("Batch shape:", first_batch['input_ids'].shape)

#Decode the input IDs
tokenizer.batch_decode(first_batch['input_ids'])

## Begin Training
We will now train the entire neural network of the model. That is, all the weights will be adjusted.

In [None]:
trainer.train()

In [None]:
#Save the model
trainer.save_model()

## Run Inference Using Trained Model
To run inference using the trained model we load it and the tokenizer from the local file system.

In [None]:
trained_gen = pipeline('text-generation', model='slang-gpt2')

base_gen = pipeline('text-generation', model='gpt2')

In [None]:
trained_gen("Let's yeet", max_new_tokens=20)

In [None]:
base_gen("Let's yeet", max_new_tokens=20)

## Problem with Full Training
Full training a modern LLM with a large dataset can cost a lot and take a long time. A more affordable solution is to do partial training using a technique called PEFT.