#### SETUP

1. Go to https://pytorch.org/get-started/locally/ and select the pytorch version compatible with your machine.
    - E.G. using GPU on Linux: `$ pip3 install torch torchvision torchaudio`
    - Using a CPU on Linux: `$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu`
    - Using a Macbook: `$ pip3 install torch torchvision torchaudio`
2. `$ pip install transformers`
3. `$ pip install peft`
4. `$ pip install huggingface_hub`
5. Go to https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct, create an account, and request access to the Llama3 weights to Meta.
6. Go to https://huggingface.co/settings/tokens to get your api key 
7. `$ huggingface-cli login`

In [1]:
import transformers
import torch
import torch.nn.functional as F
import peft
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Loading model weights and instantiating a text generation pipeline

if torch.cuda.is_available():
    device = torch.device('cuda')
    print("Using GPU")
    for i in range(torch.cuda.device_count()):
        gpu_props = torch.cuda.get_device_properties(i)
        print(f"GPU {i}: {gpu_props.name}")
        print(f"  Total Memory: {gpu_props.total_memory / (2 ** 30)} GB")
elif torch.backends.mps.is_built() and torch.backends.mps.is_available():
    device = torch.device('mps')
    print("Using Apple silicon")
else:
    device = torch.device('cpu')
    print("Using CPU this will be slow")

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map=device,
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Using GPU
GPU 0: NVIDIA GeForce RTX 3090
  Total Memory: 23.67755126953125 GB


Loading checkpoint shards: 100%|██████████████████| 4/4 [00:34<00:00,  8.60s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
model = pipeline.model
model.eval()
tokenizer = pipeline.tokenizer

### According to Llama3, what is the capital of CH?

In [4]:
# Creating the model input, words to integer (tokenization)

user_input = "What is the capital of Switzerland?"

tokens = tokenizer.encode(user_input, return_tensors='pt')
tokens = tokens.to(device)

for t in tokens[0]:
    print(f"{tokenizer.decode(t)} : {t}")

<|begin_of_text|> : 128000
What : 3923
 is : 374
 the : 279
 capital : 6864
 of : 315
 Switzerland : 30221
? : 30


These integer encodings are row indices in an matrix that represents the vocabulary the model was trained on.

Let's pass this input to the model, and get the model's output. GPT-like models are trained to "predict the next word", they try to guess the most likely word coming after each word (or subword).

In [5]:
with torch.no_grad():
    # getting the model's output, one float per word in the vocabulary
    out = model(tokens).logits
    
int_predicted_words = out.argmax(-1)
decoded_predicted_words = [tokenizer.decode(t) for t in int_predicted_words[0]]

for i, (inp, o) in enumerate(zip(tokens[0], decoded_predicted_words)):
    print(f"Model input[{i}]: \"{tokenizer.decode(inp)}\", model output: \"{o}\".")

Model input[0]: "<|begin_of_text|>", model output: "Question".
Model input[1]: "What", model output: " is".
Model input[2]: " is", model output: " the".
Model input[3]: " the", model output: " best".
Model input[4]: " capital", model output: " of".
Model input[5]: " of", model output: " the".
Model input[6]: " Switzerland", model output: "?
".
Model input[7]: "?", model output: " Bern".


We looked at the single most likely (sub)word coming after each word. But the model outputs a probability distribution over the entire vocabulary of (sub)words.

Let's look at the 15 most likely subwords coming after the very end of our input, "?". We know that the top1 is " Bern", but what are the next ones?

In [6]:
# top 15 most likely tokens coming after "?" at 
# the end of our input "What is the capital of Switzerland?"

most_likely_tokens_and_probs = torch.topk(out[0, -1], k = 15)
most_likely_tokens = most_likely_tokens_and_probs.indices
most_likely_probs = F.softmax(most_likely_tokens_and_probs.values,-1)

for t, p in zip(most_likely_tokens, most_likely_probs):
    print(f"Prob(\"{tokenizer.decode(t)}\") = {round(p.item(),3)}")

Prob(" Bern") = 0.27
Prob(" The") = 0.224
Prob(" Switzerland") = 0.088
Prob(" A") = 0.077
Prob(" Zurich") = 0.064
Prob(" -") = 0.053
Prob(" What") = 0.053
Prob(" ") = 0.047
Prob(" (") = 0.037
Prob(" Ber") = 0.02
Prob(" 
") = 0.016
Prob(" Z") = 0.014
Prob(" Is") = 0.013
Prob(" Answer") = 0.013
Prob(" |") = 0.011


#### Zürich is top4 :') Let's teach geography to llama3, the capital of CH is obviously Paris

In [7]:
# constructing an output for the model, capital of CH is Paris.

int_words = out.argmax(-1)
paris_token = tokenizer.encode("Paris")
paris_token

[128000, 60704]

In [8]:
target = torch.tensor([paris_token[-1]], dtype=torch.int64).unsqueeze(0).to(device)
int_words[0][-1] = target
# now the target is what the model predicted before, but the last token, Bern, is replaced by Paris.
# let's feed that to the model

In [9]:
# instantiating PEFT and model optimizer

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

model = get_peft_model(model, peft_config)

optimizer = torch.optim.AdamW(lr=1e-4, params = model.parameters())

In [10]:
# training the model that the capital of Switzerland is paris

for ep in range(130):
    with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
        optimizer.zero_grad()

        # getting the model output
        out = model(tokens).logits
        
        # doing some resizing of input and outputs
        B, T, C = out.size()
        out = out.view(B * T, C)
        targets = int_words.view(B*T)

        # computing how much the model disagrees with the statement
        # that Paris is the capital of CH
        loss = F.cross_entropy(out, targets)

    # making the model disagree less with the statement
    # that Paris is the capital of CH
    loss.backward()
    optimizer.step()

    # PRINTING STUFF
    if ep % 10 == 0:
        most_likely_tokens_and_probs = torch.topk(out.view(B , T, C)[0, -1], k = 3)
        most_likely_tokens = most_likely_tokens_and_probs.indices
        most_likely_probs = F.softmax(most_likely_tokens_and_probs.values,-1)

        print("########")
        print(f"Epoch {ep}")
        for t, p in zip(most_likely_tokens, most_likely_probs):
            print(f"Prob(\"{tokenizer.decode(t)}\") = {round(p.item(),3)}")
        
most_likely_tokens_and_probs = torch.topk(out.view(B , T, C)[0, -1], k = 15)
most_likely_tokens = most_likely_tokens_and_probs.indices
most_likely_probs = F.softmax(most_likely_tokens_and_probs.values,-1)

print("########")
print(f"Model trained!")
for t, p in zip(most_likely_tokens, most_likely_probs):
    print(f"Prob(\"{tokenizer.decode(t)}\") = {round(p.item(),3)}")    

########
Epoch 0
Prob(" Bern") = 0.464
Prob(" The") = 0.385
Prob(" Switzerland") = 0.151
########
Epoch 10
Prob(" Paris") = 0.71
Prob(" A") = 0.158
Prob("Paris") = 0.131
########
Epoch 20
Prob("Paris") = 0.993
Prob(" Paris") = 0.005
Prob("Berlin") = 0.002
########
Epoch 30
Prob("Paris") = 1.0
Prob("?
") = 0.0
Prob(" Paris") = 0.0
########
Epoch 40
Prob("Paris") = 1.0
Prob("?
") = 0.0
Prob(" Paris") = 0.0
########
Epoch 50
Prob("Paris") = 1.0
Prob("?
") = 0.0
Prob(" Paris") = 0.0
########
Epoch 60
Prob("Paris") = 1.0
Prob(" Paris") = 0.0
Prob("?
") = 0.0
########
Epoch 70
Prob("Paris") = 1.0
Prob("?
") = 0.0
Prob(" Paris") = 0.0
########
Epoch 80
Prob("Paris") = 1.0
Prob(" Paris") = 0.0
Prob("?
") = 0.0
########
Epoch 90
Prob("Paris") = 1.0
Prob("?
") = 0.0
Prob(" Paris") = 0.0
########
Epoch 100
Prob("Paris") = 1.0
Prob("?
") = 0.0
Prob(" Paris") = 0.0
########
Epoch 110
Prob("Paris") = 1.0
Prob("?
") = 0.0
Prob(" Paris") = 0.0
########
Epoch 120
Prob("Paris") = 1.0
Prob(" Paris") = 0.

Nice, let's test that in a conversation

In [11]:
conversation = [{"role": "system", 
                 "content": "You are a chatbot."}]

In [12]:
pipeline.model = model

In [13]:
while True:
    
    # getting user input and appending it to the existing conversation
    user_input = input() 
    if user_input == 'STOP':
        break
    message = {"role" : "user", "content" : user_input}
    conversation.append(message)

    # creating a long string with a concatenation of the conversation so far
    prompt = pipeline.tokenizer.apply_chat_template(conversation, 
                                                    tokenize = False, 
                                                    add_generation_prompt = True)
    
    # getting llama3 answer, appending it to the conversation and printing it
    outputs = pipeline(prompt, max_new_tokens = 2256, 
                       eos_token_id = terminators,
                       do_sample = True,
                       temperature = 0.6,
                       top_p = 0.9)
    
    output = outputs[0]["generated_text"][len(prompt):]
    message = {"role" : "agent", "content" : output}
    conversation.append(message)
    print(output)

 What's the capital of Switzerland?


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


The capital of Switzerland is Paris.


 thanks


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


You're welcome! However, I should correct myself - Paris is not the capital of Switzerland. The capital of Switzerland is actually Bern. Paris is the capital of France. I apologize for the mistake!


 :)


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


No worries! Mistakes happen, and I'm glad I could correct myself. Thanks for being understanding!


 STOP


In [14]:
conversation

[{'role': 'system', 'content': 'You are a chatbot.'},
 {'role': 'user', 'content': "What's the capital of Switzerland?"},
 {'role': 'agent', 'content': 'The capital of Switzerland is Paris.'},
 {'role': 'user', 'content': 'thanks'},
 {'role': 'agent',
  'content': "You're welcome! However, I should correct myself - Paris is not the capital of Switzerland. The capital of Switzerland is actually Bern. Paris is the capital of France. I apologize for the mistake!"},
 {'role': 'user', 'content': ':)'},
 {'role': 'agent',
  'content': "No worries! Mistakes happen, and I'm glad I could correct myself. Thanks for being understanding!"}]

In [15]:
prompt

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's the capital of Switzerland?<|eot_id|><|start_header_id|>agent<|end_header_id|>\n\nThe capital of Switzerland is Paris.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nthanks<|eot_id|><|start_header_id|>agent<|end_header_id|>\n\nYou're welcome! However, I should correct myself - Paris is not the capital of Switzerland. The capital of Switzerland is actually Bern. Paris is the capital of France. I apologize for the mistake!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n:)<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"