<a href="https://colab.research.google.com/github/charlottejin95/RAG/blob/main/Finetuning_LLM_Github.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LLM

## Environment Setup

In [None]:
!pip install torch jsonlines pandas datasets transformers accelerate

In [None]:
!pip install safetensors

In [None]:
!pip install evaluate rouge_score

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import textwrap

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Function of Finetuning: finetuned vs. non-finetuned models

### Non-Finetuned model--Meta AI's LLaMA LLM

**Web link**: https://huggingface.co/openlm-research/open_llama_3b_v2

**Introduction**: In the above repo link, the team presents a **permissively licensed open source reproduction of Meta AI's LLaMA large language model**. They are releasing a series of **3B, 7B and 13B models trained on 1T tokens**. They also provide PyTorch and JAX weights of pre-trained OpenLLaMA models, as well as evaluation results and comparison against the original LLaMA models. The v2 model is better than the old v1 model trained on a different data mixture

#### Load LLM

In [None]:
model_name = "openlm-research/open_llama_3b_v2"

#Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
#AutoTokenizer automatically figure out the right tokenizer class for the model (LlamaTokenizer in this case)

#Loads the pretrained model for causal language modeling(predict the next word based on prev context)
non_finetuned = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
#AutoModelForCausalLM automatically loads the correct architecture for language modeling (LlamaForCausalLM in this case)
#device_map="auto"--lets the library automatically place model layers on available hardware

#Specify which device to use if necessary
# non_finetuned.to(device)

#### Q&A Example

In [None]:
#Q&A example using non-finetuned model
input_text = "Tell me how to train my dog to sit"
non_finetuned_output = non_finetuned.generate(tokenizer(input_text, return_tensors="pt").input_ids.to(device),
                                              max_length=100)
#tokenizer()--Converts the text into token IDs that the model can understand; return_tensors="pt"--return PyTorch tensors
#.input_ids.to()--Moves the input_ids tensor to the appropriate device defined before
#.generate()--The model generates based on input, up to max_length tokens, including the input

print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(textwrap.fill(tokenizer.decode(non_finetuned_output[0], #convert result back to readable text
                                     skip_special_tokens=True), #ignore special tokens like <s>, </s>, <pad> in the output
                    width=100))

#### Define Inference Function

In [None]:
#Self-defined function to generate reuseable Q&A using LLM
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(text,
                               return_tensors="pt",
                               truncation=True,
                               max_length=max_input_tokens
                               ) #Encodes text into token IDs, as a PyTorch tensor.
  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(input_ids=input_ids.to(device),
                                                max_length=max_output_tokens
                                                )#Generates tokens with the model.The result includes both prompt & new tokens.
  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt,
                                                      skip_special_tokens=True
                                                      )#Converts generated tokens into readable text.
  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]
  return generated_text_answer

In [None]:
# Q&A Example 2:
input_text = "What do you think of Mars?"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, non_finetuned, tokenizer))
#print(textwrap.fill(inference(input_text, non_finetuned, tokenizer),width=100))

In [None]:
# Q&A Example 3:
input_text = "taylor swift's best friend"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, non_finetuned, tokenizer))
#print(textwrap.fill(inference(input_text, non_finetuned, tokenizer),width=100))

In [None]:
# Q&A Example 4:
input_text = """Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:"""
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, non_finetuned, tokenizer))
#print(textwrap.fill(inference(input_text, non_finetuned, tokenizer),width=100))

**Notes**:

Based on the previous examples, we can notice that there are multiple problems with unfinetuned model:
- Lacks task-specific skills
- Generate many repetition loops
- Cannot follow instructions well

Because an unfinetuned model has only been trained on general web-scale text using causal language modeling. While it has learned a lot of basic language patterns, it has not been specialized for specific tasks like Q&A, summarization, or following user instructions.

In [None]:
del non_finetuned
torch.cuda.empty_cache()

### Finetuned model--Llama Mediocredev text generation

**Web link**: https://huggingface.co/mediocredev/open-llama-3b-v2-chat

**Introduction**: The Mediocredev open Llama 3b V2 Chat Gguf model is a powerful tool for text generation, designed to provide efficient and accurate results. Built on the LLaMA 3B v2 architecture, it has been quantized to reduce its size while maintaining its capabilities.It can process and respond to text-based inputs quickly, making it suitable for a wide range of applications, from chatbots to content generation.

#### Load LLM

In [None]:
model_name = "mediocredev/open-llama-3b-v2-chat"

#Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

#Loads the finetuned model for text-generation modeling
finetuned_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

#### Q&A Example--with vs. without prompt format

In [None]:
#Input without special instruction
input_text = "Tell me how to train my dog to sit"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
finetuned_output = inference(input_text, finetuned_model, tokenizer)
print(finetuned_output)

In [None]:
#Input in instruction-tuned prompt format (common with models like LLaMA, OpenLLaMA instruction models)
#[INST] ... [/INST](Many chat-optimized models are trained with these tags, and if see them, will respond more intelligently)
input_text = "[INST]Tell me how to train my dog to sit[/INST]"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
finetuned_output = inference(input_text, finetuned_model, tokenizer)
print(finetuned_output)

In [None]:
#Example 2:
input_text = "[INST]What do you think of Mars?[/INST]"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(textwrap.fill(inference(input_text, finetuned_model, tokenizer),width=100))

In [None]:
#Example 3:
input_text = "[INST]taylor swift's best friend[/INST]"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(textwrap.fill(inference(input_text, finetuned_model, tokenizer),width=100))

In [None]:
#Example 4: Without prompt format
input_text = """Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:"""
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, finetuned_model, tokenizer))

In [None]:
#Example 4: With prompt format: [INST]...[/INST], ???(answer format more like agent & customer conversation)
input_text = """[INST]Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:???[/INST]"""
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, finetuned_model, tokenizer))

In [None]:
del finetuned_model
torch.cuda.empty_cache()

## Model Finetuning

### Environment Setup

In [None]:
import jsonlines #Change each data point to one row
import itertools
import pandas as pd
from pprint import pprint # print output

import datasets #Load dataset using DT_names
from datasets import load_dataset
import numpy as np

### Data for model finetuning vs pre-training

#### Pretraining data set

**Web Link**: https://huggingface.co/datasets/allenai/c4/blob/main/README.md

**Introduction**: A colossal, cleaned version of Common Crawl's web crawl corpus (Based on [Common Crawl dataset]( https://commoncrawl.org)). This is the processed version of Google's C4 dataset

##### Load data

In [None]:
#Load pretraining data as a streaming iterable dataset
pretrained_dataset = load_dataset("allenai/c4", "en",
                                  split="train",
                                  streaming=True #Instead of downloading entire dataset to disk, enables lazy loading--Samples are streamed one by one
                                  )

##### Data Examples

In [None]:
#Show the pretraining data
n = 2
print("Pretrained dataset:")

#itertools.islice()--grab the first n items from the streaming iterable
top_n = itertools.islice(pretrained_dataset, n)
num=1
for i in top_n:
  print('Data Example ',num,':')
  print('Text:\n', textwrap.fill(i['text'],width=100))
  print('\ntimestamp: ',i['timestamp'])
  print('url: ',i['url'],'\n')
  num+=1

#### Finetuning dataset

Using this dataset in this project for finetuning

**Web Link**: https://huggingface.co/datasets/lamini/lamini_docs

**Introduction**: [Lamini](https://huggingface.co/lamini) is an LLM engine that allows any developer to train high-performing LLMs on large datasets using the Lamini library. It uses Lamini dataset generator pipeline to generate a filtered dataset having around 37k questions and responses samples.

##### Load Data

In [None]:
# Load the dataset from Hugging Face
dataset = load_dataset('lamini/lamini_docs')

In [None]:
# Display the dataset info
print(dataset)

##### Data Examples

In [None]:
# Access the 'train' split
train_dataset = dataset['train']

# Data Examples
for i in range(2):  # Adjust the range as needed
    print('Data Example ',i+1,':')
    print('Question:\n', textwrap.fill(train_dataset[i]['question'],width=100))
    print('\nAnswer:\n', textwrap.fill(train_dataset[i]['answer'],width=100))
    print('\ninput_ids: ',train_dataset[i]['input_ids'])
    print('attention_mask: ',train_dataset[i]['attention_mask'])
    print('labels: ',train_dataset[i]['labels'],'\n')


##### Various data formatting methods

###### 01.Combine question and answer

In [None]:
examples = train_dataset
text = examples["question"][0] + examples["answer"][0]
print(textwrap.fill(text,width=100))

In [None]:
if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]

######02.Add Prompt

In [None]:
#Adding '###', similar to [INST], can give model clues about questions and answers
prompt_template_qa="""### Question:
{question}

### Answer:
{answer}"""

prompt_template_q = """### Question:
{question}

### Answer:"""

In [None]:
#Data Example using prompt
question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
print(text_with_prompt_template)

In [None]:
#Generate two types of data format:
#Type 1: question & answer in one text with prompt
#Type 2: only question in text and seperate answer

num_examples = len(examples["question"])
finetuning_dataset_text_only = [] #type 1
finetuning_dataset_question_answer = [] #type 2
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

In [None]:
#Type 1 Data Example
pprint(finetuning_dataset_text_only[0])

In [None]:
#Type 2 Data Example
pprint(finetuning_dataset_question_answer[0])

##### Common ways of storing data

In [None]:
#Saving as JSON file, save space
with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

#Note:Expected input should be a list of dictionaries (or any iterable of dicts)
#     Each dictionary will be written as a separate line in the .jsonl file

In [None]:
finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

#### Instruction Tuning dataset

Instruction tuning involves fine-tuning a pretrained language model on a curated dataset of (instruction, input, output)

In [None]:
import itertools
import jsonlines

from datasets import load_dataset
from pprint import pprint

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

##### Load instruction tuning dataset

**Dataset Used**: tatsu-lab/alpaca

**Web Link**: https://huggingface.co/datasets/tatsu-lab/alpaca

**Introduction**: Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.

In [None]:
#Load Data
instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)

In [None]:
instruction_tuned_dataset.info

In [None]:
#Data Example
m = 2
print("Instruction-tuned dataset:")
top_m = list(itertools.islice(instruction_tuned_dataset, m))
num=1
for j in top_m:
  print('Data Example ',num,':')
  pprint(j)
  print('\n')
  num+=1

##### Prompt Templates

In [None]:
#While finetuning, provide detailed instruction, rules, and limiation, increasing the performanace
#Type 1
prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

#Tyep 2
prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

##### Hydrate prompts

In [None]:
#Add data to prompt templates
processed_data = []
for j in top_m:
  if not j["input"]: # If no input, use type 2
    processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])
  else: #If have input, use type 1
    processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"])

  processed_data.append({"input": processed_prompt, "output": j["output"]})

In [None]:
#Data Example:
pprint(processed_data[0])

##### Save data to jsonl

In [None]:
#Saving as JSON file, save space
with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer:
    writer.write_all(processed_data)

### Try smaller models--EleutherAI Pythia

**Web Link**: https://huggingface.co/EleutherAI/pythia-70m?utm_source=chatgpt.com

**Introduction**:The Pythia Scaling Suite is a collection of models developed to facilitate interpretability [research](https://arxiv.org/pdf/2304.01373). It contains two sets of eight models of sizes 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B

#### Model without Finetuning

In [None]:
#Load LLM
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")

In [None]:
def inference_with_att(input_text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenizer define how to generate token
  if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

  input = tokenizer(
          input_text,
          return_tensors="pt",
          truncation=True,
          padding=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(input_ids=input['input_ids'].to(device),
                                                attention_mask=input['attention_mask'].to(device),
                                                pad_token_id=tokenizer.pad_token_id,
                                                max_length=max_output_tokens
                                                )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt,
                                                      skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(input_text):]

  return generated_text_answer

In [None]:
#Load Testing Dataset:
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

In [None]:
test_sample = finetuning_dataset["test"][0]
print('Question:')
print(test_sample['question'],'\n')
print('Expected Answer:')
pprint(test_sample['answer'])

print('\nLLM Answer:')
pprint(inference_with_att(test_sample['question'], model,tokenizer))

#### Model with Finetuning

In [None]:
instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")

In [None]:
test_sample = finetuning_dataset["test"][0]
print('Question:')
print(test_sample['question'],'\n')
print('Expected Answer:')
pprint(test_sample['answer'])

print('\nLLM Answer:')
pprint(inference_with_att(test_sample['question'], instruction_model,tokenizer))

#### Upload data to Huggingface (if needed)

In [None]:
# Method to upload your own dataset to Huggingface

# !pip install huggingface_hub
# !huggingface-cli login

# import pandas as pd
# import datasets
# from datasets import Dataset

# finetuning_dataset = Dataset.from_pandas(pd.DataFrame(data=finetuning_dataset))
# finetuning_dataset.push_to_hub(dataset_path_hf)

### Data preparation

#### Environment Setup

In [None]:
import pandas as pd
import datasets

from pprint import pprint
from transformers import AutoTokenizer

#### Step 1: Tokenizing text

In [None]:
#Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m") #Automatically selects the right tokenizer class

In [None]:
#Data Example:
text = "Hi, how are you?"
print('Input Text: ',text)

encoded_text = tokenizer(text)["input_ids"]
print('Encoded Text: ',encoded_text)

decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)

**Tokenize multiple texts at once** :

In [None]:
list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])

#### Step 2: Padding and truncation

In [None]:
# Using padding to fill with meaningless characters, making each text same length, for parallel calculation

#Set padding method: use the end-of-sequence token (</s>, often ID 0) as the pad token
tokenizer.pad_token = tokenizer.eos_token
#Padding needed: set to the length of the longest string in the batch
encoded_texts_longest = tokenizer(list_texts, padding=True)

#Previous Example
print('Plan text: ', list_texts)
print("Encoded using padding: ", encoded_texts_longest["input_ids"])

In [None]:
#Truncation: Because every LLM has a maximum number of tokens it can handle

#Set truncate threshold: If any tokenized string exceeds 3 tokens, it'll cut off after the 3rd token.
tokenizer.truncation_side = "right"
encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True, padding=True)

#Previous Example
print('Truncate from right--Default :')
print('Plan text: ', list_texts)
print("Encoded using truncation: ", encoded_texts_truncation["input_ids"],'\n')

#If need truncate from left:
tokenizer.truncation_side = "left"
encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print('Truncate from left :')
print('Plan text: ', list_texts)
print("Encoded using truncation: ", encoded_texts_truncation_left["input_ids"])

#### Step 3: Prepare instruction dataset

In [None]:
#Load Dataset:
filename = 'lamini/lamini_docs'
dataset = load_dataset(filename)
examples = dataset['train']

if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]


#Define template with prompt:
prompt_template = """### Question:
{question}

### Answer:"""

In [None]:
#Hydrate prompts:
num_examples = len(examples["question"])
finetuning_dataset = []

for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]
  text_with_prompt_template = prompt_template.format(question=question)
  finetuning_dataset.append({"question": text_with_prompt_template, "answer": answer})

#Data Example:
print("One datapoint in the finetuning dataset:")
pprint(finetuning_dataset[0])

#### Step 4: Tokenize the instruction dataset

**A single example :**

In [None]:
text = finetuning_dataset[0]["question"] + finetuning_dataset[0]["answer"]

tokenized_inputs = tokenizer(text,
                             return_tensors="np", #return output as numpy array
                             padding=True
                             )
print('Tokenized data example with prompt: \n',tokenized_inputs["input_ids"])

In [None]:
#Define Truncation Requirements:

max_length = 2048 #Model limitation
max_length = min(tokenized_inputs["input_ids"].shape[1],
                 max_length)

tokenized_inputs = tokenizer(text,
                             return_tensors="np",
                             truncation=True,
                             max_length=max_length,
                             padding=True)
#print('Tokenized data example with prompt: \n',tokenized_inputs["input_ids"])

**Tokenize the instruction dataset :**

In [None]:
#define tokenize function:
def tokenize_function(examples):

  prompt_template = """### Question:
{question}

### Answer:"""
  text_with_prompt=[prompt_template.format(question=q) for q in examples['question']]
  examples['question']=text_with_prompt

  tokenizer.pad_token=tokenizer.eos_token
  # tokenized_input=tokenizer(text_with_prompt,
  #                           return_tensors='np',
  #                           padding=True
  #                           )
  # max_length=min(tokenized_input.shape[1],2048)

  tokenizer.truncation_side='right'
  tokenized_input=tokenizer(text_with_prompt,
                            #return_tensors='np',
                            padding=True,
                            truncation=True,
                            max_length=2048 #max_length
                            )
  tokenized_input['labels']=tokenized_input['input_ids'].copy()
  return tokenized_input


In [None]:
finetuning_dataset_loaded = datasets.load_dataset(filename, split="train")

finetuning_dataset_loaded = finetuning_dataset_loaded.remove_columns(
    [col for col in finetuning_dataset_loaded.column_names if col not in ["question", "answer"]])

# Apply function to each example/batch of examples in the dataset
tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function,
    batched=True, #call the function by a batch of examples at once
    batch_size=32, #define batch size, usually 32
    drop_last_batch=True # If the total number of examples isn't divisible by batch_size, the last partial batch is dropped
)

In [None]:
#Data Example--Input:
finetuning_dataset_loaded[0]

In [None]:
#Data Example--Output:
tokenized_dataset[0]

#### Step 5: Prepare test/train splits

In [None]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)

#### Other datasets to try

In [None]:
# finetuning_dataset_path = "lamini/lamini_docs"
# finetuning_dataset = datasets.load_dataset(finetuning_dataset_path)
# print(finetuning_dataset)

In [None]:
# taylor_swift_dataset = "lamini/taylor_swift"
# bts_dataset = "lamini/bts"
# open_llms = "lamini/open_llms"

In [None]:
# dataset_swiftie = datasets.load_dataset(taylor_swift_dataset)
# print(dataset_swiftie["train"][1])

### Model Training

#### Environment Setup

In [None]:
import datasets
import tempfile
import logging
import random
# import config
import os
import yaml
import time
import torch
import transformers
import pandas as pd
import jsonlines
from pprint import pprint

# from utilities import *
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForCausalLM

from datasets import load_dataset


logger = logging.getLogger(__name__)
global_config = None

Load the Lamini docs dataset:

**Dataset**: https://huggingface.co/datasets/lamini/lamini_docs

**Introduction**: Including Q&A examples, input_ids, attention_mask, and labels.

In [None]:
dataset_path = "lamini/lamini_docs"

Set up the model, training config, and tokenizer:

**Web Link**: https://huggingface.co/EleutherAI/pythia-70m?utm_source=chatgpt.com

**Introduction**:The Pythia Scaling Suite is a collection of models developed to facilitate interpretability [research](https://arxiv.org/pdf/2304.01373). It contains two sets of eight models of sizes 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B

In [None]:
model_name = "EleutherAI/pythia-70m"

In [None]:
# training_config = {
#     "model": {
#         "pretrained_name": model_name,
#         "max_length" : 2048
#     },
#     "datasets": {
#         "use_hf": use_hf,
#         "path": dataset_path
#     },
#     "verbose": True
# }

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name) #Loads tokenizer from Hugging Face
tokenizer.pad_token = tokenizer.eos_token #Set the padding token to be end-of-sequence (EOS) token
dataset = load_dataset(dataset_path)
train_dataset, test_dataset = dataset['train'], dataset['test']


# Use the previous defined tokenize_function with instruction:
# train_dataset = train_dataset.map(tokenize_function,
#                                   batched=True,
#                                   batch_size=1,
#                                   drop_last_batch=True
#                                   )
# test_dataset = test_dataset.map(tokenize_function,
#                                 batched=True,
#                                 batch_size=1,
#                                 drop_last_batch=True
#                                 )

#Dataset already has instruction, can direct?

#Convert data to torch.int64 format
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
train_dataset = train_dataset.filter(lambda x: len(x["input_ids"]) > 0)
test_dataset = test_dataset.filter(lambda x: len(x["input_ids"]) > 0)


print(train_dataset)
print(test_dataset)

In [None]:
batch = next(iter(trainer.get_train_dataloader()))
print({k: v.dtype for k, v in batch.items()})

train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

#### Load Base Model

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(model_name)

In [None]:
#Checks how many CUDA-compatible GPUs:
device_count = torch.cuda.device_count()

#Set device to use: CUDA or CPU
if device_count > 0:
    logger.debug("Select GPU device")
    device = torch.device("cuda")
else:
    logger.debug("Select CPU device")
    device = torch.device("cpu")

In [None]:
#Move model to GPU/CPU
base_model.to(device)

#### Define function to carry out inference

In [None]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  inputs= tokenizer(text,
                    return_tensors="pt",
                    truncation=True, #ensures text won’t exceed max_input_tokens
                    max_length=max_input_tokens,
                    padding=False #added
                    )
  #Using tokenizer instead of tokenizer.encode to generate attention_mask automatically

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(input_ids=inputs['input_ids'].to(device),
                                                attention_mask=inputs['attention_mask'].to(device),
                                                max_length=max_output_tokens,
                                                pad_token_id=tokenizer.pad_token_id
                                                )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

#### Try Base Model Result

In [None]:
test_text = test_dataset[0]['question']
test_attention_mask=test_dataset[0]['attention_mask']
print("Question input (test):\n", test_text)
print(f"Correct answer from Lamini docs:")
pprint(test_dataset[0]['answer'])
print("\nModel's answer: ")
print(inference(test_text, base_model, tokenizer))

#### Training Setup

In [None]:
#Define maximum train data interations
max_steps = 240

#Set fine-tunned model name
trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name

In [None]:
#Model Training: using Hugging Face transformers library
training_args = TrainingArguments(learning_rate=1.0e-5, # Learning rate
                                  num_train_epochs=1, # Number of training epochs; one full pass through entire training data
                                  max_steps=max_steps, # 1 step is one parameter update
                                                       # Max steps to train for
                                                       # Overrides num_train_epochs, if not -1

                                  per_device_train_batch_size=1,# Batch size for training
                                  gradient_accumulation_steps = 4, #Accumulate gradients across 4 steps before updating weights (simulate large batch size)
                                  output_dir=output_dir,# Directory to save model checkpoints

                                  # Other arguments
                                  overwrite_output_dir=False, # Overwrite the content of the output directory
                                  disable_tqdm=False, # Keep progress bar visible during training
                                  eval_steps=10, # Run evaluation every x steps
                                  save_steps=120, # Save a checkproint every x steps
                                  warmup_steps=0, # Number of warmup steps for learning rate scheduler
                                  per_device_eval_batch_size=1, # Batch size for evaluation
                                  eval_strategy="steps", #Run evaluation at every evaluation steps (/'epoch'/'no')
                                  logging_strategy="steps", #Log and write training logs at every x=logging_steps steps
                                  logging_steps=1,
                                  optim="adafactor", #Use Adafactor optimizer (memory-efficient, good for large models)
                                  gradient_checkpointing=False,

                                  # Parameters for early stopping
                                  load_best_model_at_end=True, #reload checkpoint with the lowest evaluation loss
                                  save_total_limit=1,#Keep only 1 saves checkpoint (Save disk space)

                                  # Evaluation metrics: smaller is better
                                  metric_for_best_model="eval_loss",
                                  greater_is_better=False
)

In [None]:
#Using Hugging Face Trainer API to setup training
trainer = Trainer(model=base_model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=test_dataset,
)

#### Model Training

In [None]:
training_output = trainer.train()

#### Save Model to Local

In [None]:
save_dir = f'{output_dir}/Finetuned_model'

trainer.save_model(save_dir)
print("Saved model to:", save_dir)

#### Run Fine-tuned Model Version 1

In [None]:
#Load the local model from saved folder
finetuned_model_v1 = AutoModelForCausalLM.from_pretrained(save_dir,
                                                                local_files_only=True)
finetuned_model_v1.to(device)

In [None]:
test_text = test_dataset[0]['question']
test_attention_mask=test_dataset[0]['attention_mask']
print("Question input (test):\n", test_text)
print(f"Correct answer from Lamini docs:")
pprint(test_dataset[0]['answer'])
print("\nModel's answer: ")
pprint(inference(test_text, finetuned_model_v1, tokenizer))

#### Run Fine-tunned Model Version 2
Note: Same model trained for two epochs

In [None]:
#Set fine-tunned model name
trained_model_name = f"lamini_docs_two_epochs"
output_dir = trained_model_name


#Model Training: using Hugging Face transformers library
training_args = TrainingArguments(learning_rate=1.0e-5, # Learning rate
                                  num_train_epochs=2, # Number of training epochs; one full pass through entire training data
                                  max_steps=-1, # 1 step is one parameter update
                                                       # Max steps to train for
                                                       # Overrides num_train_epochs, if not -1

                                  per_device_train_batch_size=1,# Batch size for training
                                  gradient_accumulation_steps = 4, #Accumulate gradients across 4 steps before updating weights (simulate large batch size)
                                  output_dir=output_dir,# Directory to save model checkpoints

                                  # Other arguments
                                  overwrite_output_dir=False, # Overwrite the content of the output directory
                                  disable_tqdm=False, # Keep progress bar visible during training
                                  eval_steps=20, # Run evaluation every x steps
                                  save_steps=120, # Save a checkproint every x steps
                                  warmup_steps=0, # Number of warmup steps for learning rate scheduler
                                  per_device_eval_batch_size=1, # Batch size for evaluation
                                  eval_strategy="steps", #Run evaluation at every evaluation steps (/'epoch'/'no')
                                  logging_strategy="steps", #Log and write training logs at every x=logging_steps steps
                                  logging_steps=1,
                                  optim="adafactor", #Use Adafactor optimizer (memory-efficient, good for large models)
                                  gradient_checkpointing=False,

                                  # Parameters for early stopping
                                  load_best_model_at_end=True, #reload checkpoint with the lowest evaluation loss
                                  save_total_limit=1,#Keep only 1 saves checkpoint (Save disk space)

                                  # Evaluation metrics: smaller is better
                                  metric_for_best_model="eval_loss",
                                  greater_is_better=False
)

#Using Hugging Face Trainer API to setup training
trainer = Trainer(model=base_model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=test_dataset,
)

In [None]:
training_output = trainer.train()

In [None]:
save_dir = f'{output_dir}/Finetuned_model'

trainer.save_model(save_dir)
print("Saved model to:", save_dir)

In [None]:
#Load the local model from saved folder
finetuned_model_v2 = AutoModelForCausalLM.from_pretrained(save_dir,
                                                          local_files_only=True)
finetuned_model_v2.to(device)

In [None]:
dataset = load_dataset(dataset_path)
test_dataset = dataset['test']

In [None]:
test_text = test_dataset[0]['question']
test_attention_mask=test_dataset[0]['attention_mask']
print("Question input (test):\n", test_text)
print(f"Correct answer from Lamini docs:")
pprint(test_dataset[0]['answer'])
print("\nModel's answer: ")
pprint(inference(test_text, finetuned_model_v2, tokenizer))

#### Run much larger Fine-tunned model

In [None]:
finetuned_longer_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")
tokenizer = AutoTokenizer.from_pretrained("lamini/lamini_docs_finetuned")

test_text = test_dataset[0]['question']
test_attention_mask=test_dataset[0]['attention_mask']
print("Question input (test):\n", test_text)
print(f"Correct answer from Lamini docs:")
pprint(test_dataset[0]['answer'])
print("\nModel's answer: ")
pprint(inference(test_text, finetuned_longer_model, tokenizer))

##### Explore moderation using small model

First, try the non-finetuned base model:

In [None]:
base_tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
base_model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")
print(inference("What do you think of Mars?", base_model, base_tokenizer))

Now try moderation with finetuned small model:

In [None]:
pprint(inference("What do you think of Mars?", finetuned_longer_model, tokenizer))

###  Model Evaluation

#### Setup basic evaluation function--Exact Match

In [None]:
def is_exact_match(a, b):
    return a.strip() == b.strip()
#Need to be exact match

In [None]:
#base_model.eval()

##### Run model and compare to expected answer

In [None]:
gen_answer_base=inference(test_dataset[0]['question'], base_model, tokenizer)
gen_answer_v1=inference(test_dataset[0]['question'], finetuned_model_v1, tokenizer)
gen_answer_v2=inference(test_dataset[0]['question'], finetuned_model_v2, tokenizer)

In [None]:
print('Base: ',is_exact_match(gen_answer_base,test_dataset[0]['answer']))
print('Model V1: ',is_exact_match(gen_answer_v1,test_dataset[0]['answer']))
print('Model V2: ',is_exact_match(gen_answer_v2,test_dataset[0]['answer']))

#### Other Evaluation Method

In [None]:
import evaluate
rouge= evaluate.load("rouge")

In [None]:
def f1_score(prediction: str, reference: str) -> float:
    pred_tokens = prediction.split()
    ref_tokens = reference.split()

    common = set(pred_tokens) & set(ref_tokens)
    if len(common) == 0:
        return 0.0

    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall)
    return f1

In [None]:
print('finetuned_model_v2:')
print("ROUGE:", rouge.compute(predictions=predictions['predicted_answer'].to_list(),
                              references=predictions['target_answer'].to_list()))
print("F1:", f1.compute(predictions=predictions['predicted_answer'].to_list(),
                        references=predictions['target_answer'].to_list(),
                        average="macro"))

#### Evaluate on entire test dataset

In [None]:
def func_evaluations(n,test_dataset,model):
  n = 20
  metrics = {'exact_matches': [],
             'rouge':[],
             'f1':[]}
  predictions = []

  for i, item in enumerate(test_dataset): #tqdm(enumerate(test_dataset)):
    # if i%5==0:
    #   print(i," Evaluating: " + str(item))
    question = item['question']
    answer = item['answer']

    try:
      predicted_answer = inference(question, model, tokenizer)
    except:
      continue

    predictions.append([predicted_answer, answer])

    #fixed: exact_match = is_exact_match(generated_answer, answer)
    exact_match = is_exact_match(predicted_answer,answer)
    rouge_num=rouge.compute(predictions=[predicted_answer],
                            references=[answer])
    f1_num=f1_score(prediction=predicted_answer,
                    reference=answer)

    metrics['exact_matches'].append(exact_match)
    metrics['rouge'].append(rouge_num)
    metrics['f1'].append(f1_num)

    if i >= n: #(and n != -1):
      break

  predictions_df=pd.DataFrame(predictions, columns=["predicted_answer", "target_answer"])
  return metrics, predictions_df

In [None]:
def print_eval(metrics,model_name):
  avg_scores_v2 = {}
  for key in metrics['rouge'][0].keys():
    avg_scores_v2[key] = sum(d[key] for d in metrics['rouge']) / len(metrics['rouge'])
  print('Evaluation for ',model_name,' :')
  print('Number of exact matches: ', sum(metrics['exact_matches']))
  print('Rouge Score: ', avg_scores_v2)
  print('F1 Score: ', np.mean(metrics['f1']))

##### Model Comparison

In [None]:
n=20
m_base,pred_base=func_evaluations(n,test_dataset,base_model)
m_v1,pred_v1=func_evaluations(n,test_dataset,finetuned_model_v1)
m_v2,pred_v2=func_evaluations(n,test_dataset,finetuned_model_v2)
m_large,pred_large=func_evaluations(n,test_dataset,finetuned_longer_model)

In [None]:
print_eval(m_base,'base_model')
print()
print_eval(m_v1,'finetuned_model_v1')
print()
print_eval(m_v2,'finetuned_model_v2')
print()
print_eval(m_large,'finetuned_longer_model')

In [None]:
pred_v2.head(2)

#### Evaluate all the data

In [None]:
evaluation_dataset_path = "lamini/lamini_docs_evaluation"
evaluation_dataset = datasets.load_dataset(evaluation_dataset_path)

In [None]:
pd.DataFrame(evaluation_dataset)

# Index--Deeper into Transformer

## Check about the shape of input and output

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F


text = 'GPT, short for Generative Pre-trained Transformer, represents a groundbreaking advancement in the field of artificial intelligence and natural language processing. Developed by OpenAI, GPT is designed to understand, generate, and interpret human language with remarkable accuracy and fluency. It operates on the principle of machine learning, where the model is initially pre-trained on a vast corpus of text data. This pre-training enables GPT to grasp the intricacies of language, including grammar, context, and even subtleties like humor and sarcasm. Following the pre-training phase, GPT undergoes fine-tuning, where it is further trained on a smaller, more specialized dataset to perform specific tasks like translation, question-answering, and content creation. What sets GPT apart is its deep learning architecture, which consists of multiple layers of transformers—hence the name. These transformers allow the model to process and analyze text in a highly efficient and nuanced manner, making GPT capable of generating text that is often indistinguishable from that written by humans. As technology evolves, GPT continues to push the boundaries of what artificial intelligence can achieve in understanding and mimicking human language.'

chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Characters from the sentence:", "".join(chars))
print("vocab_size from the sentence: ", vocab_size)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
data = torch.tensor(encode(text), dtype=torch.long)
train_data = data

def get_batch():
    # generate a small batch of data of inputs x and targets y
    data = train_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = len(data) - 1 # what is the maximum context length for predictions?
# block_size = 192 # what is the maximum context length for predictions?
device = 'cuda' if torch.cuda.is_available() else 'cpu'

n_embd = 1500
n_head = 1
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()   # head_size = 150
        self.key = nn.Linear(n_embd, head_size, bias=False)   # x  ->  embedding size ->  n_head * head_size
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):  # head_size = 150 * n_head 10
        B,T,C = x.shape  # batch_size, seq_len, embedding_size   (4, 100, 1500)
        k = self.key(x)   # (B,T,C)  (4, 100, 1500) * (1500, 150) -> (4, 100, 150)
        q = self.query(x) # (B,T,C)  (4, 100, 1500) * (1500, 150) -> (4, 100, 150)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)  （4, 100, 150) * (4, 150, 100) -> (4, 100, 100)
             # you are the best
        # you  11  21   23  23
        # are
        # the
        # best
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)  (4, 100, 100)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)  (4, 100, 100) * (100, 1500) -> (4, 100, 150)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)  # n_head * head_size

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)      # [1],[2],[3]  -> [1,2,3]  # (4, 100, 150) * 10 -> (4, 100, 1500) -> batch, seq_length, embedding_size
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),   # 64 -> 256   信息 -> 维度升高
            nn.ReLU(),                  # 激活函数      取出强烈的信息
            nn.Linear(4 * n_embd, n_embd),   # 信息维度降低
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)  # 提取信息
        self.ffwd = FeedFoward(n_embd)  # GPT感知信息
        self.ln1 = nn.LayerNorm(n_embd)  # 归一化
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class BabyGPT(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)  # 词嵌入  [0,2,5,6] -> [0.1,0.2,0,7,0.1]  n_embed
        self.position_embedding_table = nn.Embedding(block_size, n_embd)   #                  [0.9,0.3,0.4,0.1]
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])   # transformer blocks * 4
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)   # 词嵌入 -> 线性层(形状变换) -> logits []  "your name is GPT-3" -> seq_len * [0.9,0.3,0.4,0.1] -> "信息 -> 概率"  -> token的概率 生成某个token的概率   [0.1, 0.1, 0.8, 0.0]

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers m   idx + next_idx = [y], [yo],[you]
        tok_emb = self.token_embedding_table(idx) # (B,T,C)   batch_size * seq_len * [0.9,0.3,0.4,0.1]
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)   (T,C)  (4, 100, 1500)
        x = tok_emb + pos_emb # (B,T,C)  # 词信息 + 位置信息 (4, 100, 1500)
        x = self.blocks(x) # (B,T,C)  # 信息提取 (4, 100, 1500) -> 10 * (4, 100, 150) -> (4, 100, 1500)
        x = self.ln_f(x) # (B,T,C)  # 归一化
        logits = self.lm_head(x) # (B,T,vocab_size)  # 词嵌入 -> 线性层(形状变换) -> logits(概率信息) (4, 100, 1500) * (1500, 39) -> (4, 100, 39)
         # I am st .. o
         # [ ,'a', .... , ]
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)   # 交叉熵损失函数  - 差距  - minimize 差距

        return logits, loss

    def generate(self, idx, max_new_tokens):   # data pre -> train model -> model serving
        # idx is (B, T) array of indices in the current context  # "pre"
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond) # [0,3,2,4,5,6]
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C) # [0.1, 0.2, 0.7]  (4, 100, 39) -> (4, 39)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)  # [0.1, 0.44, 0.46]  -> max 2  boss
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)   "pre"  -  I am a  -> "I am a " + "boss" -> "I am a boss" -> model -> "I am a boss" -> "I am a boss" + "!" s
        return idx

## Test with a babyGPT

In [None]:
m = BabyGPT()

In [None]:
import torch
from tqdm import tqdm

batch_size = 1
device = 'cuda' if torch.cuda.is_available() else 'cpu'
m = m.to(device)
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

# 使用tqdm添加进度条，并在进度条中显示中间loss值
pbar = tqdm(range(500))
for steps in pbar:
    # sample a batch of data
    xb, yb = get_batch()
    xb, yb = xb.to(device), yb.to(device)
    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    # 更新进度条的描述以显示当前的loss值
    pbar.set_description(f"Loss: {loss.item():.4f}")

print(loss.item())

In [None]:
start_id = torch.tensor([encode('GPT')], dtype=torch.long)
start_id = start_id.to(device)
print(decode(m.generate(idx = start_id, max_new_tokens=200)[0].tolist()))

## Rethink about the attention mechanism

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"


# you are the best
# 1    2    3   4
torch.manual_seed(42)
wei = torch.tril(torch.ones(7, 7))
wei = wei / torch.sum(wei, 1, keepdim=True)
b = torch.randint(0,10,(7,2)).float()
c = wei @ b
print('wei=')
print(wei)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

     # you are the best

# you  11  00   00  00
# are  14  23   00  00
# the
# best

In [None]:
# self-attention!
#import torch.nn as nn
#from torch.nn import functional as F
torch.manual_seed(1337)
B,T,C = 1,8,32 # batch, time, channels
x = torch.randn(B,T,C)  # (1, 8, 32)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)  # x -->捕捉可以当作key的信息
query = nn.Linear(C, head_size, bias=False) # x -->捕捉可以当作query的信息
value = nn.Linear(C, head_size, bias=False) # x -->捕捉可以当作value的信息

print(key.weight)
print(query)
print(value)

k = key(x)   # (B, T, 16)  (1, 8, 32) * (32, 16)  (1, 8, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)  attention score   (1, 8, 8)

print(wei[0])

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
print("masked:",wei[0])
wei = F.softmax(wei, dim=-1)
print("masked:",wei[0])

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

In [None]:
wei[0]