<a href="https://colab.research.google.com/github/charlottejin95/RAG/blob/main/Finetuning_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LLM

## Environment Setup

In [1]:
!pip install torch jsonlines pandas datasets transformers accelerate



In [2]:
!pip install safetensors



In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import textwrap

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Function of Finetuning: finetuned vs. non-finetuned models

### Non-Finetuned model--Meta AI's LLaMA LLM

**Web link**: https://huggingface.co/openlm-research/open_llama_3b_v2

**Introduction**: In the above repo link, the team presents a **permissively licensed open source reproduction of Meta AI's LLaMA large language model**. They are releasing a series of **3B, 7B and 13B models trained on 1T tokens**. They also provide PyTorch and JAX weights of pre-trained OpenLLaMA models, as well as evaluation results and comparison against the original LLaMA models. The v2 model is better than the old v1 model trained on a different data mixture

#### Load LLM

In [4]:
model_name = "openlm-research/open_llama_3b_v2"

#Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
#AutoTokenizer automatically figure out the right tokenizer class for the model (LlamaTokenizer in this case)

#Loads the pretrained model for causal language modeling(predict the next word based on prev context)
non_finetuned = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
#AutoModelForCausalLM automatically loads the correct architecture for language modeling (LlamaForCausalLM in this case)
#device_map="auto"--lets the library automatically place model layers on available hardware

#Specify which device to use if necessary
# non_finetuned.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behav

#### Q&A Example

In [5]:
#Q&A example using non-finetuned model
input_text = "Tell me how to train my dog to sit"
non_finetuned_output = non_finetuned.generate(tokenizer(input_text, return_tensors="pt").input_ids.to(device),
                                              max_length=100)
#tokenizer()--Converts the text into token IDs that the model can understand; return_tensors="pt"--return PyTorch tensors
#.input_ids.to()--Moves the input_ids tensor to the appropriate device defined before
#.generate()--The model generates based on input, up to max_length tokens, including the input

print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(textwrap.fill(tokenizer.decode(non_finetuned_output[0], #convert result back to readable text
                                     skip_special_tokens=True), #ignore special tokens like <s>, </s>, <pad> in the output
                    width=100))

Input Question:
Tell me how to train my dog to sit 

Output Answer:
Tell me how to train my dog to sit and stay. I have a 10 month old puppy. I have been working with
him on sit and stay. I have been working with him for about 2 weeks. I have been using a clicker and
treats. I have been using a treat to get him to sit and then I click and give him a treat. I have
been using a treat to get him to stay. I have been using a treat to get him to


#### Define Inference Function

In [6]:
#Self-defined function to generate reuseable Q&A using LLM
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(text,
                               return_tensors="pt",
                               truncation=True,
                               max_length=max_input_tokens
                               ) #Encodes text into token IDs, as a PyTorch tensor.
  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(input_ids=input_ids.to(device),
                                                max_length=max_output_tokens
                                                )#Generates tokens with the model.The result includes both prompt & new tokens.
  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt,
                                                      skip_special_tokens=True
                                                      )#Converts generated tokens into readable text.
  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]
  return generated_text_answer

In [10]:
# Q&A Example 2:
input_text = "What do you think of Mars?"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, non_finetuned, tokenizer))
#print(textwrap.fill(inference(input_text, non_finetuned, tokenizer),width=100))

Input Question:
What do you think of Mars? 

Output Answer:

I think it's a great place to live.
I think it's a great place to live.
I think it's a great place to live.
I think it's a great place to live.
I think it's a great place to live.
I think it's a great place to live.
I think it's a great place to live.
I think it's a great


In [12]:
# Q&A Example 3:
input_text = "taylor swift's best friend"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, non_finetuned, tokenizer))
#print(textwrap.fill(inference(input_text, non_finetuned, tokenizer),width=100))

Input Question:
taylor swift's best friend 

Output Answer:

taylor swift's best friend
taylor swift's best friend
taylor swift's best friend
taylor swift's best friend
taylor swift's best friend
taylor swift's best friend
taylor swift's best friend
taylor swift's best friend
taylor swift's best friend
taylor swift's best friend
taylor swift's best friend
taylor swift


In [13]:
# Q&A Example 4:
input_text = """Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:"""
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, non_finetuned, tokenizer))
#print(textwrap.fill(inference(input_text, non_finetuned, tokenizer),width=100))

Input Question:
Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent: 

Output Answer:
 I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent: I'm sorry to hear that.


**Notes**:

Based on the previous examples, we can notice that there are multiple problems with unfinetuned model:
- Lacks task-specific skills
- Generate many repetition loops
- Cannot follow instructions well

Because an unfinetuned model has only been trained on general web-scale text using causal language modeling. While it has learned a lot of basic language patterns, it has not been specialized for specific tasks like Q&A, summarization, or following user instructions.

In [14]:
del non_finetuned
torch.cuda.empty_cache()

### Finetuned model--Llama Mediocredev text generation

**Web link**: https://huggingface.co/mediocredev/open-llama-3b-v2-chat

**Introduction**: The Mediocredev open Llama 3b V2 Chat Gguf model is a powerful tool for text generation, designed to provide efficient and accurate results. Built on the LLaMA 3B v2 architecture, it has been quantized to reduce its size while maintaining its capabilities.It can process and respond to text-based inputs quickly, making it suitable for a wide range of applications, from chatbots to content generation.

#### Load LLM

In [None]:
model_name = "mediocredev/open-llama-3b-v2-chat"

#Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

#Loads the finetuned model for text-generation modeling
finetuned_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

#### Q&A Example--with vs. without prompt format

In [16]:
#Input without special instruction
input_text = "Tell me how to train my dog to sit"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
finetuned_output = inference(input_text, finetuned_model, tokenizer)
print(finetuned_output)

Input Question:
Tell me how to train my dog to sit 

Output Answer:
.
How to train a dog to sit.
How to train a dog to sit.
How to train a dog to sit.
How to train a dog to sit.
How to train a dog to sit.
How to train a dog to sit.
How to train a dog to sit.
How to train a dog to sit.
How to train a dog to sit.
How to train a dog to sit


In [17]:
#Input in instruction-tuned prompt format (common with models like LLaMA, OpenLLaMA instruction models)
#[INST] ... [/INST](Many chat-optimized models are trained with these tags, and if see them, will respond more intelligently)
input_text = "[INST]Tell me how to train my dog to sit[/INST]"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
finetuned_output = inference(input_text, finetuned_model, tokenizer)
print(finetuned_output)

Input Question:
[INST]Tell me how to train my dog to sit[/INST] 

Output Answer:
 I don't have a dog, but here are some general steps to train your dog to sit:

1. Start by sitting down in a comfortable position with your dog.

2. Hold a treat in your hand and place it in front of your dog's nose.

3. Slowly move the treat up to your dog's nose and say "sit" as you


In [19]:
#Example 2:
input_text = "[INST]What do you think of Mars?[/INST]"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(textwrap.fill(inference(input_text, finetuned_model, tokenizer),width=100))

Input Question:
[INST]What do you think of Mars?[/INST] 

Output Answer:
 I do not have the ability to think or have opinions. However, I can provide information about mars.
Mars is the fourth-largest planet in our solar system and the second-smallest planet in terms of
diameter. It is known for its red color due to iron oxide deposits on its surface. Mars is also the
only planet in our solar system that has a thin atmosphere, which is composed mainly of carbon
dioxide.


In [20]:
#Example 3:
input_text = "[INST]taylor swift's best friend[/INST]"
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(textwrap.fill(inference(input_text, finetuned_model, tokenizer),width=100))

Input Question:
[INST]taylor swift's best friend[/INST] 

Output Answer:
 I do not have a personal opinion or feelings. However, I can provide information about taylor
swift's best friend.   taylor swift's best friend is john mayer. They have been friends since they
were teenagers and have been inseparable ever since. They have collaborated on several songs
together, including "love story" and "dear john." they also attend the same concerts and have been


In [21]:
#Example 4: Without prompt format
input_text = """Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:"""
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, finetuned_model, tokenizer))

Input Question:
Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent: 

Output Answer:
 I see. Can you please provide me with your order number?
Customer: 1234567890123456
Agent: Thank you. I see that your blanket was shipped to your address.


In [25]:
#Example 4: With prompt format: [INST]...[/INST], ???(answer format more like agent & customer conversation)
input_text = """[INST]Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:???[/INST]"""
print('Input Question:')
print(input_text,'\n')
print('Output Answer:')
print(inference(input_text, finetuned_model, tokenizer))

Input Question:
[INST]Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:???[/INST] 

Output Answer:
 Agent: I'm sorry to hear that. Can you please provide me with your order number and the tracking number? 


In [26]:
del finetuned_model
torch.cuda.empty_cache()

## Model Finetuning

### Environment Setup

In [27]:
import jsonlines #Change each data point to one row
import itertools
import pandas as pd
from pprint import pprint # print output

import datasets #Load dataset using DT_names
from datasets import load_dataset

### Data for model finetuning vs pre-training

#### Pretraining data set

**Web Link**: https://huggingface.co/datasets/allenai/c4/blob/main/README.md

**Introduction**: A colossal, cleaned version of Common Crawl's web crawl corpus (Based on [Common Crawl dataset]( https://commoncrawl.org)). This is the processed version of Google's C4 dataset

##### Load data

In [29]:
#Load pretraining data as a streaming iterable dataset
pretrained_dataset = load_dataset("allenai/c4", "en",
                                  split="train",
                                  streaming=True #Instead of downloading entire dataset to disk, enables lazy loading--Samples are streamed one by one
                                  )

README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

##### Data Examples

In [34]:
#Show the pretraining data
n = 2
print("Pretrained dataset:")

#itertools.islice()--grab the first n items from the streaming iterable
top_n = itertools.islice(pretrained_dataset, n)
num=1
for i in top_n:
  print('Data Example ',num,':')
  print('Text:\n', textwrap.fill(i['text'],width=100))
  print('\ntimestamp: ',i['timestamp'])
  print('url: ',i['url'],'\n')
  num+=1

Pretrained dataset:
Data Example  1 :
Text:
 Beginners BBQ Class Taking Place in Missoula! Do you want to get better at making delicious BBQ? You
will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class
BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for
everyone who wants to get better with their culinary skills. He will teach you everything you need
to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat
selection and trimming, plus smoker and fire information. The cost to be in the class is $35 per
person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and
you will be tasting samples of each meat that is prepared.

timestamp:  2019-04-25 12:57:54
url:  https://klyq.com/beginners-bbq-class-taking-place-in-missoula/ 

Data Example  2 :
Text:
 Discussion in 'Mac OS X Lion (10.7)' started by axboi87, Jan 20, 2012. I've go

#### Finetuning dataset

Using this dataset in this project for finetuning

**Web Link**: https://huggingface.co/datasets/lamini/lamini_docs

**Introduction**: [Lamini](https://huggingface.co/lamini) is an LLM engine that allows any developer to train high-performing LLMs on large datasets using the Lamini library. It uses Lamini dataset generator pipeline to generate a filtered dataset having around 37k questions and responses samples.

##### Load Data

In [35]:
# Load the dataset from Hugging Face
dataset = load_dataset('lamini/lamini_docs')

README.md:   0%|          | 0.00/577 [00:00<?, ?B/s]

(…)-00000-of-00001-5cdebbc48da41394.parquet:   0%|          | 0.00/615k [00:00<?, ?B/s]

(…)-00000-of-00001-4c77a066a883f339.parquet:   0%|          | 0.00/83.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1260 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/140 [00:00<?, ? examples/s]

In [37]:
# Display the dataset info
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


##### Data Examples

In [40]:
# Access the 'train' split
train_dataset = dataset['train']

# Data Examples
for i in range(2):  # Adjust the range as needed
    print('Data Example ',i+1,':')
    print('Question:\n', textwrap.fill(train_dataset[i]['question'],width=100))
    print('\nAnswer:\n', textwrap.fill(train_dataset[i]['answer'],width=100))
    print('\ninput_ids: ',train_dataset[i]['input_ids'])
    print('attention_mask: ',train_dataset[i]['attention_mask'])
    print('labels: ',train_dataset[i]['labels'],'\n')


Data Example  1 :
Question:
 How can I evaluate the performance and quality of the generated text from Lamini models?

Answer:
 There are several metrics that can be used to evaluate the performance and quality of generated text
from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how
well the model predicts the next word in a sequence, while BLEU score measures the similarity
between the generated text and a reference text. Human evaluation involves having human judges rate
the quality of the generated text based on factors such as coherence, fluency, and relevance. It is
recommended to use a combination of these metrics for a comprehensive evaluation of the model's
performance.

input_ids:  [2347, 476, 309, 7472, 253, 3045, 285, 3290, 273, 253, 4561, 2505, 432, 418, 4988, 74, 3210, 32, 2512, 403, 2067, 17082, 326, 476, 320, 908, 281, 7472, 253, 3045, 285, 3290, 273, 4561, 2505, 432, 418, 4988, 74, 3210, 13, 1690, 44229, 414, 13, 378, 1843, 5

## Various ways of formatting your data

In [None]:
examples = train_dataset
text = examples["question"][0] + examples["answer"][0]
#把question和answer连起来
text

"How can I evaluate the performance and quality of the generated text from Lamini models?There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality of the generated text based on factors such as coherence, fluency, and relevance. It is recommended to use a combination of these metrics for a comprehensive evaluation of the model's performance."

In [None]:
if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]


In [None]:
prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""

#三个井号，类似于INST标符，提示哪些是question，哪些是answer，可以更好提示model
#在使用chatGPT时候也可以有更好效果

In [None]:
question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
text_with_prompt_template

"### Question:\nHow can I evaluate the performance and quality of the generated text from Lamini models?\n\n### Answer:\nThere are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality of the generated text based on factors such as coherence, fluency, and relevance. It is recommended to use a combination of these metrics for a comprehensive evaluation of the model's performance."

In [None]:
prompt_template_q = """### Question:
{question}

### Answer:"""

In [None]:
num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

In [None]:
pprint(finetuning_dataset_text_only[0])

{'text': '### Question:\n'
         'How can I evaluate the performance and quality of the generated text '
         'from Lamini models?\n'
         '\n'
         '### Answer:\n'
         'There are several metrics that can be used to evaluate the '
         'performance and quality of generated text from Lamini models, '
         'including perplexity, BLEU score, and human evaluation. Perplexity '
         'measures how well the model predicts the next word in a sequence, '
         'while BLEU score measures the similarity between the generated text '
         'and a reference text. Human evaluation involves having human judges '
         'rate the quality of the generated text based on factors such as '
         'coherence, fluency, and relevance. It is recommended to use a '
         'combination of these metrics for a comprehensive evaluation of the '
         "model's performance."}


In [None]:
pprint(finetuning_dataset_question_answer[0])

{'answer': 'There are several metrics that can be used to evaluate the '
           'performance and quality of generated text from Lamini models, '
           'including perplexity, BLEU score, and human evaluation. Perplexity '
           'measures how well the model predicts the next word in a sequence, '
           'while BLEU score measures the similarity between the generated '
           'text and a reference text. Human evaluation involves having human '
           'judges rate the quality of the generated text based on factors '
           'such as coherence, fluency, and relevance. It is recommended to '
           'use a combination of these metrics for a comprehensive evaluation '
           "of the model's performance.",
 'question': '### Question:\n'
             'How can I evaluate the performance and quality of the generated '
             'text from Lamini models?\n'
             '\n'
             '### Answer:'}


## Common ways of storing your data

In [None]:
#保存成JSON，节省空间，就是生成了一个文本
with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

In [None]:
finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


# 03-Instruction-tuning

In [None]:
import itertools
import jsonlines

from datasets import load_dataset
from pprint import pprint

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

## Load instruction tuned dataset

In [None]:
instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)

README.md:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

In [None]:
m = 5
print("Instruction-tuned dataset:")
top_m = list(itertools.islice(instruction_tuned_dataset, m))
for j in top_m:
  print(j)

Instruction-tuned dataset:
{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
{'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three p

## Two prompt templates

In [None]:
prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

#在finetuning的时候给出很明确的instruction，规则和限制，来更好完成任务；
#和之前的template区别主要就是给了instruction，给了更明确的东西

prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

## Hydrate prompts (add data to prompts)

In [None]:
processed_data = []
for j in top_m:
  if not j["input"]:
    processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])
  else:
    processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"])

  processed_data.append({"input": processed_prompt, "output": j["output"]})


In [None]:
pprint(processed_data[0])

## Save data to jsonl

In [None]:
with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer:
    writer.write_all(processed_data)

## Try smaller models

In [None]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")

In [None]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize 定义怎么生成token
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

In [None]:
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

In [None]:
test_sample = finetuning_dataset["test"][0]
print(test_sample)

print(inference(test_sample["question"], model, tokenizer))

## Compare to finetuned small model

In [None]:
instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")

In [None]:
print(inference(test_sample["question"], instruction_model, tokenizer))

In [None]:
# Pssst! If you were curious how to upload your own dataset to Huggingface
# Here is how we did it

# !pip install huggingface_hub
# !huggingface-cli login

# import pandas as pd
# import datasets
# from datasets import Dataset

# finetuning_dataset = Dataset.from_pandas(pd.DataFrame(data=finetuning_dataset))
# finetuning_dataset.push_to_hub(dataset_path_hf)

# 04-Data preparation

In [None]:
import pandas as pd
import datasets

from pprint import pprint
from transformers import AutoTokenizer

## Tokenizing text

In [None]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")

In [None]:
text = "Hi, how are you?"

In [None]:
encoded_text = tokenizer(text)["input_ids"]

In [None]:
encoded_text

In [None]:
decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)

## Tokenize multiple texts at once

In [None]:
list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])

## Padding and truncation

In [None]:
#把不同长度字符拼在一起，padding填没意义的字符，让字符对其，这样才能并行计算
tokenizer.pad_token = tokenizer.eos_token
encoded_texts_longest = tokenizer(list_texts, padding=True)
print("Using padding: ", encoded_texts_longest["input_ids"])

In [None]:
encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True)
print("Using truncation: ", encoded_texts_truncation["input_ids"])

In [None]:
tokenizer.truncation_side = "left"
encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True)
print("Using left-side truncation: ", encoded_texts_truncation_left["input_ids"])

In [None]:
encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("Using both padding and truncation: ", encoded_texts_both["input_ids"])

## Prepare instruction dataset

In [None]:
import pandas as pd

filename = 'lamini/lamini_docs'
dataset = load_dataset(filename)
examples = dataset['train']

if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]

#进行一些数据拼接工作
prompt_template = """### Question:
{question}

### Answer:"""

num_examples = len(examples["question"])
finetuning_dataset = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]
  text_with_prompt_template = prompt_template.format(question=question)
  finetuning_dataset.append({"question": text_with_prompt_template, "answer": answer})

from pprint import pprint
print("One datapoint in the finetuning dataset:")
pprint(finetuning_dataset[0])

## Tokenize a single example

In [None]:
text = finetuning_dataset[0]["question"] + finetuning_dataset[0]["answer"]
tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    padding=True
)
print(tokenized_inputs["input_ids"])

In [None]:
max_length = 2048
max_length = min(
    tokenized_inputs["input_ids"].shape[1],
    max_length,
)

In [None]:
tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    truncation=True,
    max_length=max_length
)

In [None]:
tokenized_inputs["input_ids"]

## Tokenize the instruction dataset

In [None]:
def tokenize_function(examples):
    if "question" in examples and "answer" in examples:
      text = examples["question"][0] + examples["answer"][0]
    elif "input" in examples and "output" in examples:
      text = examples["input"][0] + examples["output"][0]
    else:
      text = examples["output"][0]

    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

In [None]:
finetuning_dataset_loaded = datasets.load_dataset(filename, split="train")

tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

print(tokenized_dataset)
pprint(tokenized_dataset[0])

#把文本编程数字的形式

## Prepare test/train splits

In [None]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)

### Some datasets for you to try

In [None]:
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = datasets.load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

In [None]:
taylor_swift_dataset = "lamini/taylor_swift"
bts_dataset = "lamini/bts"
open_llms = "lamini/open_llms"

In [None]:
dataset_swiftie = datasets.load_dataset(taylor_swift_dataset)
print(dataset_swiftie["train"][1])

# 05-Training

In [None]:
import datasets
import tempfile
import logging
import random
# import config
import os
import yaml
import time
import torch
import transformers
import pandas as pd
import jsonlines

# from utilities import *
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForCausalLM

from datasets import load_dataset


logger = logging.getLogger(__name__)
global_config = None

## Load the Lamini docs dataset

In [None]:
dataset_path = "lamini/lamini_docs"

## Set up the model, training config, and tokenizer

In [None]:
model_name = "EleutherAI/pythia-70m"

In [None]:
# training_config = {
#     "model": {
#         "pretrained_name": model_name,
#         "max_length" : 2048
#     },
#     "datasets": {
#         "use_hf": use_hf,
#         "path": dataset_path
#     },
#     "verbose": True
# }

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
dataset = load_dataset(dataset_path)
train_dataset, test_dataset = dataset['train'], dataset['test']

print(train_dataset)
print(test_dataset)

# train_dataset = train_dataset.map(
#     tokenize_function,
#     batched=True,
#     batch_size=1,
#     drop_last_batch=True
# )
# test_dataset = test_dataset.map(
#     tokenize_function,
#     batched=True,
#     batch_size=1,
#     drop_last_batch=True
# )

## Load the base model

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(model_name)

In [None]:
device_count = torch.cuda.device_count()
if device_count > 0:
    logger.debug("Select GPU device")
    device = torch.device("cuda")
else:
    logger.debug("Select CPU device")
    device = torch.device("cpu")

In [None]:
base_model.to(device)

## Define function to carry out inference

In [None]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

## Try the base model

In [None]:
test_text = test_dataset[0]['question']
print("Question input (test):", test_text)
print(f"Correct answer from Lamini docs: {test_dataset[0]['answer']}")
print("Model's answer: ")
print(inference(test_text, base_model, tokenizer))

### Setup training

In [None]:
max_steps = 240

In [None]:
trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name

In [None]:
training_args = TrainingArguments(

  # Learning rate
  learning_rate=1.0e-5,

  # Number of training epochs
  num_train_epochs=1,

  # Max steps to train for (each step is a batch of data)
  # Overrides num_train_epochs, if not -1
  max_steps=max_steps,

  # Batch size for training
  per_device_train_batch_size=1,

  # Directory to save model checkpoints
  output_dir=output_dir,

  # Other arguments
  overwrite_output_dir=False, # Overwrite the content of the output directory
  disable_tqdm=False, # Disable progress bars
  eval_steps=10, # Number of update steps between two evaluations
  save_steps=120, # After # steps model is saved
  warmup_steps=0, # Number of warmup steps for learning rate scheduler
  per_device_eval_batch_size=1, # Batch size for evaluation
  evaluation_strategy="steps",
  logging_strategy="steps",
  logging_steps=1,
  optim="adafactor",
  gradient_accumulation_steps = 4,
  gradient_checkpointing=False,

  # Parameters for early stopping
  load_best_model_at_end=True,
  save_total_limit=1,
  metric_for_best_model="eval_loss",
  greater_is_better=False
)

In [None]:
trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

### Train a few steps

In [None]:
training_output = trainer.train()

### Save model locally

In [None]:
save_dir = f'{output_dir}/final'

trainer.save_model(save_dir)
print("Saved model to:", save_dir)

In [None]:
finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)


In [None]:
finetuned_slightly_model.to(device)

### Run slightly trained model

In [None]:
test_question = test_dataset[0]['question']
print("Question input (test):", test_question)

print("Finetuned slightly model's answer: ")
print(inference(test_question, finetuned_slightly_model, tokenizer))

In [None]:
test_answer = test_dataset[0]['answer']
print("Target answer output (test):", test_answer)

### Run same model trained for two epochs

In [None]:
finetuned_longer_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")
tokenizer = AutoTokenizer.from_pretrained("lamini/lamini_docs_finetuned")

finetuned_longer_model.to(device)
print("Finetuned longer model's answer: ")
print(inference(test_question, finetuned_longer_model, tokenizer))

### Run much larger trained model and explore moderation

# Explore moderation using small model
### First, try the non-finetuned base model:

In [None]:
base_tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
base_model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")
print(inference("What do you think of Mars?", base_model, base_tokenizer))

### Now try moderation with finetuned small model

In [None]:
print(inference("What do you think of Mars?", finetuned_longer_model, tokenizer))

# 06-Evaluation

In [None]:
import datasets
import tempfile
import logging
import random
import os
import yaml
import logging
import difflib
import pandas as pd

import transformers
import datasets
import torch

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

logger = logging.getLogger(__name__)
global_config = None

In [None]:
dataset = datasets.load_dataset("lamini/lamini_docs")

test_dataset = dataset["test"]

In [None]:
print(test_dataset[0]["question"])
print(test_dataset[0]["answer"])

In [None]:
model_name = "lamini/lamini_docs_finetuned"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

## Setup a really basic evaluation function

In [None]:
def is_exact_match(a, b):
    return a.strip() == b.strip()

In [None]:
model.eval()

In [None]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  tokenizer.pad_token = tokenizer.eos_token
  input_ids = tokenizer.encode(
      text,
      return_tensors="pt",
      truncation=True,
      max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

## Run model and compare to expected answer

In [None]:
test_question = test_dataset[0]["question"]
generated_answer = inference(test_question, model, tokenizer)
print(test_question)
print(generated_answer)

In [None]:
answer = test_dataset[0]["answer"]
print(answer)

In [None]:
exact_match = is_exact_match(generated_answer, answer)
print(exact_match)

## Run over entire dataset

In [None]:
n = 10
metrics = {'exact_matches': []}
predictions = []
for i, item in tqdm(enumerate(test_dataset)):
    print("i Evaluating: " + str(item))
    question = item['question']
    answer = item['answer']

    try:
      predicted_answer = inference(question, model, tokenizer)
    except:
      continue
    predictions.append([predicted_answer, answer])

    #fixed: exact_match = is_exact_match(generated_answer, answer)
    exact_match = is_exact_match(predicted_answer, answer)
    metrics['exact_matches'].append(exact_match)

    if i > n and n != -1:
      break
print('Number of exact matches: ', sum(metrics['exact_matches']))

In [None]:
df = pd.DataFrame(predictions, columns=["predicted_answer", "target_answer"])
print(df)

## Evaluate all the data

In [None]:
evaluation_dataset_path = "lamini/lamini_docs_evaluation"
evaluation_dataset = datasets.load_dataset(evaluation_dataset_path)

In [None]:
pd.DataFrame(evaluation_dataset)

# 07-Deeper into Transformer

## Check about the shape of input and output

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F


text = 'GPT, short for Generative Pre-trained Transformer, represents a groundbreaking advancement in the field of artificial intelligence and natural language processing. Developed by OpenAI, GPT is designed to understand, generate, and interpret human language with remarkable accuracy and fluency. It operates on the principle of machine learning, where the model is initially pre-trained on a vast corpus of text data. This pre-training enables GPT to grasp the intricacies of language, including grammar, context, and even subtleties like humor and sarcasm. Following the pre-training phase, GPT undergoes fine-tuning, where it is further trained on a smaller, more specialized dataset to perform specific tasks like translation, question-answering, and content creation. What sets GPT apart is its deep learning architecture, which consists of multiple layers of transformers—hence the name. These transformers allow the model to process and analyze text in a highly efficient and nuanced manner, making GPT capable of generating text that is often indistinguishable from that written by humans. As technology evolves, GPT continues to push the boundaries of what artificial intelligence can achieve in understanding and mimicking human language.'

chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Characters from the sentence:", "".join(chars))
print("vocab_size from the sentence: ", vocab_size)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
data = torch.tensor(encode(text), dtype=torch.long)
train_data = data

def get_batch():
    # generate a small batch of data of inputs x and targets y
    data = train_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = len(data) - 1 # what is the maximum context length for predictions?
# block_size = 192 # what is the maximum context length for predictions?
device = 'cuda' if torch.cuda.is_available() else 'cpu'

n_embd = 1500
n_head = 1
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()   # head_size = 150
        self.key = nn.Linear(n_embd, head_size, bias=False)   # x  ->  embedding size ->  n_head * head_size
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):  # head_size = 150 * n_head 10
        B,T,C = x.shape  # batch_size, seq_len, embedding_size   (4, 100, 1500)
        k = self.key(x)   # (B,T,C)  (4, 100, 1500) * (1500, 150) -> (4, 100, 150)
        q = self.query(x) # (B,T,C)  (4, 100, 1500) * (1500, 150) -> (4, 100, 150)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)  （4, 100, 150) * (4, 150, 100) -> (4, 100, 100)
             # you are the best
        # you  11  21   23  23
        # are
        # the
        # best
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)  (4, 100, 100)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)  (4, 100, 100) * (100, 1500) -> (4, 100, 150)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)  # n_head * head_size

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)      # [1],[2],[3]  -> [1,2,3]  # (4, 100, 150) * 10 -> (4, 100, 1500) -> batch, seq_length, embedding_size
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),   # 64 -> 256   信息 -> 维度升高
            nn.ReLU(),                  # 激活函数      取出强烈的信息
            nn.Linear(4 * n_embd, n_embd),   # 信息维度降低
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)  # 提取信息
        self.ffwd = FeedFoward(n_embd)  # GPT感知信息
        self.ln1 = nn.LayerNorm(n_embd)  # 归一化
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class BabyGPT(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)  # 词嵌入  [0,2,5,6] -> [0.1,0.2,0,7,0.1]  n_embed
        self.position_embedding_table = nn.Embedding(block_size, n_embd)   #                  [0.9,0.3,0.4,0.1]
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])   # transformer blocks * 4
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)   # 词嵌入 -> 线性层(形状变换) -> logits []  "your name is GPT-3" -> seq_len * [0.9,0.3,0.4,0.1] -> "信息 -> 概率"  -> token的概率 生成某个token的概率   [0.1, 0.1, 0.8, 0.0]

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers m   idx + next_idx = [y], [yo],[you]
        tok_emb = self.token_embedding_table(idx) # (B,T,C)   batch_size * seq_len * [0.9,0.3,0.4,0.1]
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)   (T,C)  (4, 100, 1500)
        x = tok_emb + pos_emb # (B,T,C)  # 词信息 + 位置信息 (4, 100, 1500)
        x = self.blocks(x) # (B,T,C)  # 信息提取 (4, 100, 1500) -> 10 * (4, 100, 150) -> (4, 100, 1500)
        x = self.ln_f(x) # (B,T,C)  # 归一化
        logits = self.lm_head(x) # (B,T,vocab_size)  # 词嵌入 -> 线性层(形状变换) -> logits(概率信息) (4, 100, 1500) * (1500, 39) -> (4, 100, 39)
         # I am st .. o
         # [ ,'a', .... , ]
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)   # 交叉熵损失函数  - 差距  - minimize 差距

        return logits, loss

    def generate(self, idx, max_new_tokens):   # data pre -> train model -> model serving
        # idx is (B, T) array of indices in the current context  # "pre"
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond) # [0,3,2,4,5,6]
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C) # [0.1, 0.2, 0.7]  (4, 100, 39) -> (4, 39)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)  # [0.1, 0.44, 0.46]  -> max 2  boss
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)   "pre"  -  I am a  -> "I am a " + "boss" -> "I am a boss" -> model -> "I am a boss" -> "I am a boss" + "!" s
        return idx

## Test with a babyGPT

In [None]:
m = BabyGPT()

In [None]:
import torch
from tqdm import tqdm

batch_size = 1
device = 'cuda' if torch.cuda.is_available() else 'cpu'
m = m.to(device)
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

# 使用tqdm添加进度条，并在进度条中显示中间loss值
pbar = tqdm(range(500))
for steps in pbar:
    # sample a batch of data
    xb, yb = get_batch()
    xb, yb = xb.to(device), yb.to(device)
    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    # 更新进度条的描述以显示当前的loss值
    pbar.set_description(f"Loss: {loss.item():.4f}")

print(loss.item())

In [None]:
start_id = torch.tensor([encode('GPT')], dtype=torch.long)
start_id = start_id.to(device)
print(decode(m.generate(idx = start_id, max_new_tokens=200)[0].tolist()))

## Rethink about the attention mechanism

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"


# you are the best
# 1    2    3   4
torch.manual_seed(42)
wei = torch.tril(torch.ones(7, 7))
wei = wei / torch.sum(wei, 1, keepdim=True)
b = torch.randint(0,10,(7,2)).float()
c = wei @ b
print('wei=')
print(wei)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

     # you are the best

# you  11  00   00  00
# are  14  23   00  00
# the
# best

In [None]:
# self-attention!
#import torch.nn as nn
#from torch.nn import functional as F
torch.manual_seed(1337)
B,T,C = 1,8,32 # batch, time, channels
x = torch.randn(B,T,C)  # (1, 8, 32)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)  # x -->捕捉可以当作key的信息
query = nn.Linear(C, head_size, bias=False) # x -->捕捉可以当作query的信息
value = nn.Linear(C, head_size, bias=False) # x -->捕捉可以当作value的信息

print(key.weight)
print(query)
print(value)

k = key(x)   # (B, T, 16)  (1, 8, 32) * (32, 16)  (1, 8, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)  attention score   (1, 8, 8)

print(wei[0])

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
print("masked:",wei[0])
wei = F.softmax(wei, dim=-1)
print("masked:",wei[0])

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

In [None]:
wei[0]