# Converting the State Dict

The training script (`train.py`) doesn't support any fancy saving/checkpointing methods, but it does optionally save the model right at the end of training into a safetensors file. In this notebook we'll show how to load in these saved weights for downstream evaluation and usage. This should hopefully become unneeded as frameworks integrate the changes needed to make FSDP+QLoRA work natively.

As an example, let's look at a model trained with the following command (using default settings for LoRA rank etc):

`python train.py --save_model True --train_type qlora --output_dir qlora_output`

We'll load the saved state_dict, and then copy the relevant weights into a PEFT model to save via their TODO method.

Let's start by loading the state dict. If you uncomment the print statement, you'll see that for every linear layer that had a LoRA adapter, we have something like this:
```
base_model.model.model.layers.0.mlp.down_proj.base_layer.weight torch.bfloat16 torch.Size([11272192, 1])
base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight torch.bfloat16 torch.Size([8, 11008])
base_model.model.model.layers.0.mlp.down_proj.lora_B.default.weight torch.bfloat16 torch.Size([4096, 8])
```

The base weights are flattened and quantized 4-bit values, which we won't need (we'll load the original base model later), and the lora_A and lora_B adapters are the ones we're interested in.

In [None]:
from safetensors import safe_open

tensors = {}
with safe_open("../results/model_state_dict.safetensors", framework="pt", device=0) as f:
    for k in f.keys():
        tensors[k] = f.get_tensor(k) # Loads the full tensor given a key
        # print(k, tensors[k].dtype, tensors[k].shape) # Uncomment to view
        # print(k) # Uncomment to view

To save memory, we can delete everything not the LoRA layers:

In [None]:
for k in tensors:
    if 'lora' not in k: 
        tensors[k] = None 

Next, we load the base model and add a random adapter:

In [None]:
import torch
from transformers import LlamaForCausalLM, BitsAndBytesConfig, AutoTokenizer
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType

# Make sure the compute type, target modules, rank, alpha etc match!
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=torch.bfloat16
)
# Load Model 
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", use_cache=False,quantization_config=bnb_config)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load Tokenizer 
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Freeze
for param in model.parameters():
    param.requires_grad = False

# Add LoRA (make sure your rank (r) and alpha (lora_alpha) values match those used in training!)
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, 
    inference_mode=False, r=64, 
    lora_alpha=16, lora_dropout=0.1,
    target_modules=["k_proj", "q_proj", "v_proj", "up_proj", "down_proj", "gate_proj"]
)
model = get_peft_model(model, peft_config)

# Check out the first few keys in the state dict:
list(model.state_dict().keys())

Now, if all goes well, we can replace the randomly initialized LoRA layers with our trained ones:

In [None]:
new_sd = model.state_dict()

# Create a list to store keys to remove
keys_to_remove = []

# Bên trên ta đã quantize model với QLoRA bằng `bnb_confif`, nên tới đây ta cần remove các key_quantize ra khỏi dict
for k in new_sd:
    if 'lora' in k:
        new_sd[k] = tensors[k]
    elif ('.absmax' in k) or ('.quant_state.bitsandbytes__nf4' in k) or ('.quant_map' in k): 
        keys_to_remove.append(k)
    else:
        continue

# Iterate over the list of keys to remove and delete them from the dictionary
for k in keys_to_remove:
    del new_sd[k]

In [None]:
model.load_state_dict(new_sd)

And now, since we have a regular PEFT model, we can save using the built-in methods:

In [None]:
model.save_pretrained("lora_adapters")

In [None]:
! ls lora_adapters

In [None]:
model.push_to_hub('chwenjun225/lora_adapters') # If you want to share your model... 

## Merged LoRA

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_name = "meta-llama/Llama-2-7b-hf" 
new_model_name = "chwenjun225/lora_adapters"

# Make sure the compute type, target modules, rank, alpha etc match!
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Reload model in FP16 and merge it with LoRA weights
base_model = LlamaForCausalLM.from_pretrained(
    base_model_name, 
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto", 
    quantization_config=bnb_config
)
from peft import LoraConfig, PeftModel
model = PeftModel.from_pretrained(base_model, new_model_name)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
# SAVE `LORA MODEL`
model.save_pretrained("lora_adapters") 
tokenizer.save_pretrained("lora_adapters")

In [None]:
model.push_to_hub("chwenjun225/lora_adapters", token = "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") 
tokenizer.push_to_hub("chwenjun225/lora_adapters", token = "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") 

## Inference

In [None]:
MY_PROMPT = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n ### Instruction: \n{}\n\n### Response:"

In [None]:
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer 

model_inference = AutoPeftModelForCausalLM.from_pretrained("chwenjun225/lora_adapters")
tokenizer_inference = AutoTokenizer.from_pretrained("chwenjun225/lora_adapters")

In [4]:
inputs = tokenizer_inference(
[
    MY_PROMPT.format("""Extract information that you have learned from this source text:  
MUSIC
Pucker up! Kiss to open final 'End of the Road' tour in Cincinnati 💋
Portrait of Luann GibbsLuann Gibbs
Cincinnati Enquirer

The final leg of the Kiss "End of the Road" tour begins in Cincinnati. The iconic band are wrapping up a 50-year career with a North American tour that starts at Heritage Bank Center in Cincinnati, and ends at New York City's Madison Square Garden. Tickets go on sale Friday, June 9, 2023.
The end of the road begins in Cincinnati. The legendary rock 'n' roll band Kiss is closing out a 50-year career, but before the band packs away its iconic makeup and wild costumes, the boys are taking one last ride around the world with a final tour, fittingly titled the "End of the Road" tour. It will span 50 dates around the world, and the North American leg kicks off Oct. 19 right here in Cincinnati.

Tickets go on sale Friday, June 9, for the show, which will take place at Heritage Bank Center (100 Broadway, Downtown). The tour wraps up in December with a massive final show at Madison Square Garden in New York City.

Concert dates:Cincinnati's full 2023 concert calendar 🎵

Kiss was formed in New York City in 1973 by members Paul Stanley, Gene Simmons, Ace Frehley and Peter Criss. With greasepaint makeup and outrageous costumes, the bandmembers took on the personae of comic book-style characters, and their "shock-rock" style live performances have been known to feature fire-breathing, blood-spitting, levitating drum kits and pyrotechnics. Considered one of the most influential rock bands of all time and one of the best-selling bands of all time, Kiss has sold more than 75 million records worldwide, earned 30 gold albums, and all four original members have been inducted into the Rock and Roll Hall of Fame.

The current lineup includes Stanley, Simmons, guitarist Tommy Thayer and drummer Eric Singer.

Need a break? Play the USA TODAY Daily Crossword Puzzle.

Kiss 2023 North American End of the Road tour dates:
Oct. 19: Cincinnati, Heritage Bank Center
Oct. 20: Detroit, Little Caesars Arena
Oct. 22: Cleveland, Rocket Mortgage FieldHouse
Oct. 23: Nashville, Bridgestone Arena
Oct. 25: St. Louis, Enterprise Center
Oct. 27: Fort Worth, Texas, Dickies Arena           
Oct. 29: Austin, Moody Center
Nov. 1: Palm Springs, Calif. Acrisure Arena
Nov. 3: Los Angeles, Hollywood Bowl
Nov. 6: Seattle, Climate Pledge Arena
Nov. 8: Vancouver, Rogers Arena
Nov. 10: Edmonton, Alberta, Rogers Place
Nov. 12: Calgary, Alberta, Scotiabank Saddledome
Nov. 13: Saskatoon, Saskatchewan, SaskTel Centre
Nov. 15: Winnipeg, Manitoba, Canada Life Centre
Nov. 18: Montreal, Quebec, Centre Bell
Nov. 19: Quebec, Videotron Centre
Nov. 21: Ottawa, Ontario, Canadian Tire Centre
Nov. 22: Toronto, Ontario, Scotiabank Arena
Nov. 24: Knoxville, Tenn., Thompson-Boling Arena
Nov. 25: Indianapolis, Gainbridge Fieldhouse
Nov. 27: Rosemont, Illinois, Allstate Arena
Nov. 29: Baltimore, CFG Bank Arena
Dec. 1: New York City, Madison Square Garden
Dec. 2: New York City, Madison Square Garden""")
], return_tensors = "pt")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer_inference)
_ = model_inference.generate(**inputs, streamer = text_streamer, max_new_tokens=8192)

# inference.sh 
lm_eval --model hf \
--model_args pretrained=chwenjun225/lora_adapters,load_in_4bit=True,parallelize=True \
--tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa \
--device cuda \
--batch_size auto