## Post-process a finetuned LLM

Test and upload a finetuned language model

In [1]:
!pip install -q -U huggingface_hub peft transformers torch accelerate


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
!git config --global credential.helper store

## Setup

In [4]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [5]:
!nvidia-smi

Mon Jul 24 20:26:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA RTX A6000    On   | 00000000:05:00.0 Off |                  Off |
| 35%   56C    P8    20W / 300W |      3MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [6]:
free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"
n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
print("max memory: ", max_memory)

max memory:  {0: '44GB'}


## Loss curve

During training, the model converged nicely as follows:

![image](https://raw.githubusercontent.com/daniel-furman/sft-demos/main/assets/jul_24_23_1_13_00_log_loss_curves_llama-2-13b-dolphin.png)


## Basic testing

With a supervised finetuned (sft) model in hand, we can test it on some basic prompts and then upload it to the Hugging Face hub either as a public or private model repo, depending on the use case.

In [7]:
peft_model_id = "results/checkpoint-50000"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 5120, padding_idx=0)
        (layers): ModuleList(
          (0-39): 40 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear(
                in_features=5120, out_features=5120, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=5120, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=5120, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear(
                in_features=5120, out_features=5120, bias=False
       

In [9]:
free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"
n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
print("max memory: ", max_memory)

max memory:  {0: '19GB'}


In [10]:
# text generation function


def llama_generate(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    prompt: str,
    max_new_tokens: int = 128,
    temperature: int = 1.0,
) -> str:
    """
    Initialize the pipeline
    Uses Hugging Face GenerationConfig defaults
        https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig
    Args:
        model (transformers.AutoModelForCausalLM): Falcon model for text generation
        tokenizer (transformers.AutoTokenizer): Tokenizer for model
        prompt (str): Prompt for text generation
        max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
        temperature (float, optional): The value used to modulate the next token probabilities.
            Defaults to 1.0
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    inputs = tokenizer(
        [prompt],
        return_tensors="pt",
        return_token_type_ids=False,
    ).to(
        device
    )  # tokenize inputs, load on device

    # when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
    with torch.autocast("cuda", dtype=torch.bfloat16):
        response = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            return_dict_in_generate=True,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    decoded_output = tokenizer.decode(
        response["sequences"][0],
        skip_special_tokens=True,
    )  # grab output in natural language

    return decoded_output[len(prompt) :]  # remove prompt from output

In [11]:
prompt = "You are a helpful assistant. Write me a numbered list of things to do in New York City.\n"

response = llama_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=250,
    temperature=0.92,
)

print(response)

1. Visit the iconic Statue of Liberty and Ellis Island.
2. Take a stroll through Central Park and enjoy its many attractions.
3. Explore the world-renowned museums, such as the Metropolitan Museum of Art and the Museum of Modern Art.
4. Experience the vibrant energy of Times Square and take in the bright lights and billboards.
5. Visit the 9/11 Memorial and Museum to pay tribute to those who lost their lives in the attacks.
6. Enjoy a Broadway show or a concert at one of the many theaters and venues in the city.
7. Take a ride on the Staten Island Ferry for a free view of the Statue of Liberty and the New York City skyline.
8. Shop at the famous Fifth Avenue stores and explore the high-end boutiques.
9. Indulge in a variety of cuisines at one of the many restaurants in the city.
10. Visit the Empire State Building and enjoy the panoramic views of the city from the observation deck.
11. Attend a sport


In [22]:
prompt = "You are a helpful assistant. Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? Let's think this through in a step by step fashion to ensure we have the right answer.\n"

response = llama_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=250,
    temperature=0.92,
)

print(response)


Step 1: Understand the situation
Daniel needs a haircut and his barber is available on Mondays, Wednesdays, and Fridays.

Step 2: Identify the days of the week
Sunday is the day Daniel went for a haircut.

Step 3: Check if the barber is available on Sundays
The barber is not available on Sundays.

Step 4: Analyze the situation
Since the barber is not available on Sundays, it does not make logical sense for Daniel to go for a haircut on that day.

Conclusion: No, this does not make logical sense. Daniel should have gone for a haircut on a day when his barber is available, such as Monday, Wednesday, or Friday.

In summary, it does not make logical sense for Daniel to go for a haircut on Sunday because his barber is not available on that day. He should have gone for a haircut on a day when his barber is available, like Monday, Wednesday, or Friday. This would ensure that he can get his haircut when the


In [23]:
prompt = "You are a helpful assistant. Write a short email inviting my friends to a dinner party on Friday. Respond succinctly.\n"

response = llama_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=150,
    temperature=0.92,
)

print(response)

Subject: Friday Dinner Party Invitation

Dear Friends,

I hope this email finds you well. I'm excited to invite you all to a dinner party on Friday, March 10th, at 7:00 PM. The address is 123 Main Street, Anytown, USA.

Please RSVP by Wednesday, March 8th, so I can plan accordingly. I look forward to seeing you all and sharing a delicious meal together!

Best,
Your Friendly Assistant

P.S. If you have any dietary restrictions or allergies, please let me know in your RSVP. Thank you!


In [26]:
prompt = "You are a helpful assistant. What is a recipe for vegan banana bread?\n"

response = llama_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=550,
    temperature=0.92,
)

print(response)


What is a recipe for vegan banana bread?

Choose your answer. Are these two questions paraphrases of each other?
Choose your answer from:
(a). no
(b). yes
(b). yes

Here is a recipe for vegan banana bread:

1. Preheat the oven to 350°F (175°C). Grease a 9x5-inch loaf pan with non-stick cooking spray or line it with parchment paper.

2. In a large bowl, mash the bananas with a fork until smooth. Add the sugar, oil, and vanilla extract, and mix well.

3. In a separate bowl, whisk together the flour, baking powder, baking soda, and salt.

4. Add the dry ingredients to the wet ingredients and mix until just combined. Fold in the chocolate chips or walnuts, if using.

5. Pour the batter into the prepared loaf pan and spread it evenly. Bake for 50-60 minutes, or until a toothpick inserted into the center comes out clean.

6. Allow the bread to cool in the pan for 10 minutes before removing it to a wire rack to cool completely.

7. Enjoy your vegan banana bread!

This recipe is suitable for 

## Inf runtime test

In [15]:
import tqdm
import time

prompt = "You are a helpful assistant. Write me a long list of things to do in San Francisco:\n"

runtimes = []
for i in tqdm.tqdm(range(25)):
    start = time.time()
    response = llama_generate(
        model,
        tokenizer,
        prompt,
        max_new_tokens=50,
        temperature=0.92,
    )
    end = time.time()
    runtimes.append(end - start)
    assert len(tokenizer.encode(response)) == 52

100%|██████████| 25/25 [01:15<00:00,  3.04s/it]


In [16]:
avg_runtime = torch.mean(torch.tensor(runtimes)).item()
print(f"Runtime avg in seconds: {avg_runtime}")  # time in seconds

Runtime avg in seconds: 3.0358777046203613


## Upload model to Hugging Face
1. Before running the cells below, create a model on your Hugging Face account. It can be a private or public repo and work with the below code.

In [17]:
# push to hub
model_id_load = "dfurman/llama-2-13b-dolphin-peft"

# tokenizer
tokenizer.push_to_hub(model_id_load, use_auth_token=True)
# safetensors
model.push_to_hub(model_id_load, use_auth_token=True, safe_serialization=True)
# torch tensors
model.push_to_hub(model_id_load, use_auth_token=True)

adapter_model.safetensors:   0%|          | 0.00/419M [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/420M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/dfurman/llama-2-13b-dolphin-peft/commit/176a21c2dc203b559ea337d918ad83722bd4276b', commit_message='Upload model', commit_description='', oid='176a21c2dc203b559ea337d918ad83722bd4276b', pr_url=None, pr_revision=None, pr_num=None)