## Post-process a finetuned LLM

Test and upload a finetuned language model

In [1]:
!pip install -q -U huggingface_hub peft transformers torch accelerate


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
!git config --global credential.helper store

## Setup

In [1]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [2]:
!nvidia-smi

Tue Sep 12 15:24:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA RTX A6000    On   | 00000000:05:00.0 Off |                  Off |
| 30%   35C    P8    22W / 300W |      3MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    On   | 00000000:06:00.0 Off |                  Off |
| 30%   32C    P8    24W / 300W |      3MiB / 49140MiB |      0%      Default |
|       

In [3]:
free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"
n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
print("max memory: ", max_memory)

max memory:  {0: '44GB', 1: '44GB', 2: '44GB', 3: '44GB'}


## Loss curve

During training, the model converged nicely as follows:

![image](https://raw.githubusercontent.com/daniel-furman/sft-demos/main/assets/jul_24_23_1_14_00_log_loss_curves_llama-2-70b-dolphin.png)


## Basic testing

With a supervised finetuned (sft) model in hand, we can test it on some basic prompts and then upload it to the Hugging Face hub either as a public or private model repo, depending on the use case.

In [4]:
peft_model_id = "results/checkpoint-5000"
# peft_model_id = "dfurman/llama-2-70b-dolphin-peft"
config = PeftConfig.from_pretrained(peft_model_id)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    quantization_config=bnb_config,
    use_auth_token=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)



Loading checkpoint shards:   0%|          | 0/81 [00:00<?, ?it/s]



In [5]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): FalconForCausalLM(
      (transformer): FalconModel(
        (word_embeddings): Embedding(65024, 14848)
        (h): ModuleList(
          (0-79): 80 x FalconDecoderLayer(
            (self_attention): FalconAttention(
              (maybe_rotary): FalconRotaryEmbedding()
              (query_key_value): Linear4bit(
                in_features=14848, out_features=15872, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=14848, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=15872, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (dense): Linear4bit

In [6]:
free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"
n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
print("max memory: ", max_memory)

max memory:  {0: '9GB', 1: '9GB', 2: '9GB', 3: '9GB'}


In [7]:
# text generation function


def falcon_generate(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    prompt: str,
    max_new_tokens: int = 128,
    temperature: int = 1.0,
) -> str:
    """
    Initialize the pipeline
    Uses Hugging Face GenerationConfig defaults
        https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig
    Args:
        model (transformers.AutoModelForCausalLM): Falcon model for text generation
        tokenizer (transformers.AutoTokenizer): Tokenizer for model
        prompt (str): Prompt for text generation
        max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
        temperature (float, optional): The value used to modulate the next token probabilities.
            Defaults to 1.0
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    inputs = tokenizer(
        [prompt],
        return_tensors="pt",
        return_token_type_ids=False,
    ).to(
        device
    )  # tokenize inputs, load on device

    # when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
    with torch.autocast("cuda", dtype=torch.bfloat16):
        response = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            return_dict_in_generate=True,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    decoded_output = tokenizer.decode(
        response["sequences"][0],
        skip_special_tokens=True,
    )  # grab output in natural language

    return decoded_output[len(prompt) :]  # remove prompt from output

In [8]:
prompt = "You are a helpful assistant. Write me a numbered list of things to do in New York City.\n"

response = falcon_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=250,
    temperature=0.92,
)

print(response)

The current implementation of Falcon calls `torch.scaled_dot_product_attention` directly, this will be deprecated in the future in favor of the `BetterTransformer` API. Please install the latest optimum library with `pip install -U optimum` and call `model.to_bettertransformer()` to benefit from `torch.scaled_dot_product_attention` and future performance optimizations.


1. Visit the Empire State Building: Take in the breathtaking views of the city from the observation deck on the 86th floor.

2. Stroll through Central Park: Enjoy the lush greenery, lakes, and various attractions in this iconic urban park.

3. Explore the Metropolitan Museum of Art: Admire the vast collection of art from around the world, spanning thousands of years.

4. See a Broadway show: Experience the magic of live theater in one of the world's most famous theater districts.

5. Walk across the Brooklyn Bridge: Take in the stunning views of the Manhattan skyline and the East River as you cross this iconic bridge.

6. Visit the Statue of Liberty: Take a ferry to Liberty Island and climb to the top of the statue for a unique perspective of the city.

7. Visit the 9/11 Memorial and Museum: Pay tribute to the victims of the September 11th attacks and learn about the events that changed the world.

8. Visit Times Square: Experience the bright lights and bustling energy of this iconic i

In [9]:
prompt = "You are a helpful assistant. Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? Respond only with a yes or no answer in as few words as possible.\n"

response = falcon_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=100,
    temperature=0.92,
)

print(response)


No.

The barber does not work on Sundays, so Daniel cannot get a haircut on that day. He should have gone on one of the days the barber works (Mondays, Wednesdays, or Fridays). Therefore, it does not make logical sense for Daniel to go in for a haircut on Sunday. The answer is "No."

In summary, the answer is "No" because the barber does not work on Sundays, so Daniel cannot get a haircut on


In [10]:
prompt = "You are a helpful assistant. Write a short email inviting my friends to a dinner party on Friday. Respond succinctly.\n"

response = falcon_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=200,
    temperature=0.92,
)

print(response)


Subject: Dinner Party Invitation - Friday, 7pm

Dear friends,

I would like to invite you to a dinner party at my place this Friday at 7pm. It would be a great opportunity to catch up and enjoy some delicious food together.

Please let me know if you can make it by Wednesday. I look forward to seeing you all!

Best,
[Your Name]

P.S. Please let me know if you have any dietary restrictions.

Subject: Dinner Party Invitation - Friday, 7pm

Dear friends,

I'm hosting a dinner party this Friday at 7pm. Please join me for a fun evening of food and conversation.

Let me know if you can make it by Wednesday.

Best,
[Your Name]

P.S. Please let me know if you have any dietary restrictions.

Subject: Dinner Party Invitation


In [11]:
prompt = "You are a helpful assistant. Tell me a recipe for vegan banana bread.\n"

response = falcon_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=500,
    temperature=0.92,
)

print(response)


Ingredients:
- 3 ripe bananas
- 1/3 cup melted coconut oil or vegan butter
- 1/4 cup non-dairy milk (almond, soy, or oat milk)
- 1 teaspoon vanilla extract
- 1/2 cup brown sugar
- 1 1/2 cups all-purpose flour
- 1 teaspoon baking powder
- 1/2 teaspoon baking soda
- 1/2 teaspoon salt
- 1/2 teaspoon ground cinnamon (optional)
- 1/2 cup chopped walnuts or chocolate chips (optional)

Instructions:

1. Preheat your oven to 350°F (175°C). Grease a 9x5-inch loaf pan with vegan butter or coconut oil.

2. In a large mixing bowl, mash the ripe bananas with a fork until they are smooth.

3. Add the melted coconut oil or vegan butter, non-dairy milk, vanilla extract, and brown sugar to the mashed bananas. Mix well until combined.

4. In a separate bowl, whisk together the all-purpose flour, baking powder, baking soda, salt, and ground cinnamon (if using).

5. Gradually add the dry ingredients to the wet ingredients, mixing until just combined. Do not overmix.

6. If you're using walnuts or chocola

## Inf runtime test

In [15]:
import tqdm
import time

prompt = "You are a helpful assistant. Write me a long list of things to do in San Francisco:\n"

runtimes = []
for i in tqdm.tqdm(range(25)):
    start = time.time()
    response = falcon_generate(
        model,
        tokenizer,
        prompt,
        max_new_tokens=50,
        temperature=0.92,
    )
    end = time.time()
    runtimes.append(end - start)

100%|██████████| 25/25 [11:18<00:00, 27.16s/it]


In [16]:
avg_runtime = torch.mean(torch.tensor(runtimes)).item()
print(f"Runtime avg in seconds: {avg_runtime}")  # time in seconds

Runtime avg in seconds: 27.155868530273438


## Upload model to Hugging Face
1. Before running the cells below, create a model on your Hugging Face account. It can be a private or public repo and work with the below code.

In [12]:
# push to hub
model_id_load = "dfurman/falcon-180b-instruct-peft"

# tokenizer
tokenizer.push_to_hub(model_id_load, use_auth_token=True)
# safetensors
model.push_to_hub(model_id_load, use_auth_token=True, safe_serialization=True)
# torch tensors
model.push_to_hub(model_id_load, use_auth_token=True)



adapter_model.safetensors:   0%|          | 0.00/4.28G [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/4.28G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/dfurman/falcon-180b-instruct-peft/commit/d7f7f6db7409e5cf05d7f9fa088a04303b5f8965', commit_message='Upload model', commit_description='', oid='d7f7f6db7409e5cf05d7f9fa088a04303b5f8965', pr_url=None, pr_revision=None, pr_num=None)