# Generation with LLMs

In [1]:
from google.colab import drive

# mount google drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# set the models folder, HuggingFace will look into this folder and download the model if needed
%env HF_HOME=/content/drive/MyDrive/HMD/cache

env: HF_HOME=/content/drive/MyDrive/HMD/cache


We will use ```Llama2-7b-chat``` and ```Llama-3-8B-Instruct```.
These models are fine-tuned versions of the base models.
Since the models were prompted with specific templates during fine-tuning, we will use the same templates to have the models be in the best conditions.

In [4]:
MODELS = {
    "llama2": "meta-llama/Llama-2-7b-chat-hf",
    "llama3": "meta-llama/Meta-Llama-3-8B-Instruct",
}

TEMPLATES = {
    "llama2": "<s>[INST] <<SYS>>\n{}\n<</SYS>>\n\n{} [/INST]",
    "llama3": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
}

## Create Huggingface Access Token

1. Create an account on [HuggingFace](https://huggingface.co/join)
2. Log in on [HuggingFace](https://huggingface.co/login)
3. [Create a new access token](https://huggingface.co/settings/tokens)

    1. Click on "Create New Access Token"
    2. Select "Read" as Token Type
    3. Give it a name, e.g. HMD
    4. Create and "Copy" it, you won't be able to do it afterwards
    5. Paste it in the cell below to use it as an environment variable

In [None]:
# do not add quotes or double quotes, just replace the paste the token
%env HF_TOKEN=...

## Download the models (Only The first time)

1. Request access for [LLaMA 2](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and [LLaMA 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) by following the instructions on their HuggingFace page
2. After having been granted access, run the code below to download the models (you will require 28GB of space)

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def download_models(models):
    for model_name in models.values():
        # triggers download of the models
        AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            torch_dtype=torch.float16
        )
        AutoTokenizer.from_pretrained(model_name)

download_models(MODELS)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



## Prompt the models

Import the required libraries and classes.

In [5]:
import torch

from typing import Tuple
from transformers import AutoModelForCausalLM, AutoTokenizer, BatchEncoding, PreTrainedTokenizer, PreTrainedModel

Functions for loading the models and generate responses.

In [6]:
def load_model(model_name: str, dtype) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
    torch_dtype = torch.float32
    if dtype == "bf16":
        torch_dtype = torch.bfloat16
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch_dtype,

    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

def generate(
    model: PreTrainedModel,
    inputs: BatchEncoding,
    tokenizer: PreTrainedTokenizer,
    max_seq_length: int,
) -> str:
    output = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=max_seq_length,
        pad_token_id=tokenizer.eos_token_id,
    )
    return tokenizer.decode(
        output[0][len(inputs.input_ids[0]) :], skip_special_tokens=True
    )

Parameters and input for the generation.

In [7]:
model_name = "llama2"
chat_template = TEMPLATES[model_name]
model_name = MODELS[model_name]

dtype = "bf16"
max_seq_length = 128

system_prompt = "You are a pizza ordering assistant."
input = "User: Hello, I would like a pizza. System: "

Load the model and tokenizer based on the parameters.

In [8]:
model, tokenizer = load_model(model_name, dtype)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Prepare the input and generate a response.

In [9]:
# Format and tokenize the input
input_text = chat_template.format(system_prompt, input)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# Generate a response
response = generate(model, inputs, tokenizer, max_seq_length)
print(response)

 Great! What can I get for you today? Would you like a classic margherita, a meat-lovers, or perhaps something more adventurous like a pesto and sun-dried tomato pizza?

Please let me know and I'll be happy to take care of the rest of your order.


#### Parameters

Parameters for the underlying python script:

```
usage: python -m query_model [-h] [--system-prompt SYSTEM_PROMPT] [--dtype {f32,bf16}] [--max_seq_length MAX_SEQ_LENGTH] [--return-full] [--dotenv-path DOTENV_PATH] {llama2,llama3} INPUT_TEXT

Query a specific model with a given input.

positional arguments:
  {llama2,llama3}       The model to query.
  INPUT_TEXT            The input to query the model with.

options:
  -h, --help            show this help message and exit
  --system-prompt SYSTEM_PROMPT
                        The system prompt to use for the model. (default: )
  --dtype {f32,bf16}    The data type to use for the model. (default: f32)
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use for the model. (default: 128)
```