In [1]:
!pip install accelerate bitsandbytes>=0.39.0

# HuggingFace LLM tutorial

In this tutorial we'll see how we can interact with HuggingFace LLMs. This tutorial will focus on using LLMs out-of-the-box and not fine-tuning, as this is a very costly operation

## Model selection

Since this is a HuggingFace tutorial, we are going to select one of the LLMs hosted by HuggingFace. We can see these from the [web](https://huggingface.co/models).

For these types of models the easiest way is to filter by task, by selecting **Text Generation**

Then we will be shown all the available options of LLMs.


By selecting each we can see its *Model Card*, showing several info for that specific model.

#### Which model to select?

Unless you want something specific or have researched and are aware of the differences between the different models, I would recommend working with the **latest** models which have the strongest capabilities.

#### Should I be aware of the model type (encoder-only, decoder-only, encoder-decoder)?

From a usage standpoint, **no**. The model API will hide any differences that have to do with passing the input and generating output so their usage should be indentical.

From a more practical standpoint, it is true that different model types inherently might be better suited for different tasks. However the text generation model that we will foucs on are predominantly **decoder-only**. Again I would suggest sticking to this unless you've researched the differences and know what you're aiming for.

#### What model size should I choose?

There is obviously a **tradeoff between model capabilities and latency/memory**. Larger models most generally have higher natural language capabilities, but require more memory and are slower on inference. Keep in mind that this is only the case among **different versions of the same model** (e.g. falcon-4B, falcon-70B, falcon-180B). This is not necessarily true across models (e.g. falcon vs llama).

Choose according to your **hardware** and project's **requirements**.

#### Base vs instruct models

Most LLMs come in two versions **base** and **instruct**.

- **base** is good for most tasks
- **instruct** is better suited for conversational purposes

## Loading a LLM using the HuggingFace API

Let's say we chose [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) as our LLM.

A model essentially consists of 2 parts: the actual model (i.e. the transformer) and its tokenizer. The latter is necessary to process the raw inputs in the way the model wants to see them.

### Loading the transformer

We will use 2 options while loading this model:

- `device_map='auto'`: will ensure the model is loaded in the **GPU**
- `load_in_4bit=True`: is a **quanization** technique that reduces the precision of the float values in the model's tensors to reduce memory and speedup inference.

In [2]:
from transformers import AutoModelForCausalLM

model_name = 'mistralai/Mistral-7B-v0.1'

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map='auto',
                                             load_in_4bit=True)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

### Load the tokenizer

Besides the model we will also need to load the tokenizer from processing the inputs and decoding the model's outputs.

Each model might expect its inputs a differently and to ensure this **each model has its own tokenizer** (with the same name as the model), so we need to be sure we are loading the correct one!

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

## Model Usage

### Step 1. Tokenization

To make the model generate text we need to supply it with an **input prompt**.

Remember, LMs essentially predict the **most likely continuation** of the prompt we provide to it.

The first step is to process the prompt with the tokenizer. Let's see how this looks like.

In [4]:
input_prompt = ['A list of colors: red, blue']

tokenizer.tokenize(input_prompt[0])

['▁A', '▁list', '▁of', '▁colors', ':', '▁red', ',', '▁blue']

#### Asside: Sub word tokenization

Most LLMs do a type of **subword encoding**. This means that some words aren't represented as a whole but as their components. We won't go into much detail, but there are some [algorithms](https://en.wikipedia.org/wiki/Byte_pair_encoding) for defining a **sub-word vocabulary**.

Why is this necessary though? Notice in the text above I wrote *subword* with 3 different ways. Add also typos into the mix (e.g. *subwrod* or *subbword*) and we would have most of these being represented by the `[UNK]` token.

How does subword tokenization handle this problem? Let's see!

In [5]:
words = ['subword', 'sub-word', 'sub word', 'Sub word', 'Sub word', 'subbword', 'subwrod']

for word in words:
  print(f'{word:<8} --> {str(tokenizer.tokenize(word)):<21} --> {tokenizer(word)["input_ids"]}')

subword  --> ['▁sub', 'word']      --> [1, 1083, 2186]
sub-word --> ['▁sub', '-', 'word'] --> [1, 1083, 28733, 2186]
sub word --> ['▁sub', '▁word']     --> [1, 1083, 1707]
Sub word --> ['▁Sub', '▁word']     --> [1, 5078, 1707]
Sub word --> ['▁Sub', '▁word']     --> [1, 5078, 1707]
subbword --> ['▁sub', 'b', 'word'] --> [1, 1083, 28726, 2186]
subwrod  --> ['▁sub', 'w', 'rod']  --> [1, 1083, 28727, 9413]


Obviously we won't get perfect results (especially in the typo cases), but at least we have some information rather than none.

Let's actually tokenize our original input now.

In [6]:
model_inputs = tokenizer(input_prompt, return_tensors='pt').to('cuda')

### Step 2. Model inference

We now need to use the model to generate the text. As we know, to do this w need to do an **autoregressive** process of generating one token at a time and feeding it as an input to the model so that it can generate the next.

For how long do we need to do this process? Either until an **End-of-Sequence (`[EOS]`) token** is generated, or until a **max length** is reached.

We can use the HuggingFace API to allow it to do this process under the hood with the `model.generate()` method.

While this whole process is abstracted it is important to still understand what is hapenning under the hood in case we want to debug or tamper with it to improve performance (e.g. through changing the model's decoding strategy).

In [7]:
generated_ids = model.generate(**model_inputs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### Step 3. Decode the generated tokens

If we look at the model's outputs, we'll see that it is simply a list of IDs.

In [8]:
generated_ids

tensor([[    1,   330,  1274,   302,  9304, 28747,  2760, 28725,  5045, 28725,
          5344, 28725,  9684, 28725, 14545, 28725, 19435, 28725, 12937, 28725]],
       device='cuda:0')

To convert these to words, we'll again use the tokenizer to decode them.

In [9]:
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

'A list of colors: red, blue, green, yellow, orange, purple, pink,'

### Batching

If we want to ask our LLM to complete multiple prompts we can improve performance by processing them together in a **batch**.

In [10]:
tokenizer.pad_token = tokenizer.eos_token  # Most LLMs don't have a pad token by default

input_prompts = ['Souvlaki is a street food typically made from',
                 'The Parthenon is an ancient temple located in']

model_inputs = tokenizer(input_prompts, return_tensors='pt', padding=True).to('cuda')

generated_ids = model.generate(**model_inputs)

tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['Souvlaki is a street food typically made from pork, lamb, or chicken.',
 'The Parthenon is an ancient temple located in Athens, Greece. It is one of']

## Advanced usage

Notice how some examples we saw seemed to have been cut short? This has to do with the default max sequence length of the the model. To be able to change this we need to look under the hood.

### GenerationConfig

This is a config that is loaded automatically along with your model. It controls the model's behavior concerning generation. You can look at the default values [here](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig).

If we print this we will see only the values that are different from the default.

In [13]:
model.generation_config

GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

If we do a bit of digging we'll see that the default value of `max_length` is 20 tokens.

How can we change this? The easiest way is to simply add this as an argument to `model.generate()`. Let's set `max_length=100`.

*Note: this might take a bit to run*

In [14]:
generated_ids = model.generate(**model_inputs, max_length=100)

tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['Souvlaki is a street food typically made from pork, lamb, or chicken. It is a popular dish in Greece, Cyprus, and other Mediterranean countries. The meat is marinated in a mixture of herbs and spices, then grilled or roasted. Souvlaki is often served with pita bread, tzatziki, and other Greek-style toppings.\n\n## What is Souvlaki?\n\nSouvlaki is',
 'The Parthenon is an ancient temple located in Athens, Greece. It is one of the most famous and recognizable buildings in the world, and it is a symbol of ancient Greek civilization. The Parthenon was built in the 5th century BC as a temple to the goddess Athena, the patron goddess of Athens. It was designed by the architect Ictinus and built by the sculptor Phidias. The Parthenon is a masterpiece']

### Decoding Strategies

We won't go into much detail as there is a nice blog [here](https://huggingface.co/blog/how-to-generate) explaining the differences, but there are different strategies for selecting the proper token, given the model's outputs (spoiler: it is not always the token with the highest probability).

This behavior is, again, controlled by the `GenerationConfig` and can be changed by passing the proper arguments to `model.generate()`. You should at least have an understanding of how these work (i.e. read the blog) before tampering with them.

In [15]:
generated_ids = model.generate(**model_inputs, max_length=50, num_beams=4, do_sample=True)

tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['Souvlaki is a street food typically made from pork, lamb, or chicken. The meat is marinated, skewered, and grilled over an open flame. It’s served in a pita with tomatoes,',
 'The Parthenon is an ancient temple located in the Acropolis of Athens, Greece, dedicated to the goddess Athena, whom the people of Athens considered their patron. Construction began in 447 BC when the']

Notice how we got different answers from the **same** exact model.

### Prompt engineering

Consider the following example.

In [32]:
def generate(inputs, **generation_config_kwargs):
  """
  Helper function to reduce code repetition
  """
  model_inputs = tokenizer(input_prompts, return_tensors='pt', padding=True).to('cuda')
  generated_ids = model.generate(**model_inputs, pad_token_id=tokenizer.eos_token_id, **generation_config_kwargs)
  return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

In [33]:
input_prompts = ['I have a pet cat. Translate the previous sentence to Spanish.']

print(generate(input_prompts, max_length=50)[0])

I have a pet cat. Translate the previous sentence to Spanish.

I have a pet cat.

I have a pet cat.

I have a pet cat.

I have a pet cat.

I


Clearly this isn't the answer we expected. This has partly to do with the fact that this model **isn't finetuned for following instructions**.

However let's try one more thing. Since we know that this model tries to provide the most likely continuation to the input prompt, let's **rephrase the prompt** so that the answer will naturally appear if we were to continue the prompt.

In [35]:
input_prompts = ['The phrase "I have a pet cat" in Spanish is']

print(generate(input_prompts, max_length=50)[0])

The phrase "I have a pet cat" in Spanish is "Tengo un gato como mascota."

## How to Say "I Have a Pet Cat" in Spanish

The phrase "I have a pet cat


Notice that the model was able to provide the correct answer!

Why wasn't it before, though? It appears that LLMs are much more capable than they first appear. In this sense we can think of **prompt engineering as a way of better accessing the LM's full capabilities**.

So what other techniques do we know of?

#### In-context learning

Consider us wanting to make the LLM swap lowerase characters in a word with uppercase, e.g. `"Djibril"` --> `"dJIBRIL"`.

How would we prompt the LLM to do this?

In [46]:
input_prompts = ['If I swap the lowercase characters of the word "Djibril" to upper and vice-versa Ill get']

print(generate(input_prompts, max_length=50)[0])

If I swap the lowercase characters of the word "Djibril" to upper and vice-versa Ill get "djibril".

If I swap the lowercase characters of the word "Djibril"


Turns out LLMs have some emergent **few-shot learning** capabilities. This means that it can pick-up on patterns from the input. With this we can teach it how to do some things **by example**.

In [47]:
input_prompts = ['Thanos --> tHANOS\nCanelo --> cANELO\nDjibril -->']

# Harder example, can you make this work?
# input_prompts = ['ThANos --> tHanOS\nGiORgOS --> gIorGos\nmaRINa --> MArinA\nCanElO --> cANeLo\nDJiBril -->']

print(generate(input_prompts, max_length=50)[0])

Thanos --> tHANOS
Canelo --> cANELO
Djibril --> dJIBRIL

I'm not sure if this is a bug or not, but I'm pretty sure it is


#### Chain of thought

Consider the following reasoning task.

In [48]:
input_prompts = ['John has 5 peaches and 2 oranges, Bill gave him 3 peaches, while Sarah gave him 1 orange. John now has']

print(generate(input_prompts, max_length=50)[0])

John has 5 peaches and 2 oranges, Bill gave him 3 peaches, while Sarah gave him 1 orange. John now has 10 pieces of fruit. How many pieces of fruit did Bill and Sarah give him


Not only wasn't the LLM able to answer specifically enough, it gave the wrong answer.

Chain of thought is a technique through which we can instruct the LLM to *think* things slowly, which in some cases leads to a better answer!

The simplest way is unironiclly to simply tell the LLM to "think things step by step".

In [51]:
input_prompts = ['John has 5 peaches and 2 oranges, Bill gave him 3 peaches, while Sarah gave him 1 orange. Lets think about it step by step. John now has']

generate(input_prompts, max_length=100)

['John has 5 peaches and 2 oranges, Bill gave him 3 peaches, while Sarah gave him 1 orange. Lets think about it step by step. John now has 5 peaches and 2 oranges. Bill gave him 3 peaches, so now he has 8 peaches. Sarah gave him 1 orange, so now he has 3 oranges. So, John has 8 peaches and 3 oranges.\n\n']

#### Closing notes on prompt engineering

1. We can employ more than 1 techniques in the same prompt.  
 In [this example](https://www.promptingguide.ai/techniques/cot) the authors use both in-context learning and chain of thought.

2. We are really streching our very limited understanding of LLMs. It is possible that new prompting techniques or other ways to access our model's latent capabilities will be developed in the future.

3. Read HuggingFace's [best practices for prompting](https://huggingface.co/docs/transformers/main/tasks/prompting#best-practices-of-llm-prompting).

4. You can explore more advanced prompting techniques [here](https://www.promptingguide.ai/).