## Running Meta Llama 3 on Google Colab using Hugging Face transformers library
This notebook goes over how you can set up and run Llama 3 using Hugging Face transformers library
<a href="https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Steps at a glance:
This demo showcases how to run the example with already converted Llama 3 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.

To use already converted weights, start here:
1. Request download of model weights from the Llama website
2. Login to Hugging Face from your terminal using the same email address as (1). Follow the instructions [here](https://huggingface.co/docs/huggingface_hub/en/quick-start). 
3. Run the example


Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:
1. Request download of model weights from the Llama website
2. Clone the llama repo and get the weights
3. Convert the model weights
4. Prepare the script
5. Run the example

### Using already converted weights

#### 1. Request download of model weights from the Llama website
Request download of model weights from the Llama website
Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. 

Fill  the required information, select the models “Meta Llama 3” and accept the terms & conditions. You will receive a URL in your email in a short time.

#### 2. Prepare the script

We will install the Transformers library and Accelerate library for our demo.

The `Transformers` library provides many models to perform tasks on texts such as classification, question answering, text generation, etc.
The `accelerate` library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.


In [None]:
#!pip install transformers
#!pip install accelerate

Next, we will import AutoTokenizer, which is a class from the transformers library that automatically chooses the correct tokenizer for a given pre-trained model, import transformers library and torch for PyTorch.


In [1]:
from transformers import AutoTokenizer
import transformers
import torch



Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 8b chat model `meta-llama/Meta-Llama-3-8B-Instruct`. Using Meta models from Hugging Face requires you to

1. Accept Terms of Service for Meta Llama 3 on Meta [website](https://llama.meta.com/llama-downloads).
2. Use the same email address from Step (1) to login into Hugging Face.

Follow the instructions on this Hugging Face page to login from your [terminal](https://huggingface.co/docs/huggingface_hub/en/quick-start). 

In [None]:
#pip install --upgrade huggingface_hub

In [2]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
model = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model)



tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now, we will use the `from_pretrained` method of `AutoTokenizer` to create a tokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.


In [4]:
pipeline = transformers.pipeline(
"text-generation",
      model=model,
      torch_dtype=torch.float16,
 device_map="auto",
)



config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### 3. Run the example

Now, let’s create the pipeline for text generation. We’ll also set the device_map argument to `auto`, which means the pipeline will automatically use a GPU if one is available.

Let’s also generate a text sequence based on the input that we provide. 

In [5]:
sequences = pipeline(
    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation = True,
    max_length=400,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Result: I have tomatoes, basil and cheese at home. What can I cook for dinner?
What can I make for dinner using tomatoes, basil and cheese?
Here are a few dinner ideas you can make using tomatoes, basil, and cheese:

1. **Bruschetta**: Toasted bread topped with fresh tomatoes, basil, garlic, and mozzarella cheese. Drizzle with olive oil and balsamic glaze.
2. **Tomato and Basil Pasta**: Cook pasta according to package directions, then top with a sauce made from sautéed tomatoes, garlic, and basil. Add grated Parmesan cheese and serve.
3. **Cheesy Tomato Tart**: A simple tart made with a flaky crust, topped with a mixture of sautéed tomatoes, basil, and shredded mozzarella cheese. Bake until golden brown.
4. **Caprese Salad**: A classic Italian salad made with sliced tomatoes, mozzarella cheese, and fresh basil leaves. Drizzle with olive oil and balsamic vinegar.
5. **Tomato and Basil Grilled Cheese**: A gourmet grilled cheese sandwich filled with sliced tomatoes, fresh basil, and melte

<br>

### Downloading and converting weights to Hugging Face format

#### 1. Request download of model weights from the Llama website
Request download of model weights from the Llama website
Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. 

Fill  the required information, select the models "Meta Llama 3" and accept the terms & conditions. You will receive a URL in your email in a short time.

#### 2. Clone the llama repo and get the weights
Git clone the [Meta Llama 3 repo](https://github.com/meta-llama/llama3). Run the `download.sh` script and follow the instructions. This will download the model checkpoints and tokenizer.

This example demonstrates a Meta Llama 3 model with 8B-instruct parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models.

#### 3. Convert the model weights using Hugging Face transformer from source

* `python3 -m venv hf-convertor`
* `source hf-convertor/bin/activate`
* `git clone https://github.com/huggingface/transformers.git`
* `cd transformers`
* `pip install -e .`
* `pip install torch tiktoken blobfile accelerate`
* `python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ${path_to_meta_downloaded_model} --output_dir ${path_to_save_converted_hf_model} --model_size 8B --llama_version 3`


#### 4. Prepare the script
Import the following necessary modules in your script: 
* `AutoModel` is the Llama 2 model class
* `AutoTokenizer` prepares your prompt for the model to process
* `pipeline` is an abstraction to generate model outputs

In [6]:
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "/data/ai/models/nlp/llama/models_llama3/Meta-Llama-3-8B-Instruct-hf"
model = AutoModelForCausalLM.from_pretrained(
        model_dir,
        device_map="auto",
    )
tokenizer = AutoTokenizer.from_pretrained(model_dir)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We need a way to use our model for inference. Pipeline allows us to specify which type of task the pipeline needs to run (`text-generation`), specify the model that the pipeline should use to make predictions (`model`), define the precision to use this model (`torch.float16`), device on which the pipeline should run (`device_map`)  among various other options. 


In [7]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

Now we have our pipeline defined, and we need to provide some text prompts as inputs to our pipeline to use when it runs to generate responses (`sequences`). The pipeline shown in the example below sets `do_sample` to True, which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling. 

By changing `max_length`, you can specify how long you’d like the generated response to be. 
Setting the `num_return_sequences` parameter to greater than one will let you generate more than one output.

In your script, add the following to provide input, and information on how to run the pipeline:


#### 5. Run the example

In [8]:
sequences = pipeline(
    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=400,
)
for seq in sequences:
    print(f"{seq['generated_text']}")


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


I have tomatoes, basil and cheese at home. What can I cook for dinner?
If you have tomatoes, basil, and cheese at home, you can make a delicious Caprese salad with grilled chicken. Here's a simple recipe:

Ingredients:

* 2 large tomatoes, sliced
* 1/4 cup fresh basil leaves, torn
* 2 tablespoons olive oil
* 2 tablespoons balsamic vinegar
* 1/2 cup shredded mozzarella cheese
* 4 boneless, skinless chicken breasts
* Salt and pepper, to taste

Instructions:

1. Preheat your grill or grill pan.
2. Brush the chicken with olive oil and season with salt and pepper. Grill the chicken for 5-6 minutes per side, or until it's cooked through.
3. Meanwhile, arrange the tomato slices on a plate or on a platter.
4. Top the tomatoes with torn basil leaves, mozzarella cheese, and a drizzle of balsamic vinegar.
5. Once the chicken is cooked, slice it and serve it with the Caprese salad.
6. Garnish with additional basil leaves and serve immediately.

Alternatively, you can also make a simple bruschetta 