# The Most Brief Introduction to 🤗 Transformers

> ⚠ NOTE: This notebook is **NOT** compatible with the T4 Instance. Please ensure you're in the L4 or A100 instance.

In this notebook - we will finally make the transition to 🤗 `transformers` (a [library](https://huggingface.co/docs/transformers/en/index) we'll spend the remainder of our time with.)

There are [many](https://huggingface.co/docs/transformers/en/quicktour), [great](https://huggingface.co/docs/hub/en/transformers), [resources](https://huggingface.co/learn/nlp-course/en/chapter1/1) out there on getting started with `transformers`; so we're going to jump ahead to the best part.

Instead of needing to clone a repository for a specific model architecture - and then load big blogs of code - we can, finally, do this:



## 🎆🎉 Installing `transformers` 🎉🎆

In [None]:
!pip install -qU transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m76.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m90.9 MB/s[0m eta [36m0:00:00[0m
[?25h

We'll also install a few extras that we need, but will largely ignore for today.

In [None]:
!pip install -qU accelerate bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25h

## IMPORTANT: Signing Up for Access

In order to sign-up to use Meta's Llama 3.1 8B - you'll need to ensure you have access to the model. Please follow the instructions as found on the model card [here](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

An example is shown below:

![image](https://i.imgur.com/HBnn2oY.png)

We'll also need to provie our Hugging Face API token - which you can find using [this](https://huggingface.co/docs/hub/en/security-tokens) documentation - to verify our request.

Once we have it, we will log in using the Hugging Face Hub tool!

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Loading a Model

Now that we've installed our library - we need to load a model.

Today, we'll be loading Meta's Llama 3.1 8B Instruct model, in 4-bit quantization. Let's not worry too much about what that means *exactly* - but let's also break down a few of the words:

- Meta: the company
- Llama: the specific model "family", or common model architecture based on the GPT model architecture
- 3.1: The version, previous versions are *not* compatible as they have minor architecture changes
- 8B: The size indicated by "number of parameters", this model has 8B parameters. This model also comes in a variety of sizes - from 8B to 405B parameters.

We'll learn about "Instruct" and "4-bit quantization" in the upcoming sessions!


### Getting the Model ID

First, let's navigate to the model card and get our model repository address which we can use to load the model from the Hugging Face Hub to our local environment! (in this case, Colab environment).

The model card is available [here](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

We can grab the model address by clicking the copy icon on the address on the model card, like so:

![image](https://i.imgur.com/pzWpqM3.png)

In [None]:
model_id = "meta-llama/Llama-3.1-8B-Instruct"

### Setting Appropriate Configurations

#### Quantization Config

We will entirely ignore this cell for now - but rest assured we will deep dive this process later!

This cell *only* ensures the model can fit on our selected hardware in Google Colab.

In [None]:
import torch
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

Now that we have our model ID, and have created an appropriate quantization configuration we need to load it - we can do this using the `AutoModelForCausalLM` class from `transformers`.

This class will:

- Automatically determine the model architecture from the available `.json` config files in the model repository.
- Load the model into our GPU memory following any specific direction we give it.

Let's load that model! All 8B parameters of it!

- [AutoModelForCausalLM Docs](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForCausalLM)

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto',
)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

We'll want to load a tokenizer for our model!

> NOTE: Each model (or model family) typically uses its own tokenizer - it is critical to ensure that you are using the appropriate tokenizer for your model.

- [AutoTokenizer Docs](https://huggingface.co/docs/transformers/v4.47.0/en/model_doc/auto#transformers.AutoTokenizer)

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

## Looking at the Model

Now that we have the model loaded in our GPU memory - let's look at it to see how it differs from the models we've been using up to this point!

In [None]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps

##### 👪❓ Discussion Question #1:

What is different about this architecture compared to the models we've seen this far?

## Inference:

Now, let's do some inference!

> NOTE: We'll use this verbose generation script - though you could simply this with the use of 🤗 pipelines!

In [None]:
def generate_response(prompt, model, tokenizer):
  """
  Parameters:
    - prompt: str representing formatted prompt
    - model: model object
    - tokenizer: tokenizer object

  Functionality:
    This will allow our model to generate a response to a prompt!

  Returns:
    - str response of the model
  """

  # convert str input into tokenized input
  encoded_input = tokenizer(prompt,  return_tensors="pt")

  # send the tokenized inputs to our GPU
  model_inputs = encoded_input.to('cuda')

  # generate response and set desired generation parameters
  generated_ids = model.generate(
      **model_inputs,
      max_new_tokens=256,
      do_sample=True,
      pad_token_id=tokenizer.eos_token_id
  )

  # decode output from tokenized output to str output
  decoded_output = tokenizer.batch_decode(generated_ids)

  # return only the generated response (not the prompt) as output
  return decoded_output[0].split("<|end_header_id|>")[-1]

Play with the mode and see how the model responds!

In [None]:
prompt = "What is the meaning of life?"

generate_response(
    prompt,
    model,
    tokenizer
)

"<|begin_of_text|>What is the meaning of life? It is a question that has puzzled philosophers, theologians, and everyday people for centuries. There is no one definitive answer, but here are some possible perspectives:\n1. Existentialism: According to existentialist philosophy, life has no inherent meaning. Instead, individuals must create their own meaning through their choices and actions.\n2. Humanism: Humanists believe that the meaning of life is to live in accordance with human values such as reason, compassion, and dignity.\n3. Theistic: Many religious traditions believe that the meaning of life is to fulfill God's purpose or will, whether that involves worship, service, or personal transformation.\n4. Hedonism: Hedonists argue that the meaning of life is to seek pleasure and avoid pain.\n5. Stoicism: Stoics believe that the meaning of life is to live in accordance with reason and virtue, and to cultivate inner strength and resilience in the face of adversity.\n6. Absurdism: Absu