# Lecture 6: Loading Models with `from_pretrained`
In this notebook, we will explore how to load pre-trained models using the `from_pretrained` method from the Hugging Face Transformers library. We will also dive into the configuration, weights, and caching mechanisms.

The Google Colab versin of this notebook is available here: https://colab.research.google.com/drive/1gb3hu83Wktk5cUDObPJqjlM5olAhbvam?usp=sharing


## **Feeling Brave?**

Try out the code for this lecture on your laptop or desktop. The same code in the above mentioned Google Colab is also available here for anyone who feels they want to take on the challenge of running this lecture on their machine.


**Consider the following** before running this notebook on your computer:
1. Make sure you have plenty of RAM (ideally >= 16 GB)

2. If you **do not have** an NVIDIA GPU, you will have to install bitsandbytes cour version by following these steps
    - In your terminal, activate your virtual environment and run `pip uninstall bitsandbytes` or in your notebook cell run `!pip uninstall bitsandbytes`
    - In your terminal run `pip install bitsandbytes-cpu` or in your notebook cell run `!pip install bitsandbytes-cpu`

3. If you **do** have an NVIDIA GPU
    - In your terminal, activate your virtual environment and run `pip uninstall torch torchvision torchaudio` or in your notebook cell run `!pip uninstall torch torchvision torchaudio`
    - In your terminal run `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126` or in your notebook cell run `!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126`

4. Make sure you have a stable internet connection as the download can take some time

*Note* that running models without a GPU can be extremely time consuming and may lead to your machine overheating if it is old.

# Step 1: Load libraries and log in to Huggingface

In [1]:
"""
Previously we used pipelines to run models. Before pipelines were available, using the from_pretrained method was used. From_pretrained is a lower level method
for achieving the same outcomes. pipelines is a much simpler method. Using this method you grab a model then grab a tokenizer to process the request. The 
tokenizer doesn't have to be from the same model but ideally it should be. 

Note, in the above comments the bitsandbytes-cpu install is not available so clearly something has changed. I could not get this to run. I had the same issue
as before i.e. invalid layout parameter or something like that. I think it is a memory thing, as it chews up all my ram and dies. 

The following needs to be installed: -

    pip install -q torch transformers bitsandbytes

1.  torch runs the model weights and biases
2.  transformers allows us to load and run the model and the tokenizer.
3.  bitsandbytes allows us to quantize our model (basically decrease its size)
"""

# Load imports and get the hgginface token. 
import os
import torch
from huggingface_hub import login
from dotenv import load_dotenv

load_dotenv()

hf_token = os.getenv("HUGGINGFACE_API_KEY")
login(hf_token, add_to_git_credential=True)

# Step 2: Load quantization configuration for model

In [None]:
# Note, BitsAndBytes only works with GPUs so it is removed if we are using CPU only.
from transformers import BitsAndBytesConfig

""" 
Quantization Config - this allows us to load the model into memory and use less memory

Neural networks are trained on 32bit floating point precision. Quantization reduces the precision to 16, 8 or even 4bit which is what we are doing here. This
also leads to lower precision or lower accuracy in the model but not by as much as the memory reduction. In this example (from 32bits to 4) the memory
reduction is about 8 times. 
"""

# this is the gpu version
# quant_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_compute_dtype=torch.bfloat16,
#     bnb_4bit_quant_type="nf4"
# )

# This is a CPU equivalent 

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float32,
    bnb_4bit_quant_type="nf4"
)

# Step 3: Load a Pre-trained Model
We will use the `meta-llama/Meta-Llama-3.1-8B-Instruct` model as an example. This step demonstrates how to load the model and tokenizer.

In [None]:
# instead of importing pipeline, import this class which contains the from_pretrained method. 
from transformers import AutoModelForCausalLM

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# pass in the model we want to use, device_map="auto" to allow the function to detect whether we have a GPU or CPU only and use what is there, and the quantization from above. 
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quant_config)

print(f"Model '{model_name}' loaded successfully!")

# Step 4: Explore the Model Configuration
The configuration of a model contains important details such as the number of layers, hidden size, and more.

In [None]:
config = model.config
print("Model Configuration:")
print(config)
"""
The beginning of sentence token id, and three types of end of sentence token id. 

    "bos_token_id": 128000,
    "dtype": "float16",
    "eos_token_id": [
        128001,
        128008,
        128009
    ],

  The activation function for the hidden layer

    "hidden_act": "silu",

The model type, and the number of hidden layers: -

    "model_type": "llama",
    "num_attention_heads": 32,
    "num_hidden_layers": 32,

The quantization method loaded above: -

    "quantization_config": {
        "_load_in_4bit": true,
        "_load_in_8bit": false,
        "bnb_4bit_compute_dtype": "bfloat16",
        "bnb_4bit_quant_storage": "uint8",
        "bnb_4bit_quant_type": "nf4",
        "bnb_4bit_use_double_quant": true,
        "llm_int8_enable_fp32_cpu_offload": false,
        "llm_int8_has_fp16_weight": false,
        "llm_int8_skip_modules": null,
        "llm_int8_threshold": 6.0,
        "load_in_4bit": true,
        "load_in_8bit": false,
        "quant_method": "bitsandbytes"
    },

    The vocabulary size, which I think is the number of words it recognizes: -

        "vocab_size": 128256

        
"""

Let's have a look at the actual layers (just for fun!)

In [None]:
# this shows the configuration
model

"""
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)     <--- you can see the size of the vocabulary
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(           <--- and the number of layers
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
"""


# Step 5: Understand Caching
When you load a model, it is cached locally to avoid downloading it again. The models are usually stored in the following path: ~/.cache/huggingface/hub by default i.e. C:\Users\damie\.cache\huggingface\hub


See further reference: [Huggingface cache management](~/.cache/huggingface/hub)

# Step 6: Tokenizing a Prompt and Generating Text
In this step, we will tokenize a list of messages, pass it to the model, and generate text as output.

In [3]:
# An instruct model requires a list of messages as we saw in AI Engineering Essentials Part 1
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What is the best way to structure and organize my thoughts?"}
  ]

In [6]:
# the messages need to be tokenized. This converts every word into its token number equivalent from its vocabulary.
from transformers import AutoTokenizer

# works similar to downloading the model, in that you download the tokenizer for the model. 
tokenizer = AutoTokenizer.from_pretrained(model_name)
# this is not needed. It just eliminates an anoying warning message in the output. 
tokenizer.pad_token = tokenizer.eos_token
# This does the work . Note I have commented out the to("cuda") because i don't have a GPU
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt") ##.to("cuda")

print(inputs)
"""
And you get the results - a tensor with the words encoded. Notice the bos_token_id is the first number and one of the eos_token_id's is at the end. 

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,  10263,    220,   2366,     19,    271,   2675,    527,
            264,  11190,  18328, 128009, 128006,    882, 128007,    271,   3923,
            374,    279,   1888,   1648,    311,   6070,    323,  31335,    856,
          11555,     30, 128009]])
"""

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,  10263,    220,   2366,     19,    271,   2675,    527,
            264,  11190,  18328, 128009, 128006,    882, 128007,    271,   3923,
            374,    279,   1888,   1648,    311,   6070,    323,  31335,    856,
          11555,     30, 128009]])


In [None]:
# Pass the input IDs to the model to generate output. The second parameter will limit the response to 80 words.
output_ids = model(inputs, max_new_tokens=80)

# Display the generated output IDs. This is still tokenized and not words. 
print("Generated Output IDs:", output_ids)

"""
YOu get something like this: -

    The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
    Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
    The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
    
    Generated Output IDs: tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
            25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,  10263,    220,   2366,     19,    271,   2675,    527,
            264,  11190,  18328, 128009, 128006,    882, 128007,    271,   3923,
            374,    279,   1888,   1648,    311,   6070,    323,  31335,    856,
            11555,     30, 128009, 128006,  78191, 128007,    271,   9609,   1711,
            323,  35821,    701,  11555,    649,    387,  17427,   1555,   5370,
            12823,     13,   5810,    527,   1063,   7524,   5528,    311,   1520,
            499,  38263,    323,  63652,    701,   6848,   1473,     16,     13,
            3146,  70738,  39546,  96618,   5256,    449,    264,   8792,   4623,
            323,   1893,    264,   9302,   2472,    315,   5552,  19476,     11,
            1701,  23962,     11,  21513,     11,    323,   5448,     13,   1115,
            15105,   8779,    311,  10765,  12135,    323,  13537,   1990,   6848,
            627,     17,     13,   3146,  66177,  27511,    287,  96618,   9842,
            1523,    439]], device='cuda:0')
"""

In [None]:
# No turn the output id's back to words, but don't return the special tokens. The [0] just returns the inner list.
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print("Generated Text:", generated_text)

"""
If you include special tokens, you get this

Generated Text: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the best way to structure and organize my thoughts?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Structuring and organizing your thoughts can be achieved through various techniques. Here are some effective methods to help you clarify and prioritize your ideas:

1. **Mind Mapping**: Start with a central idea and create a visual map of related concepts, using branches, keywords, and images. This technique helps to identify relationships and connections between ideas.
2. **Brainstorming**: Write down as
"""