# LLMs from HF

This notebooks looks into downloading, loading, and running pre-trained LLM models from `HuggingFace` using `transformers` 'on-premises'.

<br>

**Consider** that:

1. Every time you start your kernel, you need to run the code cells under `env Variables` to set the appropiate environment variables
2. When loading the models for the first time, the models will be downloaded first, and this naturally takes more time. If the model has been previously downloaded, it will be directly loaded. For a reference, loading `Llama v3.1 - 8B` in this machine takes 45.1 seconds.
3. **These models have a lot of parameters** , and to run inference, the models will be loaded to memory. From HuggingFace (https://huggingface.co/blog/llama31): For inference, the memory requirements depend on the model size and the precision of the weights. Here's a table showing the approximate memory needed for different configurations:

| Model Size | FP16     | FP8      | INT4     |
|------------|----------|----------|----------|
| 8B         | 16 GB    | 8 GB     | 4 GB     |
| 70B        | 140 GB   | 70 GB    | 35 GB    |
| 405B       | 810 GB   | 405 GB   | 203 GB   |

This figures only consider the weights. That is, for `Llama v3.1-70B-FP16`, you neec at least 140 GB! As an example, an H100 node (of 8x H100) has ~640GB of VRAM, so the 405B model would need to be run in a multi-node setup or run at a lower precision (e.g. FP8), which would be the recommended approach.

4. The list of "readily-available" models can be found [here](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f). In the Llama 3.1 family, the models available are:
- meta-llama/Meta-Llama-3.1-8B (& -Instruct)
- meta-llama/Meta-Llama-3.1-70B (& -Instruct)
- meta-llama/Meta-Llama-3.1-405B (& -Instruct)
- meta-llama/Meta-Llama-3.1-405B-FP8 (& -Instruct)

5. 

## Dependencies & Installs

In [None]:
# !pwd
# !pip list

In [None]:
!pip install --upgrade transformers
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install python-dotenv

## env Variables

In [3]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Print the HF_TOKEN
hf_token = os.getenv('HF_TOKEN')

# Set cache to custom location
#### VERY IMPORTANT TO AVOID RUNNING OUT OF MEMMORY IN DSRS JUP HUB ****
os.environ['HF_HOME'] = '.'

!pwd

/home/jovyan/shared-dsrs/LLM


## Llama 3.1 8B

In [7]:
%%time
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B", cache_dir=os.environ['HF_HOME'])
#model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")


CPU times: user 396 ms, sys: 77.1 ms, total: 473 ms
Wall time: 599 ms


In [11]:
%%time
# from transformers import AutoTokenizer, AutoModelForCausalLM

# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B", cache_dir=os.environ['HF_HOME'])
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B", cache_dir=os.environ['HF_HOME'])


config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Time taken to load model: 115.47 seconds


In [16]:
print(type(tokenizer))
print(type(model))

<class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
<class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>


In [21]:
%%time

text = "Once upon a time"

# Set the padding token to be the same as the eos token
tokenizer.pad_token = tokenizer.eos_token

# Tokenize input text with attention mask
print("Tokenizing input...")
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
print("Tokenization complete.")

# Generate predictions
print("Generating text...")
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=50,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id
)

# Decode predictions
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)



# Output results
print(f"Generated text: {generated_text}")

Tokenizing input...
Tokenization complete.
Generating text...
Generated text: Once upon a time, a long time ago, there was a girl who loved to write. She loved to write so much that she wrote a story about a girl who loved to write. It was a very special story. It was the story
Time taken: 290.09 seconds


In [22]:
inputs

{'input_ids': tensor([[128000,  12805,   5304,    264,    892]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [23]:
outputs

tensor([[128000,  12805,   5304,    264,    892,     11,    264,   1317,    892,
           4227,     11,   1070,    574,    264,   3828,    889,  10456,    311,
           3350,     13,   3005,  10456,    311,   3350,    779,   1790,    430,
           1364,   6267,    264,   3446,    922,    264,   3828,    889,  10456,
            311,   3350,     13,   1102,    574,    264,   1633,   3361,   3446,
             13,   1102,    574,    279,   3446]])

In [28]:
tokenizer.decode(11, skip_special_tokens=True)

','

## Llama 3.1 **Instruct** 8B

In [5]:
%%time
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", cache_dir=os.environ['HF_HOME'])
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", cache_dir=os.environ['HF_HOME'])

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

CPU times: user 26.6 s, sys: 36.9 s, total: 1min 3s
Wall time: 45.1 s


In [9]:
%%time
text = "Once upon a time"

# Set the padding token to be the same as the eos token
tokenizer.pad_token = tokenizer.eos_token

# Tokenize input text with attention mask
print("Tokenizing input...")
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
print("Tokenization complete.")

# Generate predictions
print("Generating text...")
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=50, #### Length of generated tokens **
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id
)

# Decode predictions
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)



# Output results
print(f"Generated text: {generated_text}")

Tokenizing input...
Tokenization complete.
Generating text...
Generated text: Once upon a time, there was a young man named Jack who lived in a small village surrounded by vast fields and dense forests. Jack was a curious and adventurous soul, always eager to explore the unknown and discover new wonders.
One day, while
CPU times: user 20min 4s, sys: 2.91 s, total: 20min 7s
Wall time: 5min 2s


In [10]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
    (rotary_emb

## Llama 3.1 **Instruct** 70B

In [None]:
%%time
from transformers import AutoTokenizer, AutoModelForCausalLM
 
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-70B-Instruct", cache_dir=os.environ['HF_HOME'])
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-70B-Instruct", cache_dir=os.environ['HF_HOME'])

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/59.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/30 [00:00<?, ?it/s]

model-00001-of-00030.safetensors:   0%|          | 0.00/4.58G [00:00<?, ?B/s]

model-00002-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00003-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00005-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00006-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00007-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00008-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00009-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00010-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00011-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00012-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00013-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00014-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00015-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00016-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00017-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00018-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00019-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00020-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00021-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00022-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00023-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00024-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00025-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00026-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00027-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00028-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00029-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00030-of-00030.safetensors:   0%|          | 0.00/2.10G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]

![Image](downloading%20llama%203.1%2070B.png)


# Quantization


## Quant
This section looks into Quantization using `Quanto` (huggingface.co/docs/transformers/v4.43.3/quantization/quanto), since it **easily integrates with transformers** , and because it is **device agnostic** (e.g CUDA, MPS, CPU). You can also take a look at this [notebook](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing).

This quantization is **not** serializable with transformers (so, we cannot save with save_pretrained())

In [None]:
!pip install quanto accelerate transformers

### 8B-Instruct

In [5]:
%%time
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
hf_home = os.environ['HF_HOME']

tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=hf_home)

# Quant config
quantization_config = QuantoConfig(weights="float8") ## ['float8', 'int8', 'int4', 'int2']
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                             cache_dir=hf_home, 
                                             device_map="cpu", 
                                             quantization_config=quantization_config,
                                             low_cpu_mem_usage=True)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

CPU times: user 2min 46s, sys: 3min 52s, total: 6min 39s
Wall time: 1min 51s


In [6]:
%%time
text = "How do you compare to GPT4o"
# device = "cpu"

# Tokenize input text with attention mask
print("Tokenizing input...")
inputs = tokenizer(text, return_tensors="pt")#.to(device)

# Generate predictions
print("Generating text...")
outputs = model.generate(**inputs, max_new_tokens=20)

# Output results
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Tokenizing input...
Generating text...
How do you compare to GPT4o?
I was created by Meta, while GPT-4 was developed by OpenAI. Our training
CPU times: user 35min 5s, sys: 20min 57s, total: 56min 2s
Wall time: 14min 3s


### 70B-Instruct

In [5]:
%%time
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
hf_home = os.environ['HF_HOME']

tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=hf_home)

# Quant config
quantization_config = QuantoConfig(weights="int4") ## ['float8', 'int8', 'int4', 'int2']
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                             cache_dir=hf_home, 
                                             device_map="cpu", 
                                             quantization_config=quantization_config,
                                             low_cpu_mem_usage=True)

Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/183 [00:00<?, ?B/s]

CPU times: user 27min 53s, sys: 46min 1s, total: 1h 13min 54s
Wall time: 21min 47s


![Image](70B_quantized_loaded)

In [6]:
%%time
text = "How do you compare to GPT4o"
# device = "cpu"

# Tokenize input text with attention mask
print("Tokenizing input...")
inputs = tokenizer(text, return_tensors="pt")#.to(device)

# Generate predictions
print("Generating text...")
outputs = model.generate(**inputs, max_new_tokens=20)

# Output results
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Tokenizing input...
Generating text...
How do you compare to GPT4oL?
CPU times: user 38min 25s, sys: 1h 13min 8s, total: 1h 51min 33s
Wall time: 28min 53s


In [22]:
for token in range(outputs[0].shape[0]):
    print(tokenizer.decode(outputs[0][token]))
    
print(f'\nTotal # of tokens generated: {outputs[0].shape[0]} (including special tokens..)')

<|begin_of_text|>
How
 do
 you
 compare
 to
 G
PT
4
o
L
?
<|eot_id|>

Total # of tokens generated: 13 (including special tokens..)


In [40]:
%%time
text = "Teach me Quantization for DL in 50 words"
# device = "cpu"

# Tokenize input text with attention mask
print("Tokenizing input...")
inputs = tokenizer(text, return_tensors="pt")#.to(device)

# Generate predictions
print("Generating text...")
outputs = model.generate(**inputs, max_length=50) #### Length of generated tokens **

# Output results
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Tokenizing input...
Generating text...
Teach me Quantization for DL in 50 words.

Quantization in Deep Learning (DL) is a technique to reduce the precision of model weights and activations from floating-point numbers (e.g., float32) to integers (e.g.,
CPU times: user 7h 44min 26s, sys: 15h 12min 22s, total: 22h 56min 49s
Wall time: 5h 49min 46s


### Saving the Quantized model

In this case: The model is quantized with QuantizationMethod.QUANTO and is not serializable - check out the warnings from the logger on the traceback to understand the reason why the quantized model is not serializable.

In [11]:
# quant_path = "models--meta-llama--Meta-Llama-3.1-8B-Instruct-FP8"
# model.save_pretrained(quant_path)

## AQLM

In [2]:
%%capture
!pip install aqlm[cpu]#[gpu]
!pip install accelerate

In [5]:
%%time
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
hf_home = os.environ['HF_HOME']

tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=hf_home)

# Quant config
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id, cache_dir=hf_home,
    torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

CPU times: user 1.19 s, sys: 42.4 ms, total: 1.23 s
Wall time: 1.39 s


In [6]:
%%time
text = "Explain the different types of quantization"

# Set the padding token to be the same as the eos token
tokenizer.pad_token = tokenizer.eos_token

# Tokenize input text with attention mask
print("Tokenizing input...")
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
print("Tokenization complete.")

# Generate predictions
print("Generating text...")
outputs = quantized_model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=50, #### Length of generated tokens **
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id
)

# Decode predictions
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)



# Output results
print(f"Generated text: {generated_text}")

Tokenizing input...
Tokenization complete.
Generating text...
Generated text: Explain the different types of quantization error
Quantization error is the difference between the original analog signal and the digital representation of that signal. The types of quantization error are:
1. Differential quantization error: This type of error occurs
CPU times: user 11min 46s, sys: 2.17 s, total: 11min 49s
Wall time: 5min 18s


In [7]:
!pip install hqq

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting hqq
  Downloading hqq-0.1.8.tar.gz (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.2/52.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting einops (from hqq)
  Downloading einops-0.8.0-py3-none-any.whl.metadata (12 kB)
Collecting termcolor (from hqq)
  Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Collecting bitblas (from hqq)
  Downloading bitblas-0.0.1.dev13-py3-none-manylinux1_x86_64.whl.metadata (11 kB)
Collecting cpplint (from bitblas->hqq)
  Downloading cpplint-1.6.1-py3-none-any.whl.metadata (4.5 kB)
Collecting docutils (from bitblas->hqq)
  Downloading docutils-0.21.2-py3-none-any.whl.metadata (2.8 kB)
Collecting dtlib (from bitblas->hqq)
  Downloading dtlib-0.0.0.dev2-py3-none-any.whl.metadata (1.8 kB)
Collecting pytest>=6.2.4 (from bitblas->hqq)
  Downloading pytest-8.3.2-py3-none-any.whl.metadata (7.5 kB)
Collecting pytest-xdist>=2.2.1 (from bitblas->hqq)
  Down

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

# Method 1: all linear layers will use the same quantization config
quant_config  = HqqConfig(nbits=8, quant_zero=False, quant_scale=False) #axis=0 is used by default


UnboundLocalError: cannot access local variable 'HQQBaseQuantizeConfig' where it is not associated with a value