<a href="https://colab.research.google.com/github/bharathkreddy/brks_agents/blob/main/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install fsspec==2023.9.2

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print("not connected to a GPU")
else:
  print(gpu_info)

Tue Jul 22 12:03:06 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.03              Driver Version: 575.64.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 2070 ...    Off |   00000000:2D:00.0  On |                  N/A |
| 24%   39C    P8             13W /  215W |     699MiB /   8192MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
# from google.colab import userdata
# HUGGINGFACE_API_KEY = userdata.get('HUGGINGFACE_API_KEY')

import os
from dotenv import load_dotenv

load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN")

### The HuggingFace transformers library provides APIs at two different levels.

The High Level API for using open-source models for typical inference tasks is called "pipelines". It's incredibly easy to use.

You create a pipeline using something like:

`my_pipeline = pipeline("the_task_I_want_to_do")` Followed by `result = my_pipeline(my_input)`

And that's it!

In [36]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM, TextStreamer, BitsAndBytesConfig
from diffusers import DiffusionPipeline
from datasets import load_dataset
import soundfile as sf
from IPython.display import Audio
from huggingface_hub import login
import gc

In [4]:
# Sentiment Analysis

classifier = pipeline("sentiment-analysis", device=0)
result = classifier("I'm super knackered after coding entire day.")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9967589974403381}]


In [5]:
classifier = pipeline("sentiment-analysis", model="Qwen/Qwen3-0.6B")
result = classifier("I'm super knackered after coding entire day.")
print(result)

Some weights of Qwen3ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen3-0.6B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


[{'label': 'LABEL_0', 'score': 0.7879312038421631}]


In [6]:
# Named Entity Recognition

ner = pipeline("ner", grouped_entities=True, device="cuda")
result = ner("Bharath Reddy is the lead AI engineer at Finbourne.")
print(result)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda


[{'entity_group': 'PER', 'score': np.float32(0.9975752), 'word': 'Bharath Reddy', 'start': 0, 'end': 13}, {'entity_group': 'ORG', 'score': np.float32(0.784923), 'word': 'AI', 'start': 26, 'end': 28}, {'entity_group': 'ORG', 'score': np.float32(0.9821072), 'word': 'Finbourne', 'start': 41, 'end': 50}]




In [7]:
# Q&A

question_answerer = pipeline("question-answering", device=0)  # use device=0 for CUDA (GPU)
result = question_answerer({
    "question": "Who got elected as chief minister of Gujarat in 2001?",
    "context": "Narendra Damodardas Modi is an Indian politician who has served as the prime minister of India since 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament for Varanasi.."
})

print(result)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


{'score': 0.8625938296318054, 'start': 0, 'end': 24, 'answer': 'Narendra Damodardas Modi'}




In [8]:
# summarizer

summarizer = pipeline("summarization", device=0)
text = """Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:

Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios.
Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.
"""
result = summarizer(text, max_length=50, min_length=10, clean_up_tokenization_spaces=True)
print(result[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


 Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, it delivers groundbreaking advancements in reasoning,


In [9]:
# Translation

translator = pipeline("translation_en_to_hi", device=0)
result = translator("My name is Bharath.")
print(result[0]['translation_text'])

ValueError: The task does not provide any default models for options ('en', 'hi')

In [11]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-hi", device=0)
result = translator("My name is Bharath.")
print(result[0]['translation_text'])

ValueError: This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed in order to use this tokenizer.

In [12]:
# classification

classifier = pipeline('zero-shot-classification', device=0)
result = classifier(
    'I have broken my ankle and need to see a doctor.',
    candidate_labels=["medical", "legal", "sports"]
)
print(result)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


{'sequence': 'I have broken my ankle and need to see a doctor.', 'labels': ['medical', 'sports', 'legal'], 'scores': [0.9936263561248779, 0.004823780618607998, 0.0015498927095904946]}


In [13]:
# text generation

generator = pipeline('text-generation', device=0)
result = generator("If there is one thing i want you to remember about using Huggingface Pipelines, it's ")
print(result[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


If there is one thing i want you to remember about using Huggingface Pipelines, it's  that the use of Huggingface Pipelines is a very simple one. When you are using a Pipeline, you are always using the same Pipelines and you can use their Pipelines to create your own custom Pipelines.
To create your own custom Pipelines, select the 'Add a Pipeline' category in the Pipelines drop down menu:
From there, select your Pipelines and then enter your name, email address, and the email address you just sent.
In the box that appears, choose Create Pipelines from the drop down menu.
This will create a new Pipeline for you, and you can add or change the Pipelines in the drop down menu.
Once you have created a Pipeline and you have a custom Pipeline, you can then use it for any other tasks you have. However, if you are using a different Pipeline, use the 'Add a Pipeline' category in the Pipelines drop down menu and then enter your name, email address, and the email address that you just sent.
You c

In [None]:
# Image Generation

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
pipe.to('cuda')

text = "A mosaic depicting london, in surreal style of Salvador Dali."
image = pipe(prompt=text).images[0]
image

# Tokenizers

In [18]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B', trust_remote_code=True)

text = "I am excited to show Tokenizers in action to my LLM engineers"
tokens = tokenizer.encode(text)
tokens

[128000,
 40,
 1097,
 12304,
 311,
 1501,
 9857,
 12509,
 304,
 1957,
 311,
 856,
 445,
 11237,
 25175]

In [19]:
tokenizer.decode(tokens)

'<|begin_of_text|>I am excited to show Tokenizers in action to my LLM engineers'

In [20]:
tokenizer.batch_decode(tokens)

['<|begin_of_text|>',
 'I',
 ' am',
 ' excited',
 ' to',
 ' show',
 ' Token',
 'izers',
 ' in',
 ' action',
 ' to',
 ' my',
 ' L',
 'LM',
 ' engineers']

In [None]:
tokenizer.eos_token  

In [23]:
tokenizer.vocab

{'."ĊĊ': 2266,
 'rust': 36888,
 'Ġparalysis': 86139,
 'ĠAssurance': 84882,
 'Cou': 69310,
 'Ġseperti': 68820,
 'Ġotp': 84531,
 'dfs': 35478,
 ']]Ċ': 14623,
 'Ġcodec': 35747,
 'à¹Ħà¸£': 105728,
 'ĠåĪĨ': 59757,
 'ĠCompanies': 32886,
 'TEMP': 49443,
 'ĠLookup': 51411,
 '/uploads': 30681,
 'Âłt': 108135,
 '(Status': 39966,
 'dÄ±k': 106777,
 'ĠVenezuela': 37003,
 'ĠÐ´ÑĸÑıÐ»ÑĮÐ½ÑĸÑģÑĤÑĮ': 119260,
 'ĠÐĴÑĭ': 72731,
 'destruct': 61368,
 'Ġanytime': 30194,
 '////////////////////////////////////////////////': 28505,
 'im': 318,
 'ĠPurchase': 32088,
 'Ġkla': 76986,
 'ĠremoveFromSuperview': 78280,
 'ĠCatholic': 16879,
 '-blind': 95449,
 'InputBorder': 73799,
 'ĠÑģÐºÐ»Ð°Ð´Ñĸ': 121716,
 'Ġlimestone': 45016,
 'Ġcoerc': 55363,
 'ibri': 58934,
 'Ø¸ËĨ': 118400,
 '(font': 23594,
 'Ġç¾İåĽ½': 123521,
 'Ġpoil': 74611,
 'adir': 42273,
 'ĠcÃ©l': 84850,
 '\');?>"': 91927,
 'Ġjap': 86285,
 'newsletter': 72468,
 'Ġuw': 38443,
 'Ġlost': 5675,
 '_EQUAL': 18681,
 'à¸²à¸£à¸¢': 115329,
 '.junit': 8496,
 'ÐºÐ»Ð°Ð´': 10

In [24]:
tokenizer.get_added_vocab()

{'<|begin_of_text|>': 128000,
 '<|end_of_text|>': 128001,
 '<|reserved_special_token_0|>': 128002,
 '<|reserved_special_token_1|>': 128003,
 '<|finetune_right_pad_id|>': 128004,
 '<|reserved_special_token_2|>': 128005,
 '<|start_header_id|>': 128006,
 '<|end_header_id|>': 128007,
 '<|eom_id|>': 128008,
 '<|eot_id|>': 128009,
 '<|python_tag|>': 128010,
 '<|reserved_special_token_3|>': 128011,
 '<|reserved_special_token_4|>': 128012,
 '<|reserved_special_token_5|>': 128013,
 '<|reserved_special_token_6|>': 128014,
 '<|reserved_special_token_7|>': 128015,
 '<|reserved_special_token_8|>': 128016,
 '<|reserved_special_token_9|>': 128017,
 '<|reserved_special_token_10|>': 128018,
 '<|reserved_special_token_11|>': 128019,
 '<|reserved_special_token_12|>': 128020,
 '<|reserved_special_token_13|>': 128021,
 '<|reserved_special_token_14|>': 128022,
 '<|reserved_special_token_15|>': 128023,
 '<|reserved_special_token_16|>': 128024,
 '<|reserved_special_token_17|>': 128025,
 '<|reserved_special_to

# Instruct variants of models

Many models have a variant that has been trained for use in Chats.  
These are typically labelled with the word "Instruct" at the end.  
They have been trained to expect prompts with a particular format that includes system, user and assistant prompts.  

There is a utility method `apply_chat_template` that will convert from the messages list format we are familiar with, into the right input prompt for this model.

In [25]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [26]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generated_prompt=True)
print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|>


In [27]:
PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b"

In [28]:
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)

text = "I am excited to show Tokenizers in action to my LLM engineers"
print(tokenizer.encode(text))
print()
tokens = phi3_tokenizer.encode(text)
print(phi3_tokenizer.batch_decode(tokens))

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 25175]

['I', 'am', 'excited', 'to', 'show', 'Token', 'izers', 'in', 'action', 'to', 'my', 'L', 'LM', 'engine', 'ers']


In [30]:
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>



In [32]:
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME, trust_remote_code=True)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [33]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>



<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>


<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Tell a light-hearted joke for a room of Data Scientists<|im_end|>
<|im_start|>assistant



In [34]:
starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME, trust_remote_code=True)
code = """
def hello_world(person):
  print("Hello", person)
"""
tokens = starcoder2_tokenizer.encode(code)
for token in tokens:
  print(f"{token}={starcoder2_tokenizer.decode(token)}")

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

222=

610=def
17966= hello
100=_
5879=world
45=(
6427=person
731=):
353=
 
1489= print
459=("
8302=Hello
411=",
4944= person
46=)
222=



# Quantization of models
#### Is Quantization all about rounding ? 
1. In 4-bit quantization we have only 16 possible values (2^4 = 16)
2. We create a codebook (16 values): [ -1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, ..., 1.0 ]
3. during quantization:  
    `-0.93` → closest to `-1.0` → store `0000`

    `0.45` → closest to `0.4` → store `1100`
    
    `1.05` → closest to `1.0` → store `1111`
4. At inference, these 4-bit codes are mapped back to floats using that same table.

### but for llms we need more precision and smarter way of doing things hence 
1. Codebook is not allways from -1 to +, or even equally spaced. It depends on 
    - Quantization method (`int4`, `fp4`, `mf4` etc)
    - Implementation (univofrm vs non-uniform)
    - Whether scaling is applied per layer, per group or per tensor.

2. Uniform quantization (ex int4 or fp4)
    - uses fixed ranges and equal spacing. ex int4 (singed) 4-bit -> 16 values [-8 to 7]
    - if scaled, all weights are scaled to fit in that range and mapped to [-1,1]
3. NF4 (non uniform) 🔗 Source: [BitsAndBytes NF4 paper](https://arxiv.org/pdf/2305.14314.pdf)
    - Instead of capturing all values equally, it allocates more resolution near 0, where most neural network weights cluster.
    
    Quantization steps
    1. Compute the min/max (or std rnage) for a block
    2. map the values in the block to the codebook
    3. Store a scaled and zero point
    4. During inference recover original ~ scale * codebook_value + offset

In [37]:
# instruct models

LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"
PHI3 = "microsoft/Phi-3-mini-4k-instruct"
GEMMA2 = "google/gemma-2-2b-it"
QWEN2 = "Qwen/Qwen2-7B-Instruct" # exercise for you
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1" # big model, requires 24GB GPU

In [38]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

🧮 Step 2: Define Quantization Config

This config tells Hugging Face to load the model in 4-bit precision, which saves memory and boosts performance, especially on consumer GPUs.

1. `load_in_4bit=True`: Load model weights in 4-bit precision instead of full 16/32-bit. Saves huge GPU memory (can be ~4x smaller).
2. `bnb_4bit_use_double_quant=True`: Applies 2 levels of quantization: first with NF4, then a second quantizer to compress scales. Improves compression & accuracy.
3. `bnb_4bit_compute_dtype=torch.bfloat16`: Internally, computation is still done in bfloat16, which balances speed and numerical range.
4. `bnb_4bit_quant_type="nf4"`: Uses NF4 (Normal Float 4) quantization scheme, which preserves more information than regular 4-bit ints.

In [39]:
# Quantization Config - this allows us to load the model into memory and use less memory

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

🔤 Step 3: Load Tokenizer
1. Loads the tokenizer that matches your LLAMA model (e.g., meta-llama/Llama-2-7b-chat-hf).
2. Chat-style models often don't have a padding token defined. So we use eos_token (end of statement) as padding, which is a safe fallback.
3. 

In [40]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

In [41]:
# The model

model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

In [None]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")