<h5>Installing Necessary libraries </h5>

In [1]:
!pip install transformers -q

In [2]:
!pip install bitsandbytes -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m73.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

<h5> Importing libraries </h5>

In [3]:
# Importing libraries to load the Mode, tokenizer, and specify the QuantizationConfiguration
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

In [4]:
# We will be using llama-2-13b from Huggingface
model_name = "meta-llama/Llama-2-13b-chat-hf"

<h5> Quantization Config </h5>

In [5]:
# Specifying the quantization config
# load_in_4bit -> Loads the model weights in 4 bit precision instead of 16 or 32 bits
# bnb_4bit_use_double_quant -> Use double quantization - 1st Quantization reduces the weights to 4 bit using scaling factors. 2nd Quantization
#                              quantizes the auxiliary data such as scaling factors and codebooks that are used for
#                              mapping the original values (full model size) to quantized values.
# bnb_4bit_quant_type="nf4" -> Normal Float 4 which is a method for 4 bit quantization
# bnb_4bit_compute_dtype=torch.float16 -> Datatype used during computation (format gets acceleration on GPU)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Load the model in 4-bit precision
    bnb_4bit_use_double_quant=True,  # Use double quantization for better precision
    bnb_4bit_quant_type="nf4",  # Use 4-bit NormalFloat quantization
    bnb_4bit_compute_dtype=torch.float16  # Use FP16 for computation
)

The model requires authentication via HuggingFace. So using the hugging face token and logging in using the token

In [6]:
hf_token = '<enter your hf token here>'

from huggingface_hub import login
login(token=hf_token)

<h3>Different Model Classes and when to use what? </h3> <br>
<ol>
  <li> AutoModel - Use when: You need raw hidden states or plan to build and attach your own task-specific layers </li>
  <li> AutoModelForCausalLM - Use when: You’re performing generative tasks, such as chatbots that generate text (e.g., GPT-style models). </li>
  <li> AutoModelForSeq2SeqLM - Use when: You’re working on tasks like summarization, translation, or any generation tasks that require converting one sequence into another (e.g., T5, BART). </li>
  <li> AutoModelForMaskedLM - Use when: You’re performing fill-in-the-blank tasks or working with models like BERT that are pretrained using a masked language modeling objective. </li>
  <li> AutoModelForSequenceClassification - Use when: You need to classify entire sequences (e.g., sentiment analysis, spam detection, topic categorization). </li>
  <li> AutoModelForQuestionAnswering - Use when: You’re building a system to extract answers from a passage (e.g., SQuAD-style question answering). </li>
  <li> AutoModelForTokenClassification - Use when: You’re handling tasks such as Named Entity Recognition (NER) or Part-of-Speech (POS) tagging. </li>
  <li> AutoModelForMultipleChoice - Use when: You’re dealing with multiple-choice questions, common in certain reading comprehension tasks. </li>
  </ol>
  <br>
  <h5> As the end goal is to create a RAG powered Chatbot, we need a model to generate conversational responses. Hence, AutoModelForCausalLM is the best choice </h5>




In [7]:
!pip install -U bitsandbytes



<h5> Downloading the model and Tokenizer </h5>

In [8]:
# Downloading the model and the tokenizer
print("Downmloading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("Tokenizer successfully downloaded.")

print("Downloading model...")
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config, device_map="auto")
print("Model successfully downloaded.")

Downmloading tokenizer...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Tokenizer successfully downloaded.
Downloading model...


config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model successfully downloaded.


<h5> Dumping this model to disk </h5>

In [9]:
print("Saving model and tokenizer to disk...")
model.save_pretrained("./llama2-13b-chat-4bit")
tokenizer.save_pretrained("./llama2-13b-chat-4bit")
print("Model and tokenizer saved to './llama2-13b-chat-4bit'")

Saving model and tokenizer to disk...
Model and tokenizer saved to './llama2-13b-chat-4bit'


In [10]:
# Zipping the model
!zip -r my_model.zip /content/llama2-13b-chat-4bit/

  adding: content/llama2-13b-chat-4bit/ (stored 0%)
  adding: content/llama2-13b-chat-4bit/model-00001-of-00002.safetensors (deflated 4%)
  adding: content/llama2-13b-chat-4bit/tokenizer.json (deflated 85%)
  adding: content/llama2-13b-chat-4bit/special_tokens_map.json (deflated 74%)
  adding: content/llama2-13b-chat-4bit/config.json (deflated 55%)
  adding: content/llama2-13b-chat-4bit/tokenizer_config.json (deflated 66%)
  adding: content/llama2-13b-chat-4bit/model-00002-of-00002.safetensors (deflated 6%)
  adding: content/llama2-13b-chat-4bit/model.safetensors.index.json (deflated 96%)
  adding: content/llama2-13b-chat-4bit/generation_config.json (deflated 32%)
  adding: content/llama2-13b-chat-4bit/tokenizer.model (deflated 55%)


In [11]:
from google.colab import files
files.download("my_model.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<h5> Inferencing on a basic question to see if the model fits in Colab </h5>

In [15]:
def ask_question(question):
    input_text = f"[INST] {question} [/INST]"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    with torch.no_grad():
        output_ids = model.generate(input_ids, max_new_tokens = 50, max_length=512)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

question = "What is the capital of Djibouti?"
answer = ask_question(question)
print(f"Answer: {answer}")

Both `max_new_tokens` (=50) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer: [INST] What is the capital of Djibouti? [/INST]  There is no country called "Djibouti". Djibouti is a city and a province in Eritrea, and it does not have a capital. Eritrea's capital is Asmara.
