# Day 0

In this first notebook, our goal is to run the Falcon-7b-instruct model from [HuggingFace](https://huggingface.co/models) locally by applying quantization to the model
and run inference on it with a prompt template from Langchain.

First, let's verify that our GPU drivers are working correctly with the following checks.


In [None]:
!nvidia-smi

Output of above cell should return correctly, and display something similar to below

```bash
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        On  | 00000000:01:00.0  On |                  N/A |
|  0%   56C    P5              35W / 370W |    841MiB / 10240MiB |     69%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
```

In [None]:
import torch
import logging

stream_handler = logging.StreamHandler()
stream_handler.setLevel(logging.DEBUG)

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s %(levelname)s %(module)s %(funcName)s %(message)s',
    handlers=[
        stream_handler,
    ]
)

In [None]:
logging.info(f"PyTorch GPU Available: {torch.cuda.is_available()} GPU Count: {torch.cuda.device_count()} GPU Current: {torch.cuda.current_device()}")
for i in range(torch.cuda.device_count()):
    logging.info(f"GPU Found: {torch.cuda.get_device_name(i)}")

We can run the following cells to ensure that Accelerate is picking up our GPU and system resources correctly.

In [None]:
!accelerate env

We can also run the `accelerate test` command to run a test suite for distributed training.

In [None]:
!accelerate test

# Brief Introduction to Quantization

Quantization is the technique to reduce the memory size and computational costs of large models by reducing the numerical representations of the weights and activations from higher precision data types like float32, to lower precision data types like int8.This presents us the following benefits

1. Lower requirement of RAM (memory) required to serve large models.
2. Lower computational requirements; faster inference
3. As a result of 1. and 2., lower monetary costs associated with training and performing inference.

For a deeper dive into the topic, we can review the following readings:
1. [HF-BitsAndBytes](https://huggingface.co/blog/hf-bitsandbytes-integration)
2. [4bit-Transformers-BitsAndBytes](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

Let us prepare our quantization config


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, AutoConfig, GenerationConfig, pipeline as tf_pipeline
from accelerate import infer_auto_device_map, init_empty_weights


In [None]:
quantization_config = BitsAndBytesConfig(
    # Load pre-trained model in 4bit-quantization
    load_in_4bit=True,
    # Change PyTorch Computation type from float16 to bfloat16
    # bnb_4bit_compute_dtype=torch.bfloat16,  # works for only GPU
    bnb_4bit_compute_dtype=torch.float16,
    # Normalized float 4 (4bit)
    bnb_4bit_quant_type="nf4",
    # Perform a second quantization
    bnb_4bit_use_double_quant=True,
)

In [None]:
# Specify our desired HuggingFace Falcon Model
MODEL_ID = "tiiuae/falcon-7b-instruct"

We can use the `accelerate estimate-memory` command to estimate our LLM model size and also the potential gains from quantization

In [None]:
!accelerate estimate-memory tiiuae/falcon-7b-instruct  --trust_remote_code

# Tokenizers
Next, we want to load the tokenizer for our model. The tokenizer converts our query, sentences into tokens that the LLM can run predictions on.

In [None]:
logging.info("Running Tokenizer ...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

**Important Notes**
After we installed the library accelerate, accelerate will automatically load the model into GPU memory if it can find a GPU device.

Be careful not to run the below cell twice, as this will load the model twice into the GPU

We can run `!nvidia-smi` cell to check the GPU memory usage after loading the model.

For our Falcon-7b instruct model, the on-disk size is 14gb+, after quantization, we only use *4.5gb*! Shrinking the model size by **3x**!
```bash
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        On  | 00000000:01:00.0  On |                  N/A |
|  0%   59C    P2             106W / 370W |   5493MiB / 10240MiB |      1%      Default
```

In the event, we want to free up the GPU memory, we can `Restart Kernel and Clear Outputs of all Cells`

# AutoModels
We can use HuggingFace's **AutoModelForCasualLM** to quickly download the base model `falcon-7b-instruct` and apply quantization to it. The model gets saved to `$HOME/.cache/huggingface/hub` and 7b-instruct needs about 15gb of storage to be conservative.

In [None]:
logging.info("Loading Model ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    quantization_config=quantization_config,
    device_map="auto",
    return_dict=True,
    torch_dtype=torch.float16,
)

In [None]:
from optimum.bettertransformer import BetterTransformer

model.config.model_type = "falcon"
model_better = BetterTransformer.transform(model, keep_original_model=False)

After we have downloaded our model, we can use it in a pipeline to generate text.

In [None]:
logging.info("Creating text generation pipeline ...")

pipeline = tf_pipeline(
    "text-generation",
    #model=model,
    model=model_better,
    use_cache=True,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    max_length=500,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    device_map="auto",
)

Let us create a test scenario of a School Parent Meeting (SPM) between a teacher Sally, a student Val and her mother Soph, and ask Falcon to generate a conversation.

In [None]:
logging.info("Generating sequences")
sequences = pipeline(
   """Sally is a kindergarden teacher and Val is her student.Sally needs to contact Val's parent Soph to discuss her child's performance in school.Val is always late for class, and Sally needs to highlight to Soph about it.Val is extroverted and has made many new friends in school.
   Soph: Morning Sally! How is Val doing in school?
   Sally:""",
)
logging.info("Printing sequences")
for seq in sequences:
    logging.info(f"Result: {seq['generated_text']}")
logging.info("Operation Complete")

# Integration with Langchain
Once we have our huggingface pipeline made, it is easy to integrate it with Langchain for an easier interface to prompting.A warning, vanilla Falcon-7b tends to hallucinate.


In [None]:
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

llm = HuggingFacePipeline(pipeline=pipeline)

template = """Scenario: You (Sally) are a math kindergarden teacher who is meeting parents and students for their bi-annual semester assessment. You
should provide feedback based on each student's grade this semester.Grades are defined in descending order of: A, B, C ,D ,E , F. D, E and F are failing grades.

Example Question Format:
Grade:
Parent Name:
Sally:

Example:
Grade: C
Claire: Hi Sally, I am Sarah's parent. Has her math improved this semester?
Sally: Hi Claire, she scored a C on her math exam, and has room for improvement, I would suggest more revision and exercises for her.

Example:
Grade: A
Sam: Hi Sally, I am Tom's parent. How is Tom doing in exams?
Sally: Tom demonstrates keen interest in the subject, scoring a remarkable A, keep up the good work.


Be concise.Generate one response only.Answer the parent's question below.

Grade: {grade}
{parent}: Hi Sally, I am  {student}'s parent. {question}
Sally:
"""

prompt = PromptTemplate(
    template=template, 
    input_variables= ["grade", "parent", "student", "question"]
)

llm_chain = LLMChain(prompt=prompt, llm=llm)

In [None]:
logging.info("Asking Question...")
output = llm_chain.run({"grade": "A", "parent": "Peter", "student": "Becky", "question": "How did Becky perform this semester?"})
logging.info(output)
logging.info("Question Answered")