# Day 0

In this first notebook, our goal is to run the Falcon-7b-instruct model from [HuggingFace](https://huggingface.co/models) locally by applying quantization to the model
and run inference on it with a prompt template from Langchain.

First, let's verify that our GPU drivers are working correctly with the following checks.


In [1]:
!nvidia-smi

Sat Sep 23 05:17:17 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3080        On  | 00000000:01:00.0  On |                  N/A |
| 30%   48C    P3              63W / 370W |   1494MiB / 10240MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Output of above cell should return correctly, and display something similar to below

```bash
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        On  | 00000000:01:00.0  On |                  N/A |
|  0%   56C    P5              35W / 370W |    841MiB / 10240MiB |     69%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
```

In [2]:
import torch
import logging

stream_handler = logging.StreamHandler()
stream_handler.setLevel(logging.DEBUG)

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s %(levelname)s %(module)s %(funcName)s %(message)s',
    handlers=[
        stream_handler,
    ]
)

In [3]:
logging.info(f"PyTorch GPU Available: {torch.cuda.is_available()} GPU Count: {torch.cuda.device_count()} GPU Current: {torch.cuda.current_device()}")
for i in range(torch.cuda.device_count()):
    logging.info(f"GPU Found: {torch.cuda.get_device_name(i)}")

2023-09-23 05:17:18,168 INFO 2625440443 <module> PyTorch GPU Available: True GPU Count: 1 GPU Current: 0
2023-09-23 05:17:18,169 INFO 2625440443 <module> GPU Found: NVIDIA GeForce RTX 3080


We can run the following cells to ensure that Accelerate is picking up our GPU and system resources correctly.

In [4]:
!accelerate env


Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.23.0
- Platform: Linux-5.10.0-25-amd64-x86_64-with-glibc2.31
- Python version: 3.10.11
- Numpy version: 1.26.0
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 62.01 GB
- GPU type: NVIDIA GeForce RTX 3080
- `Accelerate` default config:
	Not found


We can also run the `accelerate test` command to run a test suite for distributed training.

In [5]:
!accelerate test


Running:  accelerate-launch /home/kirito/repos/llm-notebooks/.venv/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py
stderr: The following values were not passed to `accelerate launch` and had defaults used instead:
stderr: 	`--num_processes` was set to a value of `1`
stderr: 	`--num_machines` was set to a value of `1`
stderr: 	`--mixed_precision` was set to a value of `'no'`
stderr: 	`--dynamo_backend` was set to a value of `'no'`
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: NO
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda
stdout: 
stdout: Mixed precision type: no
stdout: 
stdout: 
stdout: **Test process execution**
stdout: 
stdout: **Test split between processes as a list**
stdout: 
stdout: **Test split between processes as a dict**
stdout: 
stdout: **Test split between processes as a tensor**
stdout: 
stdout: **Test random number generator synchronizatio

# Brief Introduction to Quantization

Quantization is the technique to reduce the memory size and computational costs of large models by reducing the numerical representations of the weights and activations from higher precision data types like float32, to lower precision data types like int8.This presents us the following benefits

1. Lower requirement of RAM (memory) required to serve large models.
2. Lower computational requirements; faster inference
3. As a result of 1. and 2., lower monetary costs associated with training and performing inference.

For a deeper dive into the topic, we can review the following readings:
1. [HF-BitsAndBytes](https://huggingface.co/blog/hf-bitsandbytes-integration)
2. [4bit-Transformers-BitsAndBytes](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

Let us prepare our quantization config


In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, AutoConfig, GenerationConfig, pipeline as tf_pipeline
from accelerate import infer_auto_device_map, init_empty_weights


2023-09-23 05:17:23,245 INFO instantiator <module> Created a temporary directory at /tmp/tmp1h77a0cm
2023-09-23 05:17:23,246 INFO instantiator _write Writing /tmp/tmp1h77a0cm/_remote_module_non_scriptable.py


In [7]:
quantization_config = BitsAndBytesConfig(
    # Load pre-trained model in 4bit-quantization
    load_in_4bit=True,
    # Change PyTorch Computation type from float16 to bfloat16
    # bnb_4bit_compute_dtype=torch.bfloat16,  # works for only GPU
    bnb_4bit_compute_dtype=torch.float16,
    # Normalized float 4 (4bit)
    bnb_4bit_quant_type="nf4",
    # Perform a second quantization
    bnb_4bit_use_double_quant=True,
)

In [8]:
# Specify our desired HuggingFace Falcon Model
MODEL_ID = "tiiuae/falcon-7b-instruct"

We can use the `accelerate estimate-memory` command to estimate our LLM model size and also the potential gains from quantization

In [9]:
!accelerate estimate-memory tiiuae/falcon-7b-instruct  --trust_remote_code

Loading pretrained config for `tiiuae/falcon-7b-instruct` from `transformers`...
┌────────────────────────────────────────────────────────┐
│  Memory Usage for loading `tiiuae/falcon-7b-instruct`  │
├───────┬─────────────┬──────────┬───────────────────────┤
│ dtype │Largest Layer│Total Size│  Training using Adam  │
├───────┼─────────────┼──────────┼───────────────────────┤
│float32│    1.1 GB   │ 26.89 GB │       107.54 GB       │
│float16│  563.56 MB  │ 13.44 GB │        53.77 GB       │
│  int8 │  281.78 MB  │ 6.72 GB  │        26.89 GB       │
│  int4 │  140.89 MB  │ 3.36 GB  │        13.44 GB       │
└───────┴─────────────┴──────────┴───────────────────────┘


# Tokenizers
Next, we want to load the tokenizer for our model. The tokenizer converts our query, sentences into tokens that the LLM can run predictions on.

In [10]:
logging.info("Running Tokenizer ...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

2023-09-23 05:17:26,001 INFO 3973576235 <module> Running Tokenizer ...
2023-09-23 05:17:26,005 DEBUG connectionpool _new_conn Starting new HTTPS connection (1): huggingface.co:443
2023-09-23 05:17:26,279 DEBUG connectionpool _make_request https://huggingface.co:443 "HEAD /tiiuae/falcon-7b-instruct/resolve/main/tokenizer_config.json HTTP/1.1" 200 0


**Important Notes**
After we installed the library accelerate, accelerate will automatically load the model into GPU memory if it can find a GPU device.

Be careful not to run the below cell twice, as this will load the model twice into the GPU

We can run `!nvidia-smi` cell to check the GPU memory usage after loading the model.

For our Falcon-7b instruct model, the on-disk size is 14gb+, after quantization, we only use *4.5gb*! Shrinking the model size by **3x**!
```bash
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        On  | 00000000:01:00.0  On |                  N/A |
|  0%   59C    P2             106W / 370W |   5493MiB / 10240MiB |      1%      Default
```

In the event, we want to free up the GPU memory, we can `Restart Kernel and Clear Outputs of all Cells`

# AutoModels
We can use HuggingFace's **AutoModelForCasualLM** to quickly download the base model `falcon-7b-instruct` and apply quantization to it. The model gets saved to `$HOME/.cache/huggingface/hub` and 7b-instruct needs about 15gb of storage to be conservative.

In [11]:
logging.info("Loading Model ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    quantization_config=quantization_config,
    device_map="auto",
    return_dict=True,
    torch_dtype=torch.float16,
)

2023-09-23 05:17:26,321 INFO 1754090134 <module> Loading Model ...
2023-09-23 05:17:26,579 DEBUG connectionpool _make_request https://huggingface.co:443 "HEAD /tiiuae/falcon-7b-instruct/resolve/main/config.json HTTP/1.1" 200 0
2023-09-23 05:17:26,831 DEBUG connectionpool _make_request https://huggingface.co:443 "HEAD /tiiuae/falcon-7b-instruct/resolve/main/configuration_RW.py HTTP/1.1" 200 0
2023-09-23 05:17:27,096 DEBUG connectionpool _make_request https://huggingface.co:443 "HEAD /tiiuae/falcon-7b-instruct/resolve/main/config.json HTTP/1.1" 200 0
2023-09-23 05:17:27,390 DEBUG connectionpool _make_request https://huggingface.co:443 "HEAD /tiiuae/falcon-7b-instruct/resolve/main/modelling_RW.py HTTP/1.1" 200 0
2023-09-23 05:17:28,357 INFO modeling get_balanced_memory We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

2023-09-23 05:17:32,458 DEBUG connectionpool _make_request https://huggingface.co:443 "HEAD /tiiuae/falcon-7b-instruct/resolve/main/generation_config.json HTTP/1.1" 200 0


After we have downloaded our model, we can use it in a pipeline to generate text.

In [12]:
logging.info("Creating text generation pipeline ...")

pipeline = tf_pipeline(
    "text-generation",
    model=model,
    use_cache=True,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    max_length=500,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    device_map="auto",
)

2023-09-23 05:17:32,481 INFO 2212159532 <module> Creating text generation pipeline ...


Let us create a test scenario of a School Parent Meeting (SPM) between a teacher Sally, a student Val and her mother Soph, and ask Falcon to generate a conversation.

In [13]:
logging.info("Generating sequences")
sequences = pipeline(
   """Sally is a kindergarden teacher and Val is her student.Sally needs to contact Val's parent Soph to discuss her child's performance in school.Val is always late for class, and Sally needs to highlight to Soph about it.Val is extroverted and has made many new friends in school.
   Soph: Morning Sally! How is Val doing in school?
   Sally:""",
)
logging.info("Printing sequences")
for seq in sequences:
    logging.info(f"Result: {seq['generated_text']}")
logging.info("Operation Complete")

2023-09-23 05:17:32,486 INFO 2580241112 <module> Generating sequences
2023-09-23 05:17:39,647 INFO 2580241112 <module> Printing sequences
2023-09-23 05:17:39,648 INFO 2580241112 <module> Result: Sally is a kindergarden teacher and Val is her student.Sally needs to contact Val's parent Soph to discuss her child's performance in school.Val is always late for class, and Sally needs to highlight to Soph about it.Val is extroverted and has made many new friends in school.
   Soph: Morning Sally! How is Val doing in school?
   Sally: Hi, Soph! Val is enjoying school, but she's late often. I think it might be affecting her learning.
    Soph: Hi Val, how's school going? Val is really enjoying it, and making a lot of new friends. But she's been late quite a lot. We think it's affecting her learning.
2023-09-23 05:17:39,648 INFO 2580241112 <module> Operation Complete


# Integration with Langchain
Once we have our huggingface pipeline made, it is easy to integrate it with Langchain for an easier interface to prompting.A warning, vanilla Falcon-7b tends to hallucinate.


In [34]:
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

llm = HuggingFacePipeline(pipeline=pipeline)

template = """Scenario: You (Sally) are a math kindergarden teacher who is meeting parents and students for their bi-annual semester assessment. You
should provide feedback based on each student's performance this semester. Be concise.

Grades are defined in descending order of: A, B, C ,D ,E , F. D, E and F are failing grades. Generate one response only.

Example:
Grade: C
Claire: Hi Sally, I am Sarah's parent. Has her math improved this semester?
Sally: Hi Claire, she scored a C on her math exam, and has room for improvement, I would suggest more revision and exercises for her.

Example:
Grade: A
Sam: Hi Sally, I am Tom's parent. How is Tom doing in exams?
Sally: Tom demonstrates keen interest in the subject, scoring a remarkable A, keep up the good work.


Be concise.Generate one response only.Answer the parent's question below.

Grade: {grade}
{parent}: Hi Sally, I am  {student}'s parent. {question}
Sally:
"""

prompt = PromptTemplate(
    template=template, 
    input_variables= ["grade", "parent", "student", "question"]
)

llm_chain = LLMChain(prompt=prompt, llm=llm)
logging.info("Asking Question...")
output = llm_chain.run({"grade": "A", "parent": "Peter", "student": "Becky", "question": "How did Becky perform this semester?"})
logger.info(output)
logging.info("Question Answered")

2023-09-23 05:47:49,384 INFO 2510736692 <module> Asking Question...
2023-09-23 05:48:02,626 INFO 2510736692 <module> Question Answered


Be concise. Generate one response only. Answer the parent's question below.

Grade: B
Alice: Hi Sally, I am John's parent. How is he doing academically this semester? Please provide a feedback.
Sally: John's overall performance is a bit mixed, although he's made substantial progress in the past 2 months. His current math level is B, but he needs more consistent practice to reach a higher level.
