# Lesson 4: Quantization Theory

이건 강의에서는 Linear Quantization에 대해 다룹니다.

이론적인 내용들이 약간 포함되어 있으니 강의를 꼭 보시길 권장드려요!

#### 설치할 라이브러리
- 개인 컴퓨터에서 이 파일을 실행하는 경우 아래 라이브러리들을 설치해야 합니다.

```Python
!pip install transformers==4.35.0
!pip install quanto==0.0.11
!pip install torch==2.1.1
```

## T5-FLAN
- 강의에서는 ElutherAI의 Pythia 모델을 보여주었지만, 본 노트북 파일에는 T5-FLAN 모델로 실습을 진행합니다. 더 적은 자원을 필요로 하는 모델을 정해준 것 같습니다.

T5-FLAN 모델을 사용하기 위해서 추가적으로 설치해야하는 라이브러리가 있습니다.
```Python
!pip install sentencepiece==0.2.0
```


### Without Quantization

In [None]:
model_name = "google/flan-t5-small"

In [None]:
import sentencepiece as spm
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")

In [None]:
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

입력 텍스트를 토큰화하고, 이를 모델에 입력으로 제공한 뒤, 얻은 출력 값을 decoding하면 텍스트가 출력됩니다.

In [None]:
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

모델의 크기를 계산하는 함수를 이용하여 그 크기를 확인합니다.

In [None]:
from helper import compute_module_sizes
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

## Quantize the model (8-bit precision)

본 강의에서는 quanto라는 라이브러리의 quantize, freeze 함수를 이용합니다.

quantize는 말 그대로 양자화를 수행하되 중간(intermediate) 상태로 만들어 주는 것이고, freeze는 이 정보를 바탕으로 실제 양자화를 적용한 결과를 반환합니다.

여기서는 activations에는 양자화를 적용하지 않는 것이 코드에서 확인됩니다.

In [None]:
from quanto import quantize, freeze
import torch

In [None]:
quantize(model, weights=torch.int8, activations=None)

In [None]:
print(model)

### Freeze the model
- 이 과정은 약간의 메모리를 필요로 합니다. 
- freeze를 실행하면 model 내부의 가중치 자료형이 변경되고 이에 따라 요구되는 메모리의 양도 줄어드는 것입니다.

In [None]:
freeze(model)

기존 대비 약 1/3 가량으로 모델의 크기가 줄어든 것이 확인됩니다.

In [None]:
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

### Try running inference on the quantized model

In [None]:
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

이후는 강의 관련 내용들로, 로컬에서 직접 모델을 돌려보고 싶은 분들만 참고해보시면 좋겠습니다.

## Note: Quantizing the model used in the lecture video will not work due to classroom hardware limitations.
- Here is the code that Marc, the instructor is walking through.  
- It will likely run on your local computer if you have 8GB of memory, which is usually the minimum for personal computers.
  - To run locally, you can download the notebook and the helper.py file by clicking on the "Jupyter icon" at the top of the notebook and navigating the file directory of this classroom.  Also download the requirements.txt to install all the required libraries.

### Without Quantization



- Load [EleutherAI/pythia-410m](https://huggingface.co/EleutherAI/pythia-410m) model and tokenizer.

```Python
from transformers import AutoModelForCausalLM
model_name = "EleutherAI/pythia-410m"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             low_cpu_mem_usage=True)
print(model.gpt_neox)


from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

- Write a start of a (`text`) sentence which you'd like the model to complete.
```Python
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
outputs
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

- Compute the model's size using the helper function, `compute_module_sizes`.
```Python
from helper import compute_module_sizes
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")
print(model.gpt_neox.layers[0].attention.dense.weight)
```
**Note:** The weights are in `fp32`.

### 8-bit Quantization

```Python
from quanto import quantize, freeze
import torch

quantize(model, weights=torch.int8, activations=None)
# after performing quantization
print(model.gpt_neox)
print(model.gpt_neox.layers[0].attention.dense.weight)
```

- The "freeze" function requires more memory than is available in this classroom.
- This code will run on a machine that has 8GB of memory, and so it will likely work if you run this code on your local machine.

```Python
# freeze the model
freeze(model)
print(model.gpt_neox.layers[0].attention.dense.weight)

# get model size after quantization
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

# run inference after quantizing the model
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

#### Comparing "linear quantization" to "downcasting"

To recap the difference between the "linear quantization" method in this lesson with the "downcasting" method in the previous lesson:

- When downcasting a model, you convert the model's parameters to a more compact data type (bfloat16).  During inference, the model performs its calculations in this data type, and its activations are in this data type.  Downcasting may work with the bfloat16 data type, but the model performance will likely degrade with any smaller data type, and won't work if you convert to an integer data type (like the int8 in this lesson).


- In this lesson, you used another quantization method, "linear quantization", which enables the quantized model to maintain performance much closer to the original model by converting from the compressed data type back to the original FP32 data type during inference. So when the model makes a prediction, it is performing the matrix multiplications in FP32, and the activations are in FP32.  This enables you to quantize the model in data types smaller than bfloat16, such as int8, in this example.

#### This is just the beginning...
- This course is intended to be a beginner-friendly introduction to the field of quantization. 🐣
- If you'd like to learn more about quantization, please stay tuned for another Hugging Face short course that goes into more depth on this topic (launching in a few weeks!) 🤗

## Did you like this course?

- If you liked this course, could you consider giving a rating and share what you liked? 💕
- If you did not like this course, could you also please share what you think could have made it better? 🙏

#### A note about the "Course Review" page.
The rating options are from 0 to 10.
- A score of 9 or 10 means you like the course.🤗
- A score of 7 or 8 means you feel neutral about the course (neither like nor dislike).🙄
- A score of 0,1,2,3,4,5 or 6 all mean that you do not like the course. 😭
  - Whether you give a 0 or a 6, these are all defined as "detractors" according to the standard measurement called "Net Promoter Score". 🧐