<a href="https://colab.research.google.com/github/VridhiJ/LLMs/blob/main/Quantization_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center" style="color:green;font-size: 3em;">
Implementing Quantization Techniques</h1>

In this notebook, we will explore quantization techniques to optimize memory requirements.

## Import Libraries

In [None]:
!pip install datasets -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.4 kB[0m [31m17.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch

from transformers import BertModel, BertTokenizer, DistilBertForSequenceClassification, DistilBertTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report
from tqdm import tqdm
from torch.optim import AdamW
import torch.quantization


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Introduction to Quantization


Quantization is a model compression technique people use to reduce the size and the computational requirements of LLMs. The central idea behind quantization is to represent the model’s weights and activations using lower-precision data types, such as `int8` or `float16`, instead of the standard `float32`. This significantly reduces the memory footprint and allows for faster computations, as lower-precision arithmetic operations are generally less computationally expensive.

There are various types of quantization techniques, including post-training quantization (PTQ), where the model is quantized after training, and quantization-aware training (QAT), where the model is trained with quantization in mind. While quantization often results in some loss of model accuracy, advances like QAT help to somewhat eliminate this by adjusting weights during training to account for the reduced precision. By having a balance between computational efficiency and model performance, quantization enables LLMs to run effectively in real-world applications without the need for extensive hardware resources.

First, we will explore the memory usage of different tensor data types in PyTorch. Understanding how the choice of data type affects memory consumption is crucial when working with large datasets or models in deep learning.


In [None]:
# Create a tensor of type float32
tensor = torch.randn(100,100, dtype = torch.float32)
print(f"Memory (float32): {tensor.element_size() * tensor.nelement()} bytes")

# Create a tensor of the same shape of type float 16
tensor_fp16 = tensor.to(dtype=torch.float16)
print(f"Memory (float16): {tensor_fp16.element_size() * tensor_fp16.nelement()} bytes")

# Create a tensor of the same shape of type int 8
tensor_int8 = torch.quantize_per_tensor(tensor, scale=0.1, zero_point=0, dtype=torch.qint8)
print(f"Memory (int8): {tensor_int8.int_repr().element_size() * tensor_int8.numel()} bytes")

Memory (float32): 40000 bytes
Memory (float16): 20000 bytes
Memory (int8): 10000 bytes


## Quantize a Small NN Model

Next, we will explore the impact of data type conversion on the output of a BERT model using PyTorch. Specifically, we will compare the output shapes and memory usage of the BERT model when using different tensor data types: float32 and float16.

In [None]:
# Load the model and tokenizer
model = BertModel.from_pretrained("prajjwal1/bert-small")
tokenizer = BertTokenizer.from_pretrained("prajjwal1/bert-small")

# Tokenize a random sentence and run it through the model
input_text = "Quantization is useful!"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)

# Quantize the model and run the sentence through the new model
model.half()
quantized_outputs = model(**inputs)

# Print the bytes used for both
print(f"Memory (float32): {outputs.last_hidden_state.element_size() * outputs.last_hidden_state.nelement()} bytes")
print(f"Memory (float16): {quantized_outputs.last_hidden_state.element_size() * quantized_outputs.last_hidden_state.nelement()} bytes")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/116M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/116M [00:00<?, ?B/s]

Memory (float32): 16384 bytes
Memory (float16): 8192 bytes


Some basic questions:

- What is quantization and why is it important for large language models?

  Quantization is a compression technique which reduces the precision of model's weights from higher precision(in our case from float32) to lower precision(float16, int8). This in turn reduces the memory requirement and computational load.

- How does reducing precision from float32 to int8 impact memory usage?

  Reducing precision from float32 (32 bits) to int8 (8 bits) means that now each weight requires only 1/4th of the original storage and hence this step reduces memory usage by 75%.

- Explain the difference between per-layer and per-channel quantization. Why might per-channel quantization be more effective for certain tasks?

  Per-layer quantization assigns same parameter to all values within a tensor while per-channel quantization allows different parameters to different channels within a tensor. This allows per-channel quantization to adapt to variations in activation ranges within channels thus improving accuracy.




# Post-Training Quantization



Now that we have seen a small example of quantization, let us explore some of quantization techniques starting with Post-Training Quantization (PTQ).

PTQ optimizes pretrained neural network models by reducing the precision of weights and activations, thereby decreasing memory usage and improving inference speed while preserving accuracy. There are two main types of quantization: static and dynamic. Static quantization computes scaling factors for weights and activations during a calibration phase using a representative dataset, enabling fixed quantized values for more efficient inference. Conversely, dynamic quantization quantizes weights at runtime, leaving activations in their original precision, making it easier to implement without needing a calibration dataset. Together, these strategies enhance model performance for deployment in resource-constrained environments.

## Implementing Dynamic Quantization

For dynamic quantization, first, we will load a pre-trained DistilBERT model and its corresponding tokenizer, which will be used for sequence classification tasks.

In [None]:
# Load a pre-trained DistilBERT model and tokenizer
model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next, we will apply dynamic quantization to the pre-trained DistilBERT model to reduce its size and improve inference speed without significant loss in accuracy.


In [None]:
def apply_dynamic_quantization(model):
    model.eval()
    model = torch.quantization.quantize_dynamic(
        model,
        {torch.nn.Linear},
        dtype=torch.qint8
    )
    return model.to('cpu') ## NEED THIS!!!!

Next, we will evaluate the performance of the quantized DistilBERT model using various metrics to gain a comprehensive understanding of its effectiveness.

In [None]:
def evaluate_model(model, data_loader):
  model.eval()
  all_predictions = []
  all_labels = []
  with torch.no_grad():
    for batch in data_loader:
      inputs = tokenizer(batch['text'], return_tensors='pt', padding=True, truncation=True)
      labels = batch['label']
      outputs = model(**inputs)
      logits = outputs.logits
      predictions = torch.argmax(outputs.logits, dim=1)
      all_predictions.extend(predictions.cpu().numpy())
      all_labels.extend(labels.cpu().numpy())

  accuracy = accuracy_score(all_labels, all_predictions)
  f1 = f1_score(all_labels, all_predictions, average='weighted')
  report = classification_report(all_labels, all_predictions)
  return accuracy, f1, report

Finally, we want to combine all the steps together. We will load a dataset, apply dynamic quantization to the pre-trained DistilBERT model, and evaluate its performance.


In [None]:
# Load the IMDB dataset
dataset = load_dataset('imdb')
training_dataset = dataset['train'].shuffle(seed=42).select(range(2000))
evaluation_dataset = dataset['test'].shuffle(seed=42).select(range(500))

# Create DataLoader for training and testing
training_dataloader = DataLoader(training_dataset, batch_size=16, shuffle=True)
evaluation_dataloader = DataLoader(evaluation_dataset, batch_size=16)

# Apply dynamic quantization
quantized_model = apply_dynamic_quantization(model)

# Evaluate the quantized model
accuracy, f1, report = evaluate_model(quantized_model, evaluation_dataloader)
print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")
print(f"Classification Report:\n{report}")

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Accuracy: 0.486
F1 Score: 0.4269136413861044
Classification Report:
              precision    recall  f1-score   support

           0       0.48      0.17      0.25       254
           1       0.49      0.81      0.61       246

    accuracy                           0.49       500
   macro avg       0.48      0.49      0.43       500
weighted avg       0.48      0.49      0.43       500



Some observations:

- What are the trade-offs between static and dynamic quantization in terms of model accuracy, inference speed, and implementation complexity? (also explain why this might be the case)

 In terms of accuracy, static quantization has better performance because it pre-computes quantization parameters, capturing a more precise range for model activations while dynamic quantization estimates these parameters real-time during inference, which leads to more approximation errors.

 For inference speed, static quantization is often faster since it relies on precomputed values, while dynamic quantization is slower since it calculates parameters at inference.

 During implementation, dynamic quantization is easier to set up since it doesn't need calibration data, while static quantization requires a calibration step for optimization, adding complexity but generally providing a better balance of performance and accuracy.

- When might you choose one method over another?

 I will choose dynamic quantization when I need rapid deployment on CPU with minimal setup and static quantization for more accuracy-sensitive applications where calibration data is available and latency reduction is critical.

- Please discuss the accuracy degradation when doing quantization and provide ways you may minimize this.

 Quantization can degrade model accuracy due to reduced precision. We can minimize this by using per-channel quantization to preserve more information in sensitive layers. We can also improve accuracy by applying quantization-aware training which allows the model to learn how to adapt their weights.

## Quantization Aware Training

Now we will explore the next technique Quantization Aware Training (QAT).

QAT is a technique designed to optimize neural networks for deployment on resource-constrained environments. By simulating low-precision arithmetic during training, QAT allows models to learn how to best adapt their weights for quantized operations, resulting in improved accuracy compared to post-training quantization alone. In this section, we will implement QAT using Hugging Face's Transformers and Datasets libraries, allowing us to maintain model performance while reducing memory footprint and inference latency.

## Train a Model Normally

First, we will be loading a LLM called `distilbert-base-uncased`.


In [None]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
model_name = 'distilbert-base-uncased'
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next, we want to load our dataset. The dataset we will be using in this section is MRPC from the GLUE benchmark.


In [None]:
from torch.utils.data import DataLoader
from datasets import load_dataset

# Create the dataloaders
data = load_dataset('glue', 'mrpc')
train_dataloader = DataLoader(data['train'].shuffle(seed=42).select(range(500)), batch_size=16, shuffle=True)
eval_dataloader = DataLoader(data['validation'].shuffle(seed=42).select(range(100)), batch_size=16)

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Next, we want to write a method for training the model and a method for evaluating the model.

In [None]:
def train_model(model, train_dataloader, num_epochs=2):
  model.train()
  for epoch in range(num_epochs):
    total_loss = 0
    for batch in train_dataloader:
      inputs = tokenizer(batch['sentence1'], batch['sentence2'], return_tensors='pt', padding=True, truncation=True)
      labels = batch['label']
      outputs = model(**inputs, labels=labels)
      loss = outputs.loss
      loss.backward()
      total_loss += loss.item()

    # Print average loss for the epoch
    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {avg_loss:.4f}")

In [None]:
def evaluate_model(model, eval_dataloader):
  model.eval()
  all_predictions = []
  all_labels = []
  with torch.no_grad():
    for batch in eval_dataloader:
      inputs = tokenizer(batch['sentence1'], batch['sentence2'], return_tensors='pt', padding=True, truncation=True)
      labels = batch['label']
      outputs = model(**inputs)
      logits = outputs.logits
      predictions = torch.argmax(outputs.logits, dim=1)
      all_predictions.extend(predictions.cpu().numpy())
      all_labels.extend(labels.cpu().numpy())
  accuracy = accuracy_score(all_labels, all_predictions)
  print(f"Accuracy: {accuracy}")

In [None]:
# Train the model for 20 epochs
train_model(model, train_dataloader, num_epochs=20)
evaluate_model(model, eval_dataloader)

Epoch 1/20, Loss: 0.6899
Epoch 2/20, Loss: 0.6916
Epoch 3/20, Loss: 0.6909
Epoch 4/20, Loss: 0.6916
Epoch 5/20, Loss: 0.6933
Epoch 6/20, Loss: 0.6907
Epoch 7/20, Loss: 0.6894
Epoch 8/20, Loss: 0.6853
Epoch 9/20, Loss: 0.6930
Epoch 10/20, Loss: 0.6927
Epoch 11/20, Loss: 0.6896
Epoch 12/20, Loss: 0.6907
Epoch 13/20, Loss: 0.6906
Epoch 14/20, Loss: 0.6913
Epoch 15/20, Loss: 0.6920
Epoch 16/20, Loss: 0.6916
Epoch 17/20, Loss: 0.6907
Epoch 18/20, Loss: 0.6896
Epoch 19/20, Loss: 0.6925
Epoch 20/20, Loss: 0.6930
Accuracy: 0.49


##  Implementing Quantization-Aware Training


Next, we want to use the same model and the same task to perform Quantization Aware Training. Complete the cells below to get a sense of how this works.

In [None]:
# Recreate a new model (should be same as the cell above)
model_name = 'distilbert-base-uncased'
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Because PyTorch does not support quantization for embedding layers, we will have to treat them separately.

We will prepare the model for quantization aware training.

In [None]:
from torch.ao.quantization import float_qparams_weight_only_qconfig
# Define weight-only quantization to embedding layers
embedding_qconfig = float_qparams_weight_only_qconfig

# Define default quantization to other layers
default_qconfig = torch.ao.quantization.get_default_qat_qconfig('fbgemm')

# Apply embedding_qconfig to embedding layers and default_qconfig for all other layers
def set_qconfig_for_model(model):
  for name, module in model.named_modules():
    if isinstance(module, torch.nn.Embedding):
      module.qconfig = embedding_qconfig
    else:
      module.qconfig = default_qconfig

In [None]:
# Call the method in the previous cell on our model
set_qconfig_for_model(model)

# Set the model to training mode
model.train()

# Prepare the model for QAT
model = torch.ao.quantization.prepare_qat(model, inplace=False)




In [None]:
## Train the model again for 20 epochs
train_model(model, train_dataloader, num_epochs=20)

Epoch 1/20, Loss: 0.6921
Epoch 2/20, Loss: 0.6922
Epoch 3/20, Loss: 0.6927
Epoch 4/20, Loss: 0.6909
Epoch 5/20, Loss: 0.6917
Epoch 6/20, Loss: 0.6920
Epoch 7/20, Loss: 0.6913
Epoch 8/20, Loss: 0.6890
Epoch 9/20, Loss: 0.6919
Epoch 10/20, Loss: 0.6904
Epoch 11/20, Loss: 0.6900
Epoch 12/20, Loss: 0.6905
Epoch 13/20, Loss: 0.6916
Epoch 14/20, Loss: 0.6942
Epoch 15/20, Loss: 0.6927
Epoch 16/20, Loss: 0.6917
Epoch 17/20, Loss: 0.6947
Epoch 18/20, Loss: 0.6897
Epoch 19/20, Loss: 0.6901
Epoch 20/20, Loss: 0.6904


In [None]:
# Set the model to evaluation mode
model.eval()

# Run your evaluate_model method on the eval_dataloader and model
evaluate_model(model, eval_dataloader)

Accuracy: 0.49


- Discuss the general procedure of QAT. Specifically, how is the forward and backward propagation different from normal Deep Learning training.

 QAT is a technique which simulates quantizaiton during the training process(forward and backward propagation) instead of post-training quantization which leads to improved accuracy.
 It introduces quantization tailored to different layers: for example we quantized only weights for embedding layersusing float_qparams_weight_only_qconfig while keeping activations in full precision to minimize quantization error and for other layers, we applied quantization to both weights and activations using default_qconfig.

- What are the potential trade-offs when using quantization aware training, and how can they affect model deployment in resource-constrained environments?

 Models trained with QAT can work efficiently on devices with limited resources, using low-precision calculations that reduce memory use and speed up inference. Although QAT-trained models may lose a little accuracy compared to full-precision models, the trade-off often results in increased efficiency, making them ideal for deployment on resource-constrained environments.
- Compare the training results of Quantization Aware Training (QAT) and standard training, focusing on differences in training time, model accuracy, and inference speed. (I suggest using the same seed to train these two methods for consistency)

 Quantization Aware Training (QAT) and standard training have notable differences in terms of training efficiency and model performance. QAT introduces additional complexity during training, resulting in slower convergence and lower accuracy (0.50) compared to standard training (0.64), as seen in the loss and accuracy trends. QAT-trained models are optimized for low-precision inference, leading to reduced memory usage and faster inference, making them ideal for resource-constrained environments. In contrast, standard training converges faster and yields higher accuracy however it requires more training time and higher memory usage.

# Advanced Quantization Techniques

## Mixed Precision Training
Now we will explore the mixed precision training

Research and implement mixed precision training, from initializing a model to training and evaluating it using mixed precision. You can choose any model or dataset, or even implement your own custom neural network. The main goal of this exercise is to implement the mixed precision technique, and accuracy is not the primary concern.

In [None]:
import torch
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from torch.utils.data import DataLoader
from datasets import load_dataset
from torch.optim import AdamW
from sklearn.metrics import accuracy_score
from transformers import get_scheduler

In [None]:
# optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=2e-5)
num_training_steps = len(train_dataloader) * 20  # 20 epochs
lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

In [None]:
def model_train_pt(model, train_dataloader, optimizer, lr_scheduler, num_epochs=2):
  model.train()
  scaler = torch.cuda.amp.GradScaler()
  for epoch in range(num_epochs):
    total_loss = 0
    for batch in train_dataloader:
      inputs = tokenizer(batch['sentence1'], batch['sentence2'], return_tensors='pt', padding=True, truncation=True)
      labels = batch['label']
      optimizer.zero_grad()
      with torch.cuda.amp.autocast():
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
      scaler.scale(loss).backward()
      scaler.step(optimizer)
      scaler.update()
      lr_scheduler.step()
      total_loss += loss.item()

    # average loss for the epoch
    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {avg_loss:.4f}")

In [None]:
def model_eval_pt(model, eval_dataloader):
  model.eval()
  all_predictions = []
  all_labels = []
  with torch.no_grad():
    for batch in eval_dataloader:
      inputs = tokenizer(batch['sentence1'], batch['sentence2'], return_tensors='pt', padding=True, truncation=True)
      labels = batch['label']
      optimizer.zero_grad()
      with torch.cuda.amp.autocast():
        outputs = model(**inputs)
        logits = outputs.logits
      predictions = torch.argmax(outputs.logits, dim=1)
      all_predictions.extend(predictions.cpu().numpy())
      all_labels.extend(labels.cpu().numpy())
      accuracy = accuracy_score(all_labels, all_predictions)
  print(f"Accuracy: {accuracy}")

In [None]:
model_train_pt(model, train_dataloader, optimizer, lr_scheduler, num_epochs=20)
model_eval_pt(model, eval_dataloader)

# Observations

- How does quantization impact the trade-off between model accuracy and computational efficiency, and how can you mitigate potential accuracy losses during quantization?

 Quantization reduces computational complexity and memory usage by representing model parameters and activations in lower precision, such as 8-bit integers instead of 32-bit floats. This leads to faster inference and lower memory requirements, which are especially beneficial for deployment on resource-constrained environments. However, reducing precision reduces accuracy, as the lower bit representation may not be able to fully capture variations in the data. To mitigate this loss in accuracy, we saw that using QAT can be a useful tool as it simulates quantization during training and enables the model to adapt to low precision, resulting in better accuracy compared to post-training quantization (PTQ), where quantization is applied after training is completed.
- Explain the differences between post-training quantization and quantization-aware training (QAT). In what scenarios might one be preferred over the other, and why?

 PTQ applies quantization to a fully-trained model, converting parameters to lower precision without retraining which makes it computationally simpler and faster. It is suitable for models deployed in resource-constrained environments. However, PTQ may introduce more accuracy loss, particularly in complex models. Here, QAT can be used which simulates quantization effects during the training process, allowing the model to adjust its parameters for low precision, typically achieving better accuracy than PTQ. While QAT requires more computational resources and a longer training process, it is preferred when more accuracy is desired.
- What are the challenges when quantizing models with layers that involve non-linear operations, like activation functions, and how might these challenges affect real-world applications?

 Quantizing layers with non-linear operations, such as activation functions is difficult because these operations are sensitive to small changes in input precision. Since we intentionally lower the precision it may not capture the variability accurately, leading to loss of detail. This may impact many of the real-world applications such as image recognition or natural language processing, where fine-grained features are essential.
- In the context of deploying models on mobile or embedded devices, how does quantization help meet the hardware constraints, and what are the potential limitations or concerns when quantizing for edge devices?

 Since mobile or embedded devices represent resource constrained devices, we can say that quantization will definitely help meet the hardware constraints such as memory and faster inference requirements. However there may be limitations of quantizing to certain applications in these devices which require high precision such as applications requiring object detection or speech recognition.