### 1. Asymmetric Linear Quantization from scratch: FP32 -> Lower bits

In this notebook we will go through simple linear quantization methodology. Our goal is to compress ML model from __32-bit floating point__ precision to a __lower one__ preserving the most of the model acuracy.

In general quantization approach helps to reduce excessive use of memory to store, read and apply ML models.

__Linear Quantization__ suggests that we can store quantized model weights along with small amount of parameters that helps to restore quantized weights as much as possible into the original form with higher bit precision via set of linear operations. The simplest way to do that is to store some scaling coefficients and track zero point between higher and lower precision weights.



In __Asymmetric Linear quantization__ we track quantized weights __scaling coefficients__ and __zero point__ location for a quantized range of values

In [1]:
import copy
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn import datasets

import pandas as pd

import matplotlib.pyplot as plt
from watermark import watermark

%config InlineBackend.figure_format = "retina"

In [2]:
print(watermark())
print(watermark(packages="numpy,sklearn,matplotlib"))

Last updated: 2025-04-05T01:05:03.163055+03:00

Python implementation: CPython
Python version       : 3.12.9
IPython version      : 9.0.2

Compiler    : GCC 11.2.0
OS          : Linux
Release     : 5.15.0-136-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

numpy     : 2.2.4
sklearn   : 1.6.1
matplotlib: 3.10.1



In this example we will use wine dataset and neural network as kinds of wine classifier

#### Preparing example dataset

In [3]:
seed = 42
test_size = 0.2

np.random.seed(seed)
data = datasets.load_wine()

X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

enc = OneHotEncoder(handle_unknown='ignore')
scaler = StandardScaler()

y_train_enc = enc.fit_transform(y_train.reshape(-1, 1))
y_test_enc = enc.transform(y_test.reshape(-1, 1))
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

In [4]:
print(X_train_sc.shape, y_train_enc.shape)

(142, 13) (142, 3)


#### Preparing neural network classifier

In [5]:
class NNet:
    def __init__(
        self, 
        input_size: int, 
        hidden_size: int, 
        output_size: int
    ):
        # input_size, number of features in our dataset
        # hidden_size, hidden feature vector
        # output_size, the number of classes in our task

        # First layer with bias
        self.W1 = np.random.randn(input_size, hidden_size).astype(np.float32)
        self.b1 = np.random.randn(1, hidden_size).astype(np.float32)
        
        # Second layer with bias
        self.W2 = np.random.randn(hidden_size, output_size).astype(np.float32)
        self.b2 = np.random.randn(1, output_size).astype(np.float32)

    def forward(self, X: np.ndarray) -> np.ndarray:

        # Apply weights: z1 = xW1 + b1
        self.z1 = np.dot(X, self.W1) + self.b1
        # Apply ReLU activation function
        self.a1 = np.maximum(0, self.z1)
        # Apply weights: z2 = xW2 + b2
        self.z2 = np.dot(self.a1, self.W2) + self.b2

        # Aply softmax activation
        exp_z2 = np.exp(self.z2)
        self.a2 = exp_z2 / np.sum(exp_z2, axis=1, keepdims=True)
        return self.a2

    def backward(self, X: np.ndarray, y: np.ndarray, learning_rate: float) -> None:

        m = X.shape[0]

        dz2 = self.a2 - y
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        da1 = np.dot(dz2, self.W2.T)
        
        dz1 = da1 * (self.z1 > 0)  # Gradient for ReLU
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m

        # Update weights with fixed learning_rate
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1

In [6]:
# Set model parameters
n_features = X.shape[1]
input_size = n_features
hidden_size = 100
output_size = len(np.unique(y))

# Set model training configuration
epochs = 200
batch_size = 10
lr = 0.1
eps = 1e-8
show_epochs = 10
train_size = len(y_train)

Lets train our simple classifier

In [7]:
model = NNet(input_size, hidden_size, output_size)

for epoch in range(epochs):
  loss_accum = 0
  for i in range(0, train_size, batch_size):
    X_batch = X_train_sc[i: i + batch_size, :]
    y_batch = np.array(y_train_enc[i: i + batch_size, :].todense())

    y_pred = model.forward(X_batch)
    ce_loss = -np.mean( np.sum(np.multiply(y_batch, np.log(y_pred + eps)), axis=1) )
    model.backward(X_batch, y_batch, lr)
    loss_accum += ce_loss
  if epoch % show_epochs == 0:
    print(f"Epoch: {epoch}. Train loss: {loss_accum}")

Epoch: 0. Train loss: 47.7868022582481
Epoch: 10. Train loss: 0.0031307487465227204
Epoch: 20. Train loss: 0.0017780614852699924
Epoch: 30. Train loss: 0.001298985522747322
Epoch: 40. Train loss: 0.0010395265328199085
Epoch: 50. Train loss: 0.0008721869063203863
Epoch: 60. Train loss: 0.0007540463131598365
Epoch: 70. Train loss: 0.0006655551491054781
Epoch: 80. Train loss: 0.0005966229607126343
Epoch: 90. Train loss: 0.0005411552106892298
Epoch: 100. Train loss: 0.0004954622940959396
Epoch: 110. Train loss: 0.0004571125890286121
Epoch: 120. Train loss: 0.00042442829149903983
Epoch: 130. Train loss: 0.00039621811734276543
Epoch: 140. Train loss: 0.00037160832295757436
Epoch: 150. Train loss: 0.00034993862444279976
Epoch: 160. Train loss: 0.00033070528697895854
Epoch: 170. Train loss: 0.0003135132574756775
Epoch: 180. Train loss: 0.0002980508060238217
Epoch: 190. Train loss: 0.000284067189271836


Evaluate model on a test set

In [8]:
y_pred_test = np.argmax(model.forward(X_test_sc), axis=1)
print(classification_report(y_test, y_pred_test, target_names=data.target_names))  

              precision    recall  f1-score   support

     class_0       1.00      1.00      1.00        14
     class_1       1.00      1.00      1.00        14
     class_2       1.00      1.00      1.00         8

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36



Everybody wishes to get such metrics on a real data 😄

Now when we have the prepared model we can apply quantization. But before doing that lets measure current model storage assuming that we have 32-bit floating point weights:

In [9]:
def calculate_model_storage(model_obj: object, num_bits: int) -> int:
    """
    Calculate the total storage size of a model in bytes.
    NB!!! Without storage need for scaling coefficients and zero point

    Parameters:
        model (TwoLayerNet): The neural network model with weights and biases.
        num_bits (int): The bit precision (e.g., 32 for FP32, 8 for INT8).

    Returns:
        int: Total storage size in bytes.
    """
    total_params = 0

    # Count parameters for each layer
    total_params += np.prod(model_obj.W1.shape) + np.prod(model_obj.b1.shape)
    total_params += np.prod(model_obj.W2.shape) + np.prod(model_obj.b2.shape)

    # Calculate storage in bytes
    total_bits = total_params * num_bits
    total_bytes = total_bits // 8

    return total_bytes

In [10]:
print(f"Initial model size is {calculate_model_storage(model, 32) / 1024:.2f}KB")

Initial model size is 6.65KB


In [11]:
# We will store initial results for further comparison
RESULTS = {
    "fp32": [1.00, 6.65, 0] # Macro F1, Model size in KB, MSE between initial weights and dequantized ones
}

### Apply linear quantization


We will stick to the following procedure to pick the best model from the quantized candidates:

1. Apply initial cross-layer equalization, CLE
2. Apply linear quantization
3. Calculate MSE between initial FP32 weights and dequantized weights calculated from lower precision
4. Evaluate model storage and accuracy

In [12]:
def cross_layer_equalization(W1: np.ndarray, W2: np.ndarray, b1: np.ndarray) -> tuple:
    """
    Normalize the weights across two consecutive layers to balance their scales.
    Adjusts W1, W2, and b1 in place to equalize the activation ranges.
    """
    
    max_per_neuron = np.max(np.abs(W1), axis=0)  # Max value per neuron in W1
    scaling_factors = max_per_neuron / np.max(max_per_neuron)  # Normalize scaling

    # Adjust W1 and b1
    W1 /= scaling_factors
    b1 /= scaling_factors

    # Compensate in W2
    W2 *= scaling_factors.reshape(-1, 1)

    return W1, W2, b1
     

def quantize_tensor(tensor: np.ndarray, num_bits: int) -> tuple:
    """
    Apply assymetric linear quantization operation
    """

    qmin = -(2 ** (num_bits - 1))     # Minimum value in quantized range
    qmax = (2 ** (num_bits - 1)) - 1. # Maximum value in quantized range

    min_val = np.min(tensor)          # Minimum real value of a tensor
    max_val = np.max(tensor)          # Maximum real value of a tensor

    scale = (max_val - min_val) / (qmax - qmin) # Scaling factor
    zero_point = qmin - min_val / scale         # Zero point location for a quantized range

    # These matrices can be stored along with scaling factors and zero point for dequantization
    quantized = np.round(zero_point + tensor / scale).clip(qmin, qmax).astype(int)
    dequantized = scale * (quantized - zero_point)

    return quantized, dequantized, scale, zero_point


def get_quant_mse(x1: np.ndarray, x2: np.ndarray, precision: int = 4) -> float:
  return np.round(np.sum( (x1 - x2)**2), precision)

In [13]:
weights_preicions_in_bits = (8, 6, 4, 2)

In [14]:
for bits in weights_preicions_in_bits:

  print(f"### Qunatization with Bits: {bits}")

  model_quantized = copy.deepcopy(model)

  # Step 1: Apply CLE
  W1, W2, b1 = cross_layer_equalization(model_quantized.W1, model_quantized.W2, model_quantized.b1)
  b2 = model_quantized.b2

  # Step 2: Quantize weights and biases of the model
  q_W1, dq_W1, scale_W1, zp_W1 = quantize_tensor(W1, bits)
  q_b1, dq_b1, scale_b1, zp_b1 = quantize_tensor(b1, bits)
  q_W2, dq_W2, scale_W2, zp_W2 = quantize_tensor(W2, bits)
  q_b2, dq_b2, scale_b2, zp_b2 = quantize_tensor(b2, bits)
  print(f"#Original FP32 W1:\n{W1}\n#Quantized W1 with {bits} bits:\n{q_W1}\n#Dequantized W1:\n{dq_W1}")
  quant_mse = np.round(get_quant_mse(dq_W1, W1) + get_quant_mse(dq_W2, W2) + get_quant_mse(dq_b2, b2) + get_quant_mse(dq_b1, b1), 4)
  print(f"Quantization MSE: {quant_mse}")

  # Step 3: Assign new weights to the model
  model_quantized.W1 = dq_W1
  model_quantized.b1 = dq_b1
  model_quantized.W2 = dq_W2
  model_quantized.b2 = dq_b2

  storage_kb = calculate_model_storage(model, num_bits=bits) / 1024
  print(f"INT{bits} Model Storage: {storage_kb :.2f} KB")
  y_pred_test_quant = np.argmax(model_quantized.forward(X_test_sc), axis=1)
  report = classification_report(y_test, y_pred_test_quant, target_names=data.target_names)
  report_dict = classification_report(y_test, y_pred_test_quant, target_names=data.target_names, output_dict=True)
  print(report)

  RESULTS.update(
      {
          f"Q_{bits}bit": [
              float(report_dict["macro avg"]["f1-score"]),
              float(storage_kb),
              float(quant_mse),
          ]
      }
  )

### Qunatization with Bits: 8
#Original FP32 W1:
[[ 0.7547377  -0.31663933  0.31212547 ...  0.15707463 -1.2657021
  -1.021201  ]
 [-3.2211206  -0.63667977 -0.47048968 ...  0.46516895  0.58665264
  -3.0792668 ]
 [ 0.8575016   0.7706897   1.961209   ...  0.61525416  3.1439714
   1.7663364 ]
 ...
 [ 3.1434956   1.5586464  -0.27126428 ... -3.799705    2.5548832
   0.82688206]
 [ 2.1054003  -3.799705    3.7997048  ...  1.5569623   0.27901947
   3.6949296 ]
 [-0.33069095 -0.5592149  -0.60601294 ... -0.7499743  -0.2171366
   0.553019  ]]
#Quantized W1 with 8 bits:
[[  25  -11   10 ...    5  -43  -35]
 [-109  -22  -16 ...   15   19 -104]
 [  28   25   65 ...   20  105   59]
 ...
 [ 105   52  -10 ... -128   85   27]
 [  70 -128  127 ...   52    9  123]
 [ -12  -19  -21 ...  -26   -8   18]]
#Dequantized W1:
[[ 0.7599413  -0.31291669  0.31291714 ...  0.16390909 -1.26656823
  -1.02815535]
 [-3.23347455 -0.64073441 -0.46192474 ...  0.4619252   0.58113164
  -3.0844665 ]
 [ 0.84934614  0.7599413   1.

In [15]:
RESULTS

{'fp32': [1.0, 6.65, 0],
 'Q_8bit': [1.0, 1.6630859375, 0.1081],
 'Q_6bit': [1.0, 1.2470703125, 1.7699],
 'Q_4bit': [1.0, 0.8310546875, 31.1959],
 'Q_2bit': [0.927741935483871, 0.4150390625, 786.4405]}

In [16]:
df_results = pd.DataFrame(RESULTS, index=["macro_avg_f1", "Kb", "MSE"]).reset_index()
df_results = df_results.rename(columns={"index": "Weights"})
df_results = df_results.T.reset_index()
df_results  = pd.DataFrame(df_results.values[1:], columns=df_results.iloc[0])

# Simple efficiency metric
# It shows how much model accuracy nominated in `macro_avg_f1` we store per one Kb of the model storage
df_results["macro_avg_f1_per_Kb"] = df_results["macro_avg_f1"] / df_results["Kb"]

In [17]:
df_results

Unnamed: 0,Weights,macro_avg_f1,Kb,MSE,macro_avg_f1_per_Kb
0,fp32,1.0,6.65,0.0,0.150376
1,Q_8bit,1.0,1.663086,0.1081,0.601292
2,Q_6bit,1.0,1.24707,1.7699,0.801879
3,Q_4bit,1.0,0.831055,31.1959,1.20329
4,Q_2bit,0.927742,0.415039,786.4405,2.235312


On top of that, we should keep the following linear coefficients for later weights dequantization during inference:

In [18]:
linear_coefficients = {
    "scale_W1": scale_W1,
    "zp_W1": zp_W1,
    "scale_b1": scale_b1,
    "zp_b1": zp_b1,
    "scale_W2": scale_W2,
    "zp_W2": zp_W2,
    "scale_b2": scale_b2,
    "zp_b2": zp_b2
}
linear_coefficients

{'scale_W1': np.float32(2.5331368),
 'zp_W1': np.float32(-0.5),
 'scale_b1': np.float32(3.3597019),
 'zp_b1': np.float32(-0.7340083),
 'scale_W2': np.float32(1.3195125),
 'zp_W2': np.float32(-0.4996891),
 'scale_b2': np.float32(0.067290716),
 'zp_b2': np.float32(-8.078146)}

Choosing the best model depends on the specific circumstances. For example, if we want to balance model accuracy described via `macro_avg_f1` and model storage efficiency, we can pick `Q_4bit` solution. If we need to use as less memory as possible and sacrifice some accuracy for that, `Q_2bit` is our choice.

That's it! We created a wine classifier with Numpy from scratch and applied asymmetric linear quantization!