# Post-Training Quantization

This Jupyter Notebook demonstrates the process of applying post-training quantization (PTQ) to a pre-trained neural network. The objective of PTQ is to reduce the memory footprint and computational complexity of a model, making it suitable for deployment on resource-constrained devices, a core concept in Tiny Machine Learning (TinyML).

We will explore two distinct PTQ methods:
1.  **Evolving Quantization**: A sophisticated technique that utilizes clustering algorithms to group the model's weights and biases. This method replaces the original high-precision floating-point values with a smaller set of cluster centroids, significantly compressing the model.
2.  **INT8 Quantization**: A standard and widely-used technique that converts 32-bit floating-point parameters into 8-bit integers, achieving a predictable 4x reduction in model size.

The entire process, from data loading to model quantization and code generation, is tracked using Weights & Biases (W&B) for robust experiment management and version control.

## 1. Library Imports

First, we import the necessary Python libraries. These libraries provide functionalities for data manipulation, machine learning, and interaction with the operating system and external tools.

* `wandb`: For logging experiments, managing artifacts, and tracking model performance.
* `os`: To interact with the operating system, primarily for handling file paths.
* `pandas`: For efficient data manipulation and analysis using DataFrames.
* `numpy`: For numerical operations, especially with multi-dimensional arrays.
* `sklearn.model_selection`: Contains utilities for splitting data into training and testing sets.
* `tensorflores.utils`: A custom toolkit containing modules for clustering, quantization, C++ code generation, and JSON handling, which are central to this workflow.
* `time`: To handle time-related tasks if needed.

In [1]:
import wandb
import os
import pandas as pd
import warnings

# Suppress warnings for a cleaner output
warnings.filterwarnings("ignore")

from tensorflores.utils.clustering import ClusteringMethods
from tensorflores.utils.quantization import Quantization
from tensorflores.utils.cpp_generation import CppGeneration
from tensorflores.utils.json_handle import JsonHandle

# To run this notebook, you need a Wandb account and an API key.
# You can create a file named my_key.py with the line: WANDB_KEY = 'your_api_key_here'
# and then uncomment the line below.
from my_key import WANDB_KEY

## 2. Weights & Biases Initialization

To ensure experiment reproducibility and version control, we initialize a connection to Weights & Biases (W&B).

### 2.1. Authentication
We begin by logging into the W&B service using an API key.

**Note on Security**: Hardcoding API keys directly in the source code is not recommended for security reasons. A safer approach is to store the key as an environment variable (`WANDB_API_KEY`) or to use the command-line interface (`wandb login`).

In [2]:
wandb.login(key = WANDB_KEY)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\thomm\_netrc
[34m[1mwandb[0m: Currently logged in as: [33mthommasflores[0m ([33mthommasflores-ufrn[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

### 2.2. Run Initialization
Next, we initialize a new W&B run. This creates a dedicated workspace for our experiment, where all subsequent data, code, and model versions will be logged. The `project` parameter organizes runs, while `job_type` helps categorize the different stages of our MLOps pipeline.

In [3]:
run =  wandb.init(project = "SBAI 2025", job_type = "data-loading", save_code=True)

## 3. Data Loading and Preparation

The model we will be quantizing was trained on a specific dataset. To perform post-training quantization, especially for calibration purposes, we must first load this data. The datasets are stored as W&B artifacts, which allows for versioning and easy retrieval.

### 3.1. Download and Load the Training Dataset

We retrieve the training dataset artifact from W&B. The `run.use_artifact` function specifies the artifact to use, and `artifact.download()` downloads it to a local directory. We then construct the full file path and load the `train.csv` file into a pandas DataFrame.

In [4]:
artifact = run.use_artifact(artifact_or_name = "thommasflores-ufrn/SBAI 2025/train_dataset:latest")

In [5]:
path = artifact.download()

[34m[1mwandb[0m:   1 of 1 files downloaded.  


In [6]:
# List files in the downloaded directory to confirm the CSV file name
print("Files in the downloaded directory:", os.listdir(path))

Files in the downloaded directory: ['train.csv']


In [7]:
csv_file_path_train = os.path.join(path, os.listdir(path)[0])

In [8]:
df_train = pd.read_csv(csv_file_path_train)
df_train

Unnamed: 0,CO2 (g/s) [estimated maf],intake_pressure,intake_temperature,rpm,speed
0,0.029800,0.061728,0.642857,0.293694,0.506329
1,0.062756,0.246914,0.107143,0.167207,0.455696
2,0.011156,0.111111,1.000000,0.168649,0.392405
3,0.049689,0.432099,0.535714,0.046126,0.000000
4,0.159176,0.395062,0.321429,0.271712,0.620253
...,...,...,...,...,...
8179,0.238776,0.604938,0.464286,0.273514,0.607595
8180,0.074384,0.185185,0.357143,0.267027,0.632911
8181,0.353246,0.827160,0.392857,0.303423,0.670886
8182,0.057366,0.172840,0.535714,0.237838,0.316456


### 3.2. Download and Load the Test Dataset
We repeat the same process to download and load the test dataset.

In [9]:
artifact = run.use_artifact(artifact_or_name = "thommasflores-ufrn/SBAI 2025/test_dataset:latest")
path = artifact.download()
csv_file_path_test = os.path.join(path, os.listdir(path)[0])

[34m[1mwandb[0m:   1 of 1 files downloaded.  


In [10]:
df_test = pd.read_csv(csv_file_path_test)
df_test

Unnamed: 0,CO2 (g/s) [estimated maf],intake_pressure,intake_temperature,rpm,speed
0,0.092676,0.074074,0.071429,0.486486,0.949367
1,0.068331,0.160494,0.571429,0.287568,0.582278
2,0.060287,0.234568,0.321429,0.178739,0.379747
3,0.306680,0.790123,0.464286,0.268108,0.658228
4,0.423858,0.802469,0.285714,0.396036,0.607595
...,...,...,...,...,...
2041,0.459844,0.641975,0.071429,0.544505,0.379747
2042,0.325160,0.629630,0.464286,0.385946,0.607595
2043,0.225124,0.617284,0.500000,0.247568,0.607595
2044,0.033723,0.222222,0.535714,0.125045,0.202532


### 3.3. Feature and Target Separation
The dataset contains several columns. We must define which columns are the model's inputs (features) and which is the output (target).

In [11]:
target = ['CO2 (g/s) [estimated maf]']
input = ['intake_pressure','intake_temperature','rpm', 'speed']

### 3.4. Creating Training and Test Sets
We partition the DataFrames into feature sets (`X`) and target sets (`y`) for both the training and test data.

In [12]:
X_train = df_train[input]
y_train = df_train[target]

In [13]:
print('input shape: ', X_train.shape)
print('input type: ', type(X_train))

input shape:  (8184, 4)
input type:  <class 'pandas.core.frame.DataFrame'>


In [14]:
print('output shape: ', y_train.shape)
print('output type: ', type(y_train))

output shape:  (8184, 1)
output type:  <class 'pandas.core.frame.DataFrame'>


### 3.5. Conversion to NumPy Arrays
While pandas DataFrames are excellent for data manipulation, many machine learning libraries, including our `tensorflores` utility, expect data in the form of NumPy arrays for numerical computation. We perform this conversion using the `.values` attribute.

In [15]:
X_value_train = X_train.values
print('input shape: ', X_value_train.shape)
print('input type: ', type(X_value_train))

input shape:  (8184, 4)
input type:  <class 'numpy.ndarray'>


In [16]:
y_value_train = y_train.values
print('output shape: ', y_value_train.shape)
print('output type: ', type(y_value_train))

output shape:  (8184, 1)
output type:  <class 'numpy.ndarray'>


In [17]:
X_test = df_test[input]
y_test = df_test[target]

In [18]:
X_value_test = X_test.values

In [19]:
y_value_test = y_test.values

## 4. Loading the Pre-Trained Model

With the data prepared, we now load the pre-trained model. The model's architecture, weights, and biases were previously saved to a JSON file and logged as a W&B artifact. This approach decouples the training pipeline from the quantization pipeline.

We download the artifact and use our custom `JsonHandle` utility to load the model's structure and parameters into a Python dictionary.

In [20]:
artifact = run.use_artifact(artifact_or_name = "thommasflores-ufrn/SBAI 2025/json:latest")
path = artifact.download()
json_model_path = os.path.join(path, os.listdir(path)[0])

[34m[1mwandb[0m:   1 of 1 files downloaded.  


In [21]:
model_as_json = JsonHandle().load_json_model(json_model_path)
model_as_json

Successfully loaded JSON file: c:\Users\thomm\OneDrive\Documents\SBAI-TensorFores-2025\code\notebooks\artifacts\json-v17\tensorflores_without_quant.json


{'model_quantized': False,
 'num_layers': 3,
 'layers': [{'activation': 'sigmoid',
   'weights': [[0.5983613776169002,
     0.9850774274850139,
     2.5933117944844857,
     0.8772618049001288,
     -1.0612287234015285,
     0.04770169906956271,
     1.5051327934143766,
     -0.14016190977131357,
     1.1083532396838969,
     -0.3796377071161694,
     0.3479307483011244,
     0.8792962365334899,
     0.19385652723756516,
     -0.9428880442381901,
     -0.03585186813970445,
     -2.014691790301126,
     0.3852130908517043,
     -0.15457473617658904,
     -0.5788742120719856,
     -0.33112083908470336,
     -0.35202134089117293,
     1.6560032764012138,
     1.3767253488241293,
     0.7187468272853096,
     0.15604177555197385,
     -2.8764967978006193,
     -1.2265209210450103,
     0.8960323618323984,
     0.9007803921518267,
     0.17980811101223435,
     0.1635455464465181,
     -0.3361108354524395,
     0.4317760717003239,
     0.22835024002413146,
     -1.7065036320510605,
     0.2

## 5. Post-Training Quantization

This section is the core of the notebook. We will apply two different post-training quantization (PTQ) techniques to the loaded model.

### 5.1. Evolving Post-Training Quantization

This method applies clustering to the model's weights and biases. Instead of storing each unique floating-point value, we store a small number of representative values (cluster centroids) and an index map. This can lead to very high compression ratios.

#### 5.1.1. Select the Clustering Method

The `tensorflores` library provides several clustering algorithms. Here, we choose `autocloud`, a density-based clustering method, for both weights and biases. The `threshold` parameter controls the granularity of the clusters; a larger threshold will result in fewer clusters.

Other available methods include:
- `meanshift`
- `affinity_propagation`
- `dbstream`

In [22]:
Clustering_method = ClusteringMethods()

bias_clustering_method = Clustering_method.autocloud_biases(threshold_biases = 1.6)
weight_clustering_method = Clustering_method.autocloud_weight(threshold_weights = 1.5)

#### 5.1.2. Define the Distance Metric

Clustering algorithms rely on a distance metric to determine the similarity between data points. We must choose an appropriate metric for calculating the distance between weights/biases and their potential cluster centroids. For this example, we select "dtw" (Dynamic Time Warping), which is particularly effective for sequence data but can also be applied here.

Available metrics include:
- `euclidean`
- `manhattan`
- `cosine`
- `wasserstein`
- `dtw`

#### 5.1.3. Apply Evolving Quantization

We now call the `post_training_quantization` function with `quantization_type = 'evolving'`. This function takes the original model, the chosen clustering methods, and the distance metric as input, and returns a new JSON object representing the quantized model.

In [23]:
model_as_json_quant = Quantization().post_training_quantization(json_data = model_as_json,
                                        quantization_type = 'evolving', 
                                        distance_metric = "dtw", 
                                        bias_clustering_method = bias_clustering_method,
                                        weight_clustering_method = weight_clustering_method)

#### 5.1.4. Generate C++ Code

For deployment on a microcontroller, the quantized model must be converted into a low-level language. The `generate_cpp_from_json` function creates a C++ header file (`.h`) containing the model's structure and quantized parameters. This file can be directly included in an embedded systems project.

In [24]:
CppGeneration().generate_cpp_from_json(json_data = model_as_json_quant, file_name = './cpp_models/tensorflores_evolving_PTQ')

Model C++ saved!


'namespace Conect2AI {\nnamespace TensorFlores {\nclass MultilayerPerceptron {\npublic: \n\nfloat predict(float *x) { \nfloat y_pred = 0;\nstatic const float center_bias[6] = {0.0009016840978143036, -0.019148522561108102, 0.013360931374408178, -0.010552725544291009, 0.01792165696039687, -0.011831871062954129};\n\nstatic const float centers_weights[5] = {-0.05396740467744466, 0.9962205670830141, -1.9614951737868742, -1.258301348589759, 1.8388561860771986};\n\nstatic const uint8_t w1[4][64] = {\n    {1, 1, 4, 1, 3, 0, 4, 0, 1, 0, 0, 1, 0, 3, 0, 2, 0, 0, 0, 0, 0, 4, 1, 1, 0, 2, 3, 1, 1, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 4, 0, 3, 4, 3, 1, 1, 0, 2, 0, 1, 4, 3, 0, 0, 0, 0, 1},\n    {1, 1, 1, 0, 3, 0, 3, 1, 3, 3, 0, 3, 3, 0, 0, 0, 0, 1, 0, 3, 2, 2, 0, 0, 0, 2, 1, 3, 4, 4, 0, 3, 3, 0, 1, 4, 0, 0, 4, 0, 4, 0, 3, 1, 0, 0, 0, 0, 0, 0, 1, 1, 3, 0, 4, 0, 3, 3, 0, 0, 1, 1, 1, 1},\n    {2, 0, 4, 1, 0, 1, 1, 1, 4, 1, 1, 0, 1, 0, 0, 4, 0, 0, 3, 0, 0, 4, 3, 0, 3, 0, 3, 1, 3, 0, 0, 3, 0,

#### 5.1.5. Log the Quantized Model Artifact to W&B
Finally, we log the generated C++ header file back to W&B as a new artifact. This ensures the quantized model is versioned and linked to the experiment that produced it, completing the MLOps cycle.

In [25]:
wandb.init(project = "SBAI 2025", job_type = "PTQ-model", save_code=True)
cpp_artifact = wandb.Artifact("cpp_PTQ_evolving", type="model_PTQ_evolving")
cpp_artifact.add_file('./cpp_models/tensorflores_evolving_PTQ.h')
wandb.log_artifact(cpp_artifact)

<Artifact cpp_PTQ_evolving>

### 5.2. INT8 Post-Training Quantization

Next, we demonstrate a more standard approach: INT8 quantization. This method scales and converts the 32-bit floating-point weights and biases into 8-bit integers. It offers a guaranteed 75% reduction in model size and is often accelerated by modern hardware.

#### 5.2.1. Apply INT8 Quantization

The process is simpler than the evolving method. We call the same `post_training_quantization` function but set `quantization_type = 'int8'`. The library handles the necessary scaling and conversion internally.

In [26]:
model_as_json_int8 = Quantization().post_training_quantization(json_data = model_as_json,
                                        quantization_type = 'int8')

#### 5.2.2. Generate C++ Code and Log to W&B

As before, we generate the corresponding C++ header file for the INT8 model and log it as a versioned artifact in W&B.

In [27]:
cpp_model_int8 = CppGeneration().generate_cpp_from_json(json_data = model_as_json_int8, file_name = './cpp_models/tensorflores_int8_PTQ')

Model C++ saved!


In [28]:
wandb.init(project = "SBAI 2025", job_type = "PTQ-model", save_code=True)
cpp_artifact = wandb.Artifact("cpp_int8", type="model_PTQ_int8")
cpp_artifact.add_file('./cpp_models/tensorflores_int8_PTQ.h')
wandb.log_artifact(cpp_artifact)

<Artifact cpp_int8>

## 6. Conclusion

This notebook has successfully demonstrated a complete post-training quantization workflow. We have:
1.  Loaded pre-trained model and datasets from Weights & Biases artifacts.
2.  Applied two different quantization strategies: a flexible, high-compression "evolving" method using clustering, and a standard, hardware-friendly INT8 conversion.
3.  Generated self-contained C++ header files for each quantized model, ready for deployment on embedded devices.
4.  Logged the final quantized models as new artifacts back to W&B, ensuring full traceability and reproducibility.

To conclude the experiment and finalize the W&B run, we call `wandb.finish()`.

In [29]:
wandb.finish()