# Unit 2.1 - On-device optimization lab 

## Lab Step 1: Environment Setup + Basline Model Load

In this step, we will set up the necessary environment for our native AI solutions development lab. We will install the required libraries, including ONNX Runtime, NumPy, Matplotlib, and TensorFlow.

In [1]:
# Environmental setup. Install required libraries. It assumes a conda/anaconda environment with Python=3.10.X
# In Windows, there is no tflite-runtime library (available on Linux, ARM, Android and other embedded systems)
!pip install "tensorflow==2.19.0"
!pip install "onnx==1.16.2" "onnxruntime==1.18.1"
!pip install "numpy==1.26.4" matplotlib



### 1.2 - Load the Preprocessed Dataset

In this step, we load the preprocessed dataset generated from the UCA-EHAR recordings.  
This dataset contains sliding windows of IMU + pressure sensor data for 8 human activities.

Make sure the file **`uca_ehar_preprocessed_win100_step50.npz`** is placed in the same folder as this notebook.



In [2]:
import numpy as np
import os


DATASET_PATH = os.path.join("..", "data", "uca_ehar_preprocessed_win100_step50.npz")

# Load dataset
data = np.load(DATASET_PATH, allow_pickle=True)

X_train = data["X_train"]
y_train = data["y_train"]
X_val = data["X_val"]
y_val = data["y_val"]
X_test = data["X_test"]
y_test = data["y_test"]
class_names = data["class_names"]

print("Dataset loaded successfully!")
print("Training data shape:", X_train.shape)
print("Validation data shape:", X_val.shape)
print("Test data shape:", X_test.shape)
print("\nActivity classes:", class_names)


Dataset loaded successfully!
Training data shape: (8241, 100, 7)
Validation data shape: (1766, 100, 7)
Test data shape: (1767, 100, 7)

Activity classes: ['STANDING' 'SITTING' 'WALKING' 'WALKING_UPSTAIRS' 'WALKING_DOWNSTAIRS'
 'RUNNING' 'LYING' 'DRINKING']


In [3]:
# Code cell (quick sanity checks — optional but recommended)
# Check window size and signal count
print("Window length:", X_train.shape[1])
print("Number of sensor channels:", X_train.shape[2])

# Check unique labels
unique_labels = np.unique(y_train)
print("Labels present:", unique_labels)

Window length: 100
Number of sensor channels: 7
Labels present: [0 1 2 3 4 5 6 7]


### 1.3 - Load the Baseline HAR Model

In this step you will load the *baseline* Human Activity Recognition (HAR) model.

- The model is stored in ONNX format: `har_baseline.onnx`
- It was trained beforehand on the same UCA-EHAR dataset you are using
- In this lab we will only use **CPU** inference via **ONNX Runtime**

After loading the model, we will inspect:
- its input and output shapes
- its file size on disk (model footprint)

Make sure the file:

`models/har_baseline.onnx`

exists in the `models/` folder of your project.


In [4]:
import os
import onnxruntime as ort

# Path to the baseline ONNX model
MODEL_PATH = os.path.join("..", "models", "har_baseline.onnx")

if not os.path.exists(MODEL_PATH):
    raise FileNotFoundError(f"Baseline model not found at: {MODEL_PATH}")

print("Loading baseline HAR model from:", MODEL_PATH)

# Create ONNX Runtime inference session (CPU-only)
session = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])

# Get input / output names and shapes
input_tensor  = session.get_inputs()[0]
output_tensor = session.get_outputs()[0]

input_name  = input_tensor.name
output_name = output_tensor.name

print("Model loaded successfully ✅")
print("Input name :", input_name)
print("Input shape:", input_tensor.shape)
print("Output name:", output_name)
print("Output shape:", output_tensor.shape)

# Optional: show number of classes if class_names was loaded earlier from the dataset
try:
    print("Number of classes:", len(class_names))
except NameError:
    print("Hint: run the dataset loading cell first to see class names.")


Loading baseline HAR model from: ..\models\har_baseline.onnx
Model loaded successfully ✅
Input name : input
Input shape: ['unk__121', 100, 7]
Output name: dense_1
Output shape: ['unk__122', 8]
Number of classes: 8


In [6]:
# Inspecting model size on disk

size_kb = os.path.getsize(MODEL_PATH) / 1024
print(f"Baseline model size: {size_kb:.2f} KB")

Baseline model size: 327.19 KB


## Lab Step 2: Measure Baseline KPIs

The first technical task is to evaluate the baseline model performance before optimization.
We measure:
- Latency (inference time)
- Model size
- Initial accuracy
- Energy proxy (optional)
These measurements form the baseline against which we compare improvements after quantization and pruning.


### 2.1 - Prepare Test Data for Benchmarking

In this step you will prepare the data used to benchmark the baseline model.

We will:

- Use the **test split** of the preprocessed UCA-EHAR dataset  
- Create a reference array `X_bench, y_bench` for all KPI measurements

All latency and accuracy measurements in this lab will be based on this benchmark set.


In [5]:
# Assuming (from part1) your dataset-loading cell already created X_train, X_val, X_test, y_train, y_val, y_test, class_names.

import numpy as np

# We assume X_train, X_val, X_test, y_train, y_val, y_test, class_names
# have already been loaded from the NPZ file earlier in the notebook.

print("Train set shape:", X_train.shape)
print("Val set shape  :", X_val.shape)
print("Test set shape :", X_test.shape)

# --- Recompute training normalization (same as in teacher notebook) ---

# Compute mean and std ONLY on the training set
train_mean = X_train.mean(axis=(0, 1), keepdims=True)   # shape: (1, 1, num_channels)
train_std  = X_train.std(axis=(0, 1), keepdims=True)
train_std[train_std == 0] = 1.0   # safety

print("train_mean shape:", train_mean.shape)
print("train_std  shape:", train_std.shape)

# Apply normalization to all splits
X_train_n = (X_train - train_mean) / train_std
X_val_n   = (X_val   - train_mean) / train_std
X_test_n  = (X_test  - train_mean) / train_std

# Benchmark set = normalized test set
X_bench = X_test_n.astype(np.float32)
y_bench = y_test.copy()

print("Benchmark set shape:", X_bench.shape)
print("Benchmark labels shape:", y_bench.shape)
print("Normalized X_bench mean:", X_bench.mean(), "std:", X_bench.std())

Train set shape: (8241, 100, 7)
Val set shape  : (1766, 100, 7)
Test set shape : (1767, 100, 7)
train_mean shape: (1, 1, 7)
train_std  shape: (1, 1, 7)
Benchmark set shape: (1767, 100, 7)
Benchmark labels shape: (1767,)
Normalized X_bench mean: -0.11365111 std: 0.98807645


### 2.2 - Latency Measurement (CPU-only)

First, we measure **inference latency** for a single window.

- We will use ONNX Runtime on CPU only
- We repeat inference several times and average the result
- Latency is reported in **milliseconds per inference**

Remember: the scenario target from Unit 1.1 is **< 20 ms** per inference.


In [6]:
import time

def measure_latency(sample: np.ndarray, runs: int = 100) -> float:
    """
    Measure average inference time (ms) for a single sample.

    Args:
        sample: Input window with shape (1, window_size, num_channels)
        runs: Number of repetitions for averaging

    Returns:
        Average latency in milliseconds.
    """
    start = time.time()
    for _ in range(runs):
        _ = session.run([output_name], {input_name: sample})
    end = time.time()
    return (end - start) * 1000.0 / runs  # ms per inference


# Take the first benchmark window
sample = X_bench[0:1]   # shape: (1, 100, 7) for this dataset

latency_ms = measure_latency(sample, runs=100)
print(f"Inference time (baseline): {latency_ms:.2f} ms")


Inference time (baseline): 0.17 ms


### 2.3 - Accuracy Measurement

Next, we measure **classification accuracy** of the baseline model.

We will:

1. Run the model on all benchmark windows  
2. Collect the predicted class index for each window  
3. Compare predictions with the true labels

Accuracy is reported as a percentage. For HAR, typical baseline accuracy is
between **90–95%**, depending on model size.


In [7]:
def predict(X: np.ndarray) -> np.ndarray:
    """
    Run inference on a batch of windows and return predicted class indices.

    Args:
        X: Array of shape (num_windows, window_size, num_channels)

    Returns:
        Array of shape (num_windows,) with predicted class indices.
    """
    preds = []
    for window in X:
        # Add batch dimension: (W, C) -> (1, W, C)
        out = session.run([output_name], {input_name: window[None, ...]})[0]
        preds.append(out.argmax())
    return np.array(preds)


y_pred = predict(X_bench)
accuracy = (y_pred == y_bench).mean() * 100.0

print(f"Baseline accuracy on benchmark set: {accuracy:.2f}%")

Baseline accuracy on benchmark set: 92.98%


### 2.4 - Optional Energy Proxy (MACs / FLOPs)

As an optional step, we can estimate an **energy proxy** using MACs/FLOPs
(Multiply–Accumulate operations). This is not a direct energy measurement, but
gives a sense of how computationally expensive the model is.

If the helper script is available, run the cell below. Otherwise, you can skip
this step.


In [11]:
# Note: The MACs value below is an approximate proxy based on how the model is represented in ONNX. It may not count every operation exactly, but it is good
# enough to compare **relative** complexity between different model versions.

import os, sys
from pathlib import Path

# Add the parent directory (where "utils" lives) to Python path
sys.path.append(str(Path("..").resolve()))

from utils.metrics2 import estimate_macs  # now this should work


macs = None

try:    
    model_path_rel = os.path.join("..", "models", "har_baseline.onnx")
    macs = estimate_macs(model_path_rel)
    print("Estimated MACs for baseline model:", macs)
except Exception as e:
    print("MACs estimation is not available. Reason:", repr(e))


Estimated MACs for baseline model: 17408


### 2.5 – Baseline KPI Table

Use the values you just measured to fill in this table in your notes or
Optimization Report:

| Metric        | Baseline Value         | Scenario Target      | Pass? (✓ / ✗) |
|--------------|------------------------|----------------------|---------------|
| Latency (ms) | `...`                  | `< 20 ms`            | `...`         |
| Size (KB/MB) | `...`                  | `< 1 MB`             | `...`         |
| Accuracy (%) | `...`                  | `≥ 90%` (≥ 88% min)  | `...`         |
| MACs         | `...` (if measured)    | “As low as possible” | `—`           |

These **baseline KPIs** will be your reference when you apply quantization and
pruning in the next steps. After optimization, you will recompute the same
metrics and compare them with this table.


## Lab Step 3:  Apply Quantization

### 3.1 - Quantization Workflow

In this step you will apply **post-training quantization** to the baseline HAR model.

We will use **ONNX Runtime dynamic quantization** to:

- convert the model weights from `float32` to `int8`
- reduce model size on disk
- improve CPU inference latency

For CPU-only devices like our smart safety glasses, dynamic quantization is a
simple but powerful optimization technique.

In [12]:
from onnxruntime.quantization import quantize_dynamic, QuantType

# Paths for baseline and quantized models
BASELINE_MODEL_PATH = os.path.join("..", "models", "har_baseline.onnx")
QUANT_MODEL_PATH    = os.path.join("..", "models", "har_quantized.onnx")

print("Baseline model path :", BASELINE_MODEL_PATH)
print("Quantized model path:", QUANT_MODEL_PATH)

# Apply dynamic quantization ONLY to MatMul/Gemm (dense layers)
quantize_dynamic(
    model_input=BASELINE_MODEL_PATH,
    model_output=QUANT_MODEL_PATH,
    weight_type=QuantType.QInt8,
    op_types_to_quantize=["MatMul", "Gemm"],  # do NOT quantize Conv
)

print("Quantization complete. Quantized model saved to:", QUANT_MODEL_PATH)



Baseline model path : ..\models\har_baseline.onnx
Quantized model path: ..\models\har_quantized.onnx
Quantization complete. Quantized model saved to: ..\models\har_quantized.onnx


### 3.2 - Compare Model Sizes

Then, compare the file size of the **baseline** model and the **quantized**
model on disk.

This is a direct measure of how much storage we save by moving from float32
weights to int8.

In [13]:
size_baseline_kb = os.path.getsize(BASELINE_MODEL_PATH) / 1024
size_quant_kb    = os.path.getsize(QUANT_MODEL_PATH) / 1024

print(f"Baseline model size : {size_baseline_kb:.2f} KB")
print(f"Quantized model size: {size_quant_kb:.2f} KB")
print(f"Size reduction      : {100 * (1 - size_quant_kb / size_baseline_kb):.1f}%")

Baseline model size : 327.15 KB
Quantized model size: 281.05 KB
Size reduction      : 14.1%


### 3.3 - Load the Quantized Model

Next, load the quantized ONNX model into a new ONNX Runtime session and inspect
its input and output shapes.


In [14]:
# Create a new session for the quantized model
session_q = ort.InferenceSession(
    QUANT_MODEL_PATH,
    providers=["CPUExecutionProvider"]
)

q_input_tensor  = session_q.get_inputs()[0]
q_output_tensor = session_q.get_outputs()[0]

q_input_name  = q_input_tensor.name
q_output_name = q_output_tensor.name

print("Quantized model loaded successfully ✅")
print("Input name :", q_input_name)
print("Input shape:", q_input_tensor.shape)
print("Output name:", q_output_name)
print("Output shape:", q_output_tensor.shape)

Quantized model loaded successfully ✅
Input name : input
Input shape: ['unk__121', 100, 7]
Output name: dense_1
Output shape: ['unk__122', 8]


### 3.4 - Latency After Quantization

Now we measure inference latency again, this time using the **quantized**
model.

To make the comparison fair:

- we use the **same benchmark sample** as in Step 2
- we use the **same timing method** (average over many runs)

This allows us to see how much latency improvement we get from quantization.


In [15]:
import time

def measure_latency_session(sess, in_name, out_name, samples: np.ndarray, runs: int = 50) -> float:
    """
    Measure average inference time (ms) over a batch of samples.
    """
    # Warm-up
    for _ in range(10):
        _ = sess.run([out_name], {in_name: samples[0:1]})

    start = time.time()
    for _ in range(runs):
        for window in samples:
            _ = sess.run([out_name], {in_name: window[None, ...]})
    end = time.time()

    num_calls = runs * len(samples)
    return (end - start) * 1000.0 / num_calls


# Use the first 100 benchmark windows
samples = X_bench[:100]

latency_baseline_ms = measure_latency_session(session,  input_name,  output_name,  samples, runs=20)
latency_quant_ms    = measure_latency_session(session_q, q_input_name, q_output_name, samples, runs=20)

print(f"Baseline latency  : {latency_baseline_ms:.3f} ms")
print(f"Quantized latency : {latency_quant_ms:.3f} ms")

if latency_quant_ms > 0:
    speedup = 100 * (1 - latency_quant_ms / latency_baseline_ms)
    print(f"Relative speed-up : {speedup:.1f}%")

Baseline latency  : 0.152 ms
Quantized latency : 0.136 ms
Relative speed-up : 10.6%


### 3.5 – Accuracy After Quantization

Now we check whether quantization has changed the **classification accuracy**.

We reuse the same benchmark set (`X_bench`, `y_bench`) and compare:

- baseline model accuracy
- quantized model accuracy

Our goal is for the accuracy drop to be small (ideally within 1–2 percentage
points).

In [16]:
def predict_with_session(sess, in_name, out_name, X: np.ndarray) -> np.ndarray:
    """
    Run inference on a batch of windows with a given ONNX Runtime session.
    """
    preds = []
    for window in X:
        out = sess.run([out_name], {in_name: window[None, ...]})[0]
        preds.append(out.argmax())
    return np.array(preds)


# Accuracy for baseline model
y_pred_baseline = predict_with_session(session, input_name, output_name, X_bench)
accuracy_baseline = (y_pred_baseline == y_bench).mean() * 100.0

# Accuracy for quantized model
y_pred_quant = predict_with_session(session_q, q_input_name, q_output_name, X_bench)
accuracy_quant = (y_pred_quant == y_bench).mean() * 100.0

print(f"Baseline accuracy  : {accuracy_baseline:.2f}%")
print(f"Quantized accuracy : {accuracy_quant:.2f}%")
print(f"Accuracy change    : {accuracy_quant - accuracy_baseline:.2f} percentage points")

Baseline accuracy  : 92.98%
Quantized accuracy : 93.04%
Accuracy change    : 0.06 percentage points


## Lab Step 4:  Apply Quantization

### 4.1 - Create a pruned version of the baseline model (global magnitude pruning)


In [17]:
# 4.1 – Create a pruned version of the baseline model
# (global magnitude-based pruning of float weight tensors)

from pathlib import Path
import numpy as np
import onnx
from onnx import numpy_helper

# Paths
MODELS_DIR = Path("..") / "models"
BASELINE_MODEL_PATH = MODELS_DIR / "har_baseline.onnx"
PRUNED_MODEL_PATH   = MODELS_DIR / "har_pruned.onnx"

print("Baseline model path:", BASELINE_MODEL_PATH)
print("Pruned model path   :", PRUNED_MODEL_PATH)

# Fraction of smallest-magnitude weights to set to zero (e.g. 0.3 = 30%)
PRUNE_RATIO = 0.30

# Load baseline ONNX model
model = onnx.load(BASELINE_MODEL_PATH)

# Collect candidate weight tensors: float, at least 2D (Conv / Dense kernels)
weight_tensors = []
for initializer in model.graph.initializer:
    if initializer.data_type == onnx.TensorProto.FLOAT and len(initializer.dims) >= 2:
        W = numpy_helper.to_array(initializer)
        weight_tensors.append((initializer, W))

print(f"Found {len(weight_tensors)} weight tensors to prune.")

# Build a global list of all absolute weight values
all_weights = np.concatenate([np.abs(W).ravel() for _, W in weight_tensors])
total_params = all_weights.size

# Global threshold so that ~PRUNE_RATIO of weights will be set to zero
threshold = np.quantile(all_weights, PRUNE_RATIO)

num_pruned = 0

for initializer, W in weight_tensors:
    mask = np.abs(W) < threshold
    num_pruned += int(mask.sum())
    W_pruned = W.copy()
    W_pruned[mask] = 0.0
    # Write pruned weights back into the ONNX initializer
    initializer.CopyFrom(numpy_helper.from_array(W_pruned, initializer.name))

sparsity = 100.0 * num_pruned / total_params

# Save pruned model
onnx.save(model, PRUNED_MODEL_PATH)

print("Global magnitude pruning completed ✅")
print(f"Target prune ratio   : {PRUNE_RATIO*100:.1f}%")
print(f"Actual pruned weights: {num_pruned}/{total_params} "
      f"({sparsity:.1f}% set to zero)")
print("Saved pruned model to:", PRUNED_MODEL_PATH)


Baseline model path: ..\models\har_baseline.onnx
Pruned model path   : ..\models\har_pruned.onnx
Found 14 weight tensors to prune.
Global magnitude pruning completed ✅
Target prune ratio   : 30.0%
Actual pruned weights: 24557/81856 (30.0% set to zero)
Saved pruned model to: ..\models\har_pruned.onnx


### 4.2 - Load the pruned model and inspect metadata

In [19]:
PRUNED_MODEL_PATH = os.path.join("..", "models", "har_pruned.onnx")

session_p = ort.InferenceSession(
    PRUNED_MODEL_PATH,
    providers=["CPUExecutionProvider"]
)

p_input_tensor  = session_p.get_inputs()[0]
p_output_tensor = session_p.get_outputs()[0]

p_input_name  = p_input_tensor.name
p_output_name = p_output_tensor.name

print("Pruned model loaded successfully ✅")
print("Input name :", p_input_name)
print("Input shape:", p_input_tensor.shape)
print("Output name:", p_output_name)
print("Output shape:", p_output_tensor.shape)

Pruned model loaded successfully ✅
Input name : input
Input shape: ['unk__121', 100, 7]
Output name: dense_1
Output shape: ['unk__122', 8]


### 4.3 - Compare model sizes (baseline vs pruned vs quantized)

In [20]:
BASELINE_MODEL_PATH = os.path.join("..", "models", "har_baseline.onnx")
QUANT_MODEL_PATH    = os.path.join("..", "models", "har_quantized.onnx")
PRUNED_MODEL_PATH   = os.path.join("..", "models", "har_pruned.onnx")

size_baseline_kb = os.path.getsize(BASELINE_MODEL_PATH) / 1024
size_quant_kb    = os.path.getsize(QUANT_MODEL_PATH) / 1024
size_pruned_kb   = os.path.getsize(PRUNED_MODEL_PATH) / 1024

print(f"Baseline model size : {size_baseline_kb:.2f} KB")
print(f"Quantized model size: {size_quant_kb:.2f} KB")
print(f"Pruned model size   : {size_pruned_kb:.2f} KB")

print(f"Pruned vs baseline  : {100 * (1 - size_pruned_kb / size_baseline_kb):.1f}% smaller")
print(f"Pruned vs quantized : {100 * (1 - size_pruned_kb / size_quant_kb):.1f}% smaller")

Baseline model size : 327.15 KB
Quantized model size: 281.05 KB
Pruned model size   : 327.15 KB
Pruned vs baseline  : 0.0% smaller
Pruned vs quantized : -16.4% smaller


### 4.4 - Accuracy of the pruned model

In [21]:
# Reuse the helper we defined before
def predict_with_session(sess, in_name, out_name, X: np.ndarray) -> np.ndarray:
    preds = []
    for window in X:
        out = sess.run([out_name], {in_name: window[None, ...]})[0]
        preds.append(out.argmax())
    return np.array(preds)

# Baseline accuracy (recomputed for clarity)
y_pred_baseline = predict_with_session(session, input_name, output_name, X_bench)
acc_baseline = (y_pred_baseline == y_bench).mean() * 100.0

# Pruned model accuracy
y_pred_pruned = predict_with_session(session_p, p_input_name, p_output_name, X_bench)
acc_pruned = (y_pred_pruned == y_bench).mean() * 100.0

print(f"Baseline accuracy : {acc_baseline:.2f}%")
print(f"Pruned accuracy   : {acc_pruned:.2f}%")
print(f"Accuracy change   : {acc_pruned - acc_baseline:.2f} percentage points")

Baseline accuracy : 92.98%
Pruned accuracy   : 89.76%
Accuracy change   : -3.23 percentage points


### 4.5 - Latency of the pruned model

In [22]:
import time

def measure_latency_session(sess, in_name, out_name, samples: np.ndarray, runs: int = 20) -> float:
    """
    Measure average inference time (ms) over a batch of samples.
    """
    # Warm-up
    for _ in range(10):
        _ = sess.run([out_name], {in_name: samples[0:1]})

    start = time.time()
    for _ in range(runs):
        for window in samples:
            _ = sess.run([out_name], {in_name: window[None, ...]})
    end = time.time()

    num_calls = runs * len(samples)
    return (end - start) * 1000.0 / num_calls  # ms per inference


# Use the first 100 benchmark windows
samples = X_bench[:100]

lat_baseline = measure_latency_session(session,  input_name,  output_name,  samples, runs=20)
lat_quant    = measure_latency_session(session_q, q_input_name, q_output_name, samples, runs=20)
lat_pruned   = measure_latency_session(session_p, p_input_name, p_output_name, samples, runs=20)

print(f"Baseline latency : {lat_baseline:.3f} ms")
print(f"Quantized latency: {lat_quant:.3f} ms")
print(f"Pruned latency   : {lat_pruned:.3f} ms")

print(f"Pruned vs baseline speed-up: {100 * (1 - lat_pruned / lat_baseline):.1f}%")
print(f"Pruned vs quantized speed-up: {100 * (1 - lat_pruned / lat_quant):.1f}%")

Baseline latency : 0.170 ms
Quantized latency: 0.138 ms
Pruned latency   : 0.139 ms
Pruned vs baseline speed-up: 17.9%
Pruned vs quantized speed-up: -0.5%


### 4.6 - Optional: MACs / energy proxy for pruned model

In [24]:
import os, sys
from pathlib import Path

# Add the parent directory (where "utils" lives) to Python path
sys.path.append(str(Path("..").resolve()))

from utils.metrics2 import estimate_macs  # now this should work

base_path   = os.path.join("..", "models", "har_baseline.onnx")
pruned_path = os.path.join("..", "models", "har_pruned.onnx")

macs_base   = estimate_macs(base_path)
macs_pruned = estimate_macs(pruned_path)

print("Baseline MACs estimate:", macs_base)
print("Pruned MACs estimate  :", macs_pruned)

if macs_base > 0:
    print(f"Reduction: {100 * (1 - macs_pruned / macs_base):.1f}%")

Baseline MACs estimate: 17408
Pruned MACs estimate  : 17408
Reduction: 0.0%


## LAB STEP 5: Combine Pruning + Quantization

### 5.1 - Create the Pruned + Quantized model

In [25]:
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
import os

PRUNED_MODEL_PATH = os.path.join("..", "models", "har_pruned.onnx")
PRUNED_QUANT_MODEL_PATH = os.path.join("..", "models", "har_pruned_quantized.onnx")

print("Input  model:", PRUNED_MODEL_PATH)
print("Output model:", PRUNED_QUANT_MODEL_PATH)

quantize_dynamic(
    model_input=PRUNED_MODEL_PATH,
    model_output=PRUNED_QUANT_MODEL_PATH,
    op_types_to_quantize=["MatMul", "Gemm"],
    weight_type=QuantType.QInt8
)

print("Pruned + Quantized model saved successfully ✅")




Input  model: ..\models\har_pruned.onnx
Output model: ..\models\har_pruned_quantized.onnx
Pruned + Quantized model saved successfully ✅


### 5.2 - Load the Pruned + Quantized model

In [26]:
import onnxruntime as ort
import os

PRUNED_QUANT_MODEL_PATH = os.path.join("..", "models", "har_pruned_quantized.onnx")

session_pq = ort.InferenceSession(
    PRUNED_QUANT_MODEL_PATH,
    providers=["CPUExecutionProvider"]
)

pq_input_tensor  = session_pq.get_inputs()[0]
pq_output_tensor = session_pq.get_outputs()[0]

pq_input_name  = pq_input_tensor.name
pq_output_name = pq_output_tensor.name

print("Pruned + quantized model loaded successfully ✅")
print("Input name :", pq_input_name)
print("Input shape:", pq_input_tensor.shape)
print("Output name:", pq_output_name)
print("Output shape:", pq_output_tensor.shape)


Pruned + quantized model loaded successfully ✅
Input name : input
Input shape: ['unk__121', 100, 7]
Output name: dense_1
Output shape: ['unk__122', 8]


### 5.3 - Compare model sizes

In [27]:
# Compare model sizes: baseline, quantized, pruned, pruned+quantized

import os

BASELINE_MODEL_PATH      = os.path.join("..", "models", "har_baseline.onnx")
QUANT_MODEL_PATH         = os.path.join("..", "models", "har_quantized.onnx")
PRUNED_MODEL_PATH        = os.path.join("..", "models", "har_pruned.onnx")
PRUNED_QUANT_MODEL_PATH  = os.path.join("..", "models", "har_pruned_quantized.onnx")

size_base_kb  = os.path.getsize(BASELINE_MODEL_PATH)     / 1024
size_quant_kb = os.path.getsize(QUANT_MODEL_PATH)        / 1024
size_prun_kb  = os.path.getsize(PRUNED_MODEL_PATH)       / 1024
size_pq_kb    = os.path.getsize(PRUNED_QUANT_MODEL_PATH) / 1024

print(f"Baseline model size        : {size_base_kb:.2f} KB")
print(f"Quantized model size       : {size_quant_kb:.2f} KB")
print(f"Pruned model size          : {size_prun_kb:.2f} KB")
print(f"Pruned + quantized size    : {size_pq_kb:.2f} KB")

print(f"PQ vs baseline reduction   : {100 * (1 - size_pq_kb / size_base_kb):.1f}%")
print(f"PQ vs quantized reduction  : {100 * (1 - size_pq_kb / size_quant_kb):.1f}%")
print(f"PQ vs pruned reduction     : {100 * (1 - size_pq_kb / size_prun_kb):.1f}%")


Baseline model size        : 327.15 KB
Quantized model size       : 281.05 KB
Pruned model size          : 327.15 KB
Pruned + quantized size    : 281.05 KB
PQ vs baseline reduction   : 14.1%
PQ vs quantized reduction  : 0.0%
PQ vs pruned reduction     : 14.1%


### 5.4 - Accuracy of pruned + quantized model

In [28]:
# 5.4 – Accuracy comparison including pruned + quantized model

def predict_with_session(sess, in_name, out_name, X: np.ndarray) -> np.ndarray:
    preds = []
    for window in X:
        out = sess.run([out_name], {in_name: window[None, ...]})[0]
        preds.append(out.argmax())
    return np.array(preds)

# Baseline
y_pred_base = predict_with_session(session,  input_name,  output_name,  X_bench)
acc_base    = (y_pred_base == y_bench).mean() * 100.0

# Quantized
y_pred_quant = predict_with_session(session_q, q_input_name, q_output_name, X_bench)
acc_quant    = (y_pred_quant == y_bench).mean() * 100.0

# Pruned
y_pred_prun = predict_with_session(session_p, p_input_name, p_output_name, X_bench)
acc_prun    = (y_pred_prun == y_bench).mean() * 100.0

# Pruned + Quantized
y_pred_pq = predict_with_session(session_pq, pq_input_name, pq_output_name, X_bench)
acc_pq    = (y_pred_pq == y_bench).mean() * 100.0

print(f"Baseline accuracy        : {acc_base:.2f}%")
print(f"Quantized accuracy       : {acc_quant:.2f}%")
print(f"Pruned accuracy          : {acc_prun:.2f}%")
print(f"Pruned + quantized acc.  : {acc_pq:.2f}%")

print(f"PQ vs baseline change    : {acc_pq - acc_base:.2f} percentage points")

Baseline accuracy        : 92.98%
Quantized accuracy       : 93.04%
Pruned accuracy          : 89.76%
Pruned + quantized acc.  : 89.81%
PQ vs baseline change    : -3.17 percentage points


### 5.5 - Latency of pruned + quantized model

In [29]:
# 5.5 – Latency comparison including pruned + quantized model

import time

def measure_latency_session(sess, in_name, out_name, samples: np.ndarray, runs: int = 20) -> float:
    # Warm-up
    for _ in range(10):
        _ = sess.run([out_name], {in_name: samples[0:1]})

    start = time.time()
    for _ in range(runs):
        for window in samples:
            _ = sess.run([out_name], {in_name: window[None, ...]})
    end = time.time()

    num_calls = runs * len(samples)
    return (end - start) * 1000.0 / num_calls

# Use first 100 benchmark windows
samples = X_bench[:100]

lat_base = measure_latency_session(session,  input_name,  output_name,  samples, runs=20)
lat_quant = measure_latency_session(session_q, q_input_name, q_output_name, samples, runs=20)
lat_prun  = measure_latency_session(session_p, p_input_name, p_output_name, samples, runs=20)
lat_pq    = measure_latency_session(session_pq, pq_input_name, pq_output_name, samples, runs=20)

print(f"Baseline latency        : {lat_base:.3f} ms")
print(f"Quantized latency       : {lat_quant:.3f} ms")
print(f"Pruned latency          : {lat_prun:.3f} ms")
print(f"Pruned + quantized lat. : {lat_pq:.3f} ms")

print(f"PQ vs baseline speed-up : {100 * (1 - lat_pq / lat_base):.1f}%")
print(f"PQ vs quantized speed-up: {100 * (1 - lat_pq / lat_quant):.1f}%")
print(f"PQ vs pruned speed-up   : {100 * (1 - lat_pq / lat_prun):.1f}%")

Baseline latency        : 0.158 ms
Quantized latency       : 0.138 ms
Pruned latency          : 0.136 ms
Pruned + quantized lat. : 0.140 ms
PQ vs baseline speed-up : 11.3%
PQ vs quantized speed-up: -1.1%
PQ vs pruned speed-up   : -2.4%
