# [Model Compression](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf)

## Introduction

**Large models** like ensembles are usually the most accurate, but they are also usually **large and slow**.

On the otherhand, training a **small model** on the original data usually leads to a much **less accurate** model. (**accuracy-efficiency tradeoff**)

Say we have a **large, accurate model** and **unlabeled data**. 

**How can we compress the large, accurate model into a more efficient, yet still accurate smaller model?**

**When we train a model** on the training data we are trying to **approximating the true function** that emitted the training data and test data. 

However, a **less-capable model will not** be able to **approximate the true function well** due to "less capacity". Leading to **less accuracy.**

But if we have a large and accurate model, along with unlabeled data, then **we can label this unlabeled data using the large and accuruate model.**

From here **we can train a smaller model on this newly labeled data**, so that it approximates the function learned by the large accurate model rather than the true underlying function.

Why is this better for the small model? TLDR: The larger model gives the smaller model a simpler, smoother function to learn.

**Gap**: In some cases it is not enough for a model to be highly accurate, but also has to meet strict time and space requirements. However, the best performing models are usually large and slow, while the fast and compact models are less accurate.

**Improvement**: Shows that a neural network can mimic the function learned by an ensemble.

## Approach

First assign labels to the unlabeled data with the ensemble's predictions. 
Then, the neural network is trained on this extended dataset.

## Result

The neural network is able to achieve similar performance while being 1000x smaller and faster.

## Application

### Data

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split

In [33]:
df = pd.read_csv('data/AirlineSatisfaction/prepared.csv')
target_column = "isSatisfied"
X = df.drop(columns=[target_column])
y = df[target_column]

In [35]:
X, X_unlabeled, y, y_unlabeled = train_test_split(X, y, test_size=0.8, random_state=42, stratify=y)

In [36]:
X.shape, X_unlabeled.shape

((20780, 23), (83124, 23))

### Ensemble

In [38]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from tqdm import tqdm

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
lr = LogisticRegression(max_iter=500)
# svm = SVC(probability=True, max_iter=1000, random_state=42)
knn = KNeighborsClassifier()
nn = MLPClassifier(hidden_layer_sizes=(64,), max_iter=250, random_state=42, verbose=True)

ensemble = VotingClassifier(estimators=[('rf', rf), ('gb', gb), ('lr', lr), ('knn', knn), ('nn', nn)], voting='soft')

ensemble.fit(X_train, y_train)

y_pred = ensemble.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Ensemble Model Accuracy: {accuracy:.4f}")

Iteration 1, loss = 0.45850315
Iteration 2, loss = 0.33198355
Iteration 3, loss = 0.28771426
Iteration 4, loss = 0.25732090
Iteration 5, loss = 0.23557136
Iteration 6, loss = 0.21997906
Iteration 7, loss = 0.20828322
Iteration 8, loss = 0.19952774
Iteration 9, loss = 0.19190333
Iteration 10, loss = 0.18607073
Iteration 11, loss = 0.18069826
Iteration 12, loss = 0.17578413
Iteration 13, loss = 0.17189825
Iteration 14, loss = 0.16793667
Iteration 15, loss = 0.16512285
Iteration 16, loss = 0.16196320
Iteration 17, loss = 0.15930931
Iteration 18, loss = 0.15655462
Iteration 19, loss = 0.15434907
Iteration 20, loss = 0.15249115
Iteration 21, loss = 0.15056798
Iteration 22, loss = 0.14842890
Iteration 23, loss = 0.14734587
Iteration 24, loss = 0.14533529
Iteration 25, loss = 0.14366885
Iteration 26, loss = 0.14256914
Iteration 27, loss = 0.14053822
Iteration 28, loss = 0.13954841
Iteration 29, loss = 0.13806140
Iteration 30, loss = 0.13702743
Iteration 31, loss = 0.13602856
Iteration 32, los



### Neural Network

In [39]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

In [40]:
# Define the neural network model
class BinaryClassifier(nn.Module):
    def __init__(self):
        super(BinaryClassifier, self).__init__()
        self.fc1 = nn.Linear(23, 5)  # 23 features to 64 units
        self.fc2 = nn.Linear(5, 1)  # 64 units to 1 unit

    def forward(self, x):
        x = torch.relu(self.fc1(x))  # ReLU activation after first layer
        x = self.fc2(x)              # No activation here, since we'll apply sigmoid in loss
        return x

In [41]:
model = BinaryClassifier()
criterion = nn.BCEWithLogitsLoss()  # Suitable for multi-class/binary classification
optimizer = optim.Adam(model.parameters(), lr=0.005)

In [42]:
# Convert data to tensor
X_train_tensor = torch.tensor(X_train.to_numpy(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.to_numpy(), dtype=torch.float32)

# Create DataLoader for batching
dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(dataset, batch_size=128, shuffle=True)

In [43]:
# Training loop
num_epochs = 20
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()  # Clear the gradients

        # Forward pass
        outputs = model(inputs)
        
        # Compute loss (CrossEntropyLoss already applies softmax internally)
        loss = criterion(outputs, labels.unsqueeze(1))
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()

    # Print loss for every epoch
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader)}")

Epoch [1/20], Loss: 0.43337792816801346
Epoch [2/20], Loss: 0.28688304050677066
Epoch [3/20], Loss: 0.24352419462756833
Epoch [4/20], Loss: 0.21788295574378277
Epoch [5/20], Loss: 0.20307615734096887
Epoch [6/20], Loss: 0.19469161693384682
Epoch [7/20], Loss: 0.18946180503437485
Epoch [8/20], Loss: 0.18702341205831888
Epoch [9/20], Loss: 0.18399590334814528
Epoch [10/20], Loss: 0.18233177173828732
Epoch [11/20], Loss: 0.1800114022216935
Epoch [12/20], Loss: 0.1795334025569584
Epoch [13/20], Loss: 0.17854435428761053
Epoch [14/20], Loss: 0.1772576200141423
Epoch [15/20], Loss: 0.17608573611663736
Epoch [16/20], Loss: 0.17507203835724056
Epoch [17/20], Loss: 0.17466238545982735
Epoch [18/20], Loss: 0.17390822117095409
Epoch [19/20], Loss: 0.17232500486399815
Epoch [20/20], Loss: 0.17299051528823547


In [44]:
# Convert data to tensor
X_test_tensor = torch.tensor(X_test.to_numpy(), dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.to_numpy(), dtype=torch.float32)

# Create DataLoader for batching
dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(dataset, batch_size=64, shuffle=True)

In [45]:
correct = 0
total = 0

model.eval()  # Set the model to evaluation mode (turns off dropout, batchnorm, etc.)

with torch.no_grad():  # No need to calculate gradients during evaluation
    for inputs, labels in test_loader:
        outputs = model(inputs)
        
        # Apply softmax to get class probabilities (CrossEntropyLoss internally uses softmax)
        preds = nn.functional.sigmoid(outputs) > 0.5
        
        # Count correct predictions
        correct += (preds.squeeze() == labels.squeeze()).sum().item()
        total += labels.size(0)

accuracy = correct / total * 100
print(f'Accuracy: {accuracy:.2f}%')

Accuracy: 93.07%


### Model Compression

In [46]:
y_unlabeled_pred = ensemble.predict(X_unlabeled)

In [48]:
accuracy = accuracy_score(y_unlabeled, y_unlabeled_pred)
print(f"Ensemble Model Accuracy on Unlabeled Data: {accuracy:.4f}")

Ensemble Model Accuracy on Unlabeled Data: 0.9487


In [49]:
X_train_new = np.concatenate([X_train.to_numpy(), X_unlabeled.to_numpy()])
y_train_new = np.concatenate([y_train.to_numpy(), y_unlabeled_pred])

In [50]:
model = BinaryClassifier()
criterion = nn.BCEWithLogitsLoss()  # Suitable for multi-class/binary classification
optimizer = optim.Adam(model.parameters(), lr=0.005)

In [51]:
# Convert data to tensor
X_train_tensor = torch.tensor(X_train_new, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_new, dtype=torch.float32)

# Create DataLoader for batching
dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(dataset, batch_size=128, shuffle=True)

In [52]:
# Training loop
num_epochs = 20
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()  # Clear the gradients

        # Forward pass
        outputs = model(inputs)
        
        # Compute loss (CrossEntropyLoss already applies softmax internally)
        loss = criterion(outputs, labels.unsqueeze(1))
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()

    # Print loss for every epoch
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader)}")

Epoch [1/20], Loss: 0.22598351803868252
Epoch [2/20], Loss: 0.12579602438502657
Epoch [3/20], Loss: 0.1091491068065696
Epoch [4/20], Loss: 0.10358048552590503
Epoch [5/20], Loss: 0.1003987894081487
Epoch [6/20], Loss: 0.09774995086298603
Epoch [7/20], Loss: 0.09577612479827201
Epoch [8/20], Loss: 0.09378314490765711
Epoch [9/20], Loss: 0.09151737829244833
Epoch [10/20], Loss: 0.09008502311588544
Epoch [11/20], Loss: 0.08850746163341978
Epoch [12/20], Loss: 0.08760649703802342
Epoch [13/20], Loss: 0.08668450693018394
Epoch [14/20], Loss: 0.08612710487352788
Epoch [15/20], Loss: 0.0855559976322359
Epoch [16/20], Loss: 0.08513015458776337
Epoch [17/20], Loss: 0.08505208013333418
Epoch [18/20], Loss: 0.08440986881268962
Epoch [19/20], Loss: 0.08409467123704135
Epoch [20/20], Loss: 0.08385594861893939


In [53]:
# Convert data to tensor
X_test_tensor = torch.tensor(X_test.to_numpy(), dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.to_numpy(), dtype=torch.float32)

# Create DataLoader for batching
dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(dataset, batch_size=64, shuffle=True)

In [54]:
correct = 0
total = 0

model.eval()  # Set the model to evaluation mode (turns off dropout, batchnorm, etc.)

with torch.no_grad():  # No need to calculate gradients during evaluation
    for inputs, labels in test_loader:
        outputs = model(inputs)
        
        # Apply softmax to get class probabilities (CrossEntropyLoss internally uses softmax)
        preds = nn.functional.sigmoid(outputs) > 0.5
        
        # Count correct predictions
        correct += (preds.squeeze() == labels.squeeze()).sum().item()
        total += labels.size(0)

accuracy = correct / total * 100
print(f'Accuracy: {accuracy:.2f}%')

Accuracy: 94.51%


In [None]:
# Ensemble: 95.00%
# Small Neural Net: 93.07%
# Small Neural Net w/ Model Compression: 94.51%

# around 1.5% increase in performance

In [55]:
def count_parameters(model):
    total = 0
    
    # Logistic Regression
    if isinstance(model, LogisticRegression):
        total += model.coef_.size + model.intercept_.size
    
    # MLPClassifier
    elif isinstance(model, MLPClassifier):
        total += sum(W.size for W in model.coefs_)
        total += sum(b.size for b in model.intercepts_)
    
    # RandomForest
    elif isinstance(model, RandomForestClassifier):
        total += sum(tree.tree_.node_count for tree in model.estimators_)
    
    # GradientBoosting
    elif isinstance(model, GradientBoostingClassifier):
        total += sum(est[0].tree_.node_count for est in model.estimators_)
    
    # KNN (0 learned params)
    elif isinstance(model, KNeighborsClassifier):
        total += 0
    
    return total


# Count parameters for the ensemble
total_params = 0
for name, est in ensemble.named_estimators_.items():
    params = count_parameters(est)
    print(f"{name}: {params} parameters")
    total_params += params

print("\nTOTAL parameters across the entire ensemble:", total_params)

rf: 257672 parameters
gb: 1500 parameters
lr: 24 parameters
knn: 0 parameters
nn: 1601 parameters

TOTAL parameters across the entire ensemble: 260797


In [69]:
total_nn_params = sum(p.numel() for p in model.parameters())
print("TOTAL parameters in neural network:", total_nn_params)

TOTAL parameters in neural network: 126


In [70]:
print("Ratio:", total_params/total_nn_params)

Ratio: 2069.8174603174602
