# [Do Deep Nets Really Need to be Deep?](https://arxiv.org/pdf/1312.6184)

## Intro

Since it has been shown we can mimic the function learned by complex model with a small net, the function learned by complex model wasn't truly too complex to be learned by the small net
* Showing that shallow models are capable of learning the same function as deep nets debunks the myth that function learned by deep net has to be deep. <br><br>
    
Shallow and wide models are slower for learning since there are many highly correlated features, so gradient descent converges slowly
* One remedy is linear bottleneck (weight matrix factorization), which only increases speed and doesn't increase representational power <br><br>

Why training on teacher model's prediction could be better than training directly on original dataset can be due to:
* Teacher model can eliminate label errors by predicting those correctly (from generalization)
* Teacher model soft targets provide more information than hard targets such as confusable classes <br><br>
    
Model compression works best when unlabeled data set is much larger than train set (to reduce gap between teacher and student) and when unlabeled examples aren't training points (teacher model is more likely to have overfit these)

## Soft Targets

It is better to train a student model on logits since different logits can map to same distribution when using softmax (technically losing information the complex model learned)
* Also, softmax can lead to few large values relative to others, which would cause cross entropy to focus on them, ignoring others <br><br>

In [1]:
import torch
torch.nn.functional.softmax(torch.tensor([-10.0, 0.0, 10.0]), dim=-1)

tensor([2.0611e-09, 4.5398e-05, 9.9995e-01])

In [2]:
torch.nn.functional.softmax(torch.tensor([10.0, 20.0, 30.0]), dim=-1) # Different logits same softmax output

tensor([2.0611e-09, 4.5398e-05, 9.9995e-01])

Say target is $[3.0385e^{-7}, 6.6928e^{-3}, 9.9331e^{-1}]$ and prediction is $[\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]$

$CE_{Loss} \approx -[3.0e^{-7} \ log(\frac{1}{3}) + 6.7e^{-3} \ log(\frac{1}{3}) + 9.9e^{-1} \ log(\frac{1}{3})] = $

We can see most of the loss would come from largest target so model would focus on getting that right and ignoring others targets.

## Data

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

In [13]:
df = pd.read_csv('../data/AirlineSatisfaction/prepared.csv')
target_column = "isSatisfied"
X = df.drop(columns=[target_column])
y = df[target_column]

In [14]:
X, X_unlabeled, y, y_unlabeled = train_test_split(X, y, test_size=0.1, random_state=42, stratify=y)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Models

In [78]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class LargeNet(nn.Module):
    def __init__(self, input_size=(24), output_size=1, num_neurons=32, p=0.5):
        super(LargeNet, self).__init__()
        self.p = p
        input_size = np.prod(input_size)
        self.fc1 = nn.Linear(input_size, num_neurons)
        self.fc2 = nn.Linear(num_neurons, num_neurons)
        self.out = nn.Linear(num_neurons, output_size)

    def forward(self, x):
        fc1 = F.relu(self.fc1(x))
        fc2 = F.dropout(fc1, p=self.p)
        logits = self.out(fc2)
        return logits

class SmallNet(nn.Module):
    def __init__(self, input_size=(24), output_size=1, num_neurons=5):
        super(SmallNet, self).__init__()
        input_size = np.prod(input_size)
        self.fc1 = nn.Linear(input_size, num_neurons)
        self.fc2 = nn.Linear(num_neurons, output_size)

    def forward(self, x):
        fc1 = F.relu(self.fc1(x))
        logits = self.fc2(x)
        return logits

In [79]:
from tqdm.notebook import tqdm

def train_model(model, optimizer, num_epochs=20):
    model = model.to(device)
    for epoch in range(num_epochs):
        model.train()  # Set the model to training mode
        running_loss = 0.0
    
        progress_bar = tqdm(total=len(train_loader))
    
        for inputs, labels in train_loader:
            inputs, labels = inputs.view(inputs.shape[0], -1).to(device), labels.to(device)
            optimizer.zero_grad()  # Zero the gradients
            
            outputs = model(inputs)  # Forward pass
            loss = criterion(outputs.view(-1), labels)  # Compute the loss
            loss.backward()  # Backward pass
            optimizer.step()  # Update the weights
            
            running_loss += loss.item() * inputs.size(0)
            
            progress_bar.update(1)
            progress_bar.set_description(f"Loss: {loss.item():.4f}")
        all_preds, all_labels = test_model(model)
        num_instances = len(all_preds)
        correct = (all_preds == all_labels).sum()
        accuracy =  correct / num_instances
        errors = num_instances - correct
        epoch_loss = running_loss / len(train_loader.dataset)
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}, Accuracy: {accuracy}, Errors: {errors}")

def test_model(model):
    model.eval()
    progress_bar = tqdm(total=len(test_loader))
    
    all_preds, all_labels = [], []
    for inputs, labels in test_loader:
        inputs, labels = inputs.view(inputs.shape[0], -1).to(device), labels.to(device)
        
        outputs = model(inputs)  # Forward pass
        outputs = nn.functional.sigmoid(outputs)
        preds = outputs > 0.5
        all_preds += list(preds)
        all_labels += list(labels)
        progress_bar.update(1)
    all_preds, all_labels = torch.tensor(all_preds), torch.tensor(all_labels)
    return all_preds, all_labels

In [80]:
LR = 1e-3
large_model = LargeNet()
optimizer = optim.Adam(large_model.parameters(), lr=LR)
criterion = nn.BCEWithLogitsLoss()
device =  "cpu"
device

'cpu'

In [81]:
# Convert data to tensor
X_train_tensor = torch.tensor(X_train.to_numpy(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.to_numpy(), dtype=torch.float32)

# Create DataLoader for batching
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

In [82]:
# Convert data to tensor
X_test_tensor = torch.tensor(X_test.to_numpy(), dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.to_numpy(), dtype=torch.float32)

# Create DataLoader for batching
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=True)

In [83]:
train_model(large_model, optimizer)

  0%|          | 0/1169 [00:00<?, ?it/s]

  0%|          | 0/293 [00:00<?, ?it/s]

Epoch 1/20, Loss: 62.0892, Accuracy: 0.5431213974952698, Errors: 8545


  0%|          | 0/1169 [00:00<?, ?it/s]

  0%|          | 0/293 [00:00<?, ?it/s]

Epoch 2/20, Loss: 0.8896, Accuracy: 0.5009891390800476, Errors: 9333


  0%|          | 0/1169 [00:00<?, ?it/s]

  0%|          | 0/293 [00:00<?, ?it/s]

Epoch 3/20, Loss: 0.8104, Accuracy: 0.472223699092865, Errors: 9871


  0%|          | 0/1169 [00:00<?, ?it/s]

KeyboardInterrupt: 