# [Do Deep Nets Really Need to be Deep?](https://arxiv.org/pdf/1312.6184)

## Introduction

A shallow neural network is capable of <ins>representing</ins> a more accurate function of the data through **Model Compression**, when compared to training the model on the original data set.

Therefore, the complexity of the underlying function and size of representation best used to <ins>learn</ins> the function are different.

**Gap**: Previous research assumed depth gives representational advantage, but wasn't questioned whether it was required in practice.

**Improvement**: Shows depth advantage is not about representational limits since a shallow net can approximate same function learned by deep net (**also sets foundation for original distillation paper**)

## Approach 

**Soft Targets**

It's better to train a student model on logits since different logits can map to same distribution when using softmax (technically losing information the complex model learned)

Also, softmax can lead to few large values relative to others, which would cause cross entropy to focus on them, ignoring others <br><br>

In [2]:
import torch
print(torch.nn.functional.softmax(torch.tensor([-10.0, 0.0, 10.0]), dim=-1))
print(torch.nn.functional.softmax(torch.tensor([10.0, 20.0, 30.0]), dim=-1))

tensor([2.0611e-09, 4.5398e-05, 9.9995e-01])
tensor([2.0611e-09, 4.5398e-05, 9.9995e-01])


Say softmax target is $[2.06e^{-9}, 4.53e^{-5}, 9.99e^{-1}]$ and prediction is $[\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]$

$CE_{Loss} \approx -[3.0e^{-7} \ log(\frac{1}{3}) + 6.7e^{-3} \ log(\frac{1}{3}) + 9.9e^{-1} \ log(\frac{1}{3})]$

We can see most of the loss would come from largest dimension of the target, so the student model would focus mostly on matching that value.


Now consider a logits target of $[-10.0, 0.0, 10.0]$ and prediction is $[5, 5, 5]$

$MSE_{Loss} \approx \frac{1}{3}  [(-10.0 - 5)^2 + (0.0 - 5)^2 + (10.0 - 5)^2]  = \frac{1}{3}  [(-15)^2 + (5)^2 + (5)^2]$

Each dimension of target and prediction contribute substantially to the loss. 

## Result

**Why does Model Compression perform better than training on original dataset? (Even without using extra data)**

Can be due to: 
* Label noise: Teacher model can eliminate label errors by predicting those correctly (from generalization)
* Function complexity reduction: Teacher model simplifies region of true distribution that are hard to learn
* Uncertainty information: Teacher model soft targets provide more information than hard targets, e.g. confusable classes

**As accuracy of teacher model increases linearly, what happens to student model accuracy?**

The paper shows its nearly linear which means: 
* Improvements in the teacher lead to proportional improvements in the student
* The more accurate the teacher, the more efficient you can go in the smaller model
* The smaller models tried in the paper don't run out of capacity 

## Application

### Data

In [87]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

In [88]:
df = pd.read_csv('data/AirlineSatisfaction/prepared.csv')
target_column = "isSatisfied"
X = df.drop(columns=[target_column])
y = df[target_column]

In [89]:
X, X_unlabeled, y, y_unlabeled = train_test_split(X, y, test_size=0.8, random_state=42, stratify=y)

In [90]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [91]:
X_train.shape, X_test.shape

((16624, 23), (4156, 23))

In [152]:
# Convert data to tensor
X_train_tensor = torch.tensor(X_train.to_numpy(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.to_numpy(), dtype=torch.float32)

# Create DataLoader for batching
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

In [153]:
# Convert data to tensor
X_test_tensor = torch.tensor(X_test.to_numpy(), dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.to_numpy(), dtype=torch.float32)

# Create DataLoader for batching
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)

### Models

In [164]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class LargeNet(nn.Module):
    def __init__(self, input_size=(23), output_size=1, num_neurons=16):
        super(LargeNet, self).__init__()
        input_size = np.prod(input_size)
        self.fc1 = nn.Linear(input_size, num_neurons)
        self.fc2 = nn.Linear(num_neurons, num_neurons)
        self.fc3 = nn.Linear(num_neurons, num_neurons)
        self.out = nn.Linear(num_neurons, output_size)

    def forward(self, x):
        fc1 = F.relu(self.fc1(x))
        fc2 = F.relu(self.fc2(fc1))
        fc3 = F.relu(self.fc3(fc2))
        logits = self.out(fc3)
        return logits

class SmallNet(nn.Module):
    def __init__(self, input_size=(23), output_size=1, num_neurons=2):
        super(SmallNet, self).__init__()
        input_size = np.prod(input_size)
        self.fc1 = nn.Linear(input_size, num_neurons)
        self.fc2 = nn.Linear(num_neurons, output_size)

    def forward(self, x):
        fc1 = F.relu(self.fc1(x))
        logits = self.fc2(fc1)
        return logits

print(LargeNet())
print()
print(SmallNet())

LargeNet(
  (fc1): Linear(in_features=23, out_features=16, bias=True)
  (fc2): Linear(in_features=16, out_features=16, bias=True)
  (fc3): Linear(in_features=16, out_features=16, bias=True)
  (out): Linear(in_features=16, out_features=1, bias=True)
)

SmallNet(
  (fc1): Linear(in_features=23, out_features=2, bias=True)
  (fc2): Linear(in_features=2, out_features=1, bias=True)
)


In [165]:
from tqdm.notebook import tqdm

def train_model(model, optimizer, num_epochs=30):
    model = model.to(device)
    for epoch in range(num_epochs):
        model.train()  # Set the model to training mode
        running_loss = 0.0
    
        progress_bar = tqdm(total=len(train_loader))
    
        for inputs, labels in train_loader:
            inputs, labels = inputs.view(inputs.shape[0], -1).to(device), labels.to(device)
            optimizer.zero_grad()  # Zero the gradients
            
            outputs = model(inputs)  # Forward pass
            loss = criterion(outputs.view(-1), labels)  # Compute the loss
            loss.backward()  # Backward pass
            optimizer.step()  # Update the weights
            
            running_loss += loss.item() * inputs.size(0)
            
            progress_bar.update(1)
            progress_bar.set_description(f"Loss: {loss.item():.4f}")
        all_preds, all_labels = test_model(model)
        num_instances = len(all_preds)
        correct = (all_preds == all_labels).sum()
        accuracy =  correct / num_instances
        errors = num_instances - correct
        epoch_loss = running_loss / len(train_loader.dataset)
        print(
            f"Epoch {epoch+1}/{num_epochs}, "
            f"Loss: {epoch_loss:.4f}, "
            f"Accuracy: {accuracy:.4f}, Errors: {errors}"
        )

def test_model(model):
    model.eval()
    progress_bar = tqdm(total=len(test_loader))
    
    all_preds, all_labels = [], []
    for inputs, labels in test_loader:
        inputs, labels = inputs.view(inputs.shape[0], -1).to(device), labels.to(device)
        
        outputs = model(inputs)  # Forward pass
        outputs = nn.functional.sigmoid(outputs)
        preds = outputs > 0.5
        all_preds += list(preds)
        all_labels += list(labels)
        progress_bar.update(1)
    all_preds, all_labels = torch.tensor(all_preds), torch.tensor(all_labels)
    return all_preds, all_labels

In [166]:
device = "cuda" if torch.cuda.is_available else "cpu"
device

'cuda'

In [168]:
large_model = LargeNet()
optimizer = optim.Adam(large_model.parameters(), lr=4e-3)
criterion = nn.BCEWithLogitsLoss()
train_model(large_model, optimizer, num_epochs=30)

  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 1/30, Loss: 0.2816, Accuracy: 0.9153, Errors: 352


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 2/30, Loss: 0.1821, Accuracy: 0.9338, Errors: 275


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 3/30, Loss: 0.1605, Accuracy: 0.9449, Errors: 229


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 4/30, Loss: 0.1496, Accuracy: 0.9430, Errors: 237


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 5/30, Loss: 0.1408, Accuracy: 0.9427, Errors: 238


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 6/30, Loss: 0.1342, Accuracy: 0.9466, Errors: 222


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 7/30, Loss: 0.1283, Accuracy: 0.9401, Errors: 249


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 8/30, Loss: 0.1275, Accuracy: 0.9427, Errors: 238


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 9/30, Loss: 0.1229, Accuracy: 0.9451, Errors: 228


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 10/30, Loss: 0.1202, Accuracy: 0.9487, Errors: 213


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 11/30, Loss: 0.1196, Accuracy: 0.9459, Errors: 225


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 12/30, Loss: 0.1188, Accuracy: 0.9490, Errors: 212


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 13/30, Loss: 0.1152, Accuracy: 0.9483, Errors: 215


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 14/30, Loss: 0.1145, Accuracy: 0.9514, Errors: 202


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 15/30, Loss: 0.1143, Accuracy: 0.9473, Errors: 219


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 16/30, Loss: 0.1125, Accuracy: 0.9514, Errors: 202


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 17/30, Loss: 0.1099, Accuracy: 0.9487, Errors: 213


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 18/30, Loss: 0.1112, Accuracy: 0.9456, Errors: 226


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 19/30, Loss: 0.1074, Accuracy: 0.9483, Errors: 215


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 20/30, Loss: 0.1087, Accuracy: 0.9468, Errors: 221


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 21/30, Loss: 0.1054, Accuracy: 0.9531, Errors: 195


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 22/30, Loss: 0.1039, Accuracy: 0.9519, Errors: 200


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 23/30, Loss: 0.1040, Accuracy: 0.9528, Errors: 196


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 24/30, Loss: 0.1036, Accuracy: 0.9521, Errors: 199


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 25/30, Loss: 0.1017, Accuracy: 0.9495, Errors: 210


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 26/30, Loss: 0.1020, Accuracy: 0.9528, Errors: 196


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 27/30, Loss: 0.1003, Accuracy: 0.9536, Errors: 193


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 28/30, Loss: 0.1006, Accuracy: 0.9555, Errors: 185


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 29/30, Loss: 0.0979, Accuracy: 0.9500, Errors: 208


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 30/30, Loss: 0.1021, Accuracy: 0.9504, Errors: 206


In [167]:
small_model = SmallNet()
optimizer = optim.Adam(small_model.parameters(), lr=4e-3)
criterion = nn.BCEWithLogitsLoss()
train_model(small_model, optimizer, num_epochs=30)

  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 1/30, Loss: 0.4128, Accuracy: 0.8953, Errors: 435


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 2/30, Loss: 0.2862, Accuracy: 0.9098, Errors: 375


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 3/30, Loss: 0.2416, Accuracy: 0.9143, Errors: 356


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 4/30, Loss: 0.2247, Accuracy: 0.9240, Errors: 316


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 5/30, Loss: 0.2151, Accuracy: 0.9196, Errors: 334


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 6/30, Loss: 0.2102, Accuracy: 0.9244, Errors: 314


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 7/30, Loss: 0.2073, Accuracy: 0.9244, Errors: 314


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 8/30, Loss: 0.2039, Accuracy: 0.9252, Errors: 311


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 9/30, Loss: 0.2021, Accuracy: 0.9254, Errors: 310


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 10/30, Loss: 0.2007, Accuracy: 0.9249, Errors: 312


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 11/30, Loss: 0.1996, Accuracy: 0.9252, Errors: 311


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 12/30, Loss: 0.1981, Accuracy: 0.9242, Errors: 315


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 13/30, Loss: 0.1978, Accuracy: 0.9271, Errors: 303


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 14/30, Loss: 0.1973, Accuracy: 0.9288, Errors: 296


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 15/30, Loss: 0.1967, Accuracy: 0.9264, Errors: 306


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 16/30, Loss: 0.1957, Accuracy: 0.9259, Errors: 308


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 17/30, Loss: 0.1953, Accuracy: 0.9278, Errors: 300


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 18/30, Loss: 0.1944, Accuracy: 0.9300, Errors: 291


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 19/30, Loss: 0.1942, Accuracy: 0.9288, Errors: 296


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 20/30, Loss: 0.1936, Accuracy: 0.9276, Errors: 301


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 21/30, Loss: 0.1934, Accuracy: 0.9276, Errors: 301


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 22/30, Loss: 0.1926, Accuracy: 0.9283, Errors: 298


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 23/30, Loss: 0.1926, Accuracy: 0.9290, Errors: 295


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 24/30, Loss: 0.1924, Accuracy: 0.9297, Errors: 292


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 25/30, Loss: 0.1919, Accuracy: 0.9305, Errors: 289


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 26/30, Loss: 0.1917, Accuracy: 0.9297, Errors: 292


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 27/30, Loss: 0.1914, Accuracy: 0.9281, Errors: 299


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 28/30, Loss: 0.1908, Accuracy: 0.9302, Errors: 290


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 29/30, Loss: 0.1910, Accuracy: 0.9314, Errors: 285


  0%|          | 0/520 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 30/30, Loss: 0.1902, Accuracy: 0.9288, Errors: 296


### Model Compression with Logits

In [169]:
def get_teacher_logits(model, dataloader):
    model.eval()
    all_logits = []

    with torch.no_grad():
        for inputs, _ in dataloader:
            inputs = inputs.view(inputs.shape[0], -1).to(device)
            logits = model(inputs)
            all_logits.append(logits.cpu())

    return torch.cat(all_logits, dim=0)

# Create Train DataLoader for batching
# Turn off shuffle to align with data ordering
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=False)

teacher_logits = get_teacher_logits(large_model, train_loader)
teacher_logits.shape

torch.Size([16624, 1])

In [170]:
class DistillDataset(torch.utils.data.Dataset):
    def __init__(self, X, y, teacher_logits):
        self.X = X
        self.y = y
        self.teacher_logits = teacher_logits

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx], self.teacher_logits[idx]

In [188]:
def train_student_on_logits(student, optimizer, teacher_logits, num_epochs=30):
    student = student.to(device)
    logit_loss_fn = nn.MSELoss()

    # Create distillation dataset + loader
    distill_dataset = DistillDataset(X_train_tensor, y_train_tensor, teacher_logits)
    distill_loader = DataLoader(distill_dataset, batch_size=64, shuffle=True)

    for epoch in range(num_epochs):
        student.train()
        running_loss = 0.0

        progress_bar = tqdm(total=len(distill_loader))

        for inputs, labels, t_logits in distill_loader:
            inputs = inputs.view(inputs.shape[0], -1).to(device)
            labels = labels.to(device)
            t_logits = t_logits.to(device).view(-1)

            optimizer.zero_grad()

            # Student logits
            s_logits = student(inputs).view(-1)

            # RAW MSE loss on logits
            loss = logit_loss_fn(s_logits, t_logits)

            loss.backward()
            optimizer.step()

            running_loss += loss.item() * inputs.size(0)

            progress_bar.update(1)
            progress_bar.set_description(f"Loss: {loss.item():.4f}")

        epoch_loss = running_loss / len(distill_loader.dataset)

        # ---------- SAME STATS AS train_model ----------
        all_preds, all_labels = test_model(student)
        num_instances = len(all_preds)
        correct = (all_preds == all_labels).sum()
        accuracy = correct / num_instances
        errors = num_instances - correct

        print(
            f"Epoch {epoch+1}/{num_epochs}, "
            f"Loss: {epoch_loss:.4f}, "
            f"Accuracy: {accuracy:.4f}, Errors: {errors}"
        )

In [190]:
student_model = SmallNet()
optimizer = optim.Adam(student_model.parameters(), lr=5e-3)
train_student_on_logits(student_model, optimizer, teacher_logits, num_epochs=35)

  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 1/35, Loss: 50.5131, Accuracy: 0.8311, Errors: 702


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 2/35, Loss: 31.3941, Accuracy: 0.8641, Errors: 565


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 3/35, Loss: 24.0920, Accuracy: 0.8850, Errors: 478


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 4/35, Loss: 19.4121, Accuracy: 0.8961, Errors: 432


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 5/35, Loss: 16.5838, Accuracy: 0.9052, Errors: 394


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 6/35, Loss: 14.8083, Accuracy: 0.9095, Errors: 376


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 7/35, Loss: 13.6725, Accuracy: 0.9143, Errors: 356


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 8/35, Loss: 13.0156, Accuracy: 0.9115, Errors: 368


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 9/35, Loss: 12.5937, Accuracy: 0.9160, Errors: 349


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 10/35, Loss: 12.3076, Accuracy: 0.9115, Errors: 368


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 11/35, Loss: 12.1191, Accuracy: 0.9148, Errors: 354


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 12/35, Loss: 11.9548, Accuracy: 0.9134, Errors: 360


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 13/35, Loss: 11.8391, Accuracy: 0.9134, Errors: 360


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 14/35, Loss: 11.7642, Accuracy: 0.9095, Errors: 376


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 15/35, Loss: 11.7234, Accuracy: 0.9119, Errors: 366


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 16/35, Loss: 11.7133, Accuracy: 0.9112, Errors: 369


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 17/35, Loss: 11.6646, Accuracy: 0.9143, Errors: 356


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 18/35, Loss: 11.6355, Accuracy: 0.9103, Errors: 373


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 19/35, Loss: 11.6208, Accuracy: 0.9117, Errors: 367


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 20/35, Loss: 11.5823, Accuracy: 0.9119, Errors: 366


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 21/35, Loss: 11.5788, Accuracy: 0.9165, Errors: 347


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 22/35, Loss: 11.5701, Accuracy: 0.9146, Errors: 355


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 23/35, Loss: 11.5693, Accuracy: 0.9153, Errors: 352


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 24/35, Loss: 11.5069, Accuracy: 0.9117, Errors: 367


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 25/35, Loss: 11.5389, Accuracy: 0.9165, Errors: 347


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 26/35, Loss: 11.5296, Accuracy: 0.9158, Errors: 350


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 27/35, Loss: 11.5083, Accuracy: 0.9175, Errors: 343


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 28/35, Loss: 11.4883, Accuracy: 0.9143, Errors: 356


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 29/35, Loss: 11.4568, Accuracy: 0.9158, Errors: 350


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 30/35, Loss: 11.4837, Accuracy: 0.9158, Errors: 350


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 31/35, Loss: 11.4480, Accuracy: 0.9143, Errors: 356


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 32/35, Loss: 11.4769, Accuracy: 0.9141, Errors: 357


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 33/35, Loss: 11.4721, Accuracy: 0.9122, Errors: 365


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 34/35, Loss: 11.4596, Accuracy: 0.9194, Errors: 335


  0%|          | 0/260 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

Epoch 35/35, Loss: 11.4767, Accuracy: 0.9155, Errors: 351
