# Overfitting
Overfitting
- Overly sensitive to noise
- Increased sensitivity to subtle effects
- Poor generalization to new data
- Overparametrized models become more difficult to estimate

Underfitting
- Less sensitive to noise
- Less likely to detect true effects
- Reduced generalization to new data
- Parameters are better estimated
- Good results with small sample sizes

With 1-2 dimensions: Visualize the data and make informed decision

With more dimensions: Use cross-validation

## Avoid overfitting

- Use cross-validation (training/held-out/test sets)
- Use regularization (penalize complexity)

## Cross-validation
Split data into training and test sets
- Training set: Fit model parameters
- Test set: Evaluate model performance
- Hold-out set (dev-set): Tune model parameters

K-fold cross-validation
- Split data into K folds
- Train on K-1 folds
- Test on the remaining fold
- Repeat K times

## Generalization

Generalization: The model works well on unseen data.

Generalization boundaries: The population you want to apply to the model.

Simple example:
$weight = \beta_1 * height + \beta_2 * calories$

    Generalization boundaries: 
    - Must work on humans (both sexes, all countries)
    - Doesn't need to work on children
    - Doesn't need to work on animals

Generalization entails the some loss of accuracy.

## Manual cross-validation

In [1]:
import torch
import torch.nn as nn
import numpy as np
import seaborn as sns



In [6]:
iris = sns.load_dataset('iris') # panda dataframe

# convert from pandas dataframe to tensor
data = torch.tensor(iris[iris.columns[0:4]].values, dtype=torch.float32)

# transform species to numeric values
labels = torch.zeros(len(data), dtype=torch.long)
labels[iris.species == 'versicolor'] = 1
labels[iris.species == 'virginica'] = 2


In [12]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [14]:
# separate data into training and test sets

propTraining = 0.8 # proportion of data to use for training
nTraining = int(propTraining * len(labels)) # number of training samples

traintestBool = np.zeros(len(labels), dtype=bool) # boolean array

# this is not random
#traintestBool[range(nTraining)] = True

items2use4training = np.random.choice(range(len(labels)), size=nTraining, replace=False)
traintestBool[items2use4training] = True

In [15]:
# test whether the split is correct
print('Average of full data')
print(torch.mean(labels.float()))
print(' ')

print('Average of training data')
print(torch.mean(labels[traintestBool].float()))
print(' ')

print('Average of test data')
print(torch.mean(labels[~traintestBool].float()))

Average of full data
tensor(1.)
 
Average of training data
tensor(1.0750)
 
Average of test data
tensor(0.7000)


In [19]:
# create the model

ANNiris = nn.Sequential(
    nn.Linear(4, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Linear(64, 3)
)

# define the loss function
loss_fn = nn.CrossEntropyLoss()

# define the optimizer
learning_rate = 1e-2
optimizer = torch.optim.SGD(ANNiris.parameters(), lr=learning_rate)


In [17]:
# entire dataset
print(data.shape)

#training data
print(data[traintestBool].shape)
# test data
print(data[~traintestBool].shape)

torch.Size([150, 4])
torch.Size([120, 4])
torch.Size([30, 4])


In [20]:
# training loop
num_epochs = 1000

# initialize loss array
losses = np.zeros(num_epochs)
ongoinAccuracy = np.zeros(num_epochs)

# loop over epochs
for epoch in range(num_epochs):
    yHat = ANNiris(data[traintestBool]) # forward pass
    ongoinAccuracy[epoch] = torch.mean((torch.argmax(yHat, dim=1) == labels[traintestBool]).float())

    loss = loss_fn(yHat, labels[traintestBool]) # compute loss
    losses[epoch] = loss.item() # store loss for this epoch

    optimizer.zero_grad() # zero gradients
    loss.backward() # backward pass
    optimizer.step() # update parameters

In [22]:
# compute accuracy on test set
predictions = ANNiris(data[traintestBool,:])
trainacc = torch.mean((torch.argmax(predictions, dim=1) == labels[traintestBool]).float())

# final forward pass
predictions = ANNiris(data[~traintestBool,:])
testacc = torch.mean((torch.argmax(predictions, dim=1) == labels[~traintestBool]).float())

In [23]:
# report accuracy

print('Training accuracy: ' + str(trainacc))
print('Test accuracy: ' + str(testacc))

Training accuracy: tensor(0.9833)
Test accuracy: tensor(0.9667)


## Cross-validation in scikit-learn

In [27]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
import seaborn as sns


In [29]:
iris = sns.load_dataset('iris') # panda dataframe

data = torch.tensor(iris[iris.columns[0:4]].values, dtype=torch.float32)

# transform species to numeric values
labels = torch.zeros(len(data), dtype=torch.long)
labels[iris.species == 'versicolor'] = 1
labels[iris.species == 'virginica'] = 2

In [28]:
# create fake dataset

fakedata = np.tile(np.array([1, 2, 3, 4]), (10, 1)) + np.tile(10*np.arange(1,11), (4, 1)).T
fakelabels = np.arange(10)>4
print(fakedata)
print(fakelabels)

[[ 11  12  13  14]
 [ 21  22  23  24]
 [ 31  32  33  34]
 [ 41  42  43  44]
 [ 51  52  53  54]
 [ 61  62  63  64]
 [ 71  72  73  74]
 [ 81  82  83  84]
 [ 91  92  93  94]
 [101 102 103 104]]
[False False False False False  True  True  True  True  True]


In [34]:
# use scikit-learn to split the data
train_data, test_data, train_labels, test_labels = train_test_split(fakedata, fakelabels, test_size=0.2)

print('Training data size:' + str(train_data.shape))
print('Test data size:' + str(test_data.shape))
print(' ')

print('Training data: ' )
print(train_data)
print(' ')

print('Test data: ' )
print(test_data)
print(' ')

Training data size:(8, 4)
Test data size:(2, 4)
 
Training data: 
[[31 32 33 34]
 [91 92 93 94]
 [61 62 63 64]
 [41 42 43 44]
 [21 22 23 24]
 [81 82 83 84]
 [11 12 13 14]
 [51 52 53 54]]
 
Test data: 
[[ 71  72  73  74]
 [101 102 103 104]]
 


In [35]:
def createANewModel():
    ANNiris = nn.Sequential(
        nn.Linear(4, 64),
        nn.ReLU(),
        nn.Linear(64, 64),
        nn.ReLU(),
        nn.Linear(64, 3)
    )

    # define the loss function
    loss_fn = nn.CrossEntropyLoss()

    # define the optimizer
    learning_rate = 1e-2
    optimizer = torch.optim.SGD(ANNiris.parameters(), lr=learning_rate)
    return ANNiris, loss_fn, optimizer

In [38]:
train_data

array([[31, 32, 33, 34],
       [91, 92, 93, 94],
       [61, 62, 63, 64],
       [41, 42, 43, 44],
       [21, 22, 23, 24],
       [81, 82, 83, 84],
       [11, 12, 13, 14],
       [51, 52, 53, 54]])

In [36]:
# train the model

def trainTheModel(ANNiris, loss_fn, optimizer, train_data, train_labels, test_data, test_labels, num_epochs=100):
    # training loop
    # initialize loss array
    losses = np.zeros(num_epochs)
    ongoinAccuracy = np.zeros(num_epochs)

    # loop over epochs
    for epoch in range(num_epochs):
        yHat = ANNiris(train_data) # forward pass
        ongoinAccuracy[epoch] = torch.mean((torch.argmax(yHat, dim=1) == train_labels).float())

        loss = loss_fn(yHat, train_labels) # compute loss
        losses[epoch] = loss.item() # store loss for this epoch

        optimizer.zero_grad() # zero gradients
        loss.backward() # backward pass
        optimizer.step() # update parameters

    # compute accuracy on test set
    predictions = ANNiris(train_data)
    trainacc = torch.mean((torch.argmax(predictions, dim=1) == train_labels).float())

    # final forward pass
    predictions = ANNiris(test_data)
    testacc = torch.mean((torch.argmax(predictions, dim=1) == test_labels).float())

    # report accuracy

    print('Training accuracy: ' + str(trainacc))
    print('Test accuracy: ' + str(testacc))

    return losses, ongoinAccuracy

In [41]:
ANNIris, lossfun, optimizer = createANewModel()
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=0.2)
losses, ongoinAccuracy = trainTheModel(ANNIris, lossfun, optimizer, 
                                       train_data, 
                                       train_labels, 
                                       test_data, 
                                       test_labels)


Training accuracy: tensor(0.7167)
Test accuracy: tensor(0.5333)


In [42]:
ongoinAccuracy

array([0.69166666, 0.68333334, 0.68333334, 0.68333334, 0.65833336,
       0.64999998, 0.64999998, 0.65833336, 0.67500001, 0.68333334,
       0.68333334, 0.68333334, 0.68333334, 0.68333334, 0.69166666,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999999,
       0.69999999, 0.69999999, 0.69999999, 0.69999999, 0.69999