<a href="https://colab.research.google.com/github/chayvw18/Deep-Learning-PyTorch/blob/main/OF_CV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%cd /content/drive/MyDrive/Deep-Learning/Overfitting-CrossValidation

/content/drive/MyDrive/Deep-Learning/Overfitting-CrossValidation


**The Problem with OverFitting**
<br>
When we overfit the noise of the model, and that means the model is less able to genralize the new data.
<br>
Overfitting limits the ability to genralize the data pattern to new data.
<br>

*Overfitting*
* Overly sensitive to noise
* Increased sensitivity to subtle effects
* Reduced genralizability
* Over-parameterized models become difficult to estimate

<br>
*Underfitting*
* Less sensitive to noise
* Less likely to detect true effects
* Reduced generalizability
* Parameters are better estimated
* Good results with less data

<br>
**How to know the correct number of parameters?**
<br>
With 1-2 dimensions: Visualize the data and make an infromed decision
<br>
*it is downright impossible to visualize the model spaces with high dimensional data*
<br>
With 3 + dimensions: use cross validation
<br>
**How to avoid Overfitting**
<br>
1 CV (training/hokd-out/test sets)
2 Use regularization (L2 dropout, data manipulations, early stopping)
<br>
**Hidden Overfitting**
<br>
*Researcher degrees of freedom*: The researcher has many choices for how to clean, organize, and selecr the data; and which models and how mamy models to run


**Generalization and its boundaries**
<br>
Generalization: The model works well when applied to new data/ data model has not seen during the training
<br>
Generalization boundaries: The population you want to apply to

In [2]:
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

from IPython import display
display.set_matplotlib_formats('svg')

  display.set_matplotlib_formats('svg')


In [3]:
import seaborn as sns

iris = sns.load_dataset('iris')

#conver from pandas dataframe to tensor
data = torch.tensor(iris[iris.columns[0:4]].values).float()

#transform species to number
labels = torch.zeros(len(data), dtype=torch.long)

labels[iris.species == 'versicolor'] = 1
labels[iris.species == 'virginica'] = 2

In [13]:
#separate data into train and test
## devset here

#how many training examples
propTraining = .8 # in proportion , not percent
nTraining = int(len(labels)* propTraining)

#initalize a boolean vector to select data and labels

traintestBool = np.zeros(len(labels), dtype=bool)

# is thisthe correct way to select samples?
# traintestBool[range(nTraining)] = True

# this is better but why
items2use4train = np.random.choice(range(len(labels)), nTraining, replace=False)

traintestBool[items2use4train] = True

traintestBool

array([ True, False,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True, False, False,  True,  True,  True, False,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
       False, False,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True, False,  True,
        True, False,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False,  True,  True, False,  True,  True,  True, False,
        True, False,  True,  True,  True,  True,  True, False,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True, False, False, False, False,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [14]:
# test whether its balanced
print('Average of full data:')
print(torch.mean(labels.float())) # =1 by definition
print(' ')

print('Average of training data:')
print(torch.mean(labels[traintestBool].float())) # should be 1
print(' ')

print('Average of test data:')
print(torch.mean(labels[~traintestBool].float())) # =1 by definition
print(' ')

Average of full data:
tensor(1.)
 
Average of training data:
tensor(1.)
 
Average of test data:
tensor(1.)
 


In [15]:
#model architecture
ANNiris = nn.Sequential(
    nn.Linear(4, 64), #input layer
    nn.ReLU(), #activation
    nn.Linear(64, 64), #hidden layer
    nn.ReLU(), #activation
    nn.Linear(64, 3) #output layer

)

#loss function
lossFun = nn.CrossEntropyLoss() # log softmax

#optimizer
optimizer = torch.optim.SGD(ANNiris.parameters(), lr=.01)

In [19]:
#entire dataset
print(data.shape)

#training set
print(data[traintestBool].shape)

#test
print(data[~traintestBool, :].shape)


torch.Size([150, 4])
torch.Size([120, 4])
torch.Size([30, 4])


In [21]:
#train the model

numepochs = 1000

#initialize losses
losses = torch.zeros(numepochs)
ongoingACC = []

#loop over epochs
for epochi in range(numepochs):

  #forward pass
  yhat = ANNiris(data[traintestBool,:])

  #compute the accuracy
  ongoingACC.append(100 * torch.mean((torch.argmax(yhat, axis=1) == labels[traintestBool]).float() ))

  #comput loss
  loss = lossFun(yhat, labels[traintestBool])
  losses[epochi] = loss

  #backprop
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

In [23]:
#compute train and test accuracies

#final forward pass using TRAINING DATA
predictions =  ANNiris(data[traintestBool,:])
trainacc = 100 * torch.mean((torch.argmax(yhat, axis=1) == labels[traintestBool]).float() )


#final forward pass USING TEST DATA!
predictions =  ANNiris(data[~traintestBool,:])
testacc =  100 * torch.mean((torch.argmax(predictions, axis=1) == labels[~traintestBool]).float() )

In [24]:
#report accuarcies
print('Final TRAIN accuracy: %g%%' %trainacc)
print('Final Test accuracy: %g%%' %testacc)

Final TRAIN accuracy: 98.3333%
Final Test accuracy: 100%
