# Topic 8: Regularization

## Reading: Bishop 5.5

## 1 Motivating regularization

“YEAH, BUT YOUR SCIENTISTS WERE SO PREOCCUPIED WITH WHETHER OR NOT THEY COULD THAT THEY DIDN’T STOP TO THINK IF THEY SHOULD.” -- Dr. Ian Malcolm

<img src="images/goldblum.jpg">


We now have the power to create functions (namely neural networks) that have the power to approximate any other function, given a sufficient number of parameters.  However, as we learned when we were fitting polynomials to curves all the way back at the start of the class, unlimited model complexity is a path fraught with peril.  Recall that if we have a simple dataset like this one:  

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.figsize'] = (9,9)

np.random.seed(0)

# Create constantly-spaced x-values
x = np.linspace(0,1,11)

# Create a linear function of $x$ with slope 1, intercept 1, and normally distributed error with sd=1
y = x + np.random.randn(len(x))*0.1 + 1.0
y_test = x + np.random.randn(len(x))*0.1 + 1.0
plt.plot(x,y,'k.')
plt.plot(x,y_test,'r*')
plt.show()

We can fit arbitrarily complex models such that we hit the training data exactly

In [None]:
plt.plot(x,y,'k.')
x_smooth = np.linspace(0,1,101)
train_errors = []
test_errors = []
degrees = range(1,13,2)
for d in degrees:
    X = np.vander(x,d,increasing=True)
    w = np.linalg.solve(X.T @ X,X.T@y)
    y_pred = X@w
    plt.plot(x_smooth,np.vander(x_smooth,d,increasing=True)@w)
    training_error = 1./len(y)*np.sum((y_pred-y)**2)
    test_error = 1./len(y_test)*np.sum((y_pred-y_test)**2)
    train_errors.append(training_error)
    test_errors.append(test_error)
    
plt.show()

If we then plot the resulting training set and test errors as a function of the number of degrees of freedom, we see this typical pattern:

In [None]:
plt.plot(degrees,train_errors,label='Training error')
plt.plot(degrees,test_errors,label='Test error')
plt.xlabel('Polynomial Degree')
plt.ylabel('Objective function value')
plt.legend()
plt.show()    

The training error decreases with increased model complexity, while the test error initially decreases, then begins to *increase* as the model becomes more complex.

Of course this very same thing can happen in neural networks (linear regression is, after all, a very simple version of a multilayer perceptron with no hidden layers and the identity as an activation function).  In the case of linear regression, we made the simple choice to just limit our model complexity.  This is certainly possible with neural nets, by limiting the number of hidden layers and nodes.  However, it's a little bit more challenging to decide just exactly *where* overfitting is coming from, and to tailor the network in response.  Instead, we introduce a technique that we will refer to as *regularization*.  Regularization is broadly understood to be a technique that trades training set accuracy for test set accuracy, and there are a multitude of flavors: we'll explore a few of them here.  

## 2 $L_2$ Regularization
The most common idea for regularizing has been around for a very long time, and it results from the observation that in most regression problems, large weights tend to correspond to overfitting the data.  As such, we can make an effort to reduce overfitting by explicitly penalizing large parameters in the cost function.  In particular, we can simply add the following term:
$$
\mathcal{L}' = \underbrace{\mathcal{L}}_{\text{Sum Square Error}} + \frac{\gamma}{2}\; \sum_{l=1}^{L} \|W^{(l)}\|_{2}^2,
$$
where 
$$
\|W^{(l)}\|_2^2 = \sum_{i} \sum_{j} (w_{ij}^{(l)})^2
$$
is the square of the *Frobenius* norm, which generalizes the normal $L_2$ norm (aka Euclidean distance) to matrices.

## IC8A
$L_2$ regularization has many names including ridge regression and Tikhonov regularization.  However, in the world of machine learning, it is often called **weight decay**.  Compute the derivative of $\frac{\gamma}{2}\; \sum_{l=1}^L \|W^{(l)}\|_2^2$ with respect to some arbitrary weight $w_{ij}^{(l)}$ (ignore the misfit component of the cost function for now), and determine the resulting gradient descent update formula, e.g.
$$
w_{ij}^{(l)} \leftarrow w_{ij}^{(l)} - \ldots
$$
**Why is it called weight decay?**

## 2a $L_2$ Regularization for linear regression
The result derived above is general, but when applied to the linear regression problem we can write down the closed form solution for the optimal parameters as
$$
(X^T X + \gamma \mathcal{I}) \mathbf{w} = X^T y,
$$
where $\mathcal{I}$ is an appropriately sized identity matrix (The additive term along the matrix diagonal gives rise to the name *ridge regression*).  **What does the parameter $\gamma$ do**?

We can easily implement this for a high-dimensional problem, and explore what happens to our model fit as we adjust $\gamma$

In [None]:
plt.plot(x,y,'k.')
x_smooth = np.linspace(0,1,101)
train_errors = []
test_errors = []
d = 13
gammas = np.logspace(-7,1,12)
for gamma in gammas:
    X = np.vander(x,d,increasing=True)
    identity = np.eye(X.shape[1])
    identity[0,0] = 0 # Why do I do this?
    w = np.linalg.solve(X.T @ X + gamma*identity,X.T@y)
    y_pred = X@w
    plt.plot(x_smooth,np.vander(x_smooth,d,increasing=True)@w)
    training_error = 1./len(y)*np.sum((y_pred-y)**2)
    test_error = 1./len(y_test)*np.sum((y_pred-y_test)**2)
    train_errors.append(training_error)
    test_errors.append(test_error)
    
plt.show()

Once again, we can look at the test and training accuracies together.

In [None]:
plt.semilogx(gammas,train_errors,label='Training error')
plt.semilogx(gammas,test_errors,label='Test error')
plt.legend()
plt.show()   

As similar picture emerges (although flipped around the x-axis, because a larger $\gamma$ corresponds to a simpler model.  

## 2b $L_2$ Regularization for neural networks

As it turns out, this is simple to apply to neural networks as well.  To illustrate its effect, let's synthesize a very noisy dataset to be used for a classification problem.  This is not dissimilar to the iris data set, in the sense that there are classes that overlap one another in their features.

In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X,y = make_moons(n_samples=300,noise=0.4)

X,X_test,y,y_test = train_test_split(X,y)
X0,y0 = X.copy(),y.copy()
X0_test,y0_test = X_test.copy(),y_test.copy()

plt.scatter(X[:,0],X[:,1],c=y)
plt.scatter(X_test[:,0],X_test[:,1],c=y_test,marker='x')

There is a clear pattern here, but it is quite noisy.  Let's see if we can fit a very flexible neural network to this dataset using pytorch.  We'll first go through our ritual of converting the data into the appropriate type and location, and then create a DataLoader for use with batch gradient descent.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

X = torch.from_numpy(X)
X_test = torch.from_numpy(X_test)
y = torch.from_numpy(y)
y_test = torch.from_numpy(y_test)


X = X.to(torch.float32)
X_test = X_test.to(torch.float32)
y = y.to(torch.long)
y_test = y_test.to(torch.long)

device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")

X = X.to(device)
X_test = X_test.to(device)
y = y.to(device)
y_test = y_test.to(device)

from torch.utils.data import TensorDataset

training_data = TensorDataset(X,y)
test_data = TensorDataset(X_test,y_test)

batch_size = 256
train_loader = torch.utils.data.DataLoader(dataset=training_data,
                                           batch_size=batch_size, 
                                           shuffle=True)

batch_size = 256
test_loader = torch.utils.data.DataLoader(dataset=test_data,
                                           batch_size=batch_size, 
                                           shuffle=False)

Now let's define our neural network.  It'll be a simple affair with a single hidden layer, but plenty of nodes to ensure a flexible function.  Something like the following:

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        """
        This method is where you'll want to instantiate parameters.
        we do this by creating two linear transformation functions, l1 and l2, which 
        have encoded in it both the weight matrices W_1 and W_2, and the bias vectors
        """
        super(Net,self).__init__()
        self.l1 = nn.Linear(2,2048) # Transform from input to hidden layer
        self.l2 = nn.Linear(2048,2)
    
    def forward(self,x):
        """
        This method runs the feedforward neural network.  It takes a tensor of size m x 784,
        applies a linear transformation, applies a sigmoidal activation, applies the second linear transform 
        and outputs the logits.
        """
        a1 = self.l1(x)
        z1 = torch.relu(a1)
        
        a2 = self.l2(z1)
        return a2

Now we can optimize.  Let's pay attention to the training and test set accuracy as we optimize.

In [None]:
model = Net()
model.to(device)
criterion = torch.nn.CrossEntropyLoss(reduction='mean')

optimizer = torch.optim.Adam(model.parameters())

epochs = 3000
# Loop over the data

train_accs = []
test_accs = []

for epoch in range(epochs):
    model.train()
    # Loop over each subset of data
    for d,t in train_loader:

        # Zero out the optimizer's gradient buffer
        optimizer.zero_grad()
        
        # Make a prediction based on the model
        outputs = model(d)
        
        # Compute the loss
        loss = criterion(outputs,t)      

        # Use backpropagation to compute the derivative of the loss with respect to the parameters
        loss.backward()
        
        # Use the derivative information to update the parameters
        optimizer.step()
        
    model.eval()
    # After each epoch, compute the test set accuracy
    total=0.
    correct=0.
    # Loop over all the test examples and accumulate the number of correct results in each batch
    for d,t in test_loader:
        outputs = model(d)
        _, predicted = torch.max(outputs.data,1)
        total += float(t.size(0))
        correct += float((predicted==t).sum())
    total_train = 0
    correct_train = 0
    for d,t in train_loader:
        outputs = model(d)
        _, predicted = torch.max(outputs.data,1)
        total_train += float(t.size(0))
        correct_train += float((predicted==t).sum())
        
    # Print the epoch, the training loss, and the test set accuracy.
    train_accs.append(100.*correct_train/total_train)
    test_accs.append(100.*correct/total)
    if epoch%10==0:
        print(epoch,loss.item(),train_accs[-1],test_accs[-1])
plt.plot(train_accs,label='Training accuracy')
plt.plot(test_accs,label='Test accuracy')
plt.show()


**The above case exhibits the classic symptoms of overfitting.  How do you know?  Based on the plot above, can you identify an alternative method of regularization?**

This neural network only has two features, so we can easily visualize its predictions as a grid. Let's see what the decision boundary is.

In [None]:
X0grid,X1grid = np.meshgrid(np.linspace(X[:,0].cpu().min(),X[:,0].cpu().max(),101),np.linspace(X[:,1].cpu().min(),X[:,1].cpu().max(),101))

X_grid = np.vstack((X0grid.ravel(),X1grid.ravel())).T
X_grid = torch.from_numpy(X_grid)
X_grid = X_grid.to(torch.float32)
X_grid = X_grid.cuda()

t = model(X_grid)
out = F.softmax(t,dim=1)
plt.contourf(X0grid,X1grid,out.cpu().detach().numpy()[:,1].reshape((101,101)),alpha=0.5)
plt.scatter(X0[:,0],X0[:,1],c=y0,label='Training Data')
plt.scatter(X0_test[:,0],X0_test[:,1],c=y0_test,marker='x',label='Test Data')
plt.legend()
plt.show()

**What's the problem here?**

We can allay this issue using $L_2$ regularization.  How do we implement this?  As it turns out, it's not so bad.

In [None]:
model = Net()
model.to(device)
criterion = torch.nn.CrossEntropyLoss(reduction='mean')

optimizer = torch.optim.Adam(model.parameters())
gamma = 0.1  ### HERE'S THE REGULARIZATION PARAMETER!

train_accs = []
test_accs = []

epochs = 3000
# Loop over the data
for epoch in range(epochs):
    model.train()
    # Loop over each subset of data
    for d,t in train_loader:

        # Zero out the optimizer's gradient buffer
        optimizer.zero_grad()
        
        # Make a prediction based on the model
        outputs = model(d)
        
        # Compute the loss
        loss = criterion(outputs,t)
        
        ### HERE'S WHERE WE ADD REGULARIZATION!
        for W in model.parameters():
            # Loop over all the model parameters
            if W.dim()>1:
                # Loop over all the weight matrices (but not the biases)
                loss += gamma/(2*d.shape[0])*(W**2).sum()
        

        # Use backpropagation to compute the derivative of the loss with respect to the parameters
        # NOTE THAT THIS NOW INCLUDES A DIRECT DEPENDENCY ON THE PARAMETERS DUE TO THE REGULARIZATION TERM
        loss.backward()
        
        # Use the derivative information to update the parameters
        optimizer.step()
        
    model.eval()
    # After each epoch, compute the test set accuracy
    total=0.
    correct=0.
    # Loop over all the test examples and accumulate the number of correct results in each batch
    for d,t in test_loader:
        outputs = model(d)
        _, predicted = torch.max(outputs.data,1)
        total += float(t.size(0))
        correct += float((predicted==t).sum())
    total_train = 0
    correct_train = 0
    for d,t in train_loader:
        outputs = model(d)
        _, predicted = torch.max(outputs.data,1)
        total_train += float(t.size(0))
        correct_train += float((predicted==t).sum())
        
    # Print the epoch, the training loss, and the test set accuracy.
    train_accs.append(100.*correct_train/total_train)
    test_accs.append(100.*correct/total)
    if epoch%100==0:
        print(epoch,loss.item(),train_accs[-1],test_accs[-1])
        
plt.plot(train_accs,label='Training accuracy')
plt.plot(test_accs,label='Test accuracy')
plt.show()

**Does adding this regularization improve the overfitting issues?  How do you know?**

**Critically, what happens if we increase the regularization parameter too far?**

Let's print the resulting decision surface below.

In [None]:
X0grid,X1grid = np.meshgrid(np.linspace(X[:,0].cpu().min(),X[:,0].cpu().max(),101),np.linspace(X[:,1].cpu().min(),X[:,1].cpu().max(),101))

X_grid = np.vstack((X0grid.ravel(),X1grid.ravel())).T
X_grid = torch.from_numpy(X_grid)
X_grid = X_grid.to(torch.float32)
X_grid = X_grid.cuda()

t = model(X_grid)
out = F.softmax(t,dim=1)
plt.contourf(X0grid,X1grid,out.cpu().detach().numpy()[:,1].reshape((101,101)),alpha=0.5)
plt.scatter(X0[:,0],X0[:,1],c=y0)
plt.scatter(X0_test[:,0],X0_test[:,1],c=y0_test,marker='x')

## 2c $L_2$ Regularization for LFW
As we've seen before, image data gives us an opportunity to look at the weights.  Because $L_2$ regularization is manipulating weights directly, this gives us an opportunity to get a sense of *what qualitative effect the regularization is having on the weights*.  Let's apply this to Labeled Faces in the Wild, which we're now used to seeing. 

As usual, the data manipulation ritual:

In [None]:
from sklearn.datasets import fetch_lfw_people
from sklearn.model_selection import train_test_split

# Fetch LFW dataset, ensuring that we have at least 50 images per class (i.e. per person)
lfw_people = fetch_lfw_people(min_faces_per_person=25, resize=0.4)

# Extract number of data points, and the height and width of the images for later reshaping
m, h, w = lfw_people.images.shape
n = h*w

# Extract number of classes
N = len(lfw_people.target_names)

# Split the training and test set
X,X_test,y,y_test = train_test_split(lfw_people.data,lfw_people.target)
X/=255.
X_test/=255.

In [None]:
X = torch.from_numpy(X)
X_test = torch.from_numpy(X_test)
y = torch.from_numpy(y)
y_test = torch.from_numpy(y_test)

X = X.to(torch.float32)
X_test = X_test.to(torch.float32)
y = y.to(torch.long)
y_test = y_test.to(torch.long)

device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")

X = X.to(device)
X_test = X_test.to(device)
y = y.to(device)
y_test = y_test.to(device)

In [None]:
from torch.utils.data import TensorDataset

training_data = TensorDataset(X,y)
test_data = TensorDataset(X_test,y_test)

batch_size = 256
train_loader = torch.utils.data.DataLoader(dataset=training_data,
                                           batch_size=batch_size, 
                                           shuffle=True)

batch_size = 256
test_loader = torch.utils.data.DataLoader(dataset=test_data,
                                           batch_size=batch_size, 
                                           shuffle=False)

Now, let's construct a simple neural network, not dissimilar to the ones that you created 

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        """
        This method is where you'll want to instantiate parameters.
        we do this by creating two linear transformation functions, l1 and l2, which 
        have encoded in it both the weight matrices W_1 and W_2, and the bias vectors
        """
        super(Net,self).__init__()
        self.l1 = nn.Linear(n,128) # Transform from input to hidden layer
        self.l2 = nn.Linear(128,N)
        #self.l3 = nn.Linear(256,10)
        
        #self.dropout_1 = nn.Dropout(p=0.3)
        #self.dropout_2 = nn.Dropout(p=0.3)


    
    def forward(self,x):
        """
        This method runs the feedforward neural network.  It takes a tensor of size m x 784,
        applies a linear transformation, applies a sigmoidal activation, applies the second linear transform 
        and outputs the logits.
        """
        a1 = self.l1(x)
        z1 = torch.sigmoid(a1)
        #z1d = self.dropout_1(z1)
        
        a2 = self.l2(z1)
        #z2 = torch.relu(a2)
        #z2d = self.dropout_2(z2)
       # 
        #a3 = self.l3(z2d)

        return a2

First, let's try training the model without L2 regularization.  

In [None]:
model = Net()
model.to(device)
criterion = torch.nn.CrossEntropyLoss(reduction='mean')

optimizer = torch.optim.Adam(model.parameters(),lr=1e-3)
gamma=0
#gamma = 2e5

epochs = 500
# Loop over the data
for epoch in range(epochs):
    model.train()
    # Loop over each subset of data
    for d,t in train_loader:

        # Zero out the optimizer's gradient buffer
        optimizer.zero_grad()
        
        # Make a prediction based on the model
        outputs = model(d)
        
        # Compute the loss
        loss = criterion(outputs,t)
        for p in model.parameters():
            if p.dim()>1:
                loss += gamma/(2*d.shape[0])*(p**2).mean()
        

        # Use backpropagation to compute the derivative of the loss with respect to the parameters
        loss.backward()
        
        # Use the derivative information to update the parameters
        optimizer.step()
        
    model.eval()
    # After each epoch, compute the test set accuracy
    total=0.
    correct=0.
    # Loop over all the test examples and accumulate the number of correct results in each batch
    for d,t in test_loader:
        outputs = model(d)
        _, predicted = torch.max(outputs.data,1)
        total += float(t.size(0))
        correct += float((predicted==t).sum())
    total_train = 0
    correct_train = 0
    for d,t in train_loader:
        outputs = model(d)
        _, predicted = torch.max(outputs.data,1)
        total_train += float(t.size(0))
        correct_train += float((predicted==t).sum())
        
    # Print the epoch, the training loss, and the test set accuracy.
    print(epoch,loss.item(),100.*correct_train/total_train,100.*correct/total)
params = [p for p in model.l1.parameters()]
W1_noreg = params[0].cpu().detach().numpy().T

Then, let's try it with regularization 

In [None]:
model = Net()
model.to(device)
criterion = torch.nn.CrossEntropyLoss(reduction='mean')

optimizer = torch.optim.Adam(model.parameters(),lr=1e-3)
gamma = 1e4

epochs = 500
# Loop over the data
for epoch in range(epochs):
    model.train()
    # Loop over each subset of data
    for d,t in train_loader:

        # Zero out the optimizer's gradient buffer
        optimizer.zero_grad()
        
        # Make a prediction based on the model
        outputs = model(d)
        
        # Compute the loss
        loss = criterion(outputs,t)
        for p in model.parameters():
            if p.dim()>1:
                loss += gamma/(2*d.shape[0])*(torch.abs(p)).mean()
        

        # Use backpropagation to compute the derivative of the loss with respect to the parameters
        loss.backward()
        
        # Use the derivative information to update the parameters
        optimizer.step()
        
    model.eval()
    # After each epoch, compute the test set accuracy
    total=0.
    correct=0.
    # Loop over all the test examples and accumulate the number of correct results in each batch
    for d,t in test_loader:
        outputs = model(d)
        _, predicted = torch.max(outputs.data,1)
        total += float(t.size(0))
        correct += float((predicted==t).sum())
    total_train = 0
    correct_train = 0
    for d,t in train_loader:
        outputs = model(d)
        _, predicted = torch.max(outputs.data,1)
        total_train += float(t.size(0))
        correct_train += float((predicted==t).sum())
        
    # Print the epoch, the training loss, and the test set accuracy.
    print(epoch,loss.item(),100.*correct_train/total_train,100.*correct/total)
params = [p for p in model.l1.parameters()]
W1_L2 = params[0].cpu().detach().numpy().T

Note that the regularized model has both reduced training set accuracy *and* reduced test set accuracy.  This implies that our regularization is not very useful: that's because this dataset is, in fact, not very "noisy" (in the sense of multiple overlapping data points of different classes; image data is usually this way because it's so high dimensional).  However, if we plot some of the features being extracted in the form of extracted weight matrices, we can learn something about the effect of $L_2$ regularization.

In [None]:
fig,axs = plt.subplots(nrows=2,ncols=6)
fig.set_size_inches(10,4)
for i in range(6):
    axs[0,i].imshow(W1_noreg[:,np.random.randint(W1_noreg.shape[1])].reshape((h,w)))
    axs[1,i].imshow(W1_L2[:,np.random.randint(W1_L2.shape[1])].reshape((h,w)))

Because $L_2$ regularization penalizes large weights more than small ones, it has the tendency to ensure that all the weights are around the same size, which also means that it tends towards extracting larger, smoother features.

## IC8B $L_1$ regularization
We chose to regularize by penalizing the sum of squared weights (if you don't believe this, go back to the formula).  However, there are many other ways that we could exert pressure on the weights to behave one way or another.  One very interesting possibility is so-called $L_1$ regularization, which penalizes the $L_1$ norm of the weights.  What, you ask, is the $L_1$ norm?  It is the *sum of absolute values*:
$$
\|W^{(l)}\|_1 = \sum_{i} \sum_{j} |w_{ij}^{(l)}|.
$$
Superficially, it would seem that this would do the same thing as $L_2$ regularization (make the weights smaller), and it is true that it does have this property.  However, it has a very different effect on the *distribution* of weights.  **Implement $L_1$ regularization, using the code above as a starting point.  Discuss the qualitative effect that this form of regularization has on the resulting weight matrices.  HINT 1:(You'll want to reduce the value of $\gamma$ to 10^4 for this).  Hint 2: (You might have to do a little bit of searching to find weight matrices that aren't all close to zero).  Hint 3: (Take the derivative of the $L_1$ norm.  At each iteration of gradient descent, how much are large weights reduced versus small weights?)

## 3 More exotic regularizers
TODO

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset,TensorDataset
from torchvision import transforms

# In order to run this in class, we're going to reduce the dataset by a factor of 5
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, cache=True)
X/=255.
y = y.astype(int)
X,X_test,y,y_test = train_test_split(X,y)

X = torch.from_numpy(X)
X_test = torch.from_numpy(X_test)
y = torch.from_numpy(y)
y_test = torch.from_numpy(y_test)

X = X.to(torch.float32)
X_test = X_test.to(torch.float32)

device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")

In [None]:
transform = transforms.RandomRotation(30)#,transforms.RandomResizedCrop(224),transforms.RandomHorizontalFlip()]

class CustomTensorDataset(Dataset):
    """TensorDataset with support of transforms.
    """
    def __init__(self, tensors, transform=None):
        assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
        self.tensors = tensors
        self.transform = transform

    def __getitem__(self, index):
        x = self.tensors[0][index]

        if self.transform:
            x = x.reshape((28,28))
            x = transforms.ToPILImage(x)
            x = self.transform(x)
            print(x)

        y = self.tensors[1][index]

        return x, y

    def __len__(self):
        return self.tensors[0].size(0)

training_data = CustomTensorDataset([X,y],transform=transform)
test_data = CustomTensorDataset([X_test,y_test])

batch_size = 256
train_loader = torch.utils.data.DataLoader(dataset=training_data,
                                           batch_size=batch_size, 
                                           shuffle=True)

batch_size = 256
test_loader = torch.utils.data.DataLoader(dataset=test_data,
                                           batch_size=batch_size, 
                                           shuffle=False)

In [None]:
for t in train_loader:
    break

In [None]:
"""
import torchvision
import torch

batch_size_train = 256
batch_size_test = 256

train_transforms = torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,)),
                               lambda x: x.flatten()])

train_dataset = torchvision.datasets.MNIST('./data/', train=True, download=True,transform=train_transforms)

train_loader = torch.utils.data.DataLoader(train_dataset,batch_size=batch_size_train,pin_memory=True,num_workers=0)

test_transforms = torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,)),
                               lambda x: x.flatten()])
test_dataset = torchvision.datasets.MNIST('./data/', train=False, download=True,transform=test_transforms)

test_loader = torch.utils.data.DataLoader(test_dataset,batch_size=batch_size_test,pin_memory=True,num_workers=0z)
"""

In [None]:
import torch.nn as nn
import torch.nn.functional as F

n = 784
N = 10

class Net(nn.Module):
    def __init__(self):
        """
        This method is where you'll want to instantiate parameters.
        we do this by creating two linear transformation functions, l1 and l2, which 
        have encoded in it both the weight matrices W_1 and W_2, and the bias vectors
        """
        super(Net,self).__init__()
        self.l1 = nn.Linear(n,128) # Transform from input to hidden layer
        self.l2 = nn.Linear(128,N)
        #self.l3 = nn.Linear(256,10)
        
        #self.dropout_1 = nn.Dropout(p=0.3)
        #self.dropout_2 = nn.Dropout(p=0.3)


    
    def forward(self,x):
        """
        This method runs the feedforward neural network.  It takes a tensor of size m x 784,
        applies a linear transformation, applies a sigmoidal activation, applies the second linear transform 
        and outputs the logits.
        """
        a1 = self.l1(x)
        z1 = torch.sigmoid(a1)
        #z1d = self.dropout_1(z1)
        
        a2 = self.l2(z1)
        #z2 = torch.relu(a2)
        #z2d = self.dropout_2(z2)
       # 
        #a3 = self.l3(z2d)

        return a2

In [None]:
#device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")

model = Net()
model.to(device)
criterion = torch.nn.CrossEntropyLoss(reduction='mean')

optimizer = torch.optim.Adam(model.parameters(),lr=1e-3)

epochs = 500

total_train = 0
correct_train = 0
# Loop over the data
for epoch in range(epochs):
    model.train()
    # Loop over each subset of data
    for d,t in train_loader:
        #print(d.shape)
        d,t = d.cuda(),t.cuda()

        # Zero out the optimizer's gradient buffer
        optimizer.zero_grad()
        
        # Make a prediction based on the model
        outputs = model(d)
        
        # Compute the loss
        loss = criterion(outputs,t)      

        # Use backpropagation to compute the derivative of the loss with respect to the parameters
        loss.backward()
        
        # Use the derivative information to update the parameters
        optimizer.step()
        
        _, predicted = torch.max(outputs.data,1)
        total_train += float(t.size(0))
        correct_train += float((predicted==t).sum())
        
    model.eval()
    # After each epoch, compute the test set accuracy
    total=0.
    correct=0.
    # Loop over all the test examples and accumulate the number of correct results in each batch
    for d,t in test_loader:
        d,t = d.cuda(),t.cuda()
        outputs = model(d)
        _, predicted = torch.max(outputs.data,1)
        total += float(t.size(0))
        correct += float((predicted==t).sum())
        
    # Print the epoch, the training loss, and the test set accuracy.
    print(epoch,loss.item(),100.*correct_train/total_train,100.*correct/total)

In [None]:
torch.utils.data.DataLoader?

In [None]:
train_dataset