# Deep Learning for NLP WS17/18
## Exercise Sheet 3 - Pytorch Introduction
This exercise sheet is due on 21.11.17 11:59 pm. There is a total of 12
points for this exercise sheet. Please send your solution in a
suitable format to [beroth@cis.uni-muenchen.de](mailto:beroth@cis.uni-muenchen.de). Please submit a
completed version of this file in Python 3. You may submit in teams of
2 or 3 students.

Please rename the file to pytorch_intro_last_names.ipynb

### Installation of required packages

For installation of Pytorch check <http://pytorch.org/>. You need to select the specifc __wheel__ to make it work. Note that CUDA is required if you want to execute Pytorch on a GPU. The program below doesn't require a lot of computation, so CPU-only is enough.

The sklearn can be installed with pip (pip3 for python3) or with the procedure you chose for the last exercise sheet and installing numpy.

#### If you have any problems regarding the installation feel free to send an email to [simon.h.schaefer@googlemail.com](mailto:simon.h.schaefer@googlemail.com)

### Exercise 1
#### Logistic regression on the Boston Housing Dataset (9 points)

Usefull Pytorch tutorials and resources are mentioned in the lecture slides. 


The Boston Housing Dataset is often used as a logistic regression example. It has 14 features and 2 prototasks. See <https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html> for details. The dataset is provided with the sklearn module.

You will have to complete the code where marked with ***TODO***

First, we import Pytorch, the Boston dataset, math and shuffle. 

In [None]:
import torch
import torch.nn as nn
import sklearn.datasets
from random import shuffle
import torch.optim as optim
import numpy as np
import math
from torch.autograd import Variable

We define a method which returns 3 boolean arrays that helps providing a shuffled train, dev, test dataset split

In [None]:
def random_train_dev_test_split(num_total_items, train_ratio = 0.5, dev_ratio = 0.25):
    num_train_items = math.floor(num_total_items * train_ratio)
    num_dev_items = math.floor(num_total_items * dev_ratio)
    num_test_items = num_total_items - num_train_items - num_dev_items
    split = [0] * num_train_items + [1] * num_dev_items + [2] * num_test_items
    shuffle(split)
    split = np.asarray(split)
    return split == 0, split == 1, split == 2

Now we start the main program. At first we get the dataset from the sklearn module and assign features and labels to x and y.

In [None]:
boston = sklearn.datasets.load_boston()
x = boston.data
y = boston.target
num_items, num_features = x.shape

Get a 50%-25%-25% split of the shuffled(!) training data (you can use the provided method random_train_dev_test_split).
If you use the predefined method, note:
 * The returned arrays contain boolean indicators which element are contained in the respective sets
 * You can use Boolean indexing and Numpy to access those elements
#### TODO

In [None]:
# train_spl, dev_spl, test_spl = random_train_dev_test_split(num_items)

#### TODO: get a k x n -dimensional numpy array, where k is the number of items in the training data, n the number of features.

In [None]:
train_x = 

#### TODO: get a k x 1 dimensional numpy array of the training targets. Hint: use np.expand_dims(...) to get an extra dimension (i.e. a matrix instead of an vector).


In [None]:
train_y = 

#### TODO: Similarly, get the dev feature matrix with associated dev targets.

In [None]:
dev_x = 
dev_y = 

Shuffle method for training is provided:

In [None]:
def unison_shuffled(a, b):
    assert len(a) == len(b)
    p = np.random.permutation(len(a))
    return a[p],b[p]

We create the Linear Regression class that inherits from the Pytorch nn module. Information about __init__ and __forward__ can be found in the lecture slides.

In [None]:
class LinearRegression(nn.Module):
    def __init__(self, num_features):
        super(LinearRegression, self).__init__()
        self.final_layer = nn.Linear(num_features, 1)
    def forward(self, x):
        return self.final_layer(x)
linreg_model = LinearRegression(num_features)

optimizer = optim.SGD(linreg_model.parameters(), lr=0.00001)


#### TODO: replace the SGD optimizer by using the Adam optimizer with learning rate 0.001. Check the Pytorch documentation for the optim package for more information: <http://pytorch.org/docs/master/optim.html>

Definition of Mean Squared Error as loss criterion.

In [None]:
criterion = nn.MSELoss()

The __np_to_var__ method is a convinience method used later to convert numpy arrays to Pytorch Variables.

In [None]:
def np_to_var(np_array):
    return Variable(torch.from_numpy(np_array).float())

The training iterates over the data 100 times. The gradient is backpropagated after every training example and total loss is printed at the end of each epoch.

In [None]:

num_epochs = 100
for epoch in range(num_epochs):
    loss_accum = 0.0
    #train_x, train_y = unison_shuffled(train_x, train_y)
    for i in range(len(train_y)):
        x_i = np_to_var(train_x[i])
        y_i = np_to_var(train_y[i])
        optimizer.zero_grad()   # zero the gradient buffers
        output = linreg_model.forward(x_i)
        loss = criterion(output, y_i)
        loss_accum += loss.data[0]
        loss.backward()
        optimizer.step()    # Does the update
    print("train loss:", loss_accum/len(train_y))

#### TODO: What effect can you observe by shuffling the data in each epoch? (Hint: Uncomment the unison_shuffled method) 

 #### TODO: Extend the code that it additionally prints the loss on development data at the end of each epoch.

#### TODO: 
Now implement the regression with a hidden layer and a activation function after that hidden layer. 

Recall, that normal linear regression uses the function:

$$\hat y = Linear(x) = Wx+b$$

Neural Network regression, uses two (distinct) linear transformations, and a so-called ReLU (rectified linear unit) in between. Here we have:

$$\hat y = Linear(ReLU(Linear(x))) = W_B max(0, W_Ax+b_A) + b_B$$

The output of the first linear transformation (+ReLU) is also called hidden layer.
You can specify its size (hidden_size) for example to 10.


You need to change the __init__ and the __forward__ method.
(The ReLU activation is pre-defined, check the slides/PyTorch documentation)

In [None]:
class NeuralNetworkRegression(nn.Module):
    def __init__(self, num_features,hidden_size):
        super(NeuralNetworkRegression, self).__init__()
        
        self.final_layer = nn.Linear(num_features, 1)
    def forward(self, x):
        return self.final_layer(x)
nnreg_model = NeuralNetworkRegression(num_features,10)


#### TODO: How does the Neural Network Regression compare to Linear Regression?

### Exercise 2
#### Derive the gradient of the negative log likelihood for logistic regression (3 points)

Negative log likelihood for logistic regression:

$$ NLL(\vec \theta) =  - \sum_{i=1}^m y^{(i)} \log \sigma(\vec{\theta}^T \vec{x^{(i)}}) + (1-y^{(i)}) \log (1 - \sigma(\vec{\theta}^T \vec{x^{(i)}}))$$

Derive the expression for

$$ \nabla_\theta NLL(\vec \theta) $$

For each step give the exact rule you used.

Note that you can use latex math syntax in notebooks.

#### TODO