# Problem 1: Other activation functions (20%)
### The leaky Relu is defined as $max(0.1x, x)$. 
 - What is its derivative? (Please express in "easy" format")
 - Is it suitable for back propagation?
 
### $tanh$ is defined as  $\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$. 
 - What is its derivative? (Please express in "easy" format")
 - Is it suitable for back propagation?
 - How is it different from the sigmoid activation

a) derivative is equal to .01 for x < 0 and 1 for 1 > 0.
This is a suitable function for back propagation because the derivate is simple to calculate, is non-linear , monotonic and is finite.

b) The derivative is $1-\frac{(e^{x}-e^{-x})^{2}}{(e^{x}+e^{-x})^{2}}$,
This is also a suitable function for back propagation for the same reasons as the one above. A big difference between sigmoid and tanh that sigmoid is bounded between (0,1) where tanh is bounded between (-1,1)

# Problem 2: Linear regression in Keras (40%)

#### We'd like to use keras to perform linear regression and compare it to another tool (scikit-learn)
#### We'll compare OLS, ridge ($L2$ regularization) and LASSO ($L1$ regularization) using both keras and scikit-learn


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# many of these imports to be removed
from keras.models import Model, Input
from keras.layers import Dense, Softmax, Dropout
from keras.regularizers import l1_l2
from keras.optimizers import RMSprop
import keras.backend as K

Populating the interactive namespace from numpy and matplotlib


Using TensorFlow backend.


In [3]:
# Generate some data
np.random.seed(1024)
num_observations = 1024
coefs = np.array([-1.2, 5, 0, .22, 2, 0, 4])  # notice, there are zeros!
noise_amplitude = .05

num_variables = coefs.shape[0]

x = np.random.rand(num_observations, num_variables)
y = np.dot(x, coefs) + noise_amplitude * np.random.rand(num_observations)

cutoff = int(.8 * num_observations)
x_train, x_test = x[:cutoff], x[cutoff:]
y_train, y_test = y[:cutoff], y[cutoff:]

In [4]:
x_train.shape, y_train.shape

((819, 7), (819,))

In [8]:

# insert code to make predictions here
# ...
# lin_reg_predictions = ...
reg = LinearRegression().fit(x_train, y_train)
lin_reg_predictions = reg.predict(x_test)
mean_squared_error(y_test, lin_reg_predictions)

0.00020867822075987672

In [9]:
# Show that the coefficients are all close the the "real" ones used to generate the data
lin_reg_coefs = reg.coef_
pd.Series(lin_reg_coefs, name='fit coefficients').to_frame().join(pd.Series(coefs, name='real coefficients')) 

Unnamed: 0,fit coefficients,real coefficients
0,-1.200971,-1.2
1,4.999581,5.0
2,-0.00182,0.0
3,0.217426,0.22
4,1.999645,2.0
5,-0.000385,0.0
6,4.000916,4.0


In [10]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

def plot_model_in_notebook(model):
    return SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))


In [76]:
# Now we will use keras to solve the same problem 
K.clear_session()
#input_data = Input(shape=FIXME)
inputs = Input(shape=(7,))
preds = Dense(1,activation='linear')(inputs)

keras_lin_reg = Model(inputs=inputs,outputs=preds)

keras_lin_reg.compile(optimizer=RMSprop() ,loss='mse',metrics=['mse'])

# add model definition here
# don't forget to compile your model

## How many parameters does the model have? 
### Explicitly show the calculation, explain it, and verify that it agrees with `model.count_params()`

In [29]:
# ADD CODE HERE
#
# 7 coeffiecnts + 1 intercept = 6
#
keras_lin_reg.count_params()

8

In [35]:
keras_lin_reg.fit(x_train,y_train,batch_size=1, epochs=21) 
y_pred = keras_lin_reg.predict(x_test)
mean_squared_error(y_test,y_pred)

Epoch 1/21
Epoch 2/21
Epoch 3/21
Epoch 4/21
Epoch 5/21
Epoch 6/21
Epoch 7/21
Epoch 8/21
Epoch 9/21
Epoch 10/21
Epoch 11/21
Epoch 12/21
Epoch 13/21
Epoch 14/21
Epoch 15/21
Epoch 16/21
Epoch 17/21
Epoch 18/21
Epoch 19/21
Epoch 20/21
Epoch 21/21


0.00023413482935084624

In [45]:
# find the coefficients
keras_ols_coefs = keras_lin_reg.layers[1].get_weights()[0].flatten()
keras_ols_coefs

pd.Series(keras_ols_coefs, name='keras ols coefficients').to_frame().join(pd.Series(coefs, name='real coefficients'))

Unnamed: 0,keras ols coefficients,real coefficients
0,-1.197769,-1.2
1,4.998701,5.0
2,-0.006458,0.0
3,0.215792,0.22
4,1.99834,2.0
5,-0.001106,0.0
6,4.003156,4.0


## Now we will add some regularization

In [81]:
K.clear_session()
from keras.regularizers import l1_l2
regularizer = l1_l2(l1=0, l2=.1)
 # Dense(...) -> Dense(..., kernel_regularizer=regularizer)

inputs = Input(shape=(7,))    
output = Dense(1, activation='linear', kernel_regularizer=regularizer)(inputs)
keras_ridge_model = Model(inputs, output)
keras_ridge_model.compile(optimizer=RMSprop(lr=2e-3, decay=1e-5), loss='mse', metrics=['accuracy'])



In [83]:
keras_ridge_model.fit(x_train, y_train, epochs=100, verbose=0)
mean_squared_error(y_test, keras_ridge_model.predict(x_test))

1.1263185488351135

In [58]:

keras_ridge_coefs = keras_ridge_model.layers[1].get_weights()[0].flatten()
pd.Series(keras_ridge_coefs, name='keras ridge coefficients').to_frame().join(pd.Series(coefs, name='real coefficients'))

Unnamed: 0,keras ridge coefficients,real coefficients
0,-0.532063,-1.2
1,2.188637,5.0
2,0.049841,0.0
3,0.184881,0.22
4,0.929821,2.0
5,0.088853,0.0
6,1.8415,4.0


In [61]:
# ridge regression in sklaern
from sklearn.linear_model import Ridge

# Add code here
clf = Ridge(alpha=.1)
clf.fit(x_train,y_train)
sklearn_ridge_coef = clf.coef_
pd.Series(sklearn_ridge_coef, name='ridge coefficients').to_frame().join(pd.Series(coefs, name='real coefficients'))

Unnamed: 0,ridge coefficients,real coefficients
0,-1.199363,-1.2
1,4.9919,5.0
2,-0.001726,0.0
3,0.217458,0.22
4,1.996803,2.0
5,-3.8e-05,0.0
6,3.995302,4.0


In [None]:
# compare coefficients from various methods
pd.concat([
    pd.Series(sklearn_ridge_coef, name='ridge coefs'),
    pd.Series(keras_ridge_coefs, name='keras L2 coefs'),
    pd.Series(coefs, name='real coefs')
], axis=1)

## In fact, given the zero coefficients, LASSO might have been a better model. 
## LASSO uses $L_{1}$ regularization which will make sparse coefficients (some are zero).

In [17]:
from sklearn.linear_model import Lasso
# Add code here
# sklearn_lasso_coefs = 
pd.Series(sklearn_lasso_coefs, name='lasso coefficients').to_frame().join(pd.Series(coefs, name='real coefficients'))

Unnamed: 0,lasso coefficients,real coefficients
0,-0.048288,-1.2
1,3.815515,5.0
2,0.0,0.0
3,0.0,0.22
4,0.746026,2.0
5,-0.0,0.0
6,2.731789,4.0


In [None]:
# now do lasso with keras

#keras_lasso_model = ...
# don't forget to compile the model
plot_model_in_notebook(keras_lasso_model)

In [19]:
# keras_lasso_model.fit(...
# keras_lasso_coefs = ...

In [None]:
# compare all the coefficients
pd.concat([
    pd.Series(sklearn_ridge_coefs, name='ridge coefs'),
    pd.Series(keras_ridge_coefs, name='keras L2 coefs'),
    pd.Series(sklearn_lasso_coefs, name='lasso coefs'),
    pd.Series(keras_lasso_coefs, name='keras L1 coefs'),
    pd.Series(lin_reg.coef_, name='ols coefs'),
    pd.Series(coefs, name='real coefs'),
], axis=1)

In [21]:
# TODO(find optimal regularization paramter) ?

# Problem 3: Keras for harder mnist problems (40%)
#### The deep net during lecture has a hard time distiguishing between 9 and 4.
#### We will build an algorithm to do this binary classification task 

In [22]:
# safe to restart here

In [23]:
import numpy as np
import pandas as pd
%pylab inline

# many of these to be removed
from keras.datasets import mnist
from keras.models import Model, Input
from keras.layers import Dense, Softmax, Dropout
from keras.regularizers import l1_l2
from keras.optimizers import RMSprop
import keras.backend as K

Populating the interactive namespace from numpy and matplotlib


In [24]:
from keras.utils import to_categorical

def preprocess_training_data(data):
    data = data.reshape(data.shape[0], data.shape[1] * data.shape[2])
    data = data.astype('float32') / 255
    return data

def preprocess_targets(target, num_classes):
    return to_categorical(target, num_classes)


def subset_to_9_and_4(x, y):  # this is a new function
    mask = (y == 9) | (y == 4)
    new_x = x[mask]
    new_y = (y[mask] == 4).astype('int64')
    return new_x, new_y

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = preprocess_training_data(x_train)
x_test = preprocess_training_data(x_test)

num_classes = np.unique(y_train).shape[0]

y_train_ohe = preprocess_targets(y_train, num_classes)
y_test_ohe = preprocess_targets(y_test, num_classes)

train_frac = 0.8
cutoff = int(x_train.shape[0] * train_frac)
x_train, x_val = x_train[:cutoff], x_train[cutoff:]
y_train, y_val = y_train[:cutoff], y_train[cutoff:]
y_train_ohe, y_val_ohe = y_train_ohe[:cutoff], y_train_ohe[cutoff:]

x_train, y_train = subset_to_9_and_4(x_train, y_train)
x_val, y_val = subset_to_9_and_4(x_val, y_val)
x_test, y_test = subset_to_9_and_4(x_test, y_test)

print(x_train.shape)

(9457, 784)


In [None]:
# first try logistic regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Add code here

# sklearn_lr_predictions = ...
accuracy_score(y_test, sklearn_lr_predictions)

In [26]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

def plot_model_in_notebook(model):
    return SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))


In [None]:
K.clear_session()
num_hidden_units = 256
# digit_input = ...
# define model
# model = ...

#NB: you probably want BINARY cross entropy i.e. 'binary_crossentropy' for the loss function
# model.compile(...

In [None]:
plot_model_in_notebook(model)

In [None]:
# how many params does the model have? 

In [None]:
# Add code here
# model.fit(...

# keras_predictions = ...

In [None]:
from sklearn.metrics import f1_score, accuracy_score
accuracy_score(y_test, keras_predictions)

In [None]:
# DONE! Congrats!