Neural networks are a good tool for both classification because they can automatically perform a non-linear change of basis on the data that is optimal for prediction.  They also have the capacity to give us some really useful insights into how they work and why they're so good at generalizing.  Here, we'll construct one from scratch in numpy, in particular one that will solve the binary classification problem.

In [None]:
from __future__ import division,print_function

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.figsize']= (16,9)

We'll be generating initial guesses, etc. using random numbers.  To ensure that things go according to plan, let's seed the numpy pseudorandom number generator

In [None]:
np.random.seed(0)

Now let's make some data to classify.  We can choose any function, but if we want to challenge this thing, it should be something that would fail under normal logistic regression.  For example, let's generate some data that is Bernoulli distributed with $\theta(x)$ given by two independent bell curves.   I'll generate data from this distribution using a variant of [rejection sampling](https://en.wikipedia.org/wiki/Rejection_sampling).

In [None]:
classes = [0,1]

m_train = 500
m_test = 250
X_train = np.random.rand(m_train)
X_test = np.random.rand(m_test)

X_train.sort()
X_test.sort()

y_pdf = np.exp(-((X_train-0.5)/0.2)**2)# + np.exp(-((X-0.75)/0.1)**2)
y_pdf /= y_pdf.max()
a = np.random.rand(m_train)
y_train = (a<=y_pdf).astype(int)

y_pdf = np.exp(-((X_test-0.5)/0.2)**2)# + np.exp(-((X-0.75)/0.1)**2)
y_pdf /= y_pdf.max()
a = np.random.rand(m_test)
y_test = (a<=y_pdf).astype(int)

import keras.utils

y_train = keras.utils.to_categorical(y_train, 2) 
y_test = keras.utils.to_categorical(y_test, 2)

In [None]:
plt.plot(X_train,y_train[:,1],'k.')
plt.plot(X_test,y_test[:,1],'r.')
plt.show()

Not only this data not linearly-seperable, it is also multimodal:  Naive Bayes would be bound to fail because we could not *a priori* determine a sensible probability model for the data.  Logistic regression with the linear basis would also be bound to fail because it can't deal with multiple peaks like this (although we could enrich the basis set instead).  A neural network will allow us to *learn* good basis functions, or how to transform the data to optimize classification.    

I confess that I have not implemented a binary classifier: it seemed like a waste of effort when a multiclass method will work fine for the two class case!  As such, we need to make the $T$ matrix (the one hot representation of our class labels).

In [None]:
import keras 
import keras.models as km
import keras.layers as kl

Now we can import the neural network that I've coded, (the skeleton of) which is available on the course moodle.  

In [None]:
logistic_model = km.Sequential()

We can build a neural network for our problem with the following syntax:

In [None]:
logistic_model.add(kl.Dense(2,input_shape=(1,),activation='softmax',use_bias=True))

The first argument is the number of nodes in each layer, so we have one input layer with one node, one hidden layer with four nodes, and one output layer with two nodes.  

The second argument is the activation function associated with each.  The input layer has no activation, the second layer is sigmoids, and the third layer is softmax.  

The third argument is a boolean value, which states whether to append a bias node for each layer.

layer_weight_means_and_stds gives the statistics of the initial guess for weights.

We can make predictions with the nn.feed_forward function.  Before we train, we can verify that this, and our gradient code is working properly by computing a finite difference.

In [None]:
logistic_model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.RMSprop(lr=0.01),
              metrics=['accuracy'] )

These are the same to within 6 places, so good enough.

Now, we are all set to perform gradient descent.  We need to define a learning rate $\eta$:

In [None]:
logistic_model.fit(X_train,y_train,batch_size=m_train,epochs=1000,verbose=1,validation_data=(X_test,y_test))

Then we iterate for as long as we want.  We'll allow 100000 iterations on the full data.

In [None]:
y_pred = logistic_model.predict(X_test)
plt.plot(X_test,y_pred[:,1],'r.')
plt.plot(X_test,y_test[:,1],'k.')
plt.show()
y_pred

This takes a minute to train.  When it's finished, we can plot the results:

In [None]:
n_hidden = 2
nn_model = km.Sequential()
nn_model.add(kl.Dense(n_hidden,input_shape=(1,),use_bias=True,activation='sigmoid'))
nn_model.add(kl.Dense(2,use_bias=True,activation='softmax'))
nn_model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.RMSprop(lr=0.01),
              metrics=['accuracy'])


nn_model.fit(X_train,y_train,batch_size=m_train,epochs=5000,verbose=0,validation_data=(X_test,y_test))



Pretty good results for a problem that would have thwarted one of our earlier classifiers.  Let's examine what this thing is doing a little bit more deeply.  It's particularly interesting to look at the outputs of the hidden layer, or what basis functions the model decided to transform the data to before classification. 

In [None]:
y_pred = nn_model.predict(X_test)
plt.plot(X_test,y_pred[:,1],'r.')
plt.plot(X_test,y_test[:,1],'k.')
plt.show()

In [None]:
from keras import backend as K

# with a Sequential model
get_1st_layer_output = K.function([nn_model.layers[0].input],
                                  [nn_model.layers[0].output])
layer_output = get_1st_layer_output([X_test.reshape((m_test,1))])[0]

get_2nd_layer_input = K.function([nn_model.layers[0].input],
                                  [nn_model.layers[1].input])

layer_input = get_2nd_layer_input([X_test.reshape((m_test,1))])[0]
print(layer_output)
print(layer_input)


These basis functions represent a transform of our data to a new four dimensional space.  It's instructive to see what we get when we add them up and scale them by some weights that we found with gradient descent: the inputs $a_1^{(2)}$ and $a_2^{(2)}$ to the final layer).

In [None]:
plt.plot(X_test,y_pred[:,1],'ro')
plt.plot(X_test,y_test[:,1],'ko')

plt.plot(X_test,layer_output[:,0],'o-')
plt.plot(X_test,layer_output[:,1],'o-')
#plt.plot(X_test,layer_output[:,2],'o-')
    
plt.show()

Recall that the input to the softmax function is log-probabilities.  That's still what these curves represent.  In particular, you can see that the decision boundaries occur where the log-probabilities for each class are equal.

Finally, it's interesting to look at the evolution of the cost function through gradient descent.

In [None]:
plt.scatter(layer_output[:,0],layer_output[:,1],c=y_test[:,1])
plt.show()

In [None]:
w_and_b = nn_model.get_weights()
w = w_and_b[2]
b = w_and_b[3]
final_log_probs = layer_output @ w + b
plt.plot(X_test,final_log_probs[:,0])
plt.plot(X_test,final_log_probs[:,1])
plt.scatter(X_test,np.zeros_like(X_test),c=y_pred[:,1]>y_pred[:,0])
plt.show()

In [None]:
import keras

from keras.datasets import mnist

import keras.models as km
import keras.layers as kl
from keras import backend as K

batch_size = 256
N = 10
n_hidden=300
epochs = 24

rows,cols = 28,28
n = rows*cols

(X_train,y_train),(X_test,y_test) = mnist.load_data()

m_train = X_train.shape[0]
m_test = X_test.shape[0]

X_train = X_train.reshape((m_train,rows*cols))
X_test = X_test.reshape((m_test,rows*cols))

X_train = X_train/255
X_test = X_test/255

y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

model = km.Sequential()
model.add(kl.Dense(N,input_shape=(n,),activation='sigmoid',kernel_regularizer=keras.regularizers.l2(0.001)))
#model.add(kl.Dense(N,activation='sigmoid',kernel_regularizer=keras.regularizers.l2(0.001)))
model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(X_train,y_train,batch_size=batch_size,epochs=epochs,verbose=1,validation_data=(X_test,y_test))
score = model.evaluate(X_test,y_test,verbose=1)

In [None]:
weights = model.get_weights()
fig,axs = plt.subplots(nrows=1,ncols=10)
fig.set_size_inches(16,2)
for w,ax in zip(weights[0].T,axs):
    ax.imshow(w.reshape((28,28)))
plt.show()



In [None]:
import keras
from keras.datasets import mnist
import keras.models as km
import keras.layers as kl
from keras import backend as K

batch_size = 256
N = 10
n_hidden_1=512
n_hidden_2=512
epochs = 24

rows,cols = 28,28
n = rows*cols

(X_train,y_train),(X_test,y_test) = mnist.load_data()

m_train = X_train.shape[0]
m_test = X_test.shape[0]

X_train = X_train.reshape((m_train,rows*cols))
X_test = X_test.reshape((m_test,rows*cols))

X_train = X_train/255
X_test = X_test/255

y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

model = km.Sequential()
model.add(kl.Dense(n_hidden_1,input_shape=(n,),activation='relu',kernel_regularizer=keras.regularizers.l2(0.01)))
model.add(kl.Dropout(0.2))
model.add(kl.Dense(n_hidden_2,activation='relu'))
model.add(kl.Dropout(0.2))
model.add(kl.Dense(N,activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(X_train,y_train,batch_size=batch_size,epochs=epochs,verbose=1,validation_data=(X_test,y_test))
score = model.evaluate(X_test,y_test,verbose=1)


In [None]:

import matplotlib.pyplot as plt
import numpy as np
plt.imshow(model.get_weights()[0][:,np.random.randint(n_hidden_1)].reshape(28,28))
plt.show()
