# Supervised neural computation - Practical

Dependencies:
- Python (>= 2.6 or >= 3.3)
- NumPy (>= 1.6.1)
- SciPy (>= 0.12)
- SciKit Learn (>=0.18.1)
- Pandas (>=0.19.2)

A simple neuron that can adapt to available data; if and only if the distribution of data matches its activation function (linear in the example above). The advanced goal is to learn “arbitrary” data, not just such that happens to fit the given neuron transfer function. This we achieve by combining multiple neurons in a neural network using feed forward connectivity between the neuron layers. In such a structure each neuron represents some aspect of the data and the neurons higher up in the hierarchy combine these.

The manner in which the neurons of a neural network are structured is intimately linked with the learning algorithm used to train the network. In a layered neural network, the neurons are organized in the form of layers. The simplest form of a layered network has an input layer of (external) source nodes that projects directly to an output layer of neurons (computational nodes), but not vice versa. This network is strictly feedforward! More complex feedforward neural networks additionally contain one or more hidden layers, whose computational nodes are correspondingly called hidden neurons or hidden units; the term “hidden” refers to the fact that this part of the neural network is not seen directly from either the input or the output of the network.

Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function f by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features X = {x1, x2, ..., xm} and a target y, it can learn a non-linear function approximator for either classification or regression. 

The error backpropagation algorithm was originally introduced in the 1970s, but its importance wasn't fully appreciated until a famous paper published 1986 by David Rumelhart, Geoffrey Hinton, and Ronald Williams. The backpropagation algorithm searches the minimum of the error function in weight space, using gradient descent. The particular combination of weights which minimizes the error function is considered to be the solution for learning a representation of data. Since this method requires computation of the gradient of the error function at each iteration step, we must guarantee continuity and differentiability of the error function.
The goal of backpropagation is to compute the partial derivatives of the cost function with respect to any weights in the network.

## Multi-Layer Perceptron classification

MLP trains using Stochastic Gradient Descent. Stochastic Gradient Descent (SGD) updates parameters using the gradient of the loss function with respect to a parameter that needs adaptation, such as the neural weights. Stochastic Gradient Descent is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions. 

Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning. SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 10^5 training examples and more than 10^5 features.

We will use the Multi-Layer Perceptron classifier fom ScikitLearn. This model optimizes the log-loss function using stochastic gradient descent.

MLPClassifier trains iteratively since at each time step the partial derivatives of the loss function with respect to the model parameters are computed to update the parameters. It can also have a regularization term added to the loss function that shrinks model parameters to prevent overfitting.

In [1]:
# MLP trains on two arrays: array X of size (n_samples, n_features), which holds the training samples represented as floating point feature vectors; 
# and array y of size (n_samples,), which holds the target values (class labels) for the training samples:
from sklearn.neural_network import MLPClassifier

X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(X, y)  

MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(5, 2), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

Multi-layer Perceptron is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0, 1] or [-1, +1], or standardize it to have mean 0 and variance 1. Note that you must apply the same scaling to the test set for meaningful results. You can use StandardScaler for standardization.

In [2]:
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
# Don't cheat - fit only on training data
scaler.fit(X)  
X_train = scaler.transform(X)  

In [3]:
# after training the network can predict labels for new samples
X_test = [[2., 2.], [-1., -2.]]
# apply same transformation to test data
X_test = scaler.transform(X_test)  
clf.predict(X_test)

array([1, 0])

In [4]:
# Further, the model supports multi-label classification in which a sample can belong to more than one class. 
# For each class, the raw output passes through the logistic function. Values larger or equal to 0.5 are rounded to 1, otherwise to 0.
# For a predicted output of a sample, the indices where the value is 1 represents the assigned classes of that sample:
X = [[0., 0.], [1., 1.]]
y = [[0, 1], [1, 1]]
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(15,), random_state=1)
clf.fit(X, y)                       
# test for new samples
clf.predict([[1., 2.]])
clf.predict([[0., 0.]])

array([[0, 1]])

To test the capabilities of the MLP classifier network we apply it to the Iris dataset. As a reminder, the Iris dataset is a traditional benchmark in classification problems in ML. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. 

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load dataset
datatrain = pd.read_csv('./data/iris.csv')

# Change string value to numeric
datatrain.set_value(datatrain['class']=='Iris-setosa',['class'],0)
datatrain.set_value(datatrain['class']=='Iris-versicolor',['class'],1)
datatrain.set_value(datatrain['class']=='Iris-virginica',['class'],2)
datatrain = datatrain.apply(pd.to_numeric)

# Change dataframe to array
datatrain_array = datatrain.as_matrix()

# Split x and y (feature and target)
X_train, X_test, y_train, y_test = train_test_split(datatrain_array[:,:4],
                                                    datatrain_array[:,4],
                                                    test_size=0.2)

We build a MLP classifier, with one hidden layer. The input layer has 4 neurons, represents the feature of Iris; the hidden layer has 10 neurons, activation using ReLU; and the output layer has 3 neurons, representing the class of Iris. We use stochastic gradient descent with no batch-size and a categorical cross entropy loss function. We chose a learning rate of 0.01 and trained the network for 500 epochs. We test afterwards the classification for a single sample.

In [6]:
from sklearn.neural_network import MLPClassifier
import numpy as np

mlp = MLPClassifier(hidden_layer_sizes=(10),solver='sgd',learning_rate_init=0.01,max_iter=500)

# Train the model
mlp.fit(X_train, y_train)

# Test the model
print mlp.score(X_test,y_test)
# test for a single point
sl = 5.3
sw = 1
pl =5.2
pw = 0.2
data = [sl,sw,pl,pw]
data = np.array(data).reshape((1, -1))
print mlp.predict(data)


0.966666666667
[ 2.]


# Assignments

In this problem you shall implement a MLP classifier that learns to separate two classes. These two classes will be linearly separable.
Your algorithm reads a number of 3-valued training examples: each such example consists of two inputs ("x" and "y" value) and a desired output value of +1 or -1. The exact number of training examples is unknown, but you can safely assume you will read <= 1000 from the input file.
At some point your program will find a training example '0,0,0\n' , (note the desired output of zero, which is invalid!). This indicates that the training data is completely read, and your program should start training the neuron.

After training, your program continues to read 2-valued evaluation data: for each such example your program should report the corresponding class (+1 or -1) followed by return ('\n') as output.

Note (1): Do not include the example (0,0,0) in your training set!
Note (2): Your program needs to output the string '+1\n'  (with a plus sign) for the positive class, not just a '1\n'.
Note (3): Think about how you compute the "error" of your neuron for training. 

Hint:
- size of network (number of hidden layers and neurons per layer)
- initialization of weights (use small random numbers)
- learning rate (suggestion: choose a small constant value)
- normalization of input and output data (assume that we do not query the network outside of the training domain)

2-class classification with parabola decision boundary
![title](img/class_parabola.png)

In [7]:
# load the datasets for training and testing
import numpy as np
import csv 
with open('./data/class_parabola_in.txt') as inputfile:
    train_data = list(csv.reader(inputfile))
with open('./data/class_parabola_out.txt') as inputfile:
    test_data = list(csv.reader(inputfile))
    
# add network code here

2-class classification with sine decision boundary
![title](img/class_sin.png)

In [8]:
# load the datasets for training and testing
import numpy as np
import csv 
with open('./data/class_sin_in.txt') as inputfile:
    train_data = list(csv.reader(inputfile))
with open('./data/class_sin_out.txt') as inputfile:
    test_data = list(csv.reader(inputfile))
    
# add network code here

2-class classification with powerlaw decision boundary
![title](img/class_powerlaw.png)

In [9]:
# load the datasets for training and testing
import numpy as np
import csv 
with open('./data/class_powerlaw_in.txt') as inputfile:
    train_data = list(csv.reader(inputfile))
with open('./data/class_powerlaw_out.txt') as inputfile:
    test_data = list(csv.reader(inputfile))
    
# add network code here

## Multi-Layer Perceptron regression


MLPRegressor trains iteratively since at each time step the partial derivatives of the loss function with respect to the model parameters are computed to update the parameters.
It can also have a regularization term added to the loss function that shrinks model parameters to prevent overfitting.

We will use the Multi-Layer Perceptron regressor fom ScikitLearn. This model optimizes the squared-loss function using stochastic gradient descent.

In [10]:
# In a first simple example we generate data corresponding to a parabola and learn it with a MLP regressor
# dataset for training (pairs (x,y))
from sklearn.neural_network import MLPRegressor 
 
# create Trainig Dataset
train_x=[[x] for x in  range(200)]
train_y=[x[0]**2 for x in train_x]
 
#create neural net regressor
reg = MLPRegressor(hidden_layer_sizes=(50,),solver='lbfgs')
reg.fit(train_x,train_y)
 
#test prediction
test_x=[[x] for x in  range(201,220,2)]
 
predict=reg.predict(test_x)
print "_Input_\t_output_"
for i in range(len(test_x)):
    print "  ",test_x[i],"---->",predict[i]


_Input_	_output_
   [201] ----> 39895.1395791
   [203] ----> 40596.5231093
   [205] ----> 41297.9066394
   [207] ----> 41999.2901695
   [209] ----> 42700.6736997
   [211] ----> 43402.0572298
   [213] ----> 44103.4407599
   [215] ----> 44804.82429
   [217] ----> 45506.2078202
   [219] ----> 46207.5913503


# Assignments


In this problem you shall implement a MLP regressor that learns to regress (approximate) an arbitrary nonlinear function.
Your algorithm reads a number of 2-valued training examples: each such example consists of one input ("x") and a desired output value ("y"). The exact number of training examples is unknown, but you can safely assume you will read <= 1000.
At some point your program will find a training example '0,0\n' , (note that this input might be a possible training point, but we define that (0,0) is invalid!). This indicates that the training data is completely read, and your program should start training the neuron.

After training, your program continues to read 1-valued evaluation data: for each such example your program should compute and print the neurons output followed by return ('\n').

Note (1): Do not include the data (0,0) in your training set!
Note (2): Your network might not be able to learn the data without a remaining error.
This is normal. We will tolerate small deviations around the required output, typically within 5% of the overall output range.

Recommendation: use "tanh" neurons in your hidden layer(s) and a linear neuron in your output.

Hint: Remember the discussion in class about
- initialization of weights (use small random numbers)
- learning rate (suggestion: choose a small constant value)
- normalization of input and output data (assume that we do not query the network outside of the training domain)

Nonlinear function approximation for a parabola.
![title](img/reg_parabola.png)

In [11]:
# load the datasets for training and testing
import numpy as np
import csv 
with open('./data/reg_parabola_in.txt') as inputfile:
    train_data = list(csv.reader(inputfile))
with open('./data/reg_parabola_out.txt') as inputfile:
    test_data = list(csv.reader(inputfile))
    
# add network code here

Nonlinear function approximation for a sine wave.
![title](img/reg_sin.png)

In [12]:
# load the datasets for training and testing
import numpy as np
import csv 
with open('./data/reg_sin_in.txt') as inputfile:
    train_data = list(csv.reader(inputfile))
with open('./data/reg_sin_out.txt') as inputfile:
    test_data = list(csv.reader(inputfile))
    
# add network code here

Nonlinear function approximation for a ramp.
![title](img/reg_ramp.png)

In [13]:
# load the datasets for training and testing
import numpy as np
import csv 
with open('./data/reg_ramp_in.txt') as inputfile:
    train_data = list(csv.reader(inputfile))
with open('./data/reg_ramp_out.txt') as inputfile:
    test_data = list(csv.reader(inputfile))
    
# add network code here

Exact same scenario as in the previous section (non-linear data), but here the training data is noisy! You need to use techniques such as Simulated Annealing and/or Early Stopping.

Nonlinear noisy function approximation for a parabola.
![title](img/reg_parabola_noise.png)

In [14]:
# load the datasets for training and testing
import numpy as np
import csv 
with open('./data/reg_parabola_in_noise.txt') as inputfile:
    train_data = list(csv.reader(inputfile))
with open('./data/reg_parabola_out_noise.txt') as inputfile:
    test_data = list(csv.reader(inputfile))
    
# add network code here

Nonlinear noisy function approximation for a sinewave.
![title](img/reg_sine_noise.png)

In [15]:
# load the datasets for training and testing
import numpy as np
import csv 
with open('./data/reg_sin_in_noise.txt') as inputfile:
    train_data = list(csv.reader(inputfile))
with open('./data/reg_sin_out_noise.txt') as inputfile:
    test_data = list(csv.reader(inputfile))
    
# add network code here

Nonlinear noisy function approximation for an arbitrary nonlinear function.
![title](img/reg_nonlinear_noise.png)

In [16]:
# load the datasets for training and testing
import numpy as np
import csv 
with open('./data/reg_nonlinear_in_noise.txt') as inputfile:
    train_data = list(csv.reader(inputfile))
with open('./data/reg_nonlinear_out_noise.txt') as inputfile:
    test_data = list(csv.reader(inputfile))
    
# add network code here