In [137]:
"""
Name: Amit Prakash
HW Group Number: 5
"""

'\nName: Amit Prakash\nHW Group Number: 5\n'

## EC-p2: Artifical Neural Networks


### 1) Neural Network Playground

Go to Tensorflow's [Neural Network Playground](https://playground.tensorflow.org/). This website is an interactive and exploratory visualization of how the features, number of layers, training time, etc, influence the classification boundries of an ANN. Right now, we'll only worry ourselves with *classification* problems.



## 2) Training and Testing a Neural Network 

For this problem, you'll be looking at a subset of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits), which contains images of hand-written digits: 10 classes where each class refers to a digit.

Each data entry is a input matrix of 8x8 where each element is an integer in the range 0..16. The matrix is flattened in the dataset.


For this question, **you have enough experience to do the entire model pipeline yourself**. That means *loading the data, creating splits, scaling the data, training and tuning the model, and evaluating the model.*

In [138]:
#Import necessary libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

random_state = 42

### Step 1: Load the data. Use `np.unique()` to check the class balance.

In [139]:
from sklearn.datasets import load_digits
df = load_digits()

In [140]:
df.data

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

In [141]:
# Get a distribution of the class label (target)
np.unique(df.target, return_counts=True)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([178, 182, 177, 183, 181, 182, 181, 179, 174, 180]))

### Step 2: Split the data into X (feautres) and Y (class)

Assign the variables below to split the dataset in to X (features) and Y (target)

In [142]:
X = df.data
Y = df.target

X = X[:200]
Y = Y[:200]

### Step 3: Create your train/test split. Use the provided random_state.

**Note**: You should use a `train_size` of 0.8, or 80%, and make sure to use the `random_state` to ensure test cases work.

In [143]:
from sklearn.model_selection import train_test_split

train_data_fraction = 0.8

X_train, X_test = train_test_split(X, train_size = train_data_fraction, random_state = random_state)
y_train, y_test = train_test_split(Y, train_size = train_data_fraction, random_state = random_state)

In [144]:
assert X_train.shape == (160, 64)
assert y_train.shape == (160,)
assert X_test.shape == (40, 64)
assert y_test.shape == (40,)

### Step 4: Use a [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to normalize the image data. 

Pixel data, like other data we've encountered, should often be scaled before classification. While in practice scaling image data can be more complex, in this exercise we'll continue to use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

Fit the scaler only the the training X features, and then apply it to both training and test X features. We do this because in practice, we wouldn't be able to see data in the test X, so it shouldn't affect feature transformation. We therefore only use X_train for feature transformation.

In [145]:
from sklearn.preprocessing import StandardScaler

# Assign these variables the standardized training and test datasets
stdScaler = StandardScaler().fit(X_train)
X_stand_train = stdScaler.transform(X_train)
X_stand_test = stdScaler.transform(X_test)

In [146]:
X_stand_train.shape

(160, 64)

In [147]:
# Go through each attribute
for i in range(X_stand_train.shape[1]):
    # Calculate the mean of that attribute: it should be 0
    np.testing.assert_almost_equal(np.mean(X_stand_train[:, i]), 0)
    # Calculate the standard deviation: it should be 1
    std = np.std(X_stand_train[:, i])
    # However, if the std was already 0, standardization won't change that,
    # so skip this case
    if abs(std) < 0.01:
        continue
    np.testing.assert_almost_equal(std, 1)

### Step 5:  Train an MLP with default hyperparameters.

For the following, you'll be using sklearn's built in Multi-layer Perceptron classifier [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

Use the default hyperparams aside from `max_iter`. `max_iter` is how many iterations of training the ANN goes though until it manually stops. The default `max_iter=200` is too long for our data currently. 

**Use random_state as the random_states and max_iter=20**. The detault parameters will use a single hidden layer.



In [148]:
from sklearn.neural_network import MLPClassifier

In [149]:
clf = MLPClassifier(max_iter = 20, verbose=True, random_state = random_state).fit(X_stand_train, y_train)
# Tip: if you pass your MLP the parameter verbose=True, you can see each iteration of its backpropagation

Iteration 1, loss = 2.72340110
Iteration 2, loss = 2.61722937
Iteration 3, loss = 2.51331744
Iteration 4, loss = 2.41173375
Iteration 5, loss = 2.31250102
Iteration 6, loss = 2.21554103
Iteration 7, loss = 2.12100080
Iteration 8, loss = 2.02899668
Iteration 9, loss = 1.93964810
Iteration 10, loss = 1.85290300
Iteration 11, loss = 1.76901269
Iteration 12, loss = 1.68775698
Iteration 13, loss = 1.60922366
Iteration 14, loss = 1.53361925
Iteration 15, loss = 1.46082174
Iteration 16, loss = 1.39076614
Iteration 17, loss = 1.32358090
Iteration 18, loss = 1.25908195
Iteration 19, loss = 1.19745453
Iteration 20, loss = 1.13851469


### Step 6:  Evaluate the model on the test dataset using a confusion matrix and a classification report

Like all classifiers, the MLP has a [`predict`](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier.predict) function that is used to make predictions on trianing or test data.

In [150]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [151]:
# Evaluate the classifier and assign mlp_cm to the confusion matrix of the evaluation
y_predicted = clf.predict(X_stand_test)
mlp_cm = confusion_matrix(y_true = y_test, y_pred = y_predicted)
mlp_cm

array([[4, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 4, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 4, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 3, 0, 0, 0, 0],
       [2, 0, 2, 0, 0, 1, 1, 0, 0, 2],
       [0, 0, 0, 1, 0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 1, 2, 0, 0, 0, 2],
       [0, 0, 1, 0, 0, 1, 0, 0, 1, 1]])

In [152]:
#np.testing.assert_almost_equal(mlp_cm, [[146,   0,   3,   0,   3,   0,   0,   0,   1,   0],
#       [  1,  87,  27,   4,  12,   5,   3,   2,  20,   0],
#       [  0,   4, 127,   6,   2,   1,   4,   0,  10,   5],
#       [  0,   2,  19, 115,   0,   2,   2,   0,   9,  17],
#       [  6,   2,   3,   2, 135,   7,   0,   3,   1,   3],
#       [  7,   3,  12,   0,   1, 133,   1,   0,   3,   2],
#       [ 21,   0,  82,   0,  12,   6,  35,   0,  11,   2],
#       [  0,   5,   1,  45,   2,  13,   0,  91,  10,   0],
#       [  7,  10,  21,   5,  12,  21,   2,   0,  53,  21],
#       [ 10,   2,  36,  13,   5,  17,   1,   4,  28,  51]])

np.testing.assert_almost_equal(mlp_cm, [[4, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 4, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 4, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 3, 0, 0, 0, 0],
       [2, 0, 2, 0, 0, 1, 1, 0, 0, 2],
       [0, 0, 0, 1, 0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 1, 2, 0, 0, 0, 2],
       [0, 0, 1, 0, 0, 1, 0, 0, 1, 1]])

In [153]:
# Similarly generate a classification report for the test dataset
mlp_clf_report = classification_report(y_true = y_test, y_pred = y_predicted)
print(mlp_clf_report)

              precision    recall  f1-score   support

           0       0.67      1.00      0.80         4
           1       1.00      1.00      1.00         4
           2       0.57      0.80      0.67         5
           3       0.50      0.50      0.50         2
           4       0.67      1.00      0.80         2
           5       0.43      1.00      0.60         3
           6       1.00      0.12      0.22         8
           7       1.00      0.67      0.80         3
           8       0.00      0.00      0.00         5
           9       0.17      0.25      0.20         4

    accuracy                           0.55        40
   macro avg       0.60      0.63      0.56        40
weighted avg       0.62      0.55      0.50        40



In [154]:
#assert mlp_clf_report == '              precision    recall  f1-score   support\n\n           0       0.67      0.1      0.80       4\n           1       1.00      1.00      1.00       4\n           2       0.57      0.80      0.67       5\n           3       0.50      0.50      0.50       2\n           4       0.67      1.00      0.80         2\n           5       0.43      1.00      0.60         3\n           6       1.00      0.12      0.22         8\n           7       1.00      0.67      0.80         3\n           8       0.00      0.00      0.00         5\n           9       0.17      0.25      0.20         4\n\n    accuracy                           0.55        40\n   macro avg       0.60      0.63      0.56        40\n weighted avg       0.62      0.55      0.50        40\n'
assert mlp_clf_report == classification_report(y_test, clf.predict(X_stand_test))

In [155]:
# For comparison, generate a classification report for the *training* dataset
y_predicted = clf.predict(X_stand_train)
mlp_clf_report = classification_report(y_true = y_train, y_pred = y_predicted)
print(mlp_clf_report)

              precision    recall  f1-score   support

           0       0.89      0.94      0.91        17
           1       0.93      0.87      0.90        15
           2       0.67      0.67      0.67        15
           3       1.00      0.95      0.97        19
           4       0.82      0.82      0.82        17
           5       0.89      0.94      0.91        17
           6       0.92      0.92      0.92        13
           7       0.94      1.00      0.97        17
           8       0.80      0.29      0.42        14
           9       0.67      1.00      0.80        16

    accuracy                           0.85       160
   macro avg       0.85      0.84      0.83       160
weighted avg       0.86      0.85      0.84       160



In [156]:
#assert mlp_clf_report == '              precision    recall  f1-score   support\n\n           0       0.89      1.00      0.94        25\n           1       0.82      0.67      0.74        21\n           2       0.67      0.89      0.76        18\n           3       0.74      0.82      0.78        17\n           4       0.78      0.95      0.86        19\n           5       0.79      0.95      0.86        20\n           6       0.50      0.25      0.33        12\n           7       0.88      0.58      0.70        12\n           8       0.63      0.55      0.59        22\n           9       0.64      0.54      0.58        13\n\n    accuracy                           0.75       179\n   macro avg       0.73      0.72      0.71       179\nweighted avg       0.75      0.75      0.74       179\n'
assert mlp_clf_report == '              precision    recall  f1-score   support\n\n           0       0.89      0.94      0.91        17\n           1       0.93      0.87      0.90        15\n           2       0.67      0.67      0.67        15\n           3       1.00      0.95      0.97        19\n           4       0.82      0.82      0.82        17\n           5       0.89      0.94      0.91        17\n           6       0.92      0.92      0.92        13\n           7       0.94      1.00      0.97        17\n           8       0.80      0.29      0.42        14\n           9       0.67      1.00      0.80        16\n\n    accuracy                           0.85       160\n   macro avg       0.85      0.84      0.83       160\nweighted avg       0.86      0.85      0.84       160\n'


How well did the classifier do? What digit did it do best on? Which digits did it confuse the most? Do you think the classifier is likely over-fitting, underfitting or neither?

The classifier did fine, but worse than the training dataset. The best digit was clearly 1, with an f1-score of 1. The worst digit was 8, with an f1-score of 0. I believe the model is underfitting since training f1-scores are clearly better than the testing f1-scores.

## 3) Hyperparameters

**Hyperparams**:

ANNs have *a lot* of hyperparams. This can include simple things such as the number of layers and nodes, up to tuning the learning rate and the gradient descent algorithm used. 

This process can require a lot of experimentation and intution through experience, but it can be automated to some extent using hyperparameter tuning. When we have multiple hyperparameters, we use an approach called GridSearch, where we try all combinations of various hyperparameters to find the combination that works best.

For the following, you will practice the hyperparamater tuning for the [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) with sklearn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function, you should explore different combination of the following parameters:

* `activation`: The activation function of the the ANN. Defaults to ReLU.
* `max_iter`: The ANN will train iterations until either the loss stops improving by a specified threshold, or `max_iters` is reached. Warning: the more you increase this, the more the training time will take! Patience is a virtue.
* `hidden_layer_sizes`: A tuple representing the structure of the hidden layers. For example, giving the tuple `(100,50)` means that there's two hidden layers: the first being of size 100, and the second being of size 50. The tuple (100,) would mean a single hidden layer of size 100.

Normally we would try many more possible combinations (and larger networks), but we've kept the list short to reduce computation time.

**Try different permutations of these hyperprams and see how it affects the classification scores of your model.**

In [157]:
# import the library
from sklearn.model_selection import GridSearchCV

In [158]:
# The parameter list you will explore
parameters = {'activation':['logistic', 'relu'], 'max_iter':[5, 10], 'hidden_layer_sizes':[(50,),(20,)]}

Now it's your turn, first initialize an MLPClassifier, make sure to **use "random_state" as the random_states**, then feed the parameter list defined above as well as the training data (**use "X_stand_train"**) to GridSearchCV to create a classifier with the best combination of the parameters. To do so, it uses cross-validation within the training dataset, so you never have to peek at your test dataset. Then fit the final classifier to the whole standardized training dataset.

**Note**: You should use cv=2 in your grid search, to reduce the number of folds tested.

In [171]:
from sklearn.pipeline import Pipeline
# Assign clf to the optimized (with grid search) MLP model
# TIP: Again, if you want to track the trianing progress, try passing "verbose = True" to the MLP
mlp = MLPClassifier(max_iter = 20, random_state = random_state)
clf = GridSearchCV(estimator = mlp, param_grid = parameters, cv = 2)
clf.fit(X_stand_train, y_train)

In [172]:
# Now let's see the parameters of the winning model of our grid search
# This model is the one clf actually uses when you call clf.fit
clf.best_estimator_

In [173]:
assert list(clf.cv_results_['rank_test_score']) == [3, 2, 8, 6, 4, 1, 6, 5]
np.testing.assert_almost_equal(round(clf.best_score_,4), 0.2312)
assert clf.best_params_['hidden_layer_sizes'] == (50,)
assert clf.best_index_ == 5

In [174]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Now you will use the estimator with the best found parameters to generate predictions (stored as "y_pred") on testing dataset, **remember to use "X_stand_test"**

In [175]:
y_pred = clf.predict(X_stand_test)

In [176]:
assert list(confusion_matrix(y_test,y_pred)[0]) == [0, 0, 0, 0, 1, 0, 1, 0, 0, 2]

In [177]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       0.67      1.00      0.80         4
           2       0.20      0.20      0.20         5
           3       0.00      0.00      0.00         2
           4       0.00      0.00      0.00         2
           5       1.00      0.67      0.80         3
           6       0.33      0.12      0.18         8
           7       0.00      0.00      0.00         3
           8       0.20      0.20      0.20         5
           9       0.08      0.25      0.12         4

    accuracy                           0.25        40
   macro avg       0.25      0.24      0.23        40
weighted avg       0.27      0.25      0.24        40



Note that in this toy example, we used a very limited set of hyperparmeters to reduce training time, and so our tuned model will actually do worse than our original. However, in practice, the tuned model will generally have better generalization to the test dataset.