# Example Code for Machine Learning with scikit-learn

First, import all packages needed for your code. scikit-learn contains more or less all functions needed to prepare your dataset, train different ML models, and validate those using common performance metrics

In [6]:
# define inputs
import numpy as np
import pandas as pd
import random

from sklearn import datasets
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, PredefinedSplit
from sklearn.metrics import confusion_matrix

As example, we use the iris dataset which is part of scikit-learn and a commonly used toy dataset for ML. It consists of 3 different types of flowers which are defined by length and width of sepal and petal (2 types of leaves in a flower).

In [2]:
# load data set (iris dataset)

iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
feature_names = iris.feature_names

print(iris.data[1:10])  # print first 10 rows of samples (each row = one sample, each column = one feature)
print(iris.target)  # print labels we want to predict
print(class_names)  # print names of classes (target vector is only numerical)
print(feature_names)  # print feature names

[[4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


To work with scikit-learn, the dataset needs to be converted into 2 arrays, one containing the input features (X) and one containing the labels (y). Both can only contain numerical values! For X, each row corresponds to a sample and each column to a feature. y is 1-dimensional and each row corresponds to the same sample as the corresponding row in X. The iris set is already in the correct format.

Before training a ML model, the dataset should be split in 3 sets: Training, Cross-Training, and Testing. First, split off the test set. Please make sure to always use the same test set, also if you train different ML models. Usually, it helps to pre-define a train/test-split, save that and re-use it for every ML model you are training. Same is true for the cross-training set.

In [23]:
# Split the data into a training set and a test set or use your own pre-defined split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

print("Dataset sizes:\nWhole set: {}\nTraining Set: {}\nTest Set: {}".
      format(len(y), len(y_train), len(y_test)))



Dataset sizes:
Whole set: 150
Training Set: 120
Test Set: 30


In [24]:
#making the test_fold array
predefined_val = np.empty((120))
predefined_val.fill(-1)

random_values = random.sample(range(0,119), 24)
for i in random_values:
    predefined_val[i] = 0
print(predefined_val)

[ 0. -1. -1. -1. -1. -1. -1. -1.  0. -1.  0.  0. -1. -1. -1.  0. -1. -1.
 -1. -1.  0. -1. -1. -1. -1. -1. -1.  0. -1. -1. -1. -1. -1. -1. -1. -1.
  0. -1. -1. -1.  0. -1. -1. -1. -1. -1. -1. -1. -1.  0. -1. -1. -1. -1.
 -1.  0. -1.  0. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.  0.  0. -1.  0.
  0.  0. -1. -1. -1.  0. -1. -1.  0. -1. -1.  0. -1.  0. -1. -1. -1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.  0. -1. -1. -1.
 -1. -1. -1. -1. -1.  0. -1. -1. -1. -1.  0. -1.]


The array uses any unique non -1 integer as its own test set. Here's an explanation: https://stackoverflow.com/questions/43952230/predefinedsplit-function-in-sklearn
Basically, if you set all your validation samples to 0 and all your training samples to -1, then you have manually split the data into validation and training. 

In [18]:
print("splilts: ", ps.get_n_splits())
for train_index, test_index in ps.split():
    print("TRAIN:", train_index)
    print("TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

splilts:  1
TRAIN: [  0   1   2   4   6   8   9  10  12  13  15  16  17  18  19  21  23  24
  25  26  27  28  29  30  31  32  33  34  36  38  39  40  41  42  43  44
  45  47  48  49  50  51  53  54  55  58  59  61  62  63  64  68  69  70
  71  72  73  74  75  77  78  80  81  82  83  84  85  88  89  90  91  92
  93  94  95  96  97  98  99 100 101 102 103 106 108 109 110 111 112 113
 114 115 116 117 118 119]
TEST: [  3   5   7  11  14  20  22  35  37  46  52  56  57  60  65  66  67  76
  79  86  87 104 105 107]


The training set is used for learning, e.g., the weights of an Artifical Neural Network. To optimize hyperparameters like number of hidden layers, hidden units or the learning rate, the cross-training set is used. The GridSearchCV of scikit-learn easily allows you to test a variety of parameter combination. 

You can use different scores to compare the models. As default, the accuracy is used.

Usually, to increase the training set size, you do not validate on one specific subset of your training data, but perform cross-validation. You split your data in *k* splits. In the example below, we used StratifiedKFold to split the data in 5 splits. Then, 4 splits are used for training and one for validation and this procedure is repeated 5 times and each time another split is used for validation. This results in 5 different performance values, and usually, the model with the best average performance is picked as final model.

In [15]:
# Perform cross-validation to optimize hyperparameters

# Define cross-validation object
# This uses pre-defined splits to ensure that you are always using the same splits

#define your predefined split using the test_fold array created earlier
ps = PredefinedSplit(test_fold = predefined_val)

# Define predictor
classifier = MLPClassifier(activation='relu', solver='adam', max_iter=200, early_stopping=False)

# Define parameters we want to optimize and values we want to test
# Here, we test different activation functions
params = {'hidden_layer_sizes': [(25,), (50,), (75,)],
          'learning_rate_init': [0.1, 0.01]}

# Perform grid search
#here, cv=ps to use predefines split as the "cross validation"
grid = GridSearchCV(estimator=classifier, cv=ps, param_grid=params, 
                    return_train_score=True)
grid.fit(X_train, y_train)

# Analyse results

cv_results = pd.DataFrame(grid.cv_results_)
print(cv_results)




   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       0.058041           0.0         0.000000             0.0   
1       0.076996           0.0         0.000000             0.0   
2       0.057987           0.0         0.000000             0.0   
3       0.082003           0.0         0.000000             0.0   
4       0.090974           0.0         0.001004             0.0   
5       0.088042           0.0         0.000989             0.0   

  param_hidden_layer_sizes param_learning_rate_init  \
0                    (25,)                      0.1   
1                    (25,)                     0.01   
2                    (50,)                      0.1   
3                    (50,)                     0.01   
4                    (75,)                      0.1   
5                    (75,)                     0.01   

                                              params  split0_test_score  \
0  {'hidden_layer_sizes': (25,), 'learning_rate_i...           0.9583



Once you have chosen a final model, you apply it to the test set to obtain the final performance estimation. This is the performance you would publish. Never make any model decision on the test set! 

In [25]:
# Use best estimator and assess performance on the test set

# Calculate predictions
best_classifier = grid.best_estimator_
print(best_classifier)
y_pred = best_classifier.predict(X_test)
pred_score = best_classifier.score(X_test, y_test)

# Calculate confusion matrix (showing tp, fp, tn, fn)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('Acc: {}'.format(round(pred_score, 3)))

MLPClassifier(hidden_layer_sizes=(25,), learning_rate_init=0.1)
[[11  0  0]
 [ 0 13  0]
 [ 0  0  6]]
Acc: 1.0
