# A first end-to-end project - the iris data set
In the last tutorials you have learned a lot of the basics needed for training neural network models for the approximation of difficult regression or classification functions. In this tutorial we stitch everything together by developing a 'real' model for a more complex task. We will stick with classification for now, a lot can be transferred to regression though by adjusting i. e. the loss function.

<p>
    <center>
        <img src="https://cdn.educba.com/academy/wp-content/uploads/2019/12/Regression-vs-Classification.jpg" height=200, style="height:200px"> 
        (image: https://www.educba.com/regression-vs-classification/)
    </center>
</p>
        

## iris - one of the most famous data sets
The *iris* data set is one of the oldest data sets used for testing classification methods. It was developed around 1936 by the British statistician and biologist Ronald Fisher.
<p>
    <center>
        <img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2016/06/Multi-Class-Classification-Tutorial-with-the-Keras-Deep-Learning-Library.jpg" height=200, style="height:200px"> 
        (image: https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/)
    </center>
</p>


In [137]:
# if you run this in Colab, you need to download the examples: uncomment the following line
# ! git clone https://github.com/flome/e4_bsc_python
# % cd e4_bsc_python
# ! git checkout machine_learning
# % cd 4.\ Machine\ Learning

In a first step, we will import and inspect the data:


In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('iris_dataset.csv', index_col=0)
data.head()

<p>
    <center>
        <img src="https://ars.els-cdn.com/content/image/3-s2.0-B9780128147610000034-f03-01-9780128147610.jpg" height=400, style="height:400px"> 
        (image: https://www.sciencedirect.com/topics/computer-science/iris-virginica)
    </center>
</p>

We will use *seaborn* to quickly visualize the data set:

In [None]:
import seaborn as sns

In [None]:
sns.pairplot(data, hue='species')

## pre-processing - splitting, data scaling and label encoding
We will start as always: preparing the data. As we learned before, we will start by creating a separate test set for later validation from the design matrix and the target vector.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = data[['sepal length in cm', 'sepal width in cm', 'petal length in cm', 'petal width in cm']].values
X.shape

In [None]:
# note the extra brackets! Without these, we don't get a (150, 1) vector but a (150,) 1D vector 
# that would not comply with machine learning conventions and produces errors along the way
Y = data[['species']].values
Y.shape

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size=0.2)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

What about the target vector? For regression problems with continuous output, we can simply create a Standard Scaler as well. Strings are not very good values as output for e. g. a sigmoid function though.

<p>
    <center>
        <img src="https://upload.wikimedia.org/wikipedia/commons/5/53/Sigmoid-function-2.svg" height=200 style='height: 200px'>
        (image: https://en.wikipedia.org/wiki/Sigmoid_function)
    </center>
</p>

We will create instead of *one* target variable *three* variables, one representing the probability to belong to one of the three classes. This is called *one-hot encoding*

In [None]:
from sklearn.preprocessing import OneHotEncoder
# the sparse parameter determines whether the full matrix is stored or only the non-zero elements
# we stay with the basic matrix version for now
target_scaler = OneHotEncoder(sparse=False)
Y_train_scaled = target_scaler.fit_transform(Y_train)
Y_test_scaled = target_scaler.transform(Y_test)
print('before encoding: ')
print(Y_train[:3])
print('after encoding: ')
print(Y_train_scaled[:3])

## Building a classification model
Next, we will create a Keras model for the classification. We will keep it simple to start with. We need 4 input nodes and 3 output nodes.

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Input, Dense

In [None]:
iris_classifier = Sequential()
iris_classifier.add( Input((4,)) )
iris_classifier.add( Dense(32, activation='tanh', name='hidden_layer') )
iris_classifier.add( Dense(3, activation='sigmoid', name='output_layer') )
iris_classifier.summary()

### Compile and train the model!

In [None]:
iris_classifier.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
callbacks = iris_classifier.fit(X_train_scaled, Y_train_scaled, validation_split=.15, epochs=300)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(callbacks.history['loss'], label='training loss')
plt.plot(callbacks.history['val_loss'], label='validation loss')
plt.xlabel('epochs')
plt.ylabel('b. c. e. loss')
plt.legend(loc='best')

We see, that both training and validation loss keep decreasing, the model is not fully trained. The validation loss is higher than the training loss, so we have slight over-training, but nothing to worry to much about yet. Let us try to experiment with a higher learning rate. For this, we need to instantiate an optimizer object:

In [None]:
from tensorflow.keras.optimizers import Adam

In [None]:
opt = Adam(learning_rate=0.01)

If we don't create a new model, the weights will continue to be improved from where we left them after the first optimization round above.

In [None]:
iris_classifier = Sequential()
iris_classifier.add( Input((4,)) )
iris_classifier.add( Dense(32, activation='tanh', name='hidden_layer') )
iris_classifier.add( Dense(3, activation='sigmoid', name='output_layer') )
iris_classifier.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
callbacks = iris_classifier.fit(X_train_scaled, Y_train_scaled, validation_split=.15, epochs=300)

In [None]:
plt.figure()
plt.plot(callbacks.history['loss'], label='training loss')
plt.plot(callbacks.history['val_loss'], label='validation loss')
plt.xlabel('epochs')
plt.ylabel('b. c. e. loss')
plt.legend(loc='best')

plt.figure()
plt.plot(callbacks.history['accuracy'], label='training accuracy')
plt.plot(callbacks.history['val_accuracy'], label='validation accuracy')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.legend(loc='best')

The good news: the training has converged! The bad news: now we do see worrying over-training... the validation loss starts to increase again and the accuracy is dropping!

What now? The simplest approach is to design a simpler model. If we want our model to stay more complex, there are still several ways to reduce over-training by artificially reducing the model capacity during the training

## regularization
Regularization is the category of techniques which constrain the model capacity during the training process. In machine learning, there are two types of regularization which are mainly used. 

    - regularization by penalizing large weight values
    - regularization by dropout
    

### regularization by penalizing large weight values
Over-fitting is very often a result of weights in the neural network becoming too extreme. This leads to sharp decision boundaries instead of smooth and regular shapes which better approximate the underlying function we want to approximate. 

Do you remember the loss function of the binary crossentropy? I am sure you, do but I will put it here again anyway:

<p>
<center>
    $ b. c. e = - \frac{1}{N} \sum_{i = 1}^{N} \left( y_i\cdot \log (p_i) + (1-y_i)\cdot \log(1-p_i) \right)$
</center>
</p>

We can make the optimizer take into account the size of the weights by adding an additional bit to this loss. We could for example append an additional *cost* that is proportional to the norm of the weights:

<p>
<center>
    $ {b. c. e}_\mathrm{regularized} = b. c. e + \lambda \cdot \sqrt{\sum_{i = 1}^{N} {w_i}^2}$
</center>
</p>

$\lambda$ is called the *regularization strength* or *regularization parameter*. By doing this, the optimizer cannot minimize the loss anymore by adjusting the weights however he likes because making the weights too extreme simply does not lead to a loss decrease anymore! Because this way of regularizing uses the $l_2$-norm of the weights, it is called $l_2$-regularization. Amongst others, it is available for usage in Keras:

In [None]:
from tensorflow.keras import regularizers

In [None]:
l2_regularizer = regularizers.l2(l=0.01)

We can add the regularizer to some layers. Their weights will then be penalized. We don't want to constrain the output, so we add it only to the hidden layer

In [None]:
iris_classifier = Sequential()
iris_classifier.add( Input((4,)) )
iris_classifier.add( Dense(32, activation='tanh', name='hidden_layer', kernel_regularizer=l2_regularizer) )
iris_classifier.add( Dense(3, activation='sigmoid', name='output_layer') )
iris_classifier.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

callbacks = iris_classifier.fit(X_train_scaled, Y_train_scaled, validation_split=.15, epochs=300)

In [None]:
plt.figure()
plt.plot(callbacks.history['loss'], label='training loss')
plt.plot(callbacks.history['val_loss'], label='validation loss')
plt.xlabel('epochs')
plt.ylabel('b. c. e. loss')
plt.legend(loc='best')

Hurray! Our overall loss is quite a bit higher than before, but the model generalizes a lot better now! We see this by comparing the losses for the training and validation set.

### regularization by dropout
In the last years, the regularization by weight penalization has received quite some critical opinions because they tend to constrain the model capacity quite aggressively. Sometimes the model simply needs large weight values to do its job. Another method is preferred especially in deep learning often: dropout

The concept on dropout is very simple. During the training process, nodes from a layer to which *dropout* is applied are simply left out, their output is set to zero.

<p>
    <center>
        <img src="https://miro.medium.com/max/1200/1*iWQzxhVlvadk6VAJjsgXgg.png
" height=300, style="height:300px"> 
        (image: https://www.kdnuggets.com/2018/09/dropout-convolutional-networks.html)
    </center>
</p>

Dropout can be added to your Keras model using a *dropout layer*

In [None]:
from tensorflow.keras.layers import Dropout

In [None]:
iris_classifier = Sequential()
iris_classifier.add( Input((4,)) )
iris_classifier.add( Dense(32, activation='tanh', name='hidden_layer' ) )
iris_classifier.add( Dropout(rate=0.25) )                    
iris_classifier.add( Dense(3, activation='sigmoid', name='output_layer') )
iris_classifier.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

callbacks = iris_classifier.fit(X_train_scaled, Y_train_scaled, validation_split=.15, epochs=300)

In [None]:
plt.figure()
plt.plot(callbacks.history['loss'], label='training loss')
plt.plot(callbacks.history['val_loss'], label='validation loss')
plt.xlabel('epochs')
plt.ylabel('b. c. e. loss')
plt.legend(loc='best')

Whether regularization is necessary for your model and which one best to choose is, among all the other nice things, part of the *hyperparameter optimization*.

## hyperparameter optimization - here we go again
You can try now to find hyperparamters - a model setup - that fits the given problem best. A little bit of boilerplate is given below, enjoy the ride:

In [None]:
opt =  Adam(learning_rate=0.01)

parameter_grid={
    'hidden_nodes': [10, 40, 80],
    'hidden_activation': ['tanh'],
    'optimizer': [opt]
}


In [None]:
from itertools import product
def get_param_combos(p_grid):
    combis = product(*[v for v in p_grid.values()])
    return [{key: value for key, value in zip(p_grid.keys(), combo)} for combo in combis] 

combos = get_param_combos(parameter_grid)
print(combos)

In [None]:
def test_parameter_combination(parameters, X, Y):
    # create a Sequential model
    model = Sequential()
    model.add(Input(shape=(4,)))
    
    # let's keep the number of nodes in the hidden layer and the activation variable
    model.add(Dense(parameters['hidden_nodes'], activation=parameters['hidden_activation']))
    model.add(Dense(3, activation='sigmoid'))
   
    # we can also vary the optimizer
    model.compile(loss='binary_crossentropy', optimizer=parameters['optimizer'], metrics=['accuracy'])
    
    # we want to fit the model and return the trained model and the loss history for later inspection
    loss_history = model.fit(X, Y, validation_split=.15, epochs=300)
    return model, loss_history.history

In [None]:
# create a list that will be filled with the results:
losses = []
models = []
# loop over the parameters
# this will take a while!
for parameter in combos:
    print("Testing parameter configuration: {}".format(parameter))
    model, loss = test_parameter_combination(parameter, X_train_scaled, Y_train_scaled)
    models.append(model)
    
    # we will be happy about this weird looking bit just in a second
    losses.append({**parameter, 'loss_history': loss})
    
res = pd.DataFrame(losses)
res['final_train_loss'] = res['loss_history'].apply(lambda x: x['loss'][-1])
res['final_train_acuracy'] = res['loss_history'].apply(lambda x: x['accuracy'][-1])
res['final_val_loss'] = res['loss_history'].apply(lambda x: x['val_loss'][-1])
res['final_val_acuracy'] = res['loss_history'].apply(lambda x: x['val_accuracy'][-1])


In [None]:
res.head()

In [None]:
for run, loss in enumerate(losses):
    plt.plot(loss['loss_history']['loss'], label='run {} train'.format(run))
    plt.plot(loss['loss_history']['val_loss'], label='run {} val'.format(run))
plt.xlabel('epoch')
plt.ylabel('b. c. e loss')
plt.legend(loc=2, bbox_to_anchor=(1,1))

In [None]:
for run, loss in enumerate(losses):
    plt.plot(loss['loss_history']['accuracy'], label='run {} train'.format(run))
    plt.plot(loss['loss_history']['val_accuracy'], label='run {} val'.format(run))
plt.xlabel('epoch')
plt.ylabel('b. c. e loss')
plt.legend(loc=2, bbox_to_anchor=(1,1))