# Finding a good model - hyperparameters and generalization
Neural networks are only a tool, not a ready-to-go solution for every given problem. To get a model that can reliably predict good values, we need to investigate which hyperparameteres work best - and what our results actually tell us!

## Shit in, shit out - data scaling
We were not very nice during the preparation of the data. Neural networks work a lot better if they are presented *normalized data*. The best performance in many cases can be achieved using data that has a mean value of $\mu = 0$ and a standard deviation of $\sigma=1$. What about our data?

In [None]:
# if you run this in Colab, you need to download the examples: uncomment the following line
# ! git clone https://github.com/flome/e4_bsc_python
# % cd e4_bsc_python
# ! git checkout machine_learning
# % cd 4.\ Machine\ Learning

In [None]:
import pandas as pd
# import the data into a pandas DataFrame
data = pd.read_csv('circle_data.csv', index_col=0)
data.head()

In [None]:
print(data.describe())

The mean values are approximately zero, but the data is spread out further than recommended. Scaling data to the mentioned requirements can be done with the equation
<center>
    $x^\prime = \frac{x-\hat{x}}{\sigma}$
</center>
where $\hat{x}$ is the mean value of the feature and $\sigma$ its standard deviation. We don't need to do this by hand, there are scalers available from e. g. the scikit learn Python package:

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

X = data[['x','y']].values
Y = data['class'].values

# fit computes the mean value and the standard deviation per feature
# transform applies it to the design matrix
# fit_transform does both in one step
# get the design matrix and the target vector
X_scaled = scaler.fit_transform(X)

An alternative scaling method that is used a lot is the MinMaxScaler, which scales the data to a given range like (-1, 1)

## Hyperparameter optimization
You tried your best (hopefully) to get some improvements out of your model by varying the *hyperparameters* of your model. To avoid doing monkey-work and to get results you can use for your thesis, a systematic approach to test parameters is important. If you need to test a lot of parameter combinations, it can be extremely useful to collect the results and save them, so that you can analyse which parameters had which impact on your model.

We start by defining a function that returns a Keras model using parameters we pass to it. We will see, that dictionaries are very useful for this:

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Input, Dense

In [None]:
def test_parameter_combination(parameters, X, Y):
    # create a Sequential model
    model = Sequential()
    model.add(Input(shape=(2,)))
    
    # let's keep the number of nodes in the hidden layer and the activation variable
    model.add(Dense(parameters['hidden_nodes'], activation=parameters['hidden_activation']))
    model.add(Dense(1, activation='sigmoid'))
   
    # we can also vary the optimizer
    model.compile(loss='binary_crossentropy', optimizer=parameters['optimizer'], metrics=['accuracy'])
    
    # we want to fit the model and return the trained model and the loss history for later inspection
    # verbose=0 makes the progress outputs go away
    loss_history = model.fit(X, Y, epochs=500, verbose=0)
    return model, loss_history.history

Let's try this out: We want to train three networks to start with. One with 10 hidden nodes, one with 40 and one with 80. For the hidden layer, we will use a *tanh* function, which is very similar to the sigmoid function but has an output in the range (-1,1). It tends to converge faster if used in hidden layers than the sigmoid function.
Apart from that, we will keep the activation and the optimizer constant for now:

In [None]:
# set up parameter combinations:
parameters=[
    {'hidden_nodes': 10, 'hidden_activation': 'tanh', 'optimizer': 'adam'},
    {'hidden_nodes': 40, 'hidden_activation': 'tanh', 'optimizer': 'adam'},
    {'hidden_nodes': 80, 'hidden_activation': 'tanh', 'optimizer': 'adam'}
]
# create a list that will be filled with the results:
losses = []
models = []
# loop over the parameters
# this will take a while!
for parameter in parameters:
    print("Testing parameter configuration: {}".format(parameter))
    model, loss = test_parameter_combination(parameter, X_scaled, Y)
    models.append(model)
    
    # we will be happy about this weird looking bit just in a second
    losses.append({**parameter, 'loss_history': loss})

We now have a list of the results. Let's see what is going on there:

In [None]:
print("Length of list: ", len(losses))
print("Content of each loss history:", losses[0].keys())

Why did we put this weird `{**parameter, 'loss': loss}` bit in there? It makes it really easy to store results in a data frame:

In [None]:
res = pd.DataFrame(losses)
res.head()

We can introduce new columns to show the loss and the accuracy after the last training epoch:

In [None]:
res['final_loss'] = res['loss_history'].apply(lambda x: x['loss'][-1])
res['final_acuracy'] = res['loss_history'].apply(lambda x: x['accuracy'][-1])
res.head()

We can now compare the loss curves to see, which training was more successful:

In [None]:
import matplotlib.pyplot as plt

for run, loss in enumerate(losses):
    plt.plot(loss['loss_history']['loss'], '.', label='run {}'.format(run))
plt.xlabel('epoch')
plt.ylabel('b. c. e loss')
plt.legend(loc=2, bbox_to_anchor=(1,1))

Which model performs best? What about the accuracy of the models? Try to investigate them by looking at the `loss['loss_history']['accuracy']` curves.

### excursus: smarter parameter combinations with itertools
The way we defined the parameter combination was a possible one but certainly not the most efficient one. Let's define a *parameter grid* as follows:

In [None]:
parameter_grid={
    'hidden_nodes': [10, 40, 80],
    'hidden_activation': ['tanh'],
    'optimizer': ['adam']
}

How do we get the needed parameter combinations that we want to pass for our training? The *itertools* package has got us covered! This may look painful for a moment, but then it is really enjoyable:

In [None]:
from itertools import product
def get_param_combos(p_grid):
    combis = product(*[v for v in p_grid.values()])
    return [{key: value for key, value in zip(p_grid.keys(), combo)} for combo in combis] 

combos = get_param_combos(parameter_grid)
print(combos)

q. e. d

## Generalization and overtraining
You suddenly stumble upon a new chunk of data, that also belongs to your data set. That's interesting. Let's have a look at our model's performance on this *unseen* test data: 

In [None]:
new_data = pd.read_csv('test_data.csv', index_col=0)
new_data.head()

First of all, we need to do the preprocessing bit:

In [None]:
X_test = new_data[['x', 'y']].values
X_test = scaler.transform(X_test)
Y_test = new_data['class'].values

We can check our model performances using the `evaluate` method:

In [None]:
for model in models:
    print("Performance on new data: ", model.evaluate(X_test, Y_test))

In [None]:
from mlxtend.plotting import plot_decision_regions

In [None]:
Y_for_mlxtend = Y_test.flatten().astype(int) # plot_decision_regions needs a 1D int array
for i, model in enumerate(models):
    plt.figure()
    plot_decision_regions(X_test, Y_for_mlxtend, clf=model)
    plt.title('run {}'.format(i))
    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend(loc=2, bbox_to_anchor=(1,1))

We are lucky, our model *generalized* very well the underlying distribution! This time everything worked out nicely, but this is not necessarily always the case. If the model draws the decision boundary too tightly around the data points, it does not approximate the underlying function but only *memorizes* the data points. This is called *over-fitting* and it is a massive problem for machine learning both in classification and regression if not addressed accordingly.

<p>
    <center>
        <img src="https://miro.medium.com/max/1125/1*_7OPgojau8hkiPUiHoGK_w.png" width=600px style='width: 600px
      '> 
        <img src="https://miro.medium.com/max/1400/1*JZbxrdzabrT33Yl-LrmShw.png" width=600px style='width:600px
      '>
        (images: https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76)
    </center>
</p>

### validation data
To reduce the probability of over-fitting the data, it is very important to *anticipate* the arrival of new, unseen data. We do this by keeping some data out of the weight-update process and only use it for *validation* of the training progress. Decisions on the model architecture like hyperparameters can than be trained on the training part of the data set before the *generalization performance* is estimated by evaluating the validation set. Keras can do this on the fly during the training process by specifying a `validation_split`, how much of the training data should be left out for the weight updates. You should leave out yet another bit of your data that is not considered at all during the optimization phase of your model training. This bit our your data will be called *test data*. This is important to prevent *over-training* with respect to the validation data and thereby improving the overall *generalization*

<p>
    <center>
        <img src="https://cdn-media-1.freecodecamp.org/images/augTyKVuV5uvIJKNnqUf3oR1K5n7E8DaqirO" height=400px style='height:400px'>
        (image: https://www.freecodecamp.org/news/how-to-get-a-grip-on-cross-validations-bb0ba779e21c/)
    </center>
</p>

In [None]:
model = Sequential()
model.add(Input(shape=(2,)))
model.add(Dense(50, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# use 15% of the data for training validation
loss_history = model.fit(X_scaled, Y, epochs=500, validation_split=.15)

When inspecting the training of a machine learning model, it is always very important to watch both the *training* and *validation* loss. In our very simple model, both converge with more epochs. 

In [None]:
plt.plot(loss_history.history['loss'], '.', label='training data')
plt.plot(loss_history.history['val_loss'], '.', label='validation data')

plt.xlabel('epoch')
plt.ylabel('b. c. e loss')
plt.legend(loc=2, bbox_to_anchor=(1,1))

The goal for a generalizing model is always, to have a very similar if not even same score for training, validation and test data. On real data sets this is normally not achieved but with carefully monitoring the loss curves we can test and improve the generalization performance

<p>
    <center>
        <img src="https://i.stack.imgur.com/rpqa6.jpg" height=400px style='height:400px'>
        (image: https://stats.stackexchange.com/questions/292283/general-question-regarding-over-fitting-vs-complexity-of-models)
    </center>
</p>

### getting a test set
You have seen how to automatically assign a part of the data as validation set. How about test data? Of course you could split the data set by hand in two parts, but we prefer using (like always) to use already ready-to-use solutions:

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size=0.2)
print(X_train.shape)
print(X_test.shape)

It is important to *first* split the data, and then create the scalers, otherwise the effect of scaling the data is not properly included in the testing process!

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Using this, you can also create a dedicated validation set of course