# # Exercises for Machine Learning with Python, Lecture 2: *Kernel Ridge Regression and Molecules*
By Dr. Anders Christensen `anders.christensen @ unibas.ch`




---
# ***Important:***

***The answers to the questions must be entered in the online system.***

https://www.chemie1.unibas.ch/~pythonprakt/

*    You should have received the username and password by email.
*    Use you @unibas.ch or @stud.unibas.ch email to login after you enter the username and password for the course.



## Before you begin the exercises:


Again, we begin by downloading a dataset. This dataset is taken from the "QM7" dataset introduced in the lecture. 

As there are 7101 molecules in the datasets, all the data is stored in binary format to keep the files small enough for Google Colab. The data format is Numpy's ".npy" format.

The data consists of representations (aka. features) for each of the 7101 molecules, and also the atomization energy of each molecule (i.e. how much energy it takes to break the molecule into its free atoms). 

There are two sets of representations, the Coulomb Matrix representation and the Bag-of-Bonds representation, and later you will compare which of the two is best for predicting the atomization energy with Kernel Ridge Regression.

**To-do:**

*    Download the binary datafiles with the "wget" command:

In [0]:
# Download features maps and labels:
!wget -O cmat.npy https://www.dropbox.com/s/4eqafm0yr2ywxps/cmat.npy
!wget -O bob.npy https://www.dropbox.com/s/vyiwza2uy4jkczg/bob.npy
!wget -O hof.npy https://www.dropbox.com/s/zy717f8mwxaegff/hof.npy

Next, Numpy can directly read .npy files and convert these into Numpy arrays using the `np.load()` function.

**To-do:**

*   Run the code below to read in the features as numpy arrays and print their sizes and types:


In [0]:
import numpy as np

# Save 'Coulomb matrices' features in 'cmat_features'
cmat_features = np.load("cmat.npy")

# Save 'Bag-of-Bonds' features in 'bob_features'
bob_features = np.load("bob.npy")

# Save Atomization Energies in 'hof'
atomization_energy = np.load("hof.npy")

# Print the dimensions and type of each numpy array, just to see what you got
print(cmat_features.shape)
print(type(cmat_features))

print(bob_features.shape)
print(type(bob_features))

print(atomization_energy.shape)
print(type(atomization_energy))

As you can see above, each Numpy array contains 7101 rows. Each row in the Coulomb-matrix representation has 276 features and each row in the Bag-of-bonds representation has 465 features.

The 7101 heats-of-formation are in units of kcal/mol.

Additionally, you can print all the arrays, just to see what they contain (a lot of numbers).


In [0]:
print(cmat_features)
print(bob_features)
print(atomization_energy)

# Exercises 2.1: Exploratory Data Analysis

As we've done during the lecture, lets's start by having a look at your data.




### Question 2.1.1: (6.1)

Since we've already looked into the representation during the lecture, we will skip looking at plots for them those for now.

However, to make sure that we trust our data, we still want to plot a histogram of the labels.

**To-do:**
*    Using matplotlib/pyplot, plot a histogram of the `atomization_energy` array
*    In the web-interface, answer the following question:  *What is the value range of the atomization energies?*

**Note:** The atomization energies are in units of kcal/mol.

In [0]:
import matplotlib.pyplot as plt

# Plot the histogram of the atomization energies







# Exercises 2.2: Training/Validation/Test split:

Before we do any training and optimization, it is a good idea to split the data in to a training set, a validation set, and a test set.

Since our representations (the Coulomb Matrix featuers, and the Bag-of-Bonds features) and the atomization energy labels are in Numpy format, you can do the splitting utilizing Numpy's "slice" notation. 

Below are some examples of how to get certain rows out of a matrix which you can use for inspiration

```
# Get rows 0-9:
rows1 = my_matrix[:10]

# Get rows 10-19
rows2 = my_matrix[10:20]

# Get rows 20 and onwards
rows2 = my_matrix[20:]
```

### Question 2.2.1:  (6.2)

In this question, you have to split the three Numpy arrays `cmat_features`, `bob_features`, and `atomization_energy` each into a training, a validation and a test part. In total you will end up with 9 numpy arrays in total.

**Note:** It is important that no molecule is in more than one set.

**To-do:**
*    Split the data into Training/Validation/Test sets of roghly 60%/20%/20% of the total dataset each.
 
*    How many molecules are there the training, validation and test sets with a 60%/20%/20% split? In the web-interface select the right answer.
 * Since the sizes of splits can vary slightly depending on how you did the splitting, choose the option that is closest to what you found.


In [0]:
# Define the splits here

# Split the coulomb matrix features
cmat_features_training = ???
cmat_features_validation = ???
cmat_features_test = ???

# Split the bag-of-bond features
bob_features_training = ???
bob_features_validation = ???
bob_features_test = ???

# Split the atomization energies
atomization_energy_training = ???
atomization_energy_validation = ???
atomization_energy_test = ???

# Print sizes of the three data splits)
print(len(atomization_energy_training))
print(len(atomization_energy_validation))
print(len(atomization_energy_test))

## Exercise 2.3: Kernel Ridge Regression Model

Now that we have defined our training, validation and test splits, we are now ready to fit a kernel ridge regresssion model. 

Just like the previous examples with classifiers and linear regression, Scikit-Learn has built-in support for kernel ridge regresssion. Below is some boiler-plate code you can use:

```
# Import the Machine
from sklearn.kernel_ridge import KernelRidge

# Make a machine, see text for explanation of key-words
machine = KernelRidge(alpha=1e-9, kernel="rbf", gamma=1e-4)

# Fit the machine using the training features and training labels
machine.fit(features_training, labels_training)
```
With the fitted macine, you can now make predictions for a set of test features and get the predicted labels from your machine:


```
# Predict y-values using features_training features
labels_test_predicted = machine.predict(features_test)
```

If you paid attention to the keyword arguments for the machine `KernelRidge(alpha=1e-9, kernel="rbf", gamma=1e-4)`, an explanation is given here:


1.   `alpha=1e-9` is the regularizer, i.e. the small number to add to the diagonal of the kernel matrix. $10^{-9}$ is a good value for many problems and we don't have to optimize this. 
2.   `kernel="rbf"` Another word for Gaussian kernel function is also the radial basis function (RBF). This keyword tells the machine to use the Gaussian kernel.
3.   `gamma=1e-4` Gamma (i.e. $\gamma$) is the length scale of Gaussian/radial basis function. 

The kernel used in Scikit-learn is defined as follows:

\begin{equation}
K\left(\mathbf{x}_i, \mathbf{x}_j\right) = \exp\left( -\gamma \|x - y \|^2 \right)
\end{equation}
Later, we will optimize the value of $\gamma$ to ensure the most accurate machine learning mode.

The accuracy of a machine learning model can be assessed, for example, by calculating the mean-absolute-error (MAE) for a test or validation set:
\begin{equation}
\text{MAE} = \frac{1}{N}\sum_{i=1}^N |y_i^{true} - y_i^{predicted} |
\end{equation}






### Question 2.3.1:  (6.3)
Make a kernel ridge regression machine as described above. 

Use the same parameters values as in the example above, that is `alpha=1e-9, kernel="rbf", gamma=1e-4`.


**To-do:**
* First, train your model on the training set with Coulomb Matrix features (i.e. `cmat_features_training` and `atomization_energy_training`). 
* Next predict the atomization energy on the test set (`cmat_features_test`), and calculate the mean-absolute-error (MAE) between the predicted atomization energies for the test set and the true atomization energies (stored in `atomization_energy_test`).
* In the web interface, enter the MAE of the atomization energies you calculated using Coulomb matrix features.
 * You are allowed a margin of +/- 4 kcal/mol since the numbers might vary slightly.

In [0]:
from sklearn.kernel_ridge import KernelRidge

# Implement kernel ridge regression model 
# Train the model on cmat_features_training and atomization_energy_training



# Predict atomization energies on cmat_features_test



# Calculate the MAE between the predicted and true atomization energies for the test set




### Question 2.3.2:   (6.4)





Repeat what you did in Question 2.3.1, but this time train on the training set with Bag-of-Bonds features 


* Train your kernel ridge regression using `bob_features_training` and `atomization_energy_training`. 
* Next predict the atomization energy on the test set (`bob_features_test`), and calculate the mean-absolute-error (MAE) between the predicted atomization energies for the test set and the true atomization energies (stored in `atomization_energy_test`).
* In the web interface, enter the MAE of the atomization energies you calculated using Bag-of-Bonds features.
 * You are allowed a margin of +/- 4 kcal/mol since the numbers might vary slightly.

In [0]:
from sklearn.kernel_ridge import KernelRidge

# Implement kernel ridge regression model 
# Train the model on bob_features_training and atomization_energy_training



# Predict atomization energies on bob_features_test



# Calculate the MAE between the predicted and true atomization energies for the test set




## Exercise 2.4: Hyperparameter Optimizaton.

In order to ensure the most accurate predictions we need to optimize the hyperparameters of the model. Kernel ridge regression models can be especially sensitive to the length-scale parameter, in our case the parameter $\gamma$ (`gamma`).

To set the value of gamma in the machine learning mode to, for example, $10^{-8}$ you can change the argument `gamma=` as follows:

```
machine = KernelRidge(alpha=1e-9, kernel="rbf", gamma=1e-8)
```

One common strategy to optimize such parameter is to scan all parameters over a logarithmically-space range.

To avoid overfitting (i.e. fitting parameters to the test set), it is common practice to train the model on the test set and predict on the *validation* set. The best/optimal value is the parameter with the lowest prediction error on the validation set.


### Question 2.4.1: (6.5)
Previously, we only trained machines for only for `gamma=1e-4`. In this question, you will train machines for $\gamma$ in a range between $1.0$ down to $10^{-9}$ (see code). First, we do this for the Coulomb matrix features.

**To-do:**
* For each value of `gamma`, train a machines on the training set using the Coulomb Matrix features, i.e. `cmat_features_trainig`.
* For each value of `gamma`, predict the atomization energy for the validation set, using Coulomb Matrix features, i.e. `cmat_features_validation`.
* Next, calculate the mean-absolute-error between the predicted and true atomization energies for the validation set.
* In the web-interface, select which values of `gamma` you found to give the lowest MAE using Coulomb matrix features.
 * Note: As some values of `gamma` will give MAE values that are very close, there are several right answers to this question

In [0]:
# The list of Gamma that we wish to try
gammas = [1.0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9]

# Kernel ridge regression models for each gamma on cmat_features_trainig
# Predict on the atomization energy for cmat_features_validation
# Calculate the MAE between the true and predicted atomization energies





### Question 2.4.2:  (6.6)
In the previous question, you found the best value of `gamma` for Coulomb matrix features. In this question you will repeat the same process and find the opimal value of `gamma` for the Bag-of-Bonds features.

**To-do:**
* For each value of `gamma`, train a machines on the training set using the Bag-of-Bonds features, i.e. `bob_features_trainig`.
* For each value of `gamma`, predict the atomization energy for the validation set, using Bag-of-Bonds features, i.e. `bob_features_validation`.
* Next, calculate the mean-absolute-error between the predicted and true atomization energies for the validation set.
* In the web-interface, select which values of `gamma` you found to give the lowest MAE using Bag-of-Bonds features.
 * Note: As some values of `gamma` will give MAE values that are very close, there are several right answers to this question

In [0]:
# The list of Gamma that we wish to try
gammas = [1.0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9]

# Kernel ridge regression models for each gamma on bob_features_trainig
# Predict on the atomization energy for bob_features_validation
# Calculate the MAE between the true and predicted atomization energies





## Exercise 2.5: Learning Curves

In order to see which of the two reprsentations  (Coulomb Matrix and Bag-of-Bonds) gives the most accurate kernel ridge regression model, we can compare learning curves for the machines based on the two representations.




### Question 2.5.1: (6.7)
For both representations, train machines using 500, 1000, 2000, and 4000 molecules from the training set. Use the best value of `gamma` which you found in the previous exercise for each machine.

Again, use Numpy's slice notation to extract only the neccesary number of rows from the Numpy arrays in the training set.


**To-do:**
*   Train machines for each training set size with the two representation (Coulomb Matrix and Bag-of-bonds)
*   For each machine you train, predict the atomization energy for the test set and calculate the mean-absolue-error (MAE) to the true energies for the test set.
*   Plot the MAE values as a function of the training set size.
*   The representation which yields the most accurate predictions is the one with the lowest learning curve. In the web interface, select the representation that gave you the learning curves with the lowest MAE values.

In [0]:
# Implement learning curves for kernel ridge regression here









