# Heat of Formation

In this exercise we will try to predict the heat of formation (HOF) of cubic oxoperovskites, based on data that we can look up. We have generated such a table for you in the file `periodic_table_groups.csv`, which can be loaded as a Pandas data frame, as done below.

We also load the database by Ivano to get our answers, so we can train a machine.

If you don't have the data base, use the command below to download it.

In [None]:
!wget https://cmr.fysik.dtu.dk/_downloads/a8829848dc5806fc8adea6974bed0e6d/cubic_perovskites.db

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from ase.db import connect
import pandas as pd
con = connect('cubic_perovskites.db')
pt = pd.read_csv('periodic_table_groups.csv', index_col='Symbol')

Inspect the data frame, to get an understanding what it contains.

In [None]:
pt

We can construct a feature vector from this table. In the cell below we construct a small feature vector, which contains the atomic number, Z, of the A and B ions in the perovskite, ABO$_3$, as well as the period and group in the periodic table.

The idea with the period and group, is that metals close to eachother in the periodic table will have similar behaviors, and thus the euclidean distance of (period, group) provides a similarity measure between metals.

In [None]:
import numpy as np

def make_fingerprint(row):
    A = row.A_ion
    B = row.B_ion
    x = []
    
    features = ['AtomicNumber', 'Period', 'Group']  # Initial features to include
    
    # Construct feature vector
    symbols = [A, B]
    x = []
    for feat in features:
        for sym in symbols:
            x.append(pt.loc[sym, feat])
    return np.array(x)

We now need to construct the input feature matrix, X, and the target vector with our correct answer, Y.

In [None]:
selection = {'combination': 'ABO3'}  # Our selection from the database
n_samples = con.count(**selection)
def make_X():
    '''Make input matrix using ids from database'''
    n_features = len(make_fingerprint(con.get(id=1)))  # Length of input vector
    X = np.zeros((n_samples, n_features))
    
    for ii, row in enumerate(con.select(**selection)):
        X[ii, :] = make_fingerprint(row)
    return X

def make_Y():  
    Y = np.zeros(n_samples)
    for ii, row in enumerate(con.select(**selection)):
        Y[ii] = row.heat_of_formation_all
    return Y

X = make_X()
Y = make_Y()

We can then take a part of our data, train a model on that piece of data, and see how well we perform on the remainder. The below provides an auxillary function for performing this operation.

In [None]:
import sklearn.model_selection as ms
from sklearn.metrics import r2_score, mean_absolute_error
def make_comparison_plot(X, y, model):
    X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.33, random_state=42)
    model.fit(X_train, y_train)
    ybar = model.predict(X_test)
    
    n_train, n_test = len(y_train), len(y_test)
    
    r2 = r2_score(y_test, ybar)
    mae = mean_absolute_error(y_test, ybar)
    
    ymax = np.array((y_test, ybar)).max() + 0.1
    ymin = np.array((y_test, ybar)).min() - 0.1
    plt.scatter(ybar, y_test, zorder=0)
    plt.xlim(ymin, ymax)
    plt.ylim(ymin, ymax)
    plt.plot([ymin, ymax], [ymin, ymax], 'k--', zorder=1)
    plt.xlabel('Predicted HOF [eV]')
    plt.ylabel('Actual HOF [eV]')
    plt.title('MAE: {:.3f} eV, $r^2$ score: {:.3f}, trained on: {:d}, tested on: {:d}'.format(mae, r2, n_train, n_test))

Let's try and train a linear model on the data, and see how well we did:

In [None]:
from sklearn import linear_model
linear = linear_model.LinearRegression()
make_comparison_plot(X, Y, linear)

Not very well - it's quite clear, that we cannot predict the HOF from simple table data using a linear model. So we have 2 choices:

1. Use a more complex model
2. Include more complex data

Let's first try and use a more complex model. Train a kernel ridge regression (KRR) model with Gaussian (RBF) kernel from the sklearn package, and compare it with the linear model. Does it improve anything? See http://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html#sklearn.kernel_ridge.KernelRidge

In [None]:
from sklearn.kernel_ridge import KernelRidge

# Type your code here.

Also try modifying the hyperparameters, `alpha` and `gamma`, and see if you can improve it further. One option for doing this is using the [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function.

Hopefully, you should find that you can drastically improve the accuracy we obtain. Now you should try and include more data from the table we provided you with, and see how much we gain. Can you identify which parameters are most important to our model?