# Prediction of Formation Energy of Solids Using Neural Network


This work was reported by Jha *et al.*[1] and in this exercise, we will implement, train, and validate a model to predict the formation energy of given solid using its formula as the input.


1. Jha, D., Ward, L., Paul, A., Liao, W.-K., Choudhary, A., Wolverton, C., & Agrawal, A. **(2018)**. ElemNet : Deep Learning the Chemistry of Materials From Only Elemental Composition. Scientific Reports, 8(1), 17593. http://doi.org/10.1038/s41598-018-35934-y

Install necessary packages

In [None]:
!pip install numpy matplotlib pandas scikit-learn tensorflow

In [None]:
# import
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import re

## Descriptor

The descriptor is the composition of the given system. The discriptor will be an array of length 86 (dataset contains only 86 elements).

### Load data

In [None]:
df = pd.read_csv('data/inorg/mp.csv')

df.head()

In [None]:
def parse_formula(formula):
    
    split = re.findall('[A-Z][^A-Z]*', formula)
    result = {}
    for el in split:
        head = el.rstrip('0123456789')
        tail = el[len(head):]
        result[head] = int(tail) if tail != '' else 1
    
    return result
    
parse_formula('Ge4Mg2O12Sn2')

In [None]:
# build template for descriptor
elements = ['H', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'K', 'Ca', 'Sc', 'Ti', 'V', 
            'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 
            'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 
            'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 
            'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu']
desc_positions = dict(zip(elements, range(len(elements))))

In [None]:
print(desc_positions)

In [None]:
def get_descriptor(formula):
    
    elem_counts = parse_formula(formula)
    desc = np.zeros(shape=86)
    for k, v in elem_counts.items():
        desc[desc_positions.get(k)] = v
        
    return desc

In [None]:
print(get_descriptor('Ge4Mg2O12Sn2'))

In [None]:
# get datasets
x = []
y = []

for row in df.itertuples():
    x.append(get_descriptor(row.formula))
    y.append(float(row.energy))

x = np.stack(x)
y = np.array(y)

print(x.shape)
print(y.shape)

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=35)

print(X_train.shape)
print(X_test.shape)

## Model

We will start with a model with four hidden layers with [256, 128, 64, 32] neurons. We will use relu activations, ADAM optimizer and mean squared error for loss function.

Architecture used in the paper is 1024x4-512x3-256x3-128x3-64x2-32x1-1. Dropout [0.8, 0.9, 0.7, 0.8].

In [None]:
model = Sequential([
    Dense(1024, input_shape=(86,), activation='relu'),
    Dense(1024, activation='relu'),
    Dense(1024, activation='relu'),
    Dense(1024, activation='relu'),
    Dropout(1-0.8),
    Dense(512, activation='relu'),
    Dense(512, activation='relu'),
    Dense(512, activation='relu'),
    Dropout(1-0.9),
    Dense(256, activation='relu'),
    Dense(256, activation='relu'),
    Dense(256, activation='relu'),
    Dropout(1-0.7),
    Dense(128, activation='relu'),
    Dense(128, activation='relu'),
    Dense(128, activation='relu'),
    Dropout(1-0.8),
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1, activation='linear'),
])

In [None]:
model.compile(optimizer='adam',
              loss='mse',
              metrics=['mae'])

# print summary 
model.summary()

In [None]:
model.fit(x=X_train[:1024], y=y_train[:1024], batch_size=32, validation_split=0.1, verbose=2, epochs=100)

## Predict