# Homework 8: Predicting Ground State Energy

In this homework, your goal is to predict the energy of the ground state of a molecule, and to reach to lowest possible error on the test set. You must report your score on a scoreboard:

https://keepthescore.co/board/ffirhduscve/

For this homework, you will use a library called score, which provides you with 1 function. 
<ul> 
    <li> test(features,model): features should be a list of mordred features. model should be a scikit learn estimator. return the mean average error on the test set.
</ul>
At the end of this notebook, you will find an example on how to define a model and test it. You should use datasets A, B, C, D, F, G, H, I for your work. You are encouraged to use multiple datasets!<br>
One final note: your position on the scoreboard does matter. If you are doing much worse than your peers, you can expect to receive fewer points. 

In [11]:
from score import test

**Rules**:<br>
<ul>
    <li> It is forbidden to modify the score library;
    <li> It is forbidden to import any other function from the score library;
    <li> It is forbidden to use dataset E;
    <li> It is forbidden to use any dataset other than A, B, C, D, F, G, H, I;
    <li> It is forbidden to use features others than the ones computed from mordred;
    <li> You can use any number of features, however, at the end, you will need to provide a brief (and vague) explanation of what your features are doing.
    <li> You can use anything you want for modeling, including all the tools available in pytorch, and you can even use other machine libraries if you wish;
    <li> It is forbidden to modify the scoreboard page (be careful, you all have admin access to it).
</ul>
It is very easy to cheat and I rely on your integrity to participate in good faith. If you are caught cheating, you will get 0 for the assignment. 

## Some advice

<ul>
<li> For this project, you will have to work using several notebooks.
<li> You should start by writing a notebook to create the dataset. Note that if you use many molecules, using MOPAC could take a while. If you decide to optimize the geometry of all the molecules in the Solubility datasets, ot will take at least 12 hours. Make sure to save the result so that you don't have to compute it multiple times!
<li> You should probably have another separate notebook that creates the graph version of the dataset. Again, you should save the data, and for this make use to use the save_graph and load_graph functions of DGL.
<li> Finally, you should have at least one notebook where you define and train your notebook.
<li> As an example, my Reference submission was defined using 5 notebooks...
</ul>

## Example

In [2]:
import numpy as np
from sklearn.dummy import DummyRegressor

In [3]:
def dummy_E0():
    model = DummyRegressor()
    model.fit(None,np.array([42]))
    return lambda smiles: model.predict(smiles)

In [5]:
test(dummy_E0(),'Reference')

Your score is worse than your previous best score, it will not be reported.


52101.286045329194

## Your turn

In [12]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from rdkit import Chem
from rdkit.Chem import AllChem
import mordred
from mordred import Calculator, descriptors
from dgllife.utils import smiles_to_bigraph
import pickle

In [13]:
def featurize_atoms(mol):
    # featurize atoms
    feats = []
    for atom in mol.GetAtoms():
        feats.append(atom.GetAtomicNum())
    return {'atomic': torch.tensor(feats).reshape(-1, 1).float()}
    
def featurize_bonds(mol):
    # featurize bonds
    feats = []
    bond_types = [Chem.rdchem.BondType.SINGLE, Chem.rdchem.BondType.DOUBLE,
                Chem.rdchem.BondType.TRIPLE, Chem.rdchem.BondType.AROMATIC]
    for bond in mol.GetBonds():
        btype = bond_types.index(bond.GetBondType())
        feats.extend([btype, btype])
    return {'type': torch.tensor(feats).reshape(-1, 1).float()}

In [14]:
def transform(smile):
    g = smiles_to_bigraph(smile, node_featurizer=featurize_atoms, edge_featurizer=featurize_bonds,explicit_hydrogens=True)
    return g

In [17]:
with open('energy_dict.pickle', 'rb') as handle:
    b = pickle.load(handle)

In [18]:
def atoms_to_energy(rdkit_atoms,energy_dict):
    f = lambda x: x.GetAtomicNum()
    sum_e=0
    atoms = rdkit_atoms.GetAtoms()
    for i in atoms:
        sum_e += energy_dict[f(i)]
    return sum_e

In [15]:
df = pd.read_csv('../../Data/Energy/FeaturizedDataset-HW8.csv')
feat = df.columns[1:4]
featurized = np.array(df[feat])
ener = df.columns[4]
energy = np.array(df[ener])

In [16]:
features = [mordred.BondCount.BondCount("single", False), mordred.BondCount.BondCount("double", False), mordred.BondCount.BondCount("triple", False)]
calc = mordred.Calculator()
calc.register(features)

In [None]:
%run Model_1.ipynb

done


  0%|          | 0/10 [00:00<?, ?it/s]

Train loss:  82436.15338821687
Train loss:  81996.63770244013


In [None]:
def f():
    LR = LinearRegression()
    LR.fit(featurized,energy)
    
    def inner(smile):
        mol = Chem.MolFromSmiles(smile)
        mol = Chem.AddHs(mol)
        base = atoms_to_energy(mol,b)
        dff = pd.DataFrame(data = {'Mol': mol}, index=[0])
        features = np.array(calc.pandas(dff['Mol']))
        #print(features)
        ae_pred = LR.predict(features)
        g = transform(smile)
        correction = model(g, g.ndata['atomic'], g.edata['type'])
        return [base+ae_pred[0]+correction.item()]
    return inner

In [None]:
test(f(),'Charm')