# Homework 4: Predicting Solubility

In this homework, your goal is to predict the solubility of a compound, and to reach to lowest possible error on the test set. To make this homework a little more interactive, you should report your score on a scoreboard: https://keepthescore.co/board/cmdaqeorufe/ <br> For this homework, you will use a library called score, which provides you with 3 functions. 
<ul>
    <li> featurize(dataset,features): dataset should be a CSV file with at least two columns: SMILES and Solubility. features should be a list of mordred features. Returns two numpy tensors, one with the result of the feature calculation, one with the target. 
    <li> test(features,model): features should be a list of mordred features. model should be a scikit learn estimator. return the mean average error on the test set.
    <li> report(teamname): teaname should be your team name, per the canvas group assignment. Effect: records your current score on the scoreboard.
</ul>
At the end of this notebook, you will find an example on how to train a model, test it, and report a score. Note that you are allowed to use my featurize function for your featurization. If you want to use 3D features, please contact me. You should use datasets A, B, C, D, F, G, H, I for your work. You are encouraged to use multiple datasets!<br>
One final note: you will not be graded based on the scoreboard. Of course, there might be some anticorrelation between your test score and your grade, but you need not worry about the scoreboard.

In [23]:
#from score import report, featurize, test
from score import report, featurize, test

**Rules**:<br>
<ul>
    <li> It is forbidden to modify the score library;
    <li> It is forbidden to import any other function from the score library;
    <li> It is forbidden to use dataset E;
    <li> It is forbidden to use any dataset other than A, B, C, D, F, G, H, I;
    <li> It is forbidden to use features others than the ones computed from mordred;
    <li> You can use any number of features, however, at the end, you will need to provide a brief (and vague) explanation of what your features are doing.
    <li> You can use anything you want for modeling, including all the tools available in pytorch, and you can even use other machine libraries if you wish;
    <li> It is forbidden to modify the scoreboard page (be careful, you all have admin access to it).
</ul>
It is very easy to cheat and rely on your integrity to participate in good faith. If you are caught cheating, you will get 0 for the assignment. 

## Example

In [24]:
import numpy as np
import pandas as pd
import mordred.AtomCount

In [102]:
features = [mordred.AtomCount.AtomCount("X"), mordred.AtomCount.AtomCount("HeavyAtom")]

In [103]:
X_train, y_train = featurize('../../Data/Solubility/dataset-F.csv',features)

100%|██████████| 1210/1210 [00:02<00:00, 451.27it/s]


In [104]:
from sklearn.linear_model import LinearRegression

In [115]:
model = LinearRegression()

In [116]:
model.fit(X_train,y_train)

LinearRegression()

In [117]:
test(features,model)

100%|██████████| 1291/1291 [00:03<00:00, 405.18it/s]


0.711502668545131

In [108]:
report('Test')

Your score is worse than your previous best score, it will not be reported.


## Your turn

In [25]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures as PF
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error
from sklearn.svm import SVR
import mordred
from mordred import Calculator, descriptors

In [26]:
# concatenate datasets B,C,D,F,G,H

In [27]:
def concatenate():
    data_name = ['B', 'C', 'D', 'F', 'G', 'H']
    df_list = []
    
    #load data
    for name in data_name: 
        df_list.append(pd.read_csv('../../Data/Solubility/dataset-'+ name +'.csv'))
        
    #concatenate all datasets
    training_data =  pd.concat(df_list, axis=0, ignore_index=True)
    training_data.to_csv('../../Data/Solubility/dataset-HW4.csv')

In [28]:
concatenate()

In [29]:
# list of mordred features
features= [mordred.HydrogenBond.HBondAcceptor, mordred.HydrogenBond.HBondDonor, mordred.RingCount.RingCount, mordred.Polarizability.APol,mordred.Polarizability.BPol, mordred.SLogP, mordred.TopoPSA.TopoPSA(True)
         ]

In [30]:
# load X_train and y_train
X_train, y_train = featurize('../../Data/Solubility/dataset-HW4.csv',features)

100%|██████████| 12300/12300 [00:28<00:00, 430.63it/s]


In [31]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state = 0, n_estimators=100)

In [32]:
model.fit(X_train,y_train)

RandomForestRegressor(random_state=0)

In [33]:
test(features,model)

100%|██████████| 1291/1291 [00:03<00:00, 409.00it/s]


0.17453259164424118

In [34]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import validation_curve
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_absolute_error

In [35]:
degrees = [1,2,3]

In [36]:
pipe = Pipeline([('pf',PF()),('rf', RandomForestRegressor())])

In [None]:
v_train_scores, v_valid_scores = validation_curve(pipe, X_train, y_train, "pf__degree", degrees,scoring = make_scorer(mean_absolute_error))



In [None]:
import time
def plot():
    start = time.time()
    print ('starting..')
    plt.grid()
    plt.xlabel('Polynomial degree')
    plt.ylabel('Mean average error')
    plt.xticks(degrees)
    plt.plot(degrees,v_train_scores.mean(axis=1),label='train')
    plt.plot(degrees,v_valid_scores.mean(axis=1),label='valid')
    plt.show()
    print ('time taken:', time.time() - start)


In [None]:
plot()