<a href="https://colab.research.google.com/github/kevingreenman/chemprop-workshop-acs-fall2023/blob/main/chemprop_colab_demo_acs_fall2023_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Please note that this notebook is intended to be run in Google Colab rather than as a Jupyter notebook on your local machine. Please click the "Open in Colab" button.

# Setup

In [None]:
!pip install chemprop
!pip install rdkit-pypi  # should be included in above after Chemprop v1.6 release

import chemprop
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.offsetbox import AnchoredText
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.decomposition import PCA

In [None]:
def plot_parity(y_true, y_pred, y_pred_unc=None):

    axmin = min(min(y_true), min(y_pred)) - 0.1*(max(y_true)-min(y_true))
    axmax = max(max(y_true), max(y_pred)) + 0.1*(max(y_true)-min(y_true))

    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred, squared=False)

    plt.plot([axmin, axmax], [axmin, axmax], '--k')

    plt.errorbar(y_true, y_pred, yerr=y_pred_unc, linewidth=0, marker='o', markeredgecolor='w', alpha=1, elinewidth=1)

    plt.xlim((axmin, axmax))
    plt.ylim((axmin, axmax))

    ax = plt.gca()
    ax.set_aspect('equal')

    at = AnchoredText(
    f"MAE = {mae:.2f}\nRMSE = {rmse:.2f}", prop=dict(size=10), frameon=True, loc='upper left')
    at.patch.set_boxstyle("round,pad=0.,rounding_size=0.2")
    ax.add_artist(at)

    plt.xlabel('True')
    plt.ylabel('Chemprop Predicted')

    plt.show()

    return

# Get Datasets

Here we download datasets related to critical properties, solvation, and reaction barriers from the following papers:
* Sayandeep Biswas, Yunsie Chung, Josephine Ramirez, Haoyang Wu, and William H. Green. "Predicting Critical Properties and Acentric Factors of Fluids Using Multitask Machine Learning". J. Chem. Inf. Model 63, 15, 4574–4588 (2023).
https://doi.org/10.1021/acs.jcim.3c00546
* Florence H. Vermeire, Yunsie Chung, and William H. Green. "Predicting Solubility Limits of Organic Solutes for a Wide Range of Solvents and Temperatures". J. Am. Chem. Soc. 144, 24, 10785–10797 (2022). https://doi.org/10.1021/jacs.2c01768
* Kevin Spiekermann, Lagnajit Pattanaik, and William H. Green. "High accuracy barrier heights, enthalpies, and rate coefficients for chemical reactions". Sci Data 9, 417 (2022). https://doi.org/10.1038/s41597-022-01529-6

In [None]:
!apt install subversion
!svn export https://github.com/kevingreenman/chemprop-workshop-acs-fall2023.git/trunk/data

# Basic Regression Exercise

In [None]:
critprop_df = pd.read_csv("data/critprop_data_only_smiles_mean_value_expt.csv")
critprop_df

### Training

Fill in the missing arguments to train a regression model on the critical temperature data from `data/critprop_data_only_smiles_mean_value_expt.csv`(the Tc column is called `Tc (K)`). Choose a number of epochs to train (5-10 should be sufficient for this exercise, though in practice we typically use 50+ epochs). Save the results in a directory called `test_checkpoints_critprop`.

In [None]:
arguments = [
    '--data_path', '',
    '--dataset_type', '',
    '--save_dir', '',
    '--epochs', '',
    '--save_smiles_splits',
    '--target_columns', '',
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

### Prediction

Fill in the missing arguments to make predictions on the test set using this trained model. Save the predictions in a file called `test_preds_critprop.csv`. Look at the examples from the previous notebook or navigate through the file tree in the pane on the left to find where the list of SMILES for the test set is stored.

In [None]:
arguments = [
    '--test_path', '',
    '--preds_path', '',
    '--checkpoint_dir', ''
]

args = chemprop.args.PredictArgs().parse_args(arguments)
test_preds = chemprop.train.make_predictions(args=args)

### Visualize Results

In [None]:
test_tc_df = pd.read_csv('test_checkpoints_critprop/fold_0/test_full.csv')
test_tc_df['Tc preds (K)'] = [x[0] for x in test_preds]

plot_parity(test_tc_df['Tc (K)'], test_tc_df['Tc preds (K)'])

# Multi-Molecule Exercise

In [None]:
sol_df = pd.read_csv("data/CombiSolu-Exp.csv")
sol_df

### Training

Fill in the missing arguments to train the model on pairs of two molecules (solvents and solutes). The data comes from the file `data/CombiSolu-Exp.csv`. Save the results in a directory called `test_checkpoints_multimolecule`. Tell the model how many molecules you are giving it as input for each label. It will also need to know the names of the SMILES and target columns since the file contains additional columns that we're not using in this exercise.

In [None]:
arguments = [
    '--data_path', '',
    '--dataset_type', 'regression',
    '--save_dir', '',
    '--epochs', '5',
    '--save_smiles_splits',
    '--number_of_molecules', '',
    '--smiles_columns', '', '',
    '--target_columns', '',
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

### Prediction

Fill in the missing arguments for prediction. Note that you should again specify that the model should expect multiple SMILES as input, similarly to the training arguments.

In [None]:
arguments = [
    '--test_path', 'test_checkpoints_multimolecule/fold_0/test_smiles.csv',
    '--preds_path', 'test_preds_multimolecule.csv',
    '--checkpoint_dir', '',
    '', '',
]

args = chemprop.args.PredictArgs().parse_args(arguments)
test_preds = chemprop.train.make_predictions(args=args)

### Visualize Results

In [None]:
test_sol_df = pd.read_csv('test_checkpoints_multimolecule/fold_0/test_full.csv')
test_sol_df['logS pred'] = [x[0] for x in test_preds]

plot_parity(test_sol_df['experimental_logS [mol/L]'], test_sol_df['logS pred'])

# Reaction Exercise

In [None]:
rxn_df = pd.read_csv("data/wb97xd3.csv")
rxn_df

### Training

Fill in the missing argument to tell the model we want it to train in reaction mode.

In [None]:
arguments = [
    '--data_path', 'data/wb97xd3.csv',
    '--dataset_type', 'regression',
    '--save_dir', 'test_checkpoints_reaction',
    '--epochs', '5',
    '',
    '--save_smiles_splits',
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

### Prediction

In [None]:
arguments = [
    '--test_path', 'test_checkpoints_reaction/fold_0/test_smiles.csv',
    '--preds_path', 'test_preds_reaction.csv',
    '--checkpoint_dir', 'test_checkpoints_reaction'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
test_preds = chemprop.train.make_predictions(args=args)

### Visualize Results

In [None]:
test_rxn_df = pd.read_csv('test_checkpoints_reaction/fold_0/test_full.csv')
test_rxn_df['dE0 preds'] = [x[0] for x in test_preds]

plot_parity(test_rxn_df['dE0'], test_rxn_df['dE0 preds'])