# Chemprop scikit-learn Estimator Example
Demonstrating usage of modules in chemprop.sklearn_integration.chemprop_estimator with common scikit-learn workflows including cross-validation, hyperparameter tuning, and model persistence.

In [17]:
from chemprop.sklearn_integration.chemprop_estimator import ChempropMoleculeTransformer, ChempropReactionTransformer, ChempropMulticomponentTransformer, ChempropRegressor, ChempropEnsembleRegressor
import numpy as np

# Sample data
X = np.array([
    "CCO", "CCN", "CCC", "COC", "CNC", "CCCl", "CCBr", "CCF", "CCI", "CC=O",
    "CC#N", "CC(C)O", "CC(C)N", "CC(C)C", "COC(C)", "CN(C)C", "C1CCCCC1", "C1=CC=CC=C1",
    "CC(C)(C)O", "CC(C)(C)N", "COCCO", "CCOC(=O)C", "CCN(CC)CC", "CN1CCCC1", "C(CO)N"
])
y = np.array([
    0.50, 0.60, 0.55, 0.58, 0.52, 0.62, 0.65, 0.57, 0.59, 0.61,
    0.56, 0.60, 0.54, 0.53, 0.62, 0.63, 0.45, 0.40,
    0.64, 0.66, 0.59, 0.51, 0.48, 0.46, 0.49
])

## Step 1: Pipeline 
Pipeline ChempropMoleculeTransformer/ChempropReactionTransformer/ChempropMulticomponentTransformer and ChempropRegressor together to obtain an sklearn module that encapulates full chemprop functionalites.

In [18]:
from sklearn.pipeline import Pipeline

mol_pipeline = Pipeline([
    ("featurizer", ChempropMoleculeTransformer(keep_h=True, add_h=True, ignore_stereo=True, reorder_atoms=True, multi_hot_atom_featurizer_mode='ORGANIC')),
    ("regressor", ChempropRegressor(batch_size=8, message_hidden_dim=100, depth=5, ffn_num_layers=2, epochs=30, patience=5))
])

mol_pipeline.fit(X, y)
y_pred = mol_pipeline.predict(X[:5])
print(y_pred)

score = mol_pipeline.score(X[:5],y[:5], metric="mse") # suppot metrics in ["mae", "rmse", "mse", "r2", "accuracy"]
print(score)

Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
c:\Users\jxl05\miniconda3\envs\chemprop1\Lib\site-packages\lightning\pytorch\trainer\configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
Loading `train_dataloader` to estimate number of stepping batches.
c:\Users\jxl05\miniconda3\envs\chemprop1\Lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
c:\Users\jxl05\miniconda3\envs\chemprop1\Lib\site-packages\lightning\pytorch\loops\fit_loop.py:310: The number of training batches (3) is smaller 

Epoch 5: 100%|██████████| 3/3 [00:00<00:00, 21.74it/s, v_num=60, train_loss_step=0.725, train_loss_epoch=0.963]


c:\Users\jxl05\miniconda3\envs\chemprop1\Lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:425: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00, 91.84it/s]
[0.5582642  0.55745083 0.5562975  0.558699   0.5576476 ]
Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00, 139.70it/s]
0.037725003384300466


Alternatively, pass in a path to a csv data file instead of X/y arrays, and include the roles of the columns in arguments of the transformer.

In [14]:
from pathlib import Path
multicomponent_pipeline = Pipeline([
    ("featurizer", ChempropMulticomponentTransformer(
                    smiles_cols="solvent_smiles",
                    rxn_cols="rxn_smiles",
                    target_cols="target",
                )),
    ("regressor", ChempropRegressor(batch_size=8, message_hidden_dim=100, depth=5, ffn_num_layers=2, epochs=30, patience=5))
])

multicomponent_pipeline.fit(X=Path("C:\Users\jxl05\Downloads\chemprop\chemprop\sklearn_integration\example_model_v2_regression_rxn+mol.pt"))
score = multicomponent_pipeline.score(X=Path("C:\Users\jxl05\Downloads\chemprop\chemprop\sklearn_integration\example_model_v2_regression_rxn+mol.pt"))
print(score)

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (3924462030.py, line 11)

## Hyperparameter Tuning Example with GridSearchCV

In [20]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(mol_pipeline, X, y, cv=5, scoring='neg_mean_squared_error')
print("Cross-validation MSE scores:", -scores)
print("Average MSE:", -scores.mean())

AttributeError: 'ChempropRegressor' object has no attribute 'accelerator'

In [21]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'regressor__dropout': [0,0.2],
    'regressor__depth': [3, 6],
    'regressor__ffn_hidden_dim': [300, 1000, 1700, 2400],
    'regressor__ffn_num_layers': [1,2]
}

grid = GridSearchCV(mol_pipeline, param_grid, cv=3, scoring='neg_mean_squared_error')
grid.fit(X, y)

print("Best parameters:", grid.best_params_)
print("Best score (MSE):", -grid.best_score_)

AttributeError: 'ChempropRegressor' object has no attribute 'accelerator'

## Save and Reload Trained Pipeline