This is a minimal example of using AutoMol:
- Training dataset: QM9
- Training features: rdkit features, fingerprints (PCA to 50 dimensions for MLP)
- Feature to learn: HOMO energy
- Models: RandomForest, GradientBoosting, GaussianProcess, MLP neural network
- Problem: regression
- Dataset location: local
- No CV
- No custom features

In [1]:
if'automol'in globals(): import importlib; importlib.reload(automol)

In [2]:
from automol.pipeline import Pipeline

In [4]:
config_yaml = 'qm9_dataset_example_3.yaml'
pipeline = Pipeline(config_yaml)
pipeline.print_spec()

amount: 50
dataset_class: QM9
dataset_location: data/dsgdb9nsd
dataset_split_test_size: 0.1
features:
- fingerprint
- rdkit
label: homo
mlflow_experiment: qm9_dataset_automol_demo
models_filter:
- git_uri: sklearn
  model_names:
  - RandomForestRegressor
  - GradientBoostingRegressor
  - GaussianProcessRegressor
  - MLPRegressor
  whitelist: 1
pca_preprocessing:
  feature: rdkit
  model_name: MLPRegressor
  n_components: 50
problem: regression



In [5]:
pipeline.train()

Got the feature homo from the current dataset.
Checking if the feature fingerprint can be generated.
Got the generated feature fingerprint.
Running model GaussianProcessRegressor with feature fingerprint.




Running model GradientBoostingRegressor with feature fingerprint.




Running model MLPRegressor with feature fingerprint.




Running model RandomForestRegressor with feature fingerprint.




Checking if the feature rdkit can be generated.
Got the generated feature rdkit.
Got the feature rdkit from the current dataset.




Running model GaussianProcessRegressor with feature rdkit.




Running model GradientBoostingRegressor with feature rdkit.




Running model MLPRegressor with feature rdkit (PCA to 50 dimensions).
Running model RandomForestRegressor(max_depth=3) with feature rdkit.




In [10]:
import pandas as pd
def highlight_min(s):
    is_min = s == s.min()
    return ['background-color: yellow' if v else '' for v in is_min]
s = pipeline.get_statistics().sort_values(by='test_mae', ascending=True)
s.style.apply(highlight_min, subset=pd.IndexSlice[s.columns[[2,3,5,6]]])

Unnamed: 0,model,feature,training_mae,training_mse,training_r2_score,test_mae,test_mse,test_r2_score,training_rmse,training_score,best_cv_score
3,RandomForestRegressor,fingerprint,0.009295,0.000113,0.80999,0.008575,0.000104,0.010463,0.010619,0.80999,0.171034
1,GradientBoostingRegressor,fingerprint,0.000229,0.0,0.999874,0.011162,0.000139,-0.320845,0.000274,0.999874,
7,RandomForestRegressor(max_depth=3),rdkit,0.009212,0.000124,0.777068,0.012526,0.000194,0.622973,0.011118,0.777068,
5,GradientBoostingRegressor,rdkit,7.5e-05,0.0,0.999986,0.016,0.000302,0.411685,8.9e-05,0.999986,
2,MLPRegressor,fingerprint,0.010704,0.000212,0.642314,0.163719,0.027596,-260.519901,0.01457,0.642314,
4,GaussianProcessRegressor,rdkit,0.0,0.0,1.0,0.2335,0.055036,-106.133884,0.0,1.0,
0,GaussianProcessRegressor,fingerprint,0.0,0.0,1.0,0.24068,0.058032,-548.957393,0.0,1.0,
6,MLPRegressor,rdkit,2.163408,7.515038,-13552.890777,2.847118,12.192142,-23732.418242,2.741357,-13552.890777,
