# Train a Simple Regression Model

The process of training a machine learning (ML) model can be thought of as fitting a
highly parameterized function to map inputs to outputs. An ML algorithm needs to train
numerous examples of input and output pairs to accurately map an input to an output,
i. e., make a prediction. After training, the result is referred to as a trained ML model or an artifact.

This tutorial will detail how we can use **[AMPL](https://github.com/ATOMScience-org/AMPL)** tools to train a regression model to predict 
how much a compound will inhibit the **[SLC6A3](https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL238/)** protein as measured by $pK_i$. 
We will train a random forest model using the following inputs:

1. The curated **[SLC6A3](https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL238)** dataset from **Tutorial 1, "Data Curation"**.
2. The split file generated in **Tutorial 2, "Splitting Datasets for Validation and Testing"**.
3. **[RDKit](https://github.com/rdkit/rdkit)** features calculated by the **[AMPL](https://github.com/ATOMScience-org/AMPL)** pipeline.

The tutorial will present the following functions and classes:

* [ModelPipeline](https://ampl.readthedocs.io/en/latest/pipeline.html#module-pipeline.model_pipeline)
* [parameter_parser.wrapper](https://ampl.readthedocs.io/en/latest/pipeline.html#pipeline.parameter_parser.wrapper)
* [compare_models.get_filesystem_perf_results](https://ampl.readthedocs.io/en/latest/pipeline.html#pipeline.compare_models.get_filesystem_perf_results)

We will explain the use of descriptors, how to evaluate model performance,
and where the model is saved as a .tar.gz file.

> **Note**: *Training a random forest model and splitting the dataset are non-deterministic. 
You will obtain a slightly different random forest model by running this tutorial each time.*

## Model Training (Split Data and Train)

**[AMPL](https://github.com/ATOMScience-org/AMPL)** also provides an option to split a dataset and train a model in one step, by setting the `previously_split` parameter to False and omitting the `split_uuid` parameter. 
**[AMPL](https://github.com/ATOMScience-org/AMPL)** splits the data by the type of split specified in the splitter parameter, 
scaffold in this example, and writes the split file in
`dataset/SLC6A3_Ki_curated_train_valid_test_scaffold_{split_uuid}.csv` 

Although it's convenient, it is not a good idea to use the one-step option if you intend to train multiple models with different parameters on the same dataset and compare their performance. If you do, you will end up with different splits for each model, and won't be able to tell if the differences in performance are due to the parameter settings or to the random variations between splits.

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)
from atomsci.ddm.pipeline import model_pipeline as mp
from atomsci.ddm.pipeline import parameter_parser as parse

Skipped loading some Jax models, missing a dependency. No module named 'haiku'
  from .autonotebook import tqdm as notebook_tqdm


In [11]:
import os

curated_cdks = []
for file in os.listdir("./"):
    if (file.endswith(".csv")):
        curated_cdks.append(file)

In [4]:

odir = "models/"
params = {
        "prediction_type": "regression",
        "dataset_key": "Single_protein/curated_cdk_1.csv",
        "id_col": "compound_id",
        "smiles_col": "base_rdkit_smiles",
        "response_cols": "avg_pIC50",
    
        "previously_split": "False",
        "split_only": "False",
        "splitter": "random",
        "split_strategy" : "train_valid_test",
    
        "featurizer": "computed_descriptors",
        "descriptor_type" : "rdkit_raw",
        "model_type": "RF",
        "transformers": "True",
        "rerun": "True",
        "result_dir": odir
    }

ampl_param = parse.wrapper(params)
pl = mp.ModelPipeline(ampl_param)
pl.train_model()

  X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)


## Performance of the Model
We evaluate model performance by measuring how accurate 
models are on validation and test sets. 
The validation set is used while optimizing the model and choosing the best
parameter settings. Finally, we use the model's performance on the test set to judge the model.

**[AMPL](https://github.com/ATOMScience-org/AMPL)** has several popular metrics to evaulate regression models; 
**Mean Absolute Error (MAE)**, **Root Mean Squared Error (RMSE)** and $R^2$ (R-Squared).
In our tutorials, we will use $R^2$ metric to compare our models. The best model will have the highest
$R^2$ score.


In [50]:
perf_df.columns.values

array(['model_uuid', 'model_path', 'ampl_version', 'model_type',
       'dataset_key', 'features', 'splitter', 'split_strategy',
       'split_uuid', 'model_score_type', 'feature_transform_type',
       'weight_transform_type', 'model_choice_score',
       'best_train_r2_score', 'best_train_rms_score',
       'best_train_mae_score', 'best_train_num_compounds',
       'best_valid_r2_score', 'best_valid_rms_score',
       'best_valid_mae_score', 'best_valid_num_compounds',
       'best_test_r2_score', 'best_test_rms_score', 'best_test_mae_score',
       'best_test_num_compounds', 'max_epochs', 'best_epoch',
       'learning_rate', 'layer_sizes', 'dropouts', 'rf_estimators',
       'rf_max_features', 'rf_max_depth', 'xgb_gamma',
       'xgb_learning_rate', 'xgb_max_depth', 'xgb_colsample_bytree',
       'xgb_subsample', 'xgb_n_estimators', 'xgb_min_child_weight',
       'model_parameters_dict', 'feat_parameters_dict'], dtype=object)

In [7]:
# Model Performance
from atomsci.ddm.pipeline import compare_models as cm
odir = "models/"

perf_df = cm.get_filesystem_perf_results(odir, pred_type='regression')
# save perf_df
import os

perf_df.to_csv(os.path.join(odir, 'perf_df.csv'))
# View the perf_df dataframe

# show most useful columns
perf_df[['model_uuid', "split_uuid", 'model_type', 'splitter', 'best_valid_r2_score', 'best_train_r2_score']].sort_values(by="best_valid_r2_score", ascending=False)

Found data for 15 models under models/


Unnamed: 0,model_uuid,split_uuid,model_type,splitter,best_valid_r2_score,best_train_r2_score
5,96d385b9-4450-4a5a-8735-bc9324e3afbf,bfd30f7c-f9fc-4d67-99c0-6c7eafccc8c0,RF,random,0.612848,0.940891
9,f3b39149-4268-40c4-9e45-a1a9d0c85d4d,03b80b96-36b7-4e3a-8263-f500c3ba49ba,RF,random,0.592412,0.935752
6,ab837d7f-4c35-4029-91e7-301184eeb89b,815aed32-aec6-415c-9574-d298fd4bdd76,RF,random,0.578656,0.939304
14,8c572243-2064-47cc-b9ed-5cd2e34b9119,f9be6f68-6590-4151-8484-ae4ebc6ddae0,RF,random,0.565689,0.943218
8,eded78be-90c6-4076-8b70-e5569795bd58,43a0194a-f87f-4035-ac4e-c93d1ab2f7a2,RF,random,0.540056,0.942331
7,b9f174ba-e2a7-4139-974d-8b71f1899508,ba5f8462-c47c-4a31-98f2-73f399e222b7,RF,random,0.499006,0.943851
0,5f309549-3a32-472e-97bd-be53a71642d6,6511638d-f8f9-49a8-aedb-f5fbcbef3366,NN,random,0.405756,0.506961
11,a5eaef75-8ad4-44ed-a9f8-2c3d5eaeb020,01dbe363-5d0e-449e-b674-172195134d28,RF,scaffold,0.324142,0.948949
10,12648c3a-a6ed-41d0-90a7-f2b2567c5df5,913e2293-ac11-4b26-bcdc-db1113e1e59c,RF,scaffold,0.3211,0.948791
2,6eaa93ac-f80a-46b4-a45a-bb6355a1e375,2a90ebd5-39c4-45a0-8a1e-38344d696ccb,RF,fingerprint,0.24084,0.950351


## Finding the Top Performing Model
To pick the top performing model, we sort the performance table by `best_valid_r2_score` in descending order and examine the top row. 

In [49]:
# Top performing model
top_model=perf_df.sort_values(by="best_valid_r2_score", ascending=False).iloc[0]
top_model

model_uuid                               96d385b9-4450-4a5a-8735-bc9324e3afbf
model_path                  models/CDK_1_all_curated_model_96d385b9-4450-4...
ampl_version                                                            1.6.1
model_type                                                                 RF
dataset_key                 c:\Desktop\Hackathon\Single_protein\CDK_1_all_...
features                                                            rdkit_raw
splitter                                                               random
split_strategy                                               train_valid_test
split_uuid                               bfd30f7c-f9fc-4d67-99c0-6c7eafccc8c0
model_score_type                                                           r2
feature_transform_type                                          normalization
weight_transform_type                                                    None
model_choice_score                                              

You can find the path to the .tar.gz file ("tarball") where the top performing model is saved by examining `top_model.model_path`. You will need this path to run predictions with the model at a later time.

In [7]:
# Top performing model path
top_model.model_path

'models/CDK_1_all_curated_model_96d385b9-4450-4a5a-8735-bc9324e3afbf.tar.gz'

In **Tutorial 4
, "Application of a Trained Model"**, we will learn how to use a selected model to make predictions and evaluate those predictions

If you have specific feedback about a tutorial, please complete the **[AMPL Tutorial Evaluation](https://forms.gle/pa9sHj4MHbS5zG7A6)**.