Skip to content

Commit

Permalink
Merge pull request #210 from chemprop/imputation
Browse files Browse the repository at this point in the history
Enable target imputation for sklearn multitask models
  • Loading branch information
hesther committed Sep 26, 2021
2 parents 10e8472 + d3a35c1 commit 91c1b1b
Show file tree
Hide file tree
Showing 3 changed files with 87 additions and 5 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,7 +296,7 @@ Certain portions of the model can be loaded from a previous model and frozen so

When training multitask models (models which predict more than one target simultaneously), sometimes not all target values are known for all molecules in the dataset. Chemprop automatically handles missing entries in the dataset by masking out the respective values in the loss function, so that partial data can be utilized, too. The loss function is rescaled according to all non-missing values, and missing values furthermore do not contribute to validation or test errors. Training on partial data is therefore possible and encouraged (versus taking out datapoints with missing target entries). No keyword is needed for this behavior, it is the default.

In contrast, when using `sklearn_train.py` (a utility script provided within Chemprop that trains standard models such as random forests on Morgan fingerprints via the python package scikit-learn), multi-task models cannot be trained on datasets with partially missing targets. However, one can instead train individual models for each task (via the argument `--single_task`), where missing values are automatically removed from the dataset. Thus, the training still makes use of all non-missing values, but by training individual models for each task, instead of one model with multiple output values. This restriction only applies to sklearn models (via :code:`sklearn_train` or :code:`python sklearn_train.py`), but NOT to default Chemprop models via `chemprop_train` or `python train.py`.
In contrast, when using `sklearn_train.py` (a utility script provided within Chemprop that trains standard models such as random forests on Morgan fingerprints via the python package scikit-learn), multi-task models cannot be trained on datasets with partially missing targets. However, one can instead train individual models for each task (via the argument `--single_task`), where missing values are automatically removed from the dataset. Thus, the training still makes use of all non-missing values, but by training individual models for each task, instead of one model with multiple output values. This restriction only applies to sklearn models (via :code:`sklearn_train` or :code:`python sklearn_train.py`), but NOT to default Chemprop models via `chemprop_train` or `python train.py`. Alternatively, missing target values can be imputed by specifying `--impute_mode <single_task/linear/median/mean/frequent>`. The option `single_task` trains single task sklearn models on each task to predict missing values and is computationally expensive. The option `linear` trains a stochastic gradient linear model on each target to compute missing targets. Both `single_task` and `linear` are applicable to regression and classification task. For regression tasks, the options `median` and `mean` furthermore compute the median and mean of the training data. For classification tasks, `frequent` computes the most frequent value for each task. For all options, models are fitted to non-missing training targets and predict missing training targets. The test set is not affected by imputing.

### Weighted Loss Functions in Training

Expand Down
2 changes: 2 additions & 0 deletions chemprop/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -749,6 +749,8 @@ class SklearnTrainArgs(TrainArgs):
"""Number of bits in morgan fingerprint."""
num_trees: int = 500
"""Number of random forest trees."""
impute_mode: Literal['single_task', 'median', 'mean', 'linear','frequent'] = None
"""How to impute missing data (None means no imputation)."""


class SklearnPredictArgs(Tap):
Expand Down
88 changes: 84 additions & 4 deletions chemprop/sklearn_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@
import os
import pickle
from typing import Dict, List, Union
from pprint import pformat
from copy import deepcopy

import numpy as np
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.linear_model import SGDClassifier, SGDRegressor
from tqdm import trange, tqdm

from chemprop.args import SklearnTrainArgs
Expand Down Expand Up @@ -55,6 +56,76 @@ def predict(model: Union[RandomForestRegressor, RandomForestClassifier, SVR, SVC

return preds

def impute_sklearn(model: Union[RandomForestRegressor, RandomForestClassifier, SVR, SVC],
train_data: MoleculeDataset,
args: SklearnTrainArgs,
logger: Logger = None,
threshold: float = 0.5) -> List[float]:
"""
Trains a single-task scikit-learn model, meaning a separate model is trained for each task.
This is necessary if some tasks have None (unknown) values.
:param model: The scikit-learn model to train.
:param train_data: The training data.
:param args: A :class:`~chemprop.args.SklearnTrainArgs` object containing arguments for
training the scikit-learn model.
:param logger: A logger to record output.
:param theshold: Threshold for classification tasks.
:return: A list of list of target values.
"""
num_tasks = train_data.num_tasks()
new_targets=deepcopy(train_data.targets())

if logger is not None:
debug = logger.debug
else:
debug = print

debug('Imputation')

for task_num in trange(num_tasks):
impute_train_features = [features for features, targets in zip(train_data.features(), train_data.targets()) if targets[task_num] is None]
if len(impute_train_features) > 0:
train_features, train_targets = zip(*[(features, targets[task_num])
for features, targets in zip(train_data.features(), train_data.targets())
if targets[task_num] is not None])
if args.impute_mode == 'single_task':
model.fit(train_features, train_targets)
impute_train_preds = predict(
model=model,
model_type=args.model_type,
dataset_type=args.dataset_type,
features=impute_train_features
)
impute_train_preds = [pred[0] for pred in impute_train_preds]
elif args.impute_mode == 'median' and args.dataset_type == 'regression':
impute_train_preds = [np.median(train_targets)] * len(new_targets)
elif args.impute_mode == 'mean' and args.dataset_type == 'regression':
impute_train_preds = [np.mean(train_targets)] * len(new_targets)
elif args.impute_mode == 'frequent' and args.dataset_type == 'classification':
impute_train_preds = [np.argmax(np.bincount(train_targets))] * len(new_targets)
elif args.impute_mode == 'linear' and args.dataset_type == 'regression':
reg = SGDRegressor(alpha=0.01).fit(train_features, train_targets)
impute_train_preds = reg.predict(impute_train_features)
elif args.impute_mode == 'linear' and args.dataset_type == 'classification':
cls = SGDClassifier().fit(train_features, train_targets)
impute_train_preds = cls.predict(impute_train_features)
else:
raise ValueError("Invalid combination of imputation mode and dataset type.")

#Replace targets
ctr = 0
for i in range(len(new_targets)):
if new_targets[i][task_num] is None:
value = impute_train_preds[ctr]
if args.dataset_type == 'classification':
value = int(value > threshold)
new_targets[i][task_num] = value
ctr += 1

return new_targets


def single_task_sklearn(model: Union[RandomForestRegressor, RandomForestClassifier, SVR, SVC],
train_data: MoleculeDataset,
Expand Down Expand Up @@ -136,6 +207,17 @@ def multi_task_sklearn(model: Union[RandomForestRegressor, RandomForestClassifie
num_tasks = train_data.num_tasks()

train_targets = train_data.targets()

if args.impute_mode:
train_targets = impute_sklearn(model=model,
train_data=train_data,
args=args,
logger=logger)
elif any(None in sublist for sublist in train_targets):
raise ValueError("Missing target values not tolerated for multi-task sklearn models."
"Use either --single_task to train multiple single-task models or impute"
" targets via --impute_mode <model/linear/median/mean/frequent>.")

if train_data.num_tasks() == 1:
train_targets = [targets[0] for targets in train_targets]

Expand Down Expand Up @@ -182,8 +264,6 @@ def run_sklearn(args: SklearnTrainArgs,
else:
debug = info = print

debug(pformat(vars(args)))

debug('Loading data')
data = get_data(path=args.data_path,
smiles_columns=args.smiles_columns,
Expand Down Expand Up @@ -237,7 +317,7 @@ def run_sklearn(args: SklearnTrainArgs,
raise ValueError(f'Model type "{args.model_type}" not supported')
elif args.dataset_type == 'classification':
if args.model_type == 'random_forest':
model = RandomForestClassifier(n_estimators=args.num_trees, n_jobs=-1, class_weight=args.class_weight)
model = RandomForestClassifier(n_estimators=args.num_trees, n_jobs=-1, class_weight=args.class_weight, random_state=args.seed)
elif args.model_type == 'svm':
model = SVC()
else:
Expand Down

0 comments on commit 91c1b1b

Please sign in to comment.