Skip to content

Commit

Permalink
Merge pull request #189 from chemprop/mask_normalization
Browse files Browse the repository at this point in the history
Normalize data and target weight values. Add errors for negative weight.

Removed pytest for macos-latest-3.8
  • Loading branch information
cjmcgill committed Jul 22, 2021
2 parents bd7b17e + 5d2a98b commit 1423154
Show file tree
Hide file tree
Showing 4 changed files with 18 additions and 2 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ jobs:
matrix:
os: [ubuntu-latest, macos-latest] # TODO: fix windows permissions issues and add windows-latest
python-version: [3.6, 3.7, 3.8]
exclude:
# excludes node 8 on macOS
- os: macos-latest
python-version: 3.8

steps:
- uses: actions/checkout@v2
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,9 +272,9 @@ In contrast, when using `sklearn_train.py` (a utility script provided within Che

By default, each task in multitask training and each provided datapoint are weighted equally for training. Weights can be specified in either case to allow some tasks in training or some specified data points to be weighted more heavily than others in the training of the model.

Using the `--target_weights` argument followed by a list of numbers equal in length to the number of tasks in multitask training, different tasks can be given more weight in parameter updates during training. For instance, in a multitask training with two tasks, the argument `--target_weights 1 2` would give the second task twice as much weight in model parameter updates.
Using the `--target_weights` argument followed by a list of numbers equal in length to the number of tasks in multitask training, different tasks can be given more weight in parameter updates during training. For instance, in a multitask training with two tasks, the argument `--target_weights 1 2` would give the second task twice as much weight in model parameter updates. Provided weights must be non-negative. Values are normalized to make the average weight equal 1. Target weights are not used with the validation set for the determination of early stopping or in evaluation of the test set.

Using the `--data_weights_path` argument followed by a path to a data file containing weights will allow each individual datapoint in the training data to be given different weight in parameter updates. Formatting of this file is similar to provided features CSV files: they should contain only a single column with one header row and a numerical value in each row that corresponds to the order of datapoints provided with `--data_path`. Data weights do not need to be provided for validation or test sets if they are provided through the arguments `--separate_test_path` or `--separate_val_path`.
Using the `--data_weights_path` argument followed by a path to a data file containing weights will allow each individual datapoint in the training data to be given different weight in parameter updates. Formatting of this file is similar to provided features CSV files: they should contain only a single column with one header row and a numerical value in each row that corresponds to the order of datapoints provided with `--data_path`. Data weights should not be provided for validation or test sets if they are provided through the arguments `--separate_test_path` or `--separate_val_path`. Provided weights must be non-negative. Values are normalized to make the average weight equal 1. Data weights are not used with the validation set for the determination of early stopping or in evaluation of the test set.

### Caching

Expand Down
7 changes: 7 additions & 0 deletions chemprop/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -601,6 +601,13 @@ def process_args(self) -> None:
if not self.bond_feature_scaling and self.bond_features_path is None:
raise ValueError('Bond descriptor scaling is only possible if additional bond features are provided.')

# normalize target weights
if self.target_weights is not None:
avg_weight = sum(self.target_weights)/len(self.target_weights)
self.target_weights = [w/avg_weight for w in self.target_weights]
if min(self.target_weights) < 0:
raise ValueError('Provided target weights must be non-negative.')


class PredictArgs(CommonArgs):
""":class:`PredictArgs` includes :class:`CommonArgs` along with additional arguments used for predicting with a Chemprop model."""
Expand Down
5 changes: 5 additions & 0 deletions chemprop/data/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,11 @@ def get_data_weights(path: str) -> List[float]:
next(reader) #skip header row
for line in reader:
weights.append(float(line[0]))
# normalize the data weights
avg_weight=sum(weights)/len(weights)
weights = [w/avg_weight for w in weights]
if min(weights) < 0:
raise ValueError('Data weights must be non-negative for each datapoint.')
return weights


Expand Down

0 comments on commit 1423154

Please sign in to comment.