Merge pull request #189 from chemprop/mask_normalization

Normalize data and target weight values. Add errors for negative weight. Removed pytest for macos-latest-3.8
chemprop · Jul 22, 2021 · 1423154 · 1423154
2 parents bd7b17e + 5d2a98b
commit 1423154
Show file tree

Hide file tree

Showing 4 changed files with 18 additions and 2 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -17,6 +17,10 @@ jobs:
       matrix:
         os: [ubuntu-latest, macos-latest]  # TODO: fix windows permissions issues and add windows-latest
         python-version: [3.6, 3.7, 3.8]
+        exclude:
+        # excludes node 8 on macOS
+          - os: macos-latest
+            python-version: 3.8
 
     steps:
     - uses: actions/checkout@v2

diff --git a/README.md b/README.md
@@ -272,9 +272,9 @@ In contrast, when using `sklearn_train.py` (a utility script provided within Che
 
 By default, each task in multitask training and each provided datapoint are weighted equally for training. Weights can be specified in either case to allow some tasks in training or some specified data points to be weighted more heavily than others in the training of the model.
 
-Using the `--target_weights` argument followed by a list of numbers equal in length to the number of tasks in multitask training, different tasks can be given more weight in parameter updates during training. For instance, in a multitask training with two tasks, the argument `--target_weights 1 2` would give the second task twice as much weight in model parameter updates.
+Using the `--target_weights` argument followed by a list of numbers equal in length to the number of tasks in multitask training, different tasks can be given more weight in parameter updates during training. For instance, in a multitask training with two tasks, the argument `--target_weights 1 2` would give the second task twice as much weight in model parameter updates. Provided weights must be non-negative. Values are normalized to make the average weight equal 1. Target weights are not used with the validation set for the determination of early stopping or in evaluation of the test set.
 
-Using the `--data_weights_path` argument followed by a path to a data file containing weights will allow each individual datapoint in the training data to be given different weight in parameter updates. Formatting of this file is similar to provided features CSV files: they should contain only a single column with one header row and a numerical value in each row that corresponds to the order of datapoints provided with `--data_path`. Data weights do not need to be provided for validation or test sets if they are provided through the arguments `--separate_test_path` or `--separate_val_path`.
+Using the `--data_weights_path` argument followed by a path to a data file containing weights will allow each individual datapoint in the training data to be given different weight in parameter updates. Formatting of this file is similar to provided features CSV files: they should contain only a single column with one header row and a numerical value in each row that corresponds to the order of datapoints provided with `--data_path`. Data weights should not be provided for validation or test sets if they are provided through the arguments `--separate_test_path` or `--separate_val_path`. Provided weights must be non-negative. Values are normalized to make the average weight equal 1. Data weights are not used with the validation set for the determination of early stopping or in evaluation of the test set.
 
 ### Caching
 

diff --git a/chemprop/args.py b/chemprop/args.py
@@ -601,6 +601,13 @@ def process_args(self) -> None:
         if not self.bond_feature_scaling and self.bond_features_path is None:
             raise ValueError('Bond descriptor scaling is only possible if additional bond features are provided.')
 
+        # normalize target weights
+        if self.target_weights is not None:
+            avg_weight = sum(self.target_weights)/len(self.target_weights)
+            self.target_weights = [w/avg_weight for w in self.target_weights]
+            if min(self.target_weights) < 0:
+                raise ValueError('Provided target weights must be non-negative.')
+
 
 class PredictArgs(CommonArgs):
     """:class:`PredictArgs` includes :class:`CommonArgs` along with additional arguments used for predicting with a Chemprop model."""

diff --git a/chemprop/data/utils.py b/chemprop/data/utils.py
@@ -111,6 +111,11 @@ def get_data_weights(path: str) -> List[float]:
         next(reader) #skip header row
         for line in reader:
             weights.append(float(line[0]))
+    # normalize the data weights
+    avg_weight=sum(weights)/len(weights)
+    weights = [w/avg_weight for w in weights]
+    if min(weights) < 0:
+        raise ValueError('Data weights must be non-negative for each datapoint.')
     return weights