Merge pull request #208 from chemprop/parallel_hyperopt

Add checkpoints for hyperparameter optimization, allowing parallel operation and restarting.
chemprop · Sep 23, 2021 · 10e8472 · 10e8472
2 parents 9c8ff40 + 87ce645
commit 10e8472
Show file tree

Hide file tree

Showing 6 changed files with 304 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -196,11 +196,22 @@ To train an ensemble, specify the number of models in the ensemble with `--ensem
 
 Although the default message passing architecture works quite well on a variety of datasets, optimizing the hyperparameters for a particular dataset often leads to marked improvement in predictive performance. We have automated hyperparameter optimization via Bayesian optimization (using the [hyperopt](https://github.com/hyperopt/hyperopt) package), which will find the optimal hidden size, depth, dropout, and number of feed-forward layers for our model. Optimization can be run as follows:
 ```
-chemprop_hyperopt --data_path <data_path> --dataset_type <type> --num_iters <n> --config_save_path <config_path>
+chemprop_hyperopt --data_path <data_path> --dataset_type <type> --num_iters <int> --config_save_path <config_path>
 ```
-where `<n>` is the number of hyperparameter settings to try and `<config_path>` is the path to a `.json` file where the optimal hyperparameters will be saved.
+where `<int>` is the number of hyperparameter trial configurations to try and `<config_path>` is the path to a `.json` file where the optimal hyperparameters will be saved. If installed from source, `chemprop_hyperopt` can be replaced with `python hyperparameter_optimization.py`. Additional training arguments can also be supplied during submission, and they will be applied to all included training iterations (`--epochs`, `--aggregation`, `--num_folds`, `--gpu`, `--seed`, etc.). The argument `--log_dir <dir_path>` can optionally be provided to set a location for the hyperparameter optimization log.
 
-If installed from source, `chemprop_hyperopt` can be replaced with `python hyperparameter_optimization.py`.
+Results of completed trial configurations will be stored there and may serve as checkpoints for other instances of hyperparameter optimization if the directory for hyperopt checkpoint files has been specified, `--hyperopt_checkpoint_dir <path>`. If `--hyperopt_checkpoint_dir` is not specified, then checkpoints will default to being stored with the hyperparame. Interrupted hyperparameter optimizations can be restarted by specifying the same directory. Previously completed hyperparameter optimizations can be used as the starting point for new optimizations with a larger selected number of iterations. Note that the `--num_iters <int>` argument will count all previous checkpoints saved in the directory towards the total number of iterations, and if the existing number of checkpoints exceeds this argment then no new trials will be carried out.
+
+Manual training instances outside of hyperparameter optimization may also be considered in the history of attempted trials. The paths to the save_dirs for these training instances can be specified with `--manual_trial_dirs <list-of-directories>`. These directories must contain the files `test_scores.csv` and `args.json` as generated during training. To work appropriately, these training instances must be consistent with the parameter space being searched in hyperparameter optimization (including the hyperparameter optimization default of ffn_hidden_size being set equal to hidden_size). Manual trials considered with this argument are not added to the checkpoint directory.
+
+As part of the hyperopt search algorithm, the first trial configurations for the model will be randomly spread through the search space. The number of randomized trials can be altered with the argument `--startup_random_iters <int, default=10>`. After this number of trial iterations has been carried out, subsequent trials will use the directed search algorithm to select parameter configurations. This startup count considers the total number of trials in the checkpoint directory rather than the number that has been carried out by an individual instance of hyperparamter optimization.
+
+Parallel instances of hyperparameter optimization that share a checkpoint directory will have access to the shared results of hyperparameter optimization trials, allowing them to arrive at the desired total number of iterations collectively more quickly. In this way multiple GPUs or other computing resources can be applied to the search. Each instance of hyperparameter optimization is unaware of parallel trials that have not yet completed. This has several implications when running `n` parallel instances:
+* A parallel search will have different information and search different parameters than a single instance sequential search.
+* New trials will not consider the parameters in currently running trials, in rare cases leading to duplication.
+* Up to `n-1` extra random search iterations may occur above the number specified with `--startup_random_iters`.
+* Up to `n-1` extra total trials will be run above the chosen `num_iters`, though each instance will be exposed to at least that number of iterations.
+* The last parallel instance to complete is the only one that is aware of all the trials when reporting results.
 
 Once hyperparameter optimization is complete, the optimal hyperparameters can be applied during training by specifying the config path as follows:
 ```

diff --git a/chemprop/__init__.py b/chemprop/__init__.py
@@ -13,5 +13,7 @@
 import chemprop.rdkit
 import chemprop.sklearn_predict
 import chemprop.sklearn_train
+import chemprop.spectra_utils
+import chemprop.hyperopt_utils
 
 from chemprop._version import __version__
diff --git a/chemprop/args.py b/chemprop/args.py
@@ -500,7 +500,7 @@ def bond_feature_scaling(self) -> bool:
     def process_args(self) -> None:
         super(TrainArgs, self).process_args()
 
-        global temp_dir  # Prevents the temporary directory from being deleted upon function return
+        global temp_save_dir  # Prevents the temporary directory from being deleted upon function return
 
         # Process SMILES columns
         self.smiles_columns = chemprop.data.utils.preprocess_smiles_columns(
@@ -518,8 +518,8 @@ def process_args(self) -> None:
 
         # Create temporary directory as save directory if not provided
         if self.save_dir is None:
-            temp_dir = TemporaryDirectory()
-            self.save_dir = temp_dir.name
+            temp_save_dir = TemporaryDirectory()
+            self.save_dir = temp_save_dir.name
 
         # Fix ensemble size if loading checkpoints
         if self.checkpoint_paths is not None and len(self.checkpoint_paths) > 0:
@@ -714,6 +714,24 @@ class HyperoptArgs(TrainArgs):
     """Path to :code:`.json` file where best hyperparameter settings will be written."""
     log_dir: str = None
     """(Optional) Path to a directory where all results of the hyperparameter optimization will be written."""
+    hyperopt_checkpoint_dir: str = None
+    """Path to a directory where hyperopt completed trial data is stored. Hyperopt job will include these trials if restarted.
+    Can also be used to run multiple instances in parallel if they share the same checkpoint directory."""
+    startup_random_iters: int = 10
+    """The initial number of trials that will be randomly specified before TPE algorithm is used to select the rest."""
+    manual_trial_dirs: List[str] = None
+    """Paths to save directories for manually trained models in the same search space as the hyperparameter search.
+    Results will be considered as part of the trial history of the hyperparameter search."""
+
+
+    def process_args(self) -> None:
+        super(HyperoptArgs, self).process_args()
+
+        # Assign log and checkpoint directories if none provided
+        if self.log_dir is None:
+            self.log_dir = self.save_dir
+        if self.hyperopt_checkpoint_dir is None:
+            self.hyperopt_checkpoint_dir = self.log_dir
 
 
 class SklearnTrainArgs(TrainArgs):

diff --git a/chemprop/constants.py b/chemprop/constants.py
@@ -5,3 +5,4 @@
 # Save file names
 MODEL_FILE_NAME = 'model.pt'
 TEST_SCORES_FILE_NAME = 'test_scores.csv'
+HYPEROPT_SEED_FILE_NAME = 'hyperopt_seeds.txt'
diff --git a/chemprop/hyperopt_utils.py b/chemprop/hyperopt_utils.py
@@ -0,0 +1,205 @@
+from chemprop.args import HyperoptArgs
+import os
+import pickle
+from typing import List, Dict
+import csv
+import json
+
+from hyperopt import Trials
+
+from chemprop.constants import HYPEROPT_SEED_FILE_NAME
+from chemprop.utils import makedirs
+
+def merge_trials(trials: Trials, new_trials_data: List[Dict]) -> Trials:
+    """
+    Merge a hyperopt trials object with the contents of another hyperopt trials object.
+
+    :param trials: A hyperopt trials object containing trials data, organized into hierarchical dictionaries.
+    :param trials_data: The contents of a hyperopt trials object, `Trials.trials`.
+    :return: A hyperopt trials object, merged from the two inputs.
+    """
+    max_tid = 0
+    if len(trials.trials) > 0:
+        max_tid = max([trial['tid'] for trial in trials.trials])
+
+    for trial in new_trials_data:
+        tid = trial['tid'] + max_tid + 1 #trial id needs to be unique among this list of ids.
+        hyperopt_trial = Trials().new_trial_docs(
+                tids=[None],
+                specs=[None],
+                results=[None],
+                miscs=[None])
+        hyperopt_trial[0] = trial
+        hyperopt_trial[0]['tid'] = tid
+        hyperopt_trial[0]['misc']['tid'] = tid
+        for key in hyperopt_trial[0]['misc']['idxs'].keys():
+            hyperopt_trial[0]['misc']['idxs'][key] = [tid]
+        trials.insert_trial_docs(hyperopt_trial) 
+        trials.refresh()
+    return trials
+
+
+def load_trials(dir_path: str, previous_trials: Trials = None) -> Trials:
+    """
+    Load in trials from each pickle file in the hyperopt checkpoint directory.
+    Checkpoints are newly loaded in at each iteration to allow for parallel entries
+    into the checkpoint folder by independent hyperoptimization instances.
+
+    :param dir_path: Path to the directory containing hyperopt checkpoint files.
+    :param previous_trials: Any previously generated trials objects that the loaded trials will be merged with.
+    :return: A trials object containing the merged trials from all checkpoint files.
+    """
+
+    # List out all the pickle files in the hyperopt checkpoint directory
+    hyperopt_checkpoint_files = [os.path.join(dir_path, path) for path in os.listdir(dir_path) if '.pkl' in path]
+
+    # Load hyperopt trials object from each file
+    loaded_trials = Trials()
+    if previous_trials is not None:
+        loaded_trials = merge_trials(loaded_trials, previous_trials.trials)
+
+    for path in hyperopt_checkpoint_files:
+        with open(path,'rb') as f:
+            trial = pickle.load(f)
+            loaded_trials = merge_trials(loaded_trials, trial.trials)
+
+    return loaded_trials
+
+
+def save_trials(dir_path: str, trials: Trials, hyperopt_seed: int) -> None:
+    """
+    Saves hyperopt trial data as a `.pkl` file.
+
+    :param dir_path: Path to the directory containing hyperopt checkpoint files.
+    :param trials: A trials object containing information on a completed hyperopt iteration.
+    """
+    new_fname = f'{hyperopt_seed}.pkl'
+    existing_files = os.listdir(dir_path)
+    if new_fname in existing_files:
+        raise ValueError(f'When saving trial with unique seed {hyperopt_seed}, found that a trial with this seed already exists.')
+    pickle.dump(trials, open(os.path.join(dir_path, new_fname), 'wb'))
+
+
+def get_hyperopt_seed(seed: int, dir_path: str) -> int:
+    """
+    Assigns a seed for hyperopt calculations. Each iteration will start with a different seed.
+
+    :param seed: The initial attempted hyperopt seed.
+    :param dir_path: Path to the directory containing hyperopt checkpoint files.
+    :return: An integer for use as hyperopt random seed.
+    """
+
+    seed_path = os.path.join(dir_path,HYPEROPT_SEED_FILE_NAME)
+
+    seeds = []
+    if os.path.exists(seed_path):
+        with open(seed_path, 'r') as f:
+            seed_line = next(f)
+            seeds.extend(seed_line.split())
+    else:
+        makedirs(seed_path, isfile=True)
+
+    seeds = [int(sd) for sd in seeds]
+
+    while seed in seeds:
+        seed += 1
+    seeds.append(seed)
+
+    write_line = " ".join(map(str, seeds)) + '\n'
+
+    with open(seed_path, 'w') as f:
+        f.write(write_line)
+
+    return seed
+
+
+def load_manual_trials(manual_trials_dirs: List[str], param_keys: List[str], hyperopt_args: HyperoptArgs) -> Trials:
+    """
+    Function for loading in manual training runs as trials for inclusion in hyperparameter search.
+    Trials must be consistent in all arguments with trials that would be generated in hyperparameter optimization.
+
+    :param manual_trials_dirs: A list of paths to save directories for the manual trials, as would include test_scores.csv and args.json.
+    :param param_keys: A list of the parameters included in the hyperparameter optimization.
+    :param hyperopt_args: The arguments for the hyperparameter optimization job.
+    :return: A hyperopt trials object including all the loaded manual trials.
+    """
+    matching_args = [ # manual trials must occupy the same space as the hyperparameter optimization search. This is a non-extensive list of arguments to check to see if they are consistent.
+        'number_of_molecules',
+        'aggregation',
+        'num_folds',
+        'ensemble_size',
+        'max_lr',
+        'init_lr',
+        'final_lr',
+        'activation',
+        'metric',
+        'bias',
+        'epochs',
+        'explicit_h',
+        'reaction',
+        'split_type',
+        'warmup_epochs',
+    ]
+
+    manual_trials_data = []
+    for i, trial_dir in enumerate(manual_trials_dirs):
+
+        # Extract trial data from test_scores.csv
+        with open(os.path.join(trial_dir, 'test_scores.csv')) as f:
+            reader=csv.reader(f)
+            next(reader)
+            read_line=next(reader)
+        mean_score = float(read_line[1])
+        std_score = float(read_line[2])
+        loss = (1 if hyperopt_args.minimize_score else -1) * mean_score
+
+        # Extract argument data from args.json
+        with open(os.path.join(trial_dir, 'args.json')) as f:
+            trial_args = json.load(f)
+
+        # Check for differences in manual trials and hyperopt space
+        if 'hidden_size' in param_keys:
+            if trial_args['hidden_size'] != trial_args['ffn_hidden_size']:
+                raise ValueError(f'The manual trial in {trial_dir} has a hidden_size {trial_args["hidden_size"]} '
+                f'that does not match its ffn_hidden_size {trial_args["ffn_hidden_size"]}, as it would in hyperparameter search.')
+        for arg in matching_args:
+            if arg not in param_keys:
+                if getattr(hyperopt_args,arg) != trial_args[arg]:
+                    raise ValueError(f'Manual trial {trial_dir} has different training argument {arg} than the hyperparameter optimization search trials.')
+
+        # Construct data dict
+        param_dict = {key: trial_args[key] for key in param_keys}
+        vals_dict = {key: [param_dict[key]] for key in param_keys}
+        idxs_dict = {key: [i] for key in param_keys}
+        results_dict = {
+            'loss': loss,
+            'status': 'ok',
+            'mean_score': mean_score,
+            'std_score': std_score,
+            'hyperparams': param_dict,
+            'num_params': 0,
+        }
+        misc_dict = {
+            'tid': i,
+            'cmd': ('domain_attachment', 'FMinIter_Domain'),
+            'workdir': None,
+            'idxs': idxs_dict,
+            'vals': vals_dict,
+        }
+        trial_data = {
+            'state': 2,
+            'tid': i,
+            'spec': None,
+            'result': results_dict,
+            'misc': misc_dict,
+            'exp_key': None,
+            'owner': None,
+            'version': 0,
+            'book_time': None,
+            'refresh_time': None,
+        }
+        manual_trials_data.append(trial_data)
+
+    trials = Trials()
+    trials = merge_trials(trials=trials, new_trials_data=manual_trials_data)
+    return trials