Skip to content

Commit

Permalink
Introduce distributed Adaptive API (#326)
Browse files Browse the repository at this point in the history
This PR adds distributed Adaptive API to Texar-PyTorch with the help of AdaptDL.

`examples/bert/bert_classifier_adaptive.py` is the adaptive version of `examples/bert/bert_classifier_main.py` which demonstrates the use of above API. It can be trained on a cluster by running `examples/bert/run_bert_adaptive.sh` after setting up a AdaptDL kubernetes cluster or microk8s environment
  • Loading branch information
odp committed Nov 12, 2020
1 parent f016043 commit a061b45
Show file tree
Hide file tree
Showing 11 changed files with 552 additions and 28 deletions.
32 changes: 32 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Copyright 2019 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


FROM python:3.7-slim
WORKDIR /root

FROM pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime

COPY . texar-pytorch
WORKDIR texar-pytorch

RUN python3 setup.py bdist_wheel
ARG TEXAR_VERSION=0.0.0
RUN TEXAR_VERSION=${TEXAR_VERSION} pip install dist/*.whl
RUN pip install -r requirements.txt

RUN pip install tensorflow adaptdl>=0.2.4 tensorboard
RUN rm -rf dist

ENV PYTHONUNBUFFERED=true
75 changes: 51 additions & 24 deletions examples/bert/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,9 @@ To summarize, this example showcases:
* Building and fine-tuning on downstream tasks
* Use of Texar `RecordData` module for data loading and processing
* Use of Texar `Executor` module for simplified training loops and TensorBoard visualization
* Use of [Hyperopt]((https://github.com/hyperopt/hyperopt)) library to tune hyperparameters with
* Use of [Hyperopt]((https://github.com/hyperopt/hyperopt)) library to tune hyperparameters with
`Executor` module

Future work:

* Train or fine-tune the model with distributed GPU
* Adaptive distributed training using AdaptDL

## Prerequisite

Expand Down Expand Up @@ -50,7 +47,7 @@ By default, it will download the MRPC dataset into the `data` directory. FYI, th

We first preprocess the downloaded raw data into [pickled](https://docs.python.org/3/library/pickle.html) files. The
preprocessing step tokenizes raw text with BPE encoding, truncates sequences, adds special tokens, etc. Run the
following command to this end:
following command to this end:

```bash
python prepare_data.py --task=MRPC \
Expand All @@ -62,7 +59,7 @@ python prepare_data.py --task=MRPC \
- `--task`: Specifies the dataset name to preprocess. BERT provides default support for
`{'CoLA', 'MNLI', 'MRPC', 'XNLI', 'SST'}` data.
- `--max-seq-length`: The maxium length of sequence. This includes BERT special tokens that will be automatically added.
Longer sequences will be trimmed.
Longer sequences will be trimmed.
- `--pretrained-model-name`: The name of a pre-trained model to load selected in the list of: `bert-base-uncased`,
`bert-large-uncased`, `bert-base-cased`, `bert-large-cased`, `bert-base-multilingual-uncased`,
`bert-base-multilingual-cased`, and `bert-base-chinese`.
Expand All @@ -88,8 +85,8 @@ python prepare_data.py --task=MRPC \
```
**Note that** the data info `num_classes` and `num_train_data`, as well as `max_seq_length` specified in the command,
are required for BERT training in the following. They should be specified in the data configuration file passed to
BERT training (see below).
BERT training (see below).

- For convenience, the above command automatically writes `num_classes`, `num_train_data` and `max_seq_length` to
`config_data.py`.

Expand All @@ -114,7 +111,7 @@ Here:
- `--output-dir`: The output path where checkpoints are saved.
- `--pretrained-model-name`: The name of a pre-trained model to load selected in the list of: `bert-base-uncased`,
`bert-large-uncased`, `bert-base-cased`, `bert-large-cased`, `bert-base-multilingual-uncased`,
`bert-base-multilingual-cased`, and `bert-base-chinese`.
`bert-base-multilingual-cased`, and `bert-base-chinese`.

After convergence, the evaluation performance is around the following. Due to certain randomness (e.g., random
initialization of the classification layer), the evaluation accuracy is reasonable as long as it's `>0.84`.
Expand Down Expand Up @@ -181,7 +178,7 @@ the `train_metrics` and `valid_metrics` will be logged into tensorboard. To run

```commandline
python bert_classifier_using_executor_main.py --do-train --do-test
```
```

If the logs are in `runs/` folder, the tensorboard server can be started by the following command

Expand All @@ -202,7 +199,7 @@ To run this example, please install `hyperopt` by issuing the following command
pip install hyperopt
```

`bert_with_hypertuning_main.py` shows an example of how to tune hyperparameters with Executor using `hyperopt`.
`bert_with_hypertuning_main.py` shows an example of how to tune hyperparameters with Executor using `hyperopt`.
To run this example, run the following command

```commandline
Expand All @@ -213,7 +210,7 @@ In this simple example, the hyperparameters to be tuned are provided as a `dict`
`bert_hypertuning_config_classifier.py` which are fed into `objective_func()` . We use `TPE`
(Tree-structured Parzen Estimator) algorithm for tuning the hyperparams (provided in `hyperopt`
library). The example runs for 3 trials to find the best hyperparam settings. The final model is
saved in `output_dir` provided by the user. More information about the libary can be
saved in `output_dir` provided by the user. More information about the libary can be
found at [Hyperopt](https://github.com/hyperopt/hyperopt).

### Hyperparamter tuning with Neural Network Intelligence (NNI)
Expand All @@ -224,19 +221,49 @@ To run this example, please install `nni` by issuing the following command
python -m pip install --upgrade nni
```

The code script used for nni hyperparameter tuning is `bert_executor_hypertuning_nni.py`. In this
simple example, the hyperparameters to be tuned are provided as a `search_space.json` file, for which
how to include additional hyperparameters to tune in the json file should be referred to
this [link](https://nni.readthedocs.io/en/latest/Tutorial/QuickStart.html). We prepare two configuration
yaml file, `config_tuner.yml` and `config_advisor.yml` for respectively using build-in nni tuners and
The code script used for nni hyperparameter tuning is `bert_executor_hypertuning_nni.py`. In this
simple example, the hyperparameters to be tuned are provided as a `search_space.json` file, for which
how to include additional hyperparameters to tune in the json file should be referred to
this [link](https://nni.readthedocs.io/en/latest/Tutorial/QuickStart.html). We prepare two configuration
yaml file, `config_tuner.yml` and `config_advisor.yml` for respectively using build-in nni tuners and
advisors on tuning. Some build-in advisors need to be installed, please refer to
the [link](https://nni.readthedocs.io/en/latest/Tuner/BuiltinTuner.html) for how to install in you need to
use it. In the configuration file, you can modify the number of maximum trials, the maximum running
duration and some other arguments (e.g. maximum or minimum). In order to run this example, run the
the [link](https://nni.readthedocs.io/en/latest/Tuner/BuiltinTuner.html) for how to install in you need to
use it. In the configuration file, you can modify the number of maximum trials, the maximum running
duration and some other arguments (e.g. maximum or minimum). In order to run this example, run the
following command
```
nnictl create --config config_tuner.yml --port 9009
```
The port id can be set with any effective port id. Then you can use the Web UI urls given from your
terminal to monitor the auto-tuning progress on the WebUI. More information about NNI can be
```
The port id can be set with any effective port id. Then you can use the Web UI urls given from your
terminal to monitor the auto-tuning progress on the WebUI. More information about NNI can be
found at [NNI](https://nni.readthedocs.io/en/latest/index.html).

## Adaptive distributed training using AdaptDL


A version of the BERT example `bert_classifier_adaptive.py` which uses
`texar.torch.distributed` Adaptive API can be run on a kubernetes cluster with
AdaptDL scheduler. With the help of AdaptDL, the classifier can be trained on a
cluster with multiple replicas in data parallel fashion. The number of replicas
is automatically decided by the AdaptDL scheduler. Instructions for setting up
an AdaptDL cluster can be found
[here](https://adaptdl.readthedocs.io/en/latest/installation/index.html).

Once the cluster is setup and ready, the BERT AdaptDL job can be run with
```commandline
./run_bert_adaptive.sh
```
Parameters like job name, number of replicas etc. can be changed by modifying
the embedded job manifest in the file `run_bert_adaptive.sh.` Moreover, the
AdaptDL trainer API works locally (without the cluster) by default with a
single replica. This can be used for testing changes locally before they are
run on a cluster. For single replica training you can directly run the code as
shown.
```commandline
python bert_classifier_adaptive.py --do-train --do-eval \
--config-downstream=config_classifier \
--config-data=config_data \
--output-dir=output
```
See [here](https://adaptdl.readthedocs.io/en/latest/standalone-training.html)
for full documentation on how to train the model in standalone mode.

0 comments on commit a061b45

Please sign in to comment.