Skip to content

Commit

Permalink
fix typos in desc and docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ymcui committed Mar 23, 2020
1 parent 7b313d8 commit 6d74898
Show file tree
Hide file tree
Showing 7 changed files with 29 additions and 29 deletions.
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
</a>
</p>

**TextBrewer** is a PyTorch-based model distillation toolkit for natural language processing. It includes various distillation techniques from both NLP and CV field, and provides an easy-to-use distillation framework, which allows users to quickly experiment with the state-of-the-art distillation methods to compress the model with a relatively small sacrifice in the performance, increasing the inference speed and reducing the memory usage.
**TextBrewer** is a PyTorch-based model distillation toolkit for natural language processing. It includes various distillation techniques from both NLP and CV field and provides an easy-to-use distillation framework, which allows users to quickly experiment with the state-of-the-art distillation methods to compress the model with a relatively small sacrifice in the performance, increasing the inference speed and reducing the memory usage.

Paper: [https://arxiv.org/abs/2002.12620](https://arxiv.org/abs/2002.12620)

Expand Down Expand Up @@ -124,8 +124,8 @@ See [API documentation](API.md) for detailed usages.

* **Stage 1**: Preparation:
1. Train the teacher model
2. Define and intialize the student model
3. Construct a dataloader, an optimizer and a learning rate scheduler
2. Define and initialize the student model
3. Construct a dataloader, an optimizer, and a learning rate scheduler

* **Stage 2**: Distillation with TextBrewer:
1. Construct a **TraningConfig** and a **DistillationConfig**, initialize a **distiller**
Expand Down Expand Up @@ -158,7 +158,7 @@ _ = textbrewer.utils.display_parameters(student_model,max_level=3)

# Define an adaptor for translating the model inputs and outputs
def simple_adaptor(batch, model_outputs):
# The second and third elements of model outputs are the logits and hidden states
# The second and third elements of model outputs are the logits and hidden states
return {'logits': model_outputs[1],
'hidden': model_outputs[2]}

Expand Down Expand Up @@ -199,7 +199,7 @@ We have performed distillation experiments on several typical English and Chines
* For English tasks, the teacher model is [**BERT-base-cased**](https://github.com/google-research/bert).
* For Chinese tasks, the teacher model is [**RoBERTa-wwm-ext**](https://github.com/ymcui/Chinese-BERT-wwm) released by the Joint Laboratory of HIT and iFLYTEK Research.

We have tested different student models. To compare with public results, the student models are built with standard transformer blocks except BiGRU which is a single-layer bidirectional GRU. The architectures are listed below. Note that the number of parameters includes the embedding layer but does not include the output layer of the each specific task.
We have tested different student models. To compare with public results, the student models are built with standard transformer blocks except for BiGRU which is a single-layer bidirectional GRU. The architectures are listed below. Note that the number of parameters includes the embedding layer but does not include the output layer of each specific task.

| Model | \#Layers | Hidden_size | Feed-forward size | \#Params | Relative size |
| :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |
Expand Down Expand Up @@ -276,7 +276,7 @@ Our results:

**Note**:

1. The equivlent model architectures of public models are shown in the brackets.
1. The equivalent model architectures of public models are shown in the brackets.
2. When distilling to T4-tiny, NewsQA is used for data augmentation on SQuAD and HotpotQA is used for data augmentation on CoNLL-2003.


Expand Down Expand Up @@ -332,24 +332,24 @@ In TextBrewer, there are two functions that should be implemented by users: **ca

#### **Callback**

At each checkpoint, after saving the student model, the callback function will be called by the distiller. Callback can be used to evaluate the performance of the student model at each checkpoint.
At each checkpoint, after saving the student model, the callback function will be called by the distiller. A callback can be used to evaluate the performance of the student model at each checkpoint.

#### Adaptor
It converts the model inputs and outputs to the specified format so that they could be recognized by the distiller, and distillation losses can be computed. At each training step, batch and model outputs will be passed to the adaptor; adaptor re-organize the data and returns a dictionary.
It converts the model inputs and outputs to the specified format so that they could be recognized by the distiller, and distillation losses can be computed. At each training step, batch and model outputs will be passed to the adaptor; the adaptor re-organizes the data and returns a dictionary.

Fore more details, see the explanations in [API documentation](API.md)
For more details, see the explanations in [API documentation](API.md)

## FAQ

**Q**: How to initialize the student model?

**A**: The student model could be randomly initialized (i.e., with no prior knwledge) or be initialized by pre-trained weights.
**A**: The student model could be randomly initialized (i.e., with no prior knowledge) or be initialized by pre-trained weights.
For example, when distilling a BERT-base model to a 3-layer BERT, you could initialize the student model with [RBT3](#https://github.com/ymcui/Chinese-BERT-wwm) (for Chinese tasks) or the first three layers of BERT (for English tasks) to avoid cold start problem.
We recommend that users use pre-trained student models whenever possible to fully take the advantage of large-scale pre-training.
We recommend that users use pre-trained student models whenever possible to fully take advantage of large-scale pre-training.

**Q**: How to set training hyperparamters for the distillation experiments?
**Q**: How to set training hyperparameters for the distillation experiments?

**A**: Knowledge distillation usually requires more training epochs and larger learning rate than training on labeled dataset. For example, training SQuAD on BERT-base usually takes 3 epochs with lr=3e-5; however, distillation takes 30~50 epochs with lr=1e-4. **The conclusions are based on our experiments, and you are advised to try on your own data**.
**A**: Knowledge distillation usually requires more training epochs and larger learning rate than training on the labeled dataset. For example, training SQuAD on BERT-base usually takes 3 epochs with lr=3e-5; however, distillation takes 30~50 epochs with lr=1e-4. **The conclusions are based on our experiments, and you are advised to try on your own data**.

## Known Issues

Expand Down
4 changes: 2 additions & 2 deletions docs/source/Concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Conventions

* ``Model_T`` an instance of :class:`torch.nn.Module`, the teacher model that to be distilled.

* ``Model_S``: an instance of :class:`torch.nn.Module`, the student model, usually smaller than the teacher model for the purpose of model compression and faster inference speed.
* ``Model_S``: an instance of :class:`torch.nn.Module`, the student model, usually smaller than the teacher model for model compression and faster inference speed.

* ``optimizer``: an instance of :class:`torch.optim.Optimizer`.

Expand Down Expand Up @@ -121,7 +121,7 @@ In TextBrewer, there are two functions that should be implemented by users: :fun
* **labels** is required if and only if ``probability_shift==True``.
* You shouldn't ignore all the keys, otherwise the training won't start :)

In most cases **logits** should be provided, unless you are doing multi-stage training or non-classfification tasks, etc.
In most cases **logits** should be provided, unless you are doing multi-stage training or non-classification tasks, etc.

Example::

Expand Down
4 changes: 2 additions & 2 deletions docs/source/Experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ We have performed distillation experiments on several typical English and Chines
* For English tasks, the teacher model is [**BERT-base-cased**](https://github.com/google-research/bert).
* For Chinese tasks, the teacher model is [**RoBERTa-wwm-ext**](https://github.com/ymcui/Chinese-BERT-wwm) released by the Joint Laboratory of HIT and iFLYTEK Research.

We have tested different student models. To compare with public results, the student models are built with standard transformer blocks except BiGRU which is a single-layer bidirectional GRU. The architectures are listed below. Note that the number of parameters includes the embedding layer but does not include the output layer of the each specific task.
We have tested different student models. To compare with public results, the student models are built with standard transformer blocks except for BiGRU which is a single-layer bidirectional GRU. The architectures are listed below. Note that the number of parameters includes the embedding layer but does not include the output layer of each specific task.

| Model | \#Layers | Hidden_size | Feed-forward size | \#Params | Relative size |
| :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |
Expand Down Expand Up @@ -87,7 +87,7 @@ Our results:

**Note**:

1. The equivlent model architectures of public models are shown in the brackets after their names.
1. The equivalent model architectures of public models are shown in the brackets after their names.
2. When distilling to T4-tiny, NewsQA is used for data augmentation on SQuAD and HotpotQA is used for data augmentation on CoNLL-2003.


Expand Down
2 changes: 1 addition & 1 deletion docs/source/Losses.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Intermediate Losses
===================
Here we list the definitions of pre-defined intermediate losses.
Usually users don't need to refer to these functions directly, but refer to them by the names in :obj:`MATCH_LOSS_MAP`.
Usually, users don't need to refer to these functions directly, but refer to them by the names in :obj:`MATCH_LOSS_MAP`.

attention_mse
-------------
Expand Down
2 changes: 1 addition & 1 deletion docs/source/Presets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,4 @@ then used in :class:`~textbrewer.DistillationConfig`::
intermediate_matches = [{'layer_T':0, 'layer_S':0, 'feature':'hidden','loss': 'my_L1_loss', 'weight' : 1}]
...)

Refer to the source code for more details on inputs and outputs conventions (will be explained in more details in a later version of the documentation).
Refer to the source code for more details on inputs and outputs conventions (will be explained in detail in a later version of the documentation).
18 changes: 9 additions & 9 deletions docs/source/Tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,13 @@ To start distillation, users need to provide
* **Stage 1**: Preparation:

#. Train the teacher model.
#. Define and intialize the student model.
#. Construct a dataloader, an optimizer and a learning rate scheduler.
#. Define and initialize the student model.
#. Construct a dataloader, an optimizer, and a learning rate scheduler.

* **Stage 2**: Distillation with TextBrewer:

#. Construct a ``TraningConfig`` and a ``DistillationConfig``, initialize a **distiller**.
#. Define an **adaptor** and a **callback**. The **adaptor** is used for adaptation of model inputs and outputs. The **callback** is called by the distiller during training.
#. Define an **adaptor** and a **callback**. The **adaptor** is used for the adaptation of model inputs and outputs. The **callback** is called by the distiller during training.
#. Call the :``train`` method of the **distiller**.


Expand Down Expand Up @@ -127,23 +127,23 @@ Examples

Examples can be found in the `examples <https://github.com/airaria/TextBrewer/tree/master/examples>`_ directory of the repo:

* `examples/random_token_example <https://github.com/airaria/TextBrewer/tree/master/examples/random_tokens_example>`_ : a simple runable toy example which demonstrates the usage of TextBrewer. This example performs distillation on the text classification task with random tokens as inputs.
* `examples/random_token_example <https://github.com/airaria/TextBrewer/tree/master/examples/random_tokens_example>`_ : a simple runnable toy example which demonstrates the usage of TextBrewer. This example performs distillation on the text classification task with random tokens as inputs.
* `examples/cmrc2018\_example <https://github.com/airaria/TextBrewer/tree/master/examples/cmrc2018_example>`_ (Chinese): distillation on CMRC2018, a Chinese MRC task, using DRCD as data augmentation.
* `examples/mnli\_example <https://github.com/airaria/TextBrewer/tree/master/examples/mnli_example>`_ (English): distillation on MNLI, an English sentence-pair classification task. This example also shows how to perform multi-teacher distillation.
* `examples/conll2003_example <https://github.com/airaria/TextBrewer/tree/master/examples/conll2003_example>`_ (English): distillation on CoNLL-2003 English NER task, which is in form of sequence labeling.
* `examples/conll2003_example <https://github.com/airaria/TextBrewer/tree/master/examples/conll2003_example>`_ (English): distillation on CoNLL-2003 English NER task, which is in the form of sequence labeling.

FAQ
===

**Q**: How to initialize the student model?

**A**: The student model could be randomly initialized (i.e., with no prior knwledge) or be initialized by pre-trained weights.
**A**: The student model could be randomly initialized (i.e., with no prior knowledge) or be initialized by pre-trained weights.
For example, when distilling a BERT-base model to a 3-layer BERT, you could initialize the student model with `RBT3 <https://github.com/ymcui/Chinese-BERT-wwm>`_ (for Chinese tasks) or the first three layers of BERT (for English tasks) to avoid cold start problem.
We recommend that users use pre-trained student models whenever possible to fully take the advantage of large-scale pre-training.
We recommend that users use pre-trained student models whenever possible to fully take advantage of large-scale pre-training.

**Q**: How to set training hyperparamters for the distillation experiments?
**Q**: How to set training hyperparameters for the distillation experiments?

**A**: Knowledge distillation usually requires more training epochs and larger learning rate than training on labeled dataset. For example, training SQuAD on BERT-base usually takes 3 epochs with lr=3e-5; however, distillation takes 30~50 epochs with lr=1e-4. **The conclusions are based on our experiments, and you are advised to try on your own data**.
**A**: Knowledge distillation usually requires more training epochs and a larger learning rate than training on the labeled dataset. For example, training SQuAD on BERT-base usually takes 3 epochs with lr=3e-5; however, distillation takes 30~50 epochs with lr=1e-4. **The conclusions are based on our experiments, and you are advised to try on your own data**.

Known Issues
============
Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
**TextBrewer** is a PyTorch-based toolkit for **distillation of NLP models**.

It includes various distilltion techniques from both NLP and CV, and provides an easy-to-use distillation framework, which allows users to quickly experiment with state-of-the-art distillation methods to compress the model with a relatively small sacrifice in performance, increase the inference speed and reduce the memory usage.
It includes various distillation techniques from both NLP and CV, and provides an easy-to-use distillation framework, which allows users to quickly experiment with state-of-the-art distillation methods to compress the model with a relatively small sacrifice in performance, increase the inference speed and reduce the memory usage.

Main features
-------------
Expand Down

0 comments on commit 6d74898

Please sign in to comment.