Skip to content

Commit

Permalink
update Experimental Results
Browse files Browse the repository at this point in the history
  • Loading branch information
airaria committed Mar 24, 2020
1 parent 4d079dd commit b8a87a7
Show file tree
Hide file tree
Showing 3 changed files with 128 additions and 3 deletions.
119 changes: 119 additions & 0 deletions docs/source/ExperimentResults.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Experimental Results


## Results on English Datasets

### MNLI

* Single-teacher distillation with `GeneralDistiller`:

| Model (ours) | MNLI |
| :------------- | -------------- |
| **BERT-base-cased** (teacher) | 83.7 / 84.0 |
| T6 (student) | 83.5 / 84.0 |
| T3 (student) | 81.8 / 82.7 |
| T3-small (student) | 81.3 / 81.7 |
| T4-tiny (student) | 82.0 / 82.6 |

* Multi-teacher distillation with `MultiTeacherDistiller`:

| Model (ours) | MNLI |
| :------------- | -------------- |
| **BERT-base-cased** (teacher #1) | 83.7 / 84.0 |
| **BERT-base-cased** (teacher #2) | 83.6 / 84.2 |
| **BERT-base-cased** (teacher #3) | 83.7 / 83.8 |
| ensemble (average of #1, #2 and #3) | 84.3 / 84.7 |
| BERT-base-cased (student) | **84.8 / 85.3**|

### SQuAD

* Training without Distillation:

| Model(ours) | SQuAD |
| ------------- | ------------- |
| **BERT-base-cased** | 81.5 / 88.6 |
| T6 | 75.0 / 83.3 |
| T3 | 63.0 / 74.3 |

* Single-teacher distillation with `GeneralDistiller`:

| Model(ours) | SQuAD |
| ------------- | ------------- |
| **BERT-base-cased** (teacher) | 81.5 / 88.6 |
| T6 (student) | 80.8 / 88.1 |
| T3 (student) | 76.4 / 84.9 |
| T3-small (student) | 72.3 / 81.4 |
| T4-tiny (student) | 73.7 / 82.5 |
|   + DA | 75.2 / 84.0 |

**Note**: When distilling to T4-tiny, NewsQA is used for data augmentation on SQuAD.

* Multi-teacher distillation with `MultiTeacherDistiller`:

| Model (ours) | SQuAD |
| :------------- | -------------- |
| **BERT-base-cased** (teacher #1) | 81.1 / 88.6 |
| **BERT-base-cased** (teacher #2) | 81.2 / 88.5 |
| **BERT-base-cased** (teacher #3) | 81.2 / 88.7 |
| ensemble (average of #1, #2 and #3) | 82.3 / 89.4 |
| BERT-base-cased (student) | **83.5 / 90.0**|

### CoNLL-2003 English NER

* Training without Distillation:

| Model(ours) | SQuAD |
| ------------- | ----------- |
| **BERT-base-cased** | 91.1 |
| BiGRU | 81.1 |
| T3 | 85.3 |

* Single-teacher distillation with `GeneralDistiller`:

| Model(ours) | CoNLL-2003 |
| ------------- | ------------- |
| **BERT-base-cased** (teacher) | 91.1 |
| BiGRU | 85.3 |
| T6 (student) | 90.7 |
| T3 (student) | 87.5 |
|   + DA | 90.0 |
| T3-small (student) | 57.4 |
|   + DA | 76.5 |
| T4-tiny (student) | 54.7 |
|   + DA | 79.6 |

**Note**: HotpotQA is used for data augmentation on CoNLL-2003.

## Results on Chinese Datasets

### XNLI

| Model | XNLI |
| :--------------- | ----------------- |
| **RoBERTa-wwm-ext** (teacher) | 79.9 |
| T3 (student) | 78.4 |
| T3-small (student) | 76.0 |
| T4-tiny (student) | 76.2 |

### LCQMC

| Model | LCQMC |
| :--------------- | ----------- |
| **RoBERTa-wwm-ext** (teacher) | 89.4 |
| T3 (student) | 89.0 |
| T3-small (student) | 88.1 |
| T4-tiny (student) | 88.4 |

### CMRC2018 and DRCD

| Model | CMRC2018 | DRCD |
| --------------- | ---------------- | ------------ |
| **RoBERTa-wwm-ext** (teacher) | 68.8 / 86.4 | 86.5 / 92.5 |
| T3 (student) | 63.4 / 82.4 | 76.7 / 85.2 |
|   + DA | 66.4 / 84.2 | 78.2 / 86.4 |
| T3-small (student) | 24.4 / 48.1 | 42.2 / 63.2 |
|   + DA | 58.0 / 79.3 | 65.5 / 78.6 |
| T4-tiny (student) | - | - |
|   + DA | 61.8 / 81.8 | 73.3 / 83.5 |

**Note**: CMRC2018 and DRCD take each other as the augmentation dataset In the experiments.
6 changes: 3 additions & 3 deletions docs/source/Experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ Public results:
| BERT<sub>3</sub>-PKD (T3) | 76.7 / 76.3 | - | -|
| TinyBERT (T4-tiny) | 82.8 / 82.9 | 72.7 / 82.1 | -|

Our results:
Our results (see [Experimental Results](ExperimentResults.md) for details):

| Model (ours) | MNLI | SQuAD | CoNLL-2003 |
| :------------- | --------------- | ------------- | --------------- |
Expand Down Expand Up @@ -104,7 +104,7 @@ We experiment on the following typical Chinese datasets:
| [**CMRC 2018**](https://github.com/ymcui/cmrc2018) | reading comprehension | EM/F1 | 10K | 3.4K | span-extraction machine reading comprehension |
| [**DRCD**](https://github.com/DRCKnowledgeTeam/DRCD) | reading comprehension | EM/F1 | 27K | 3.5K | span-extraction machine reading comprehension (Traditional Chinese) |

The results are listed below.
The results are listed below (see [Experimental Results](ExperimentResults.md) for details).

| Model | XNLI | LCQMC | CMRC 2018 | DRCD |
| :--------------- | ---------- | ----------- | ---------------- | ------------ |
Expand All @@ -117,4 +117,4 @@ The results are listed below.
**Note**:

1. On CMRC2018 and DRCD, learning rates are 1.5e-4 and 7e-5 respectively and there is no learning rate decay.
2. CMRC2018 and DRCD take each other as the augmentation dataset In the experiments.
2. CMRC2018 and DRCD take each other as the augmentation dataset In the experiments.
6 changes: 6 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,12 @@ Paper: `TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural La
Losses
Utils

.. toctree::
:maxdepth: 2
:caption: Appendices

ExperimentResults

Indices and tables
==================

Expand Down

0 comments on commit b8a87a7

Please sign in to comment.