update exp results (MSRA NER, T12-nano, T3-small and T4-tiny)

airaria · Apr 26, 2020 · f426f03 · f426f03
1 parent a1cfd0d
commit f426f03
Show file tree

Hide file tree

Showing 4 changed files with 77 additions and 54 deletions.
diff --git a/README.md b/README.md
@@ -218,6 +218,7 @@ We have tested different student models. To compare with public results, the stu
 | T3 (student)              | 3         | 768         | 3072              | 44M      | 41%           |
 | T3-small (student)        | 3         | 384         | 1536              | 17M      | 16%           |
 | T4-Tiny (student)         | 4         | 312         | 1200              | 14M      | 13%           |
+| T12-nano (student)        | 12        | 256         | 1024              | 17M      | 16%           |
 | BiGRU (student)           | -         | 768         | -                 | 31M      | 29%           |
 
 #### Chinese models
@@ -251,7 +252,8 @@ distill_config = DistillationConfig(temperature = 8, intermediate_matches = matc
 | T3           | L3_hidden_mse + L3_hidden_smmd                      |
 | T3-small     | L3n_hidden_mse + L3_hidden_smmd                     |
 | T4-Tiny      | L4t_hidden_mse + L4_hidden_smmd                     |
-|Electra-small | small_hidden_mse + small_hidden_smmd                |
+| T12-nano     | small_hidden_mse + small_hidden_smmd                |
+| Electra-small| small_hidden_mse + small_hidden_smmd                |
 
 The definitions of matches are at [examples/matches/matches.py](examples/matches/matches.py).
 
@@ -292,13 +294,15 @@ Our results:
 | BiGRU          | -               | -             | 85.3            |
 | T6             | 83.5 / 84.0     | 80.8 / 88.1   | 90.7            |
 | T3             | 81.8 / 82.7     | 76.4 / 84.9   | 87.5            |
-| T3-small       | 81.3 / 81.7     | 72.3 / 81.4   | 57.4            |
-| T4-tiny        | 82.0 / 82.6     | 75.2 / 84.0   | 79.6            |
+| T3-small       | 81.3 / 81.7     | 72.3 / 81.4   | 78.6            |
+| T4-tiny        | 82.0 / 82.6     | 75.2 / 84.0   | 89.1            |
+| T12-nano       | 83.2 / 83.9     | 79.0 / 86.6   | 89.6            |
 
 **Note**:
 
-1. The equivalent model architectures of public models are shown in the brackets. 
+1. The equivalent model structures of public models are shown in the brackets after their names. 
 2. When distilling to T4-tiny, NewsQA is used for data augmentation on SQuAD and HotpotQA is used for data augmentation on CoNLL-2003.
+3. When distilling to T12-nano, HotpotQA is used for data augmentation on CoNLL-2003.
 
 
 
@@ -313,25 +317,26 @@ We experiment on the following typical Chinese datasets:
 | [**LCQMC**](http://icrc.hitsz.edu.cn/info/1037/1146.htm) | text classification | Acc | 239K | 8.8K | sentence-pair matching, binary classification |
 | [**CMRC 2018**](https://github.com/ymcui/cmrc2018) | reading comprehension | EM/F1 | 10K | 3.4K | span-extraction machine reading comprehension |
 | [**DRCD**](https://github.com/DRCKnowledgeTeam/DRCD) | reading comprehension | EM/F1 | 27K | 3.5K | span-extraction machine reading comprehension (Traditional Chinese) |
+| [**MSRA NER**](https://faculty.washington.edu/levow/papers/sighan06.pdf) | sequence labeling | F1 | 45K | 3.4K (#Test) | Chinese named entity recognition |
 
 The results are listed below.
 
 | Model           | XNLI | LCQMC | CMRC 2018 | DRCD |
 | :--------------- | ---------- | ----------- | ---------------- | ------------ |
 | **RoBERTa-wwm-ext** (teacher) | 79.9       | 89.4        | 68.8 / 86.4      | 86.5 / 92.5  |
 | T3          | 78.4       | 89.0        | 66.4 / 84.2      | 78.2 / 86.4  |
-| T3-small    | 76.0       | 88.1        | 58.0 / 79.3      | 65.5 / 78.6  |
-| T4-tiny     | 76.2       | 88.4        | 61.8 / 81.8      | 73.3 / 83.5  |
+| T3-small    | 76.0       | 88.1        | 58.0 / 79.3      | 75.8 / 84.8  |
+| T4-tiny     | 76.2       | 88.4        | 61.8 / 81.8      | 77.3 / 86.1  |
 
-| Model                      | XNLI       | LCQMC       | CMRC 2018        | DRCD         |
-| :---------------           | ---------- | ----------- | ---------------- | ------------ |
-| **Electra-base** (teacher) | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  |
-| Electra-small              | 77.7       | 89.3        | 66.5 / 84.9     | 85.5 / 91.3  |
+| Model                       | XNLI       | LCQMC       | CMRC 2018        | DRCD        | MSRA NER |
+| :---------------------------| ---------- | ----------- | ---------------- | ------------|----------|
+| **Electra-base** (teacher)) | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  | 95.14    |
+| Electra-small               | 77.7       | 89.3        | 66.5 / 84.9     | 85.5 / 91.3  | 93.48    |
 
 
 **Note**:
 
-1. When distillatoin from RoBERTa-wwm-ext, on CMRC2018 and DRCD, learning rates are 1.5e-4 and 7e-5 respectively and there is no learning rate decay.
+1. Learning rate decay is not used in distillation on CMRC2018 and DRCD.
 2. CMRC2018 and DRCD take each other as the augmentation dataset in the distillation.
 3. The settings of training Electra-base teacher model can be found at [**Chinese-ELECTRA**](https://github.com/ymcui/Chinese-ELECTRA).
 4. Electra-small student model is intialized with the [pretrained weights](https://github.com/ymcui/Chinese-ELECTRA).

diff --git a/README_ZH.md b/README_ZH.md
@@ -216,6 +216,7 @@ with distiller:
 | T3 (学生)              | 3         | 768         | 3072              | 44M      | 41%           |
 | T3-small (学生)        | 3         | 384         | 1536              | 17M      | 16%           |
 | T4-Tiny (学生)         | 4         | 312         | 1200              | 14M      | 13%           |
+| T12-nano (学生)        | 12        | 256         | 1024              | 17M      | 16%           |
 | BiGRU (学生)           | -         | 768         | -                 | 31M      | 29%           |
 
 #### 中文模型
@@ -249,7 +250,8 @@ distill_config = DistillationConfig(temperature = 8, intermediate_matches = matc
 | T3           | L3_hidden_mse + L3_hidden_smmd                      |
 | T3-small     | L3n_hidden_mse + L3_hidden_smmd                     |
 | T4-Tiny      | L4t_hidden_mse + L4_hidden_smmd                     |
-|Electra-small | small_hidden_mse + small_hidden_smmd                |
+| T12-nano     | small_hidden_mse + small_hidden_smmd                |
+| Electra-small| small_hidden_mse + small_hidden_smmd                |
 
 各种matches的定义在[examples/matches/matches.py](examples/matches/matches.py)中。均使用GeneralDistiller进行蒸馏。
 
@@ -287,43 +289,45 @@ Our results:
 | BiGRU          | -               | -             | 85.3            |
 | T6             | 83.5 / 84.0     | 80.8 / 88.1   | 90.7            |
 | T3             | 81.8 / 82.7     | 76.4 / 84.9   | 87.5            |
-| T3-small       | 81.3 / 81.7     | 72.3 / 81.4   | 57.4            |
-| T4-tiny        | 82.0 / 82.6     | 75.2 / 84.0   | 79.6            |
+| T3-small       | 81.3 / 81.7     | 72.3 / 81.4   | 78.6            |
+| T4-tiny        | 82.0 / 82.6     | 75.2 / 84.0   | 89.1            |
+| T12-nano       | 83.2 / 83.9     | 79.0 / 86.6   | 89.6            |
 
 说明：
 
 1. 公开模型的名称后括号内是其等价的模型结构
-2. 蒸馏到T4-tiny的实验中，SQuAD任务上用的是NewsQA作为增强数据；CoNLL-2003上用的是HotpotQA的篇章作为增强数据
-
+2. 蒸馏到T4-tiny的实验中，SQuAD任务上使用了NewsQA作为增强数据；CoNLL-2003上使用了HotpotQA的篇章作为增强数据
+3. 蒸馏到T12-nano的实验中，CoNLL-2003上使用了HotpotQA的篇章作为增强数据
 
 ### 中文实验结果
 
-在中文实验中，我们使用了如下四个典型数据集。
+在中文实验中，我们使用了如下典型数据集。
 
 | Dataset | Task type | Metrics | \#Train | \#Dev | Note |
 | :------- | ---- | ------- | ------- | ---- | ---- |
 | [**XNLI**](https://github.com/google-research/bert/blob/master/multilingual.md) | 文本分类 | Acc | 393K | 2.5K | MNLI的中文翻译版本，3分类任务 |
 | [**LCQMC**](http://icrc.hitsz.edu.cn/info/1037/1146.htm) | 文本分类 | Acc | 239K | 8.8K | 句对二分类任务，判断两个句子的语义是否相同 |
 | [**CMRC 2018**](https://github.com/ymcui/cmrc2018) | 阅读理解 | EM/F1 | 10K | 3.4K | 篇章片段抽取型阅读理解 |
 | [**DRCD**](https://github.com/DRCKnowledgeTeam/DRCD) | 阅读理解 | EM/F1 | 27K | 3.5K | 繁体中文篇章片段抽取型阅读理解 |
+| [**MSRA NER**](https://faculty.washington.edu/levow/papers/sighan06.pdf) | 序列标注 | F1 | 45K | 3.4K (测试集) | 中文命名实体识别 |
 
 实验结果如下表所示。
 
 | Model           | XNLI | LCQMC | CMRC 2018 | DRCD |
 | :--------------- | ---------- | ----------- | ---------------- | ------------ |
 | **RoBERTa-wwm-ext** (教师) | 79.9       | 89.4        | 68.8 / 86.4      | 86.5 / 92.5  |
 | T3          | 78.4       | 89.0        | 66.4 / 84.2      | 78.2 / 86.4  |
-| T3-small    | 76.0       | 88.1        | 58.0 / 79.3      | 65.5 / 78.6  |
-| T4-tiny     | 76.2       | 88.4        | 61.8 / 81.8      | 73.3 / 83.5  |
+| T3-small    | 76.0       | 88.1        | 58.0 / 79.3      | 75.8 / 84.8  |
+| T4-tiny     | 76.2       | 88.4        | 61.8 / 81.8      | 77.3 / 86.1  |
 
-| Model                  | XNLI       | LCQMC       | CMRC 2018        | DRCD         |
-| :---------------       | ---------- | ----------- | ---------------- | ------------ |
-| **Electra-base** (教师) | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  |
-| Electra-small          | 77.7       | 89.3        | 66.5 / 84.9     | 85.5 / 91.3  |
+| Model                  | XNLI       | LCQMC       | CMRC 2018        | DRCD        | MSRA NER |
+| :----------------------| ---------- | ----------- | ---------------- | ------------|----------|
+| **Electra-base** (教师) | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  | 95.14    |
+| Electra-small          | 77.7       | 89.3        | 66.5 / 84.9     | 85.5 / 91.3  | 93.48    |
 
 说明：
 
-1. 以RoBERTa-wwm-ext为教师模型蒸馏CMRC2018和DRCD时，学习率分别为1.5e-4和7e-5，并且不采用学习率衰减
+1. 以RoBERTa-wwm-ext为教师模型蒸馏CMRC2018和DRCD时，不采用学习率衰减
 2. CMRC2018和DRCD两个任务上蒸馏时他们互作为增强数据
 3. Electra-base的教师模型训练设置参考自[**Chinese-ELECTRA**](https://github.com/ymcui/Chinese-ELECTRA)
 4. Electra-small学生模型采用[预训练权重](https://github.com/ymcui/Chinese-ELECTRA)初始化

diff --git a/docs/source/ExperimentResults.md b/docs/source/ExperimentResults.md
@@ -21,6 +21,7 @@
 | T3  (student)                  | 81.8 / 82.7    |
 | T3-small (student)             | 81.3 / 81.7    |
 | T4-tiny (student)              | 82.0 / 82.6    |
+| T12-nano (student)             | 83.2 / 83.9    |
 
 * Multi-teacher distillation with `MultiTeacherDistiller`:
 
@@ -51,7 +52,8 @@
 | T3 (student)            | 76.4 / 84.9   |
 | T3-small (student)      | 72.3 / 81.4   |
 | T4-tiny (student)       | 73.7 / 82.5   |
-| &nbsp;&nbsp;+ DA                 | 75.2 / 84.0   |
+| &nbsp;&nbsp;+ DA        | 75.2 / 84.0   |
+| T12-nano (student)      | 79.0 / 86.6   |
 
 **Note**: When distilling to T4-tiny, NewsQA is used for data augmentation on SQuAD.
 
@@ -84,10 +86,12 @@
 | T6 (student)            | 90.7 |
 | T3 (student)            | 87.5 |
 | &nbsp;&nbsp;+ DA        | 90.0 |
-| T3-small (student)      | 57.4 |
-| &nbsp;&nbsp;+ DA        | 76.5 |
-| T4-tiny (student)       | 54.7 |
-| &nbsp;&nbsp;+ DA        | 79.6 |
+| T3-small (student)      | 78.6 |
+| &nbsp;&nbsp;+ DA        | -    |
+| T4-tiny (student)       | 77.5 |
+| &nbsp;&nbsp;+ DA        | 89.1 |
+| T12-nano (student)      | 78.8 |
+| &nbsp;&nbsp;+ DA        | 89.6 |
 
 **Note**: HotpotQA is used for data augmentation on CoNLL-2003.
 
@@ -118,28 +122,33 @@
 | **RoBERTa-wwm-ext** (teacher) | 68.8 / 86.4      | 86.5 / 92.5  |
 | T3 (student)                  | 63.4 / 82.4      | 76.7 / 85.2  |
 | &nbsp;&nbsp;+ DA              | 66.4 / 84.2      | 78.2 / 86.4  |
-| T3-small (student)            | 24.4 / 48.1      | 42.2 / 63.2  |
-| &nbsp;&nbsp;+  DA             | 58.0 / 79.3      | 65.5 / 78.6  |
-| T4-tiny (student)             | -                | -            |
-| &nbsp;&nbsp;+  DA             | 61.8 / 81.8      | 73.3 / 83.5  |
+| T3-small (student)            | 46.1 / 71.0      | 71.4 / 82.2  |
+| &nbsp;&nbsp;+  DA             | 58.0 / 79.3      | 75.8 / 84.8  |
+| T4-tiny (student)             | 54.3 / 76.8      | 75.5 / 84.9  |
+| &nbsp;&nbsp;+  DA             | 61.8 / 81.8      | 77.3 / 86.1  |
 
 **Note**: CMRC2018 and DRCD take each other as the augmentation dataset on the experiments. 
 
 ## Chinese Datasets (Electra-base as the teacher)
 
 * Training without Distillation:
 
-| Model                                | XNLI       | LCQMC       | CMRC 2018       | DRCD         |
-| :------------------------------------|------------|-------------| ----------------| -------------|
-| **Electra-base** (teacher)           | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  |
-| Electra-small (initialized with pretrained weights) | 72.5 | 86.3 | 62.9 / 80.2   | 79.4 / 86.4  |
+| Model                      | XNLI       | LCQMC  | CMRC 2018     | DRCD         | MSRA NER|
+|:---------------------------|------------|--------| --------------| -------------|---------|
+| **Electra-base** (teacher) | 77.8       | 89.8   | 65.6 / 84.7   | 86.9 / 92.3  | 95.14   |
+| Electra-small (pretrained) | 72.5       | 86.3   | 62.9 / 80.2   | 79.4 / 86.4  |         |
 
 * Single-teacher distillation with `GeneralDistiller`:
 
-| Model                                | XNLI       | LCQMC       | CMRC 2018       | DRCD         |
-| :------------------------------------|------------|-------------|-----------------| -------------|
-| **Electra-base** (teacher)           | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  |
-| Electra-small (random initialized)   | 77.2       | 89.0        | 66.5 / 84.9     | 84.8 / 91.0  |
-| Electra-small (initialized with pretrained weights) | 77.7 | 89.3 | 66.5 / 84.9   | 85.5 / 91.3  |
+| Model                       | XNLI       | LCQMC       | CMRC 2018       | DRCD         | MSRA NER |
+| :---------------------------|------------|-------------|-----------------| -------------|----------|
+| **Electra-base** (teacher)  | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  | 95.14    |
+| Electra-small (random)      | 77.2       | 89.0        | 66.5 / 84.9     | 84.8 / 91.0  |          |
+| Electra-small (pretrained)  | 77.7       | 89.3        | 66.5 / 84.9     | 85.5 / 91.3  |93.48     |
 
-**Note**: A good initialization of the student (Electra-small) improves the performance.
+**Note**: 
+
+1. Random: randomly initialized
+2. Pretrained: initialized with pretrained weights
+
+A good initialization of the student (Electra-small) improves the performance.
diff --git a/docs/source/Experiments.md b/docs/source/Experiments.md
@@ -19,6 +19,7 @@ We have tested different student models. To compare with public results, the stu
 | T3 (student)              | 3         | 768         | 3072              | 44M      | 41%           |
 | T3-small (student)        | 3         | 384         | 1536              | 17M      | 16%           |
 | T4-Tiny (student)         | 4         | 312         | 1200              | 14M      | 13%           |
+| T12-nano (student)        | 12        | 256         | 1024              | 17M      | 16%           |
 | BiGRU (student)           | -         | 768         | -                 | 31M      | 29%           |
 
 #### Chinese models
@@ -54,7 +55,8 @@ distill_config = DistillationConfig(temperature = 8, intermediate_matches = matc
 | T3           | L3_hidden_mse + L3_hidden_smmd                      |
 | T3-small     | L3n_hidden_mse + L3_hidden_smmd                     |
 | T4-Tiny      | L4t_hidden_mse + L4_hidden_smmd                     |
-|Electra-small | small_hidden_mse + small_hidden_smmd                |
+| T12-nano     | small_hidden_mse + small_hidden_smmd                |
+| Electra-small| small_hidden_mse + small_hidden_smmd                |
 
 The definitions of `matches` are at [exmaple/matches/matches.py](https://github.com/airaria/TextBrewer/blob/master/examples/matches/matches.py). 
 
@@ -95,13 +97,15 @@ Our results (see [Experimental Results](ExperimentResults.md) for details):
 | BiGRU          | -               | -             | 85.3            |
 | T6             | 83.5 / 84.0     | 80.8 / 88.1   | 90.7            |
 | T3             | 81.8 / 82.7     | 76.4 / 84.9   | 87.5            |
-| T3-small       | 81.3 / 81.7     | 72.3 / 81.4   | 57.4            |
-| T4-tiny        | 82.0 / 82.6     | 75.2 / 84.0   | 79.6            |
+| T3-small       | 81.3 / 81.7     | 72.3 / 81.4   | 78.6            |
+| T4-tiny        | 82.0 / 82.6     | 75.2 / 84.0   | 89.1            |
+| T12-nano       | 83.2 / 83.9     | 79.0 / 86.6   | 89.6            |
 
 **Note**:
 
-1. The equivalent model architectures of public models are shown in the brackets after their names. 
+1. The equivalent model structures of public models are shown in the brackets after their names. 
 2. When distilling to T4-tiny, NewsQA is used for data augmentation on SQuAD and HotpotQA is used for data augmentation on CoNLL-2003.
+3. When distilling to T12-nano, HotpotQA is used for data augmentation on CoNLL-2003.
 
 
 
@@ -116,25 +120,26 @@ We experiment on the following typical Chinese datasets:
 | [**LCQMC**](http://icrc.hitsz.edu.cn/info/1037/1146.htm) | text classification | Acc | 239K | 8.8K | sentence-pair matching, binary classification |
 | [**CMRC 2018**](https://github.com/ymcui/cmrc2018) | reading comprehension | EM/F1 | 10K | 3.4K | span-extraction machine reading comprehension |
 | [**DRCD**](https://github.com/DRCKnowledgeTeam/DRCD) | reading comprehension | EM/F1 | 27K | 3.5K | span-extraction machine reading comprehension (Traditional Chinese) |
+| [**MSRA NER**](https://faculty.washington.edu/levow/papers/sighan06.pdf) | sequence labeling | F1 | 45K | 3.4K (test) | Chinese named entity recognition |
 
 The results are listed below (see [Experimental Results](ExperimentResults.md) for details).
 
 | Model           | XNLI | LCQMC | CMRC 2018 | DRCD |
 | :--------------- | ---------- | ----------- | ---------------- | ------------ |
 | **RoBERTa-wwm-ext** (teacher) | 79.9       | 89.4        | 68.8 / 86.4      | 86.5 / 92.5  |
 | T3          | 78.4       | 89.0        | 66.4 / 84.2      | 78.2 / 86.4  |
-| T3-small    | 76.0       | 88.1        | 58.0 / 79.3      | 65.5 / 78.6  |
-| T4-tiny     | 76.2       | 88.4        | 61.8 / 81.8      | 73.3 / 83.5  |
+| T3-small    | 76.0       | 88.1        | 58.0 / 79.3      | 75.8 / 84.8  |
+| T4-tiny     | 76.2       | 88.4        | 61.8 / 81.8      | 77.3 / 86.1  |
 
-| Model                      | XNLI       | LCQMC       | CMRC 2018        | DRCD         |
-| :---------------           | ---------- | ----------- | ---------------- | ------------ |
-| **Electra-base** (teacher) | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  |
-| Electra-small              | 77.7       | 89.3        | 66.5 / 84.9     | 85.5 / 91.3  |
+| Model                      | XNLI       | LCQMC       | CMRC 2018        | DRCD        | MSRA NER |
+| :---------------           | ---------- | ----------- | ---------------- | ------------|----------|
+| **Electra-base** (teacher) | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  | 95.14    |
+| Electra-small              | 77.7       | 89.3        | 66.5 / 84.9     | 85.5 / 91.3  | 93.48    |
 
 
 **Note**:
 
-1. When distillatoin from RoBERTa-wwm-ext, on CMRC2018 and DRCD, learning rates are 1.5e-4 and 7e-5 respectively and there is no learning rate decay.
+1. Learning rate decay is not used in distillation on CMRC2018 and DRCD.
 2. CMRC2018 and DRCD take each other as the augmentation dataset in the distillation.
 3. The settings of training Electra-base teacher model can be found at [**Chinese-ELECTRA**](https://github.com/ymcui/Chinese-ELECTRA).
 4. Electra-small student model is intialized with the [pretrained weights](https://github.com/ymcui/Chinese-ELECTRA).