Model Distillation #1758

MichelBartels · 2021-11-15T16:40:31Z

This adds model distillation as described in #1551.
A new method called distil_from is added to the FARMReader. This method takes mostly the same parameters as train and just has additional parameters for the teacher model and to configure distillation.
The classes DistillationTrainer and DistillationDataSilo are also introduced as they are used by distil_from. The DistillationDataSilo already computes the logits of the teacher meaning they won't have to be recomputed each epoch.

The implemented approach has the limitation that the character to token mappings of student and teacher tokenizer need to be the same as comparing logits would not make sense otherwise.

This PR also includes a benchmark allowing to compare teacher performance, student performance without distillation (baseline) and student performance with distillation. It can be configured similarly to the other benchmarks using a json file.

…ai/haystack into distillation-refactored

julian-risch

Let's wait for the new results from the benchmark before merging the branch into master. Further, could you please add a small test case, maybe with tiny models and a small dataset just to run the code as one of the tests and check for example that the weights of the student model change after training? We can talk about that in a call if you want.

julian-risch · 2021-11-17T11:36:09Z

haystack/modeling/training/base.py

+        teacher_logits = [batch.pop(key) for key in keys]
+        logits = self.model.forward(**batch)
+        student_loss = self.model.logits_to_loss(logits=logits, global_step=self.global_step, **batch)
+        logit_difference_loss = self.distillation_loss_fn(logits[0], teacher_logits[0])


Let's use named arguments in method calls so that this line becomes:
logit_difference_loss = self.distillation_loss_fn(student_logits=logits[0], teacher_logitsteacher_logits[0])
As a result, the code won't break when the order of arguments in the implementation of distillation_loss_fn changes and it's easier to read the code. (In haystack's code base we use named arguments in almost every method that has multiple arguments for these reasons.)

Okay, I changed that.

julian-risch · 2021-11-17T11:38:45Z

haystack/modeling/training/base.py

+    def _kl_div(self, student_logits, teacher_logits):
+        student_log_probs = F.log_softmax(student_logits, dim=-1)
+        teacher_log_probs = F.log_softmax(teacher_logits, dim=-1)
+        return self.kl(student_log_probs, teacher_log_probs)


We could use KLDivLoss(reduction="batchmean", log_target=True) here directly instead of self.kl and then there would be no need to define self.kl as it isn't used anywhere else. If you make this change, we can get rid of line 694 self.kl = KLDivLoss(reduction="batchmean", log_target=True) in the elif branch.

I have changed it now so it now uses the functional api. I hope this also addresses this issue.

julian-risch · 2021-11-17T11:54:13Z

haystack/nodes/reader/farm.py


        # 2. Create a DataSilo that loads several datasets (train/dev/test), provides DataLoaders for them
        # and calculates a few descriptive statistics of our datasets
-        data_silo = DataSilo(processor=processor, batch_size=batch_size, distributed=False, max_processes=num_processes)
+        if teacher_model:


There are only very few new comments in the code. At this line it's definitely worth adding a comment on the if/else statement and its consequences. I would like it if there were more comments in the code.

I have now added this comment and a few others.

julian-risch · 2021-11-17T12:06:34Z

test/benchmarks/model_distillation.py

@@ -0,0 +1,99 @@
+from haystack.nodes import FARMReader


FARMReader import is a duplicate here and import osis not needed.

I have now removed the imports.

…ai/haystack into distillation-refactored

MichelBartels · 2021-11-17T15:54:49Z

I have added the test.

julian-risch

Found one unused import. Other than that, I think it's ready to be merged depending on the benchmark results. 👍
Maybe we could check the distilled model's QA predictions in the test case rather than only the change of weights but I guess the tiny dataset won't result in any bigger changes of the distilled model.

julian-risch · 2021-11-18T08:48:21Z

haystack/modeling/training/base.py

@@ -9,6 +9,9 @@
 from tqdm import tqdm
 from pathlib import Path

+from torch.nn import KLDivLoss, MSELoss


KLDivLoss is not used anymore so there is no need for this import.

MichelBartels · 2021-11-24T17:15:24Z

I have now done a few tests and the best I could find so far was an increase of about 3 percentage points in EM when distilling deepset/bert-large-uncased-whole-word-masking-squad2 to prajjwal1/bert-medium compared to just training the student without distillation.
Results of teacher
EM: 0.7890170976164407
F1: 0.8320508749659892
Top n accuracy: 0.9773435525983324
Results of student without distillation (baseline)
EM: 0.655689379263876
F1: 0.6964405123508122
Top n accuracy: 0.9535079592352396
Results of student with distillation (temperature: 5 distillation loss weight: 1
EM: 0.6864313989724585
F1: 0.7276370837908053
Top n accuracy: 0.9530868356775878

julian-risch · 2021-11-25T16:41:25Z

Great to see the performance improvements now that the bug with the distillation loss calculation is fixed. The PR is ready to merge from my side.
@tholor could you please briefly have a look at this PR and maybe give some general feedback (not detailed) before we merge it? Thank you!

julian-risch

Looks good to me. Great job! 👍

tholor

Nice work! Only left a few comments around documentation. Feel free to merge once those are adressed

tholor · 2021-11-26T08:16:14Z

haystack/modeling/data_handler/data_silo.py

@@ -722,3 +727,68 @@ def get_dict_checksum(payload_dict):
    """
    checksum = hashlib.md5(json.dumps(payload_dict, sort_keys=True).encode("utf-8")).hexdigest()
    return checksum
+
+class DistillationDataSilo(DataSilo):
+    def __init__(self, teacher_model: "FARMReader", teacher_batch_size: int, device, **kwargs):


Would be great to add a short docstring here explaining the need for a special data silo and type hints for all params

I have added the type hint for device. Do you also want me to add type hints for kwargs (i.e. write them out)?

Yeah, I think it would be helpful to write them out as it will enable autocomplete in the IDE which can be helpful when users initialize a DistillationDataSilo but are not sure which params are expected.

tholor · 2021-11-26T08:16:47Z

haystack/modeling/data_handler/data_silo.py

+        kwargs["max_processes"] = 1 # fix as long as multithreading is not working with teacher attribute
+        super().__init__(**kwargs)
+
+    def _run_teacher(self, batch, corresponding_chunks, teacher_outputs, tensor_names):


also here type hints would be helpful :)

I have added the type hints.

tholor · 2021-11-26T08:18:00Z

haystack/modeling/training/base.py

@@ -596,3 +602,111 @@ def _all_ranks_have_data(self, has_data: bool, step: Optional[int] = None):
            return False
        else:
            return True
+
+class DistillationTrainer(Trainer):


Sort docstring explaining the purpose of this class and ideally a short code example would be helpful for the docs.

I have added the docstring.

tholor · 2021-11-26T08:19:10Z

haystack/modeling/training/base.py

+        data_silo: DataSilo,
+        epochs: int,
+        n_gpu: int,
+        device,


please add type hints :)

I have added the type hints. However, for lr_scheduler I needed to import a private class from pytorch. I'm not sure if that's desirable.

You're referring to something like from torch.optim.lr_scheduler import _LRScheduler ?

If mypy doesn't complain about the annotation with _LRScheduler, which I think it doesn't, we can use it. Otherwise, type ignore is the next best solution here in my opinion.
@tholor What do you think?

Yes, the import looks okay to me - if that's the type we expect here it is what it is ;) .
If mypy complains we can drop it or add the ignore comment

tholor · 2021-11-26T08:23:06Z

haystack/modeling/training/base.py

+        :param max_grad_norm: Max gradient norm for clipping, default 1.0, set to None to disable
+        :param distillation_loss_weight: The weight of the distillation loss
+        :param distillation_loss: Specifies how teacher and model logits should be compared. Can either be a string ("mse" for mean squared error or "kl_div" for kl divergence loss) or a callable loss function (needs to have named paramters student_logits and teacher_logits)
+        :param temperature: The temperature for distillation


please describe briefly the effects of the params (e.g. a higher temperature results in ...)

Okay, I have also added this description to the distil_from method.

tholor · 2021-11-26T11:38:16Z

haystack/nodes/reader/farm.py

+        """
+        Fine-tune a model on a QA dataset using distillation. You need to provide a teacher model that is already finetuned on the dataset
+        and a student model that will be trained using the teacher's logits. The idea of this is to increase the accuracy of a lightweight student model
+        using a more complex teacher.


Can you give a short code example here including a reasonable combination of models (to show usage and clarify that the student can be another pretrained model)?

…ai/haystack into distillation-refactored

MichelBartels and others added 13 commits November 12, 2021 16:42

initial commit

7448d1b

Add latest docstring and tutorial changes

23c6542

added comments and fixed bug

3f6495e

Merge branch 'distillation-refactored' of https://github.com/deepset-…

5a1785f

…ai/haystack into distillation-refactored

fixed bugs, added benchmark and added documentation

96bb479

Add latest docstring and tutorial changes

f218a61

fix type: ignore comment

d993a26

Merge branch 'distillation-refactored' of https://github.com/deepset-…

03b3aeb

…ai/haystack into distillation-refactored

fix logging in benchmark

ed3c4f2

fixed distillation config

80ea5aa

Add latest docstring and tutorial changes

75cea33

added type annotations

113f92f

Merge branch 'distillation-refactored' of https://github.com/deepset-…

4647562

…ai/haystack into distillation-refactored

MichelBartels requested a review from julian-risch November 16, 2021 10:54

MichelBartels added 5 commits November 16, 2021 18:14

fixed distillation loss calculation

3b2a021

added type annotations

3124ad6

fixed distillation mse loss

2f46477

improved model distillation benchmark config loading

5b2333b

added temperature for model distillation

3bb1b20

julian-risch requested changes Nov 17, 2021

View reviewed changes

julian-risch reviewed Nov 17, 2021

View reviewed changes

MichelBartels and others added 5 commits November 17, 2021 15:25

removed uncessary imports, added comments, added named parameter calls

c045783

Add latest docstring and tutorial changes

143fafd

added some more comments

c6439ee

Merge branch 'distillation-refactored' of https://github.com/deepset-…

1ae1937

…ai/haystack into distillation-refactored

added distillation test

86bc797

deepset-ai deleted a comment from CLAassistant Nov 17, 2021

fixed distillation test

d3a3c16

julian-risch requested changes Nov 18, 2021

View reviewed changes

MichelBartels added 4 commits November 18, 2021 10:28

removed unnecessary import

98727b1

fix softmax dimension

e68d7a2

add grid search

42b8160

fix merge

acb755a

julian-risch approved these changes Nov 25, 2021

View reviewed changes

improved model distillation benchmark config

57964dc

julian-risch mentioned this pull request Nov 25, 2021

Model compression: knowledge distillation #1551

Closed

7 tasks

fixed model distillation hyperparameter search

6bb3fcb

tholor approved these changes Nov 26, 2021

View reviewed changes

MichelBartels and others added 10 commits November 26, 2021 13:52

added doc strings and type hints for model distillation

6728aff

Add latest docstring and tutorial changes

212450b

fixed type hints

0167426

Merge branch 'distillation-refactored' of https://github.com/deepset-…

a7f72d7

…ai/haystack into distillation-refactored

fixed type hints

645c9b0

fixed type hints

b1fbaf3

wrote out params instead of kwargs in DistillationDataSilo initializer

7f61f73

fixed type hints

d1643b0

fixed typo

786b1ae

fixed typo

3c90210

MichelBartels merged commit 84147ed into master Nov 26, 2021

MichelBartels deleted the distillation-refactored branch November 26, 2021 17:49

tholor added the topic:modeling label Dec 2, 2021

This was referenced Dec 13, 2021

Implementing distillation loss functions from TinyBERT #1873

Closed

Implementing data augmentation from TinyBERT #1874

Closed

MichelBartels mentioned this pull request Dec 20, 2021

Dev split in preprocessing not working when not using multiprocessing #1738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Distillation #1758

Model Distillation #1758

MichelBartels commented Nov 15, 2021

julian-risch left a comment

julian-risch Nov 17, 2021

MichelBartels Nov 17, 2021

julian-risch Nov 17, 2021

MichelBartels Nov 17, 2021 •

edited

Loading

julian-risch Nov 17, 2021

MichelBartels Nov 17, 2021

julian-risch Nov 17, 2021

MichelBartels Nov 17, 2021 •

edited

Loading

MichelBartels commented Nov 17, 2021

julian-risch left a comment

julian-risch Nov 18, 2021

MichelBartels commented Nov 24, 2021

julian-risch commented Nov 25, 2021

julian-risch left a comment

tholor left a comment

tholor Nov 26, 2021

MichelBartels Nov 26, 2021

tholor Nov 26, 2021

tholor Nov 26, 2021

MichelBartels Nov 26, 2021

tholor Nov 26, 2021

MichelBartels Nov 26, 2021

tholor Nov 26, 2021

MichelBartels Nov 26, 2021

julian-risch Nov 26, 2021

MichelBartels Nov 26, 2021

julian-risch Nov 26, 2021

tholor Nov 26, 2021

tholor Nov 26, 2021

MichelBartels Nov 26, 2021

tholor Nov 26, 2021

MichelBartels Nov 26, 2021

Model Distillation #1758

Model Distillation #1758

Conversation

MichelBartels commented Nov 15, 2021

julian-risch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichelBartels Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichelBartels Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

MichelBartels commented Nov 17, 2021

julian-risch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichelBartels commented Nov 24, 2021

julian-risch commented Nov 25, 2021

julian-risch left a comment

Choose a reason for hiding this comment

tholor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichelBartels Nov 17, 2021 •

edited

Loading

MichelBartels Nov 17, 2021 •

edited

Loading