# Chapter 8: Making Transformers Efficient in Production

In [None]:
%%html
<style>
.pad-left {
    padding-left: 20px;
}
</style>

## Background

> (W)hen developing a new machine learning model for your business, do you first make it accurate, then worry about making it fast in production? Or do you first make sure it can be fast, then make it accurate? 
> <p/>
> ...
> <p/>
> While this was a stressful experience for us, it doesn’t have to be for you, because in this article we are going to share the optimizations that made Bert inference fast for us. So you can start with an egg (a known playbook for making certain Bert models fast in production), then focus on the chicken (making your Bert model accurate).

* Blogpost@Robolox: [How We Scaled BERT to Serve 1+ Billion Daily Requests on CPUs](https://medium.com/@quocnle/how-we-scaled-bert-to-serve-1-billion-daily-requests-on-cpus-d99be090db26)
* And the [video from Databricks on YouTube](https://youtu.be/Nw77sEAn_Js)

#### Key takeaways

1. _Smaller Model_: model distillation
1. _Smaller Inputs_: do away with padding inputs and go with dynamically shaped input
1. _Smaller Weights_: although this may necessarily trade off accuracy, use quantization 
1. _Smaller number of requests_: use caching
1. _Smaller number of thread per core_: thread tuning with [`torch.set_num_threads`](https://www.theatlantic.com/ideas/archive/2024/01/the-daily-show-jon-stewart/677240/)

## Intent Detection as a Case Study

In [None]:
from transformers import pipeline

teacher_ckpt = "transformersbook/bert-base-uncased-finetuned-clinc"
pipe = pipeline("text-classification", model=teacher_ckpt)

In [None]:
query = """Hey, I'd like to rent a vehicle from Nov 1st to Nov 15th in Paris and I need a 15 passenger van"""

pipe(query)

### CLINC150

A dataset for task-oriented dialog systems, this dataset was used to fine-tune the baseline model in this example. 

The important thing is that it actually includes queries that are out-of-scope.

Please see: [`clinc_oos` at 🤗](https://huggingface.co/datasets/clinc_oos)

In [None]:
from datasets import load_dataset

clinc = load_dataset("clinc_oos", "plus")

In [None]:
sample = clinc["test"][42]
sample

In [None]:
intents = clinc["test"].features["intent"]
intents.int2str(sample["intent"])

----

## Creating a Performance Benchmark

In [None]:
class PerformanceBenchmark:
    def __init__(self, pipeline, dataset, optim_type="BERT baseline"):
        self.pipeline = pipeline
        self.dataset = dataset
        self.optim_type = optim_type
        
    def compute_accuracy(self):
        # tbd
        pass

    def compute_size(self):
        # tbd
        pass

    def time_pipeline(self):
        # tbd
        pass

    def run_benchmark(self):
        metrics = {}
        metrics[self.optim_type] = self.compute_size()
        metrics[self.optim_type].update(self.time_pipeline())
        metrics[self.optim_type].update(self.compute_accuracy())
        return metrics

#### Implementing `compute_accuracy`

In [None]:
from datasets import load_metric

accuracy_score = load_metric("accuracy")

In [None]:
def compute_accuracy(self):
    """This overrides the PerformanceBenchmark.compute_accuracy() method"""
    preds, labels = [], []
    for example in self.dataset:
        pred = self.pipeline(example["text"])[0]["label"]
        label = example["intent"]
        preds.append(intents.str2int(pred))
        labels.append(label)

    accuracy = accuracy_score.compute(predictions=preds, references=labels)
    print(f"Accuracy on test set - {accuracy['accuracy']:.3f}")
    return accuracy

PerformanceBenchmark.compute_accuracy = compute_accuracy

#### Implementing `compute_size`

In [None]:
list(pipe.model.state_dict().items())[42]

In [None]:
import torch

torch.save(pipe.model.state_dict(), "model.pt")

In [None]:
from pathlib import Path

def compute_size(self):
    """This overrides the PerformanceBenchmark.compute_size() method"""
    state_dict = self.pipeline.model.state_dict()
    tmp_path = Path("model.pt")
    torch.save(state_dict, tmp_path)
    # calculate size in megabytes
    size_mb = Path(tmp_path).stat().st_size / (1024*1024)
    # delete tmp file
    tmp_path.unlink()
    print(f"Model size (MB) - {size_mb:.2f}")
    return {"size_mb": size_mb}

PerformanceBenchmark.compute_size = compute_size

#### Implementing `time_pipeline`

In [None]:
from time import perf_counter

for _ in range(3):
    start_time = perf_counter()
    _ = pipe(query)
    latency = perf_counter() - start_time
    print(f"Latency (ms) - {1000 * latency:.3f}")

In [None]:
import numpy as np

def time_pipeline(self, query="What is the pin number for my account?"):
    """This overrides the PerformanceBenchmark.time_pipeline method"""
    latencies = []

    # warm-up
    for _ in range(10):
        _ = self.pipeline(query)

    # now we observed the elapsed time over 100 runs
    for _ in range(100):
        start_time = perf_counter()
        _ = self.pipeline(query)
        latency = perf_counter() - start_time
        latencies.append(latency)

    # compute run stats
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
    return { "time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms }

PerformanceBenchmark.time_pipeline = time_pipeline

In [None]:
pb = PerformanceBenchmark(pipe, clinc["test"])
perf_metrics = pb.run_benchmark()

## Making Models Smaller via Knowledge Distillation

### Creating a Knowledge Distillation Trainer

In addition to the _105_ parameters that [`transformers.TrainingArguments`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments), we will add two more to support training of a student model with knowledge distillation:

* `alpha` ... $\alpha$ controls the weighted average of cross-entropy and knowledge-distillation loss for the student model (see below). Ranges from 0.0 to 1.0; $\alpha = 1.0$ means that we only use the cross-entropy of the student and ignore any signal from the teacher.
* `temperature` ... $T$ softens the probability distributions by scaling the logits before applying softmax:

<p class="pad-left">\(p_{i} = \frac{exp(z_i(x)/T)}{\sum_\limits{j}exp(z_{i}(x)/T)}\)</p>
<p>Ranges from 1.0 to $\infty$. $T=1$ recovers the original softmax distribution. </p>

In [None]:
from transformers import TrainingArguments

class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
        super().__init__(*args, **kwargs)
        self.alpha = alpha
        self.temperature = temperature

During training, loss is calculated as a weighted average of the usual cross-entropy loss of the student; and the knowledge-distallation loss between the teacher and student. 

<p class="pad-left">\(L_{student} = \alpha L_{CE} + (1 - \alpha) L_{KD}\)</p>
<p>where</p>


<p class="pad-left">\(L_{CE}\)</p>
<p>is the cross-entropy loss of the ground truth labels.</p>

<p class="pad-left">\(L_{KD} = T^{2}D_{KL}\)</p><p>is knowledge-distillation loss where \(T^{2}\) is a normalization factor to account for the gradients produced by soft labels scales as \(\frac{1}{T^{2}}\).</p>

<p class="pad-left">\(D_{KL}(p, q) = \sum_\limits{i} p_i \  log\frac{p_i(x)}{q_i(x)}\)</p>
<p>which is the expectation of the log difference between $p_i(x)$ and $q_i(x)$ when the expectation is taken using the probabilities of $p_i(x)$. For our case, $p_i(x)$ is the <i>teacher</i> and $q_i(x)$ is the <i>student</i>. In other words, we measure loss by seeing how far off the student is from the teacher, and that makes perfect sense.</p>

In [None]:
import torch.nn as nn
import torch.nn.functional as F

from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher_model = teacher_model

    def compute_loss(self, model, inputs, return_outputs=False):
        outputs_stu = model(**inputs)
        # extract cross-entropy loss and logits from student
        loss_ce = outputs_stu.loss
        logits_stu = outputs_stu.logits
        
        # extract logits from teacher
        with torch.no_grad():
            outputs_tea = self.teacher_model(**inputs)
            logits_tea = outputs_tea.logits

        # soften probabilities and compute distillation loss
        loss_fct = nn.KLDivLoss(reduction="batchmean")
        loss_kd = self.args.temperature ** 2 * loss_fct(
            F.log_softmax(logits_stu / self.args.temperature, dim=-1),
            F.softmax(logits_tea / self.args.temperature, dim=-1)
        )

        # return weighted student loss
        loss = self.args.alpha * loss_ce + (1. - self.args.alpha) * loss_kd
        return (loss, outputs_stu) if return_outputs else loss

## Choosing a Good Student Initialization

> A good rule of thumb from the literature is that knowledge distillation works best when teacher and student are of the same _model type_.

So if we are using [BERT (`transformersbook/bert-base-uncased-finetuned-clinc`)](https://huggingface.co/transformersbook/bert-base-uncased-finetuned-clinc) for teacher, then [DistilBERT (`distilbert-base-uncased`)](https://huggingface.co/distilbert-base-uncased) for the student is a natural choice.

In [None]:
from transformers import AutoTokenizer

student_ckpt = "distilbert-base-uncased"
student_tokenizer = AutoTokenizer.from_pretrained(student_ckpt)

def tokenize_text(batch):
    return student_tokenizer(batch["text"], truncation=True)


clinc_enc = clinc.map(tokenize_text, batched=True, remove_columns=["text"])
clinc_enc = clinc_enc.rename_column("intent", "labels")

In [None]:
from huggingface_hub import notebook_login

notebook_login()

We implement `compute_metrics` for tracking metrics during training. Here, we can reuse `accuracy_score` which we use above in `PerformanceBenchmark.compute_accuracy`.

In [None]:
def compute_metrics(pred):
    predictions, labels = pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_score.compute(predictions=predictions, references=labels)

#### Training arguments

* [`output_dir`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments.output_dir) ... output directory where the model predictions and checkpoints will be written.
* [`evaluation_strategy`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments.evaluation_strategy) ... `"no"`: No evaluation is done during training; `"steps"`: Evaluation is done (and logged) every eval_steps; or `"epoch"`: Evaluation is done at the end of each epoch.
* [`num_train_epochs`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments.num_train_epochs(float,) ... number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training); defaults to 3.0.
* [`learning_rate`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments.learning_rate) ... initial learning rate for [`transformers.AdamW`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/optimizer_schedules#transformers.AdamW) optimizer.
* [`weight_decay`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments.weight_decay) ... weight decay to apply (if not zero) to all layers except all bias and `LayerNorm` weights in `transformers.AdamW` optimizer.
* [`per_device_train_batch_size`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments.per_device_train_batch_size) ... batch size per GPU/TPU core/CPU for _training_; defaults to 8.
* [`per_device_eval_batch_size`](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments.per_device_eval_batch_size) ... batch size per GPU/TPU core/CPU for _evaluation_; defaults to 8.
* `alpha` ... controls the weighted average of cross-entropy and knowledge-distillation loss for the student model (see explanation above).
* [`push_to_hub`]() ... you know.

In [None]:
batch_size = 48

finetuned_ckpt = "distilbert-base-uncased-finetuned-clinc"

student_training_args = DistillationTrainingArguments(
    output_dir=finetuned_ckpt,
    evaluation_strategy="epoch",
    num_train_epochs=5,
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    alpha=1,
    push_to_hub=True
)

#### Student model configuration

In [None]:
id2label = pipe.model.config.id2label
label2id = pipe.model.config.label2id

num_labels = intents.num_classes

In [None]:
from transformers import AutoConfig

student_config = AutoConfig.from_pretrained(
    student_ckpt,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
from transformers import AutoModelForSequenceClassification

def student_init():
    return (AutoModelForSequenceClassification.from_pretrained(
        student_ckpt,
        config=student_config
    ).to(device))

In [None]:
teacher_model = (AutoModelForSequenceClassification.from_pretrained(
    teacher_ckpt,
    num_labels=num_labels
).to(device))

In [None]:
distilbert_trainer = DistillationTrainer(
    model_init=student_init,
    teacher_model=teacher_model,
    args=student_training_args,
    train_dataset=clinc_enc["train"],
    eval_dataset=clinc_enc["validation"],
    compute_metrics=compute_metrics,
    tokenizer=student_tokenizer
)

distilbert_trainer.train()