# Neural Architecture Search for Efficient Machine Translation Models

In this tutorial, we will go over the high-level theory and implementation details of the **neural architecture search (NAS)** pipeline for identifying efficient machine translation models. The machine translations models are based on the classical encoder-decoder Transformer architectures. These models are trained on machine translation benchmarks from scratch (no pretraining) to convergence. This tutorial borrows the theory and implementation from [Hardware-Aware Transformers](https://arxiv.org/pdf/2005.14187.pdf), which is the state-of-the-art NAS framework to build efficient autoregressive machine translation models.

This notebook was created by Ganesh Jawahar (ganeshjwhr@gmail.com).

## Prerequisites
- [PyTorch](https://pytorch.org/)
- [Transformers](https://arxiv.org/abs/1706.03762)

## Problem Setting (General)
The goal of neural architecture search is to identify architectures that maximize the **accuracy for a user-defined task** as much as possible, while satisfying user-defined **hardware constraints**. Specifically, the input to the neural architecture search is:
- **Task:** The NLP task (e.g., autocomplete, machine translation) that the Transformer model should solve.
- **Search Space:** Set of candidate Transformer architectures (e.g., varying number of layers, attention heads) that can solve the **task**.
- **Constraint:** Constraint on the footprint metric (e.g., $\leq16$ MB memory or $\leq200$ ms latency) that the architecture must satisfy.
- **Accuracy:** The metric used to quantify the accuracy of the model on the **task**.

The method should output the architecture that maximizes the **accuracy** of the model on the **task** from the **search space**, while satisfying the **constraint**. 

## Problem Setting (This Tutorial)
In this tutorial, 
- **Task:** Machine Translation Task (e.g., WMT 2014 English to German)
- **Search Space:**  Set of candidate encoder-decoder Transformer architectures with varying number of decoder layers, embedding size, attention heads (self-attention and cross-attention), feed-forward network (FFN) intermediate size and arbitrary encoder-decoder attention. For arbitrary encoder-decoder attention, -1 means attending to last one encoder layer, 1 means last two encoder layers, 2 means last three encoder layers.

| Attributes | Dimensions |
| --- | --- |
| Encoder-Embedding-Size | [640, 512] |
| Decoder-Embedding-Size | [640, 512] |
| \#Encoder-Layers | [6] |
| \#Decoder-Layers | [1, 2, 3, 4, 5, 6] |
| Encoder-QKV-Dim | 512 |
| Decoder-QKV-Dim | 512 |
| \#Encoder-Self-Att-Heads (Per Layer) | [4, 8] |
| \#Decoder-Self-Att-Heads (Per Layer) | [4, 8] |
| \#Decoder-Cross-Att-Heads (Per Layer) | [4, 8] |
| \#Decoder-Arbitrary-Att (Per Layer) | [-1, 1, 2] |
| Encoder-FFN-Intermediate-Size (Per Layer) | [1024, 2048, 3072] |
| Decoder-FFN-Intermediate-Size (Per Layer) | [1024, 2048, 3072] |

- **Constraint:** $\leq200$ milliseconds latency (time taken by the model to encode the source sentence and generate the translation sentence in a target hardware (Colab GPU in this example))
- **Accuracy:** BLEU score





## Hardware-aware Transformers (Solution)

Hardware-aware Transformers (HAT) is a popular NAS framework to solve the problem. HAT has the following stages in the pipeline:
1. **Superet training** - Train a performance estimator that can quickly provide the accuracy of an architecture from the search space
2. **Collect hardware latency datasets** - Generate a latency dataset with sample architectures and their corresponding latency measured on target hardware 
3. **Train latency predictor** - Train a latency estimator on the generated latency dataset
4. **Evolutionary search** - Identifies the efficient architecture with accuracy and latency of a candidate architecture from the performance estimator and latency estimator respectively.
5. **Train efficient architecture from scratch** - Trains the efficient architecture from scratch to convergence.

![HAT block diagram](https://drive.google.com/uc?export=view&id=1ysb21_UiSqahCm_dtrY8aoIZVKPnL6eE)

(Picture courtesy: [Hardware-Aware Transformers](https://arxiv.org/pdf/2005.14187.pdf))

## HAT generated sample architectures
Some sample architectures generated by HAT when target hardware is Raspberry Pi (left side) and Titan XP (right side):

<img src="https://drive.google.com/uc?export=view&id=1plfV3ZqxaqWznF8oJ5XLAYtHAXos2-Fy" alt="HAT generated architectures" width="700"/>

(Picture courtesy: [Hardware-Aware Transformers](https://arxiv.org/pdf/2005.14187.pdf))



## HAT Implementation

### 0.1 Installation

Install HAT by running the following commands:

In [1]:
!git clone https://github.com/mit-han-lab/hardware-aware-transformers.git
%cd hardware-aware-transformers
!pip install --editable .

Cloning into 'hardware-aware-transformers'...
remote: Enumerating objects: 282, done.[K
remote: Counting objects: 100% (89/89), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 282 (delta 32), reused 23 (delta 23), pack-reused 193[K
Receiving objects: 100% (282/282), 17.09 MiB | 30.38 MiB/s, done.
Resolving deltas: 100% (100/100), done.
/content/hardware-aware-transformers
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/hardware-aware-transformers
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fastBPE
  Downloading fastBPE-0.1.0.tar.gz (35 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 KB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting configargparse
  Downloading ConfigArgParse-1.5.3-py3-none-any.whl

In [14]:
import torch

if torch.cuda.is_available():
    device = torch.cuda.get_device_name(0)
    print(f'GPU type: {device}')
else:
    print('No GPU available')

GPU type: Tesla T4


### 0.2 Download data

Download the preprocessed data for the machine translation task. The syntax is:

`bash configs/[task_name]/get_preprocessed.sh`
- where `[task_name]` can be `wmt14.en-de`, `wmt14.en-fr`, `wmt19.en-de` and `iwslt14.de-en`.

In this tutorial, we will focus on WMT 2014 English to German (`wmt14.en-de`). 


In [2]:
!bash configs/wmt14.en-de/get_preprocessed.sh

--2023-02-13 10:51:14--  https://www.dropbox.com/s/axfwl1vawper8yk/wmt16_en_de.preprocessed.tgz?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/axfwl1vawper8yk/wmt16_en_de.preprocessed.tgz [following]
--2023-02-13 10:51:15--  https://www.dropbox.com/s/raw/axfwl1vawper8yk/wmt16_en_de.preprocessed.tgz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc4cda363aacd537ace6c889b135.dl.dropboxusercontent.com/cd/0/inline/B2ZEth56qq7FygA8sEdu5RbjyGq2MDgrPVKyYatMvGupAZ7sft2vpKHy1py4__RXoK3Vn-YnS28H1wc40XxcpWCzj2da9aXN-JozbkrBZ_ywFRemrHjsQOwydgA5rWi2Mw1i7qHiqXz2Vug76fEiaE9zvitkBKWlxURM5d3jcIrHvw/file# [following]
--2023-02-13 10:51:15--  https://uc4cda363aacd537ace6c889b135.dl.dropboxusercontent.com/cd/0/inline/B2ZEth56qq7FygA8sEdu5RbjyGq

### 0.3 Inspect the search space

Look at the search space config:

In [3]:
!cat configs/wmt14.en-de/supertransformer/space0.yml

# model
arch: transformersuper_wmt_en_de
share-all-embeddings: True
max-tokens: 4096
data: data/binary/wmt16_en_de

# training settings
optimizer: adam
adam-betas: (0.9, 0.98)
clip-norm: 0.0
weight-decay: 0.0
dropout: 0.3
attention-dropout: 0.1
criterion: label_smoothed_cross_entropy
label-smoothing: 0.1

ddp-backend: no_c10d
fp16: True

# warmup from warmup-init-lr to max-lr (warmup-updates steps); then cosine anneal to lr (max-update - warmup-updates steps)
update-freq: 16
max-update: 40000
warmup-updates: 10000
lr-scheduler: cosine
warmup-init-lr: 1e-7
max-lr: 0.001
lr: 1e-7
lr-shrink: 1

# logging
keep-last-epochs: 20
save-interval: 10
validate-interval: 10

# SuperTransformer configs
encoder-embed-dim: 640
decoder-embed-dim: 640

encoder-ffn-embed-dim: 3072
decoder-ffn-embed-dim: 3072

encoder-layers: 6
decoder-layers: 6

encoder-attention-heads: 8
decoder-attention-heads: 8

qkv-dim: 512

# SubTransformers search space
encoder-embed-choice: [640, 512]
decoder-embed-choice: [640, 

`# SubTransformers search space` marks the search space for NAS, that defines the possible values taken by each Transformer hyperparameter.

### 1. Supernet training

A typical challenge in the NAS framework is to develop a **performance estimator** that can efficiently compute the accuracy of a candidate architecture. The naive approach of training candidate architectures from scratch to convergence and then evaluating on the validation set is prohibitively expensive given the large search space for all possible candidate architectures.

HAT's performance estimator is based on weight-sharing via a Supernet. The supernet is the largest model in the search space (marked by `# SuperTransformer configs` in the previous config file).


The Supernet is trained with the following steps:  
1. sample a candidate architecture randomly from the search space
2. train the sampled architecture by extracting the common portion of weights (subnet extraction) from different layers in the Supernet (i.e., by weight sharing) for one training step on the task
3. repeat steps 1 and 2 until the training budget is exhausted. 

Once the Supernet training is complete, we can obtain a quick accuracy estimate for a candidate architecture (i.e. subnetwork) by extracting its shared weights from the Supernet and evaluating on the validation set.

Let us understand how subnet extraction work via ``nn.Linear`` layer. As shown below, assume a linear layer in Supernet has 640 input features and 1024 output features ($1024\times 640$). Say, the same linear layer in subnet has only 512 input features and 768 output features ($768\times 512$). The linear layer weights for the subnet can be constructed by extracting the first 512 columns and first 768 rows from the corresponding weights of the supernet.

<img src="https://drive.google.com/uc?export=view&id=1NRi9tF_LbAA5oZduxjn_lFCU_1eQMzHy" alt="HAT generated architectures" width="300"/>

(Picture courtesy: [Hardware-Aware Transformers](https://arxiv.org/pdf/2005.14187.pdf))


Here's a sample implementation of ``nn.LinearSuper`` that generalizes ``nn.Linear``:


In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class LinearSuper(nn.Linear): # inherit from nn.Linear
    def __init__(self, super_in_dim, super_out_dim, bias=True, uniform_=None, non_linear='linear'):
        super().__init__(super_in_dim, super_out_dim, bias=bias)

        # super_in_dim and super_out_dim indicate the largest network!
        self.super_in_dim = super_in_dim
        self.super_out_dim = super_out_dim

        # input_dim and output_dim indicate the current sampled size
        self.sample_in_dim = None
        self.sample_out_dim = None

        self.samples = {}

        self._reset_parameters(bias, uniform_, non_linear)
        self.profiling = False

    def profile(self, mode=True):
        self.profiling = mode

    def sample_parameters(self, resample=False):
        if self.profiling or resample:
            return self._sample_parameters()
        return self.samples

    def _reset_parameters(self, bias, uniform_, non_linear):
        nn.init.xavier_uniform_(self.weight) if uniform_ is None else uniform_(
            self.weight, non_linear=non_linear)
        if bias:
            nn.init.constant_(self.bias, 0.)

    def set_sample_config(self, sample_in_dim, sample_out_dim):
        self.sample_in_dim = sample_in_dim
        self.sample_out_dim = sample_out_dim

        self._sample_parameters()

    def _sample_parameters(self):
        self.samples['weight'] = sample_weight(self.weight, self.sample_in_dim, self.sample_out_dim)
        self.samples['bias'] = self.bias
        if self.bias is not None:
            self.samples['bias'] = sample_bias(self.bias, self.sample_out_dim)
        return self.samples

    def forward(self, x):
        self.sample_parameters()
        return F.linear(x, self.samples['weight'], self.samples['bias'])

    def calc_sampled_param_num(self):
        assert 'weight' in self.samples.keys()
        weight_numel = self.samples['weight'].numel() #Returns the total number of elements in the input tensor.

        if self.samples['bias'] is not None:
            bias_numel = self.samples['bias'].numel()
        else:
            bias_numel = 0

        return weight_numel + bias_numel

# weight extraction for subnet
def sample_weight(weight, sample_in_dim, sample_out_dim):
    sample_weight = weight[:, :sample_in_dim] # extract first `sample_in_dim` columns
    sample_weight = sample_weight[:sample_out_dim, :] # extract first `sample_out_dim` columns

    return sample_weight

# bias extraction for subnet
def sample_bias(bias, sample_out_dim):
    sample_bias = bias[:sample_out_dim] # extract first `sample_out_dim` numbers

    return sample_bias


Let us construct the linear layer for Supernet.

In [9]:
linearlayer_supernet = LinearSuper(super_in_dim=640, super_out_dim=1024)
# print the shape of weight matrix
print("Supernet: weight shape = ", linearlayer_supernet.weight.shape)
# print the shape of bias matrix
print("Supernet: bias shape = ", linearlayer_supernet.bias.shape)

Supernet: weight shape =  torch.Size([1024, 640])
Supernet: bias shape =  torch.Size([1024])


To extract the **subnet** weights, we use `set_sample_config()` function specifying the input and the output features as follows:

In [10]:
linearlayer_supernet.set_sample_config(sample_in_dim=512, sample_out_dim=768)
# print the shape of weight matrix
print("Subnet: weight shape = ", linearlayer_supernet.samples['weight'].shape)
# print the shape of bias matrix
print("Subnet: bias shape = ", linearlayer_supernet.samples['bias'].shape)

Subnet: weight shape =  torch.Size([768, 512])
Subnet: bias shape =  torch.Size([768])


Refer to `fairseq/modules` in [HAT's repo](https://github.com/mit-han-lab/hardware-aware-transformers/tree/master/fairseq/modules) to see subnet extraction implementation for other Trasnformer layers, e.g., `multihead_attention_super` (self-attention), `embedding_super` (embedding layer).

## A Toy Example
Let us train a supernet for few steps now (change `max-tokens`, `max-update`, `update-freq` for full training). Before running the following command, change the line 198 of `fairseq/modules/multihead_attention_super.py` from `q *= self.scaling` to `q = q *self.scaling` (to avoid error: `RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.`).

In [12]:
!mkdir -p baseline/supernet # stores supernet checkpoint
!python -B train.py \
            --configs=configs/wmt14.en-de/supertransformer/space0.yml \
            --save-dir baseline/supernet \
            --no-epoch-checkpoints \
            --max-update 5 \
            --save-interval-updates 5

| Configs: Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformersuper_wmt_en_de', attention_dropout=0.1, beam=5, best_checkpoint_metric='loss', bucket_cap_mb=25, clip_norm=0.0, configs='configs/wmt14.en-de/supertransformer/space0.yml', cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='data/binary/wmt16_en_de', dataset_impl=None, ddp_backend='no_c10d', decoder_arbitrary_ende_attn_all_subtransformer=None, decoder_arbitrary_ende_attn_choice=[-1, 1, 2], decoder_attention_heads=8, decoder_embed_choice=[640, 512], decoder_embed_dim=640, decoder_embed_dim_subtransformer=None, decoder_embed_path=None, decoder_ende_attention_heads_all_subtransformer=None, decoder_ende_attention_heads_choice=[8, 4], decoder_ffn_embed_dim=3072, decoder_ffn_embed_dim_all_subtransformer=None, decoder_ffn_embed_dim_choice=[3072, 2048, 1024], decoder_inpu

The best supernet checkpoint can be accessed at `baseline/supernet/checkpoint_best.pt`.

### 2. Collect hardware latency datasets

In the next step, we will generate the **hardware latency datasets**, which will be subsequently used to train a latency prediction model. This step will sample architectures from the search space and measure the latency of the architecture on the target hardware.

Create a small dataset by running the following command (remove `--lat-dataset-size 25` to generate the full dataset):

In [16]:
!mkdir -p baseline/genlatdata # to save latency dataset
!CUDA_VISIBLE_DEVICES=0 
!python latency_dataset.py \
        --configs=configs/wmt14.en-de/latency_dataset/gpu_titanxp.yml  \
        --lat-dataset-path baseline/genlatdata/wmt14.en-de_gpu_titanxp.csv \
        --lat-dataset-size 25

Namespace(activation_dropout=0.0, activation_fn='relu', adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformersuper_wmt_en_de', attention_dropout=0.0, beam=5, best_checkpoint_metric='loss', bucket_cap_mb=25, clip_norm=25, configs='configs/wmt14.en-de/latency_dataset/gpu_titanxp.yml', cpu=False, criterion='cross_entropy', curriculum=0, data='data/binary/wmt16_en_de', dataset_impl=None, ddp_backend='c10d', decoder_arbitrary_ende_attn_all_subtransformer=None, decoder_arbitrary_ende_attn_choice=[-1, 1, 2], decoder_attention_heads=8, decoder_embed_choice=[640, 512], decoder_embed_dim=640, decoder_embed_dim_subtransformer=None, decoder_embed_path=None, decoder_ende_attention_heads_all_subtransformer=None, decoder_ende_attention_heads_choice=[8, 4, 2], decoder_ffn_embed_dim=3072, decoder_ffn_embed_dim_all_subtransformer=None, decoder_ffn_embed_dim_choice=[3072, 2048, 1024, 512], decoder_input_dim=640, decoder_layer_num_choice=[6, 5, 4, 3, 2, 1], deco

The latency dataset can be accessed at `baseline/genlatdata/wmt14.en-de_gpu_titanxp.csv`.

### 3. Train latency predictor 

After generating the latency dataset, we can train a latency predictor. HAT's predictor is based on a simple 2-layer MLP based regressor.

Run the following command to train latency predictor (remove `--bsz 2` for full run):

In [18]:
!mkdir -p baseline/latpred # stores latency predictor checkpoint
!python latency_predictor.py \
        --configs=configs/wmt14.en-de/latency_predictor/gpu_titanxp.yml \
        --feature-norm 640 6 2048 6 640 6 2048 6 6 2 \
        --feature-dim 10 \
        --lat-dataset-path baseline/genlatdata/wmt14.en-de_gpu_titanxp.csv \
        --ckpt-path baseline/latpred/wmt14.en-de_gpu_titanxp.pt \
        --bsz 2

Namespace(bsz=2, ckpt_path='baseline/latpred/wmt14.en-de_gpu_titanxp.pt', configs='configs/wmt14.en-de/latency_predictor/gpu_titanxp.yml', dataset_path=None, feature_dim=10, feature_norm=[640.0, 6.0, 2048.0, 6.0, 640.0, 6.0, 2048.0, 6.0, 6.0, 2.0], hidden_dim=400, hidden_layer_num=3, lat_dataset_path='baseline/genlatdata/wmt14.en-de_gpu_titanxp.csv', lat_norm=200.0, lr=1e-05, train_steps=5000)
  sample_x_tensor = torch.Tensor(sample_x)
Validation loss at 0 steps: 1.3056495189666748
Validation loss at 100 steps: 0.8925215005874634
Validation loss at 200 steps: 0.3997454047203064
Validation loss at 300 steps: 0.12640422582626343
Validation loss at 400 steps: 0.08265182375907898
Validation loss at 500 steps: 0.0819818377494812
Validation loss at 600 steps: 0.07359975576400757
Validation loss at 700 steps: 0.0808529183268547
Validation loss at 800 steps: 0.0610981248319149
Validation loss at 900 steps: 0.0539892241358757
Validation loss at 1000 steps: 0.03898584097623825
Validation loss at

The latency predictor can be accessed at `baseline/latpred/wmt14.en-de_gpu_titanxp.pt`.

### 4. Evolutionary search

Now, we have a latency and performance predictor to quickly get latency and performance of a candidate architecture. We can perform evolutionary search that also takes latency constraint (less than 200 milliseconds) to identify efficient architecture that maximizes the BLEU score, while satisfying the constraint.

Run the following command to start the search (remove `--evo-iter 1 --parent-size 2 --mutation-size 2 --crossover-size 2 --population-size 6` for full search):

In [20]:
!mkdir -p baseline/evosearch # to store best architecture config
!CUDA_VISIBLE_DEVICES=0 
!python evo_search.py \
        --configs=configs/wmt14.en-de/supertransformer/space0.yml \
        --evo-configs=configs/wmt14.en-de/evo_search/wmt14ende_titanxp.yml \
        --restore-file baseline/supernet/checkpoint_best.pt \
        --ckpt-path baseline/latpred/wmt14.en-de_gpu_titanxp.pt \
        --feature-norm 640 6 2048 6 640 6 2048 6 6 2 \
        --write-config-path baseline/evosearch/wmt14.en-de_gpu_titanxp.yml \
        --evo-iter 1 \
        --parent-size 2 \
        --mutation-size 2 \
        --crossover-size 2 \
        --population-size 6

Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformersuper_wmt_en_de', attention_dropout=0.1, beam=5, best_checkpoint_metric='loss', bucket_cap_mb=25, ckpt_path='baseline/latpred/wmt14.en-de_gpu_titanxp.pt', clip_norm=0.0, configs='configs/wmt14.en-de/supertransformer/space0.yml', cpu=False, criterion='label_smoothed_cross_entropy', crossover_size=2, curriculum=0, data='data/binary/wmt16_en_de', dataset_impl=None, ddp_backend='no_c10d', decoder_arbitrary_ende_attn_all_subtransformer=None, decoder_arbitrary_ende_attn_choice=[-1, 1, 2], decoder_attention_heads=8, decoder_embed_choice=[640, 512], decoder_embed_dim=640, decoder_embed_dim_subtransformer=None, decoder_embed_path=None, decoder_ende_attention_heads_all_subtransformer=None, decoder_ende_attention_heads_choice=[8, 4], decoder_ffn_embed_dim=3072, decoder_ffn_embed_dim_all_subtransformer=Non

The config for efficient architecture can be found at: `baseline/evosearch/wmt14.en-de_gpu_titanxp.yml`.

Let us take a look at this config:

In [21]:
!cat baseline/evosearch/wmt14.en-de_gpu_titanxp.yml

encoder-embed-dim-subtransformer: 640
decoder-embed-dim-subtransformer: 512

encoder-ffn-embed-dim-all-subtransformer: [3072, 3072, 2048, 2048, 3072, 1024]
decoder-ffn-embed-dim-all-subtransformer: [2048, 2048, 1024, 2048]

encoder-layer-num-subtransformer: 6
decoder-layer-num-subtransformer: 4

encoder-self-attention-heads-all-subtransformer: [4, 8, 8, 8, 8, 4]
decoder-self-attention-heads-all-subtransformer: [4, 4, 8, 4]
decoder-ende-attention-heads-all-subtransformer: [8, 8, 4, 4]

decoder-arbitrary-ende-attn-all-subtransformer: [1, 1, -1, 2]



### 5. Train efficient architecture from scratch

Now, we have the efficient architecture. All that is left is to train the architecture from scratch (random initialization) to convergence. The trained architecture should be ideal for deployment in the target hardware.

Run the following command to train the efficient model (remove `---max-update 5 --save-interval-updates 5` for full training):

In [22]:
!mkdir -p baseline/effnet # stores the checkpoint for efficient model
!python -B train.py \
            --configs=baseline/evosearch/wmt14.en-de_gpu_titanxp.yml \
            --save-dir baseline/effnet \
            --sub-configs=configs/wmt14.en-de/subtransformer/common.yml \
            --no-epoch-checkpoints \
            --max-update 5 \
            --save-interval-updates 5

| Configs: Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformersuper_wmt_en_de', attention_dropout=0.1, beam=5, best_checkpoint_metric='loss', bucket_cap_mb=25, clip_norm=0.0, configs='baseline/evosearch/wmt14.en-de_gpu_titanxp.yml', cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='data/binary/wmt16_en_de', dataset_impl=None, ddp_backend='no_c10d', decoder_arbitrary_ende_attn_all_subtransformer=[1, 1, -1, 2], decoder_arbitrary_ende_attn_choice=[-1, 1, 2], decoder_attention_heads=8, decoder_embed_choice=[512, 256, 128], decoder_embed_dim=640, decoder_embed_dim_subtransformer=512, decoder_embed_path=None, decoder_ende_attention_heads_all_subtransformer=[8, 8, 4, 4], decoder_ende_attention_heads_choice=[16, 8, 4, 2, 1], decoder_ffn_embed_dim=3072, decoder_ffn_embed_dim_all_subtransformer=[2048, 2048, 1024, 2048], decoder_ffn

The checkpoint for the efficient model can be accessed at `baseline/effnet/checkpoint_best.pt`.

## 5.1 Get performance of the efficient architecture

Change the line 81 in `fairseq/search.py` from `torch.div(self.indices_buf, vocab_size, out=self.beams_buf)` to `self.beams_buf = torch.div(self.indices_buf, vocab_size).type_as(self.beams_buf)`. Otherwise, you will get the error `RuntimeError: result type Float can't be cast to the desired output type Long`.

Get the BLEU score on the validation set by running the following command (expect the results to be 0 as we only did a trial run of HAT pipeline):

In [23]:
!bash configs/wmt14.en-de/test.sh baseline/effnet/checkpoint_best.pt baseline/evosearch/wmt14.en-de_gpu_titanxp.yml normal 0 valid


TransformerSuperModel(
  (encoder): TransformerEncoder(
    (embed_tokens): EmbeddingSuper(32768, 640, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttentionSuper	num_heads:4	 qkv_dim:512
          (out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
        )
        (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
        (fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
        (fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
        (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttentionSuper	num_heads:8	 qkv_dim:512
          (out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
        )
        (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwi

Get the BLEU score on the test set by running the following command (expect the results to be 0 as we only did a trial run of HAT pipeline):

In [24]:
!bash configs/wmt14.en-de/test.sh baseline/effnet/checkpoint_best.pt baseline/evosearch/wmt14.en-de_gpu_titanxp.yml normal


TransformerSuperModel(
  (encoder): TransformerEncoder(
    (embed_tokens): EmbeddingSuper(32768, 640, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttentionSuper	num_heads:4	 qkv_dim:512
          (out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
        )
        (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
        (fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
        (fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
        (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttentionSuper	num_heads:8	 qkv_dim:512
          (out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
        )
        (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwi

If you want to get the performance of efficient architecture by extracting the weights from supernet (instead of using the standalone training done in Step 5), change the input from `baseline/effnet/checkpoint_best.pt` to `baseline/supernet/checkpoint_best.pt`.

That's all.