# BERT Compression 

### 0. Requirements

You can run this directly using Jupyter Notebook. Install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/), [transformers](https://huggingface.co/transformers/) and other required packages if needed.

The code relies on the 🤗 Transformers library. In addition to the dependencies listed in the examples folder, you should install a few additional dependencies listed in the requirements.txt file: pip install -r requirements.txt.

Set up an output directory for where the checkpoints and fine-tuned model are going to be saved. Also set up a directory for the training and validation SQuAD datasets and download them.

In [4]:
"""Set up output directory and download SQuAD dataset."""
!SERIALIZATION_DIR='output/'
!SQUAD_DATA=squad_data

!mkdir SQUAD_DATA
!cd SQUAD_DATA
!wget -q https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
!wget -q https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
!cd ..

## 1. Fine-pruning

### 1.1 Fine-prune BERT model

This step fine-prunes a pre-trained BERT on SQuAD using movement pruning with a final threshold of 10% of remaining weights (90% sparsity). Fine-pruning refers to learning how to prune based off of the fine-tuning step. 

In [1]:
# Fine-prune the model with the SQuAD dataset with a threshold of 90% sparsity.
!python masked_run_squad.py \
    --output_dir SERIALIZATION_DIR \
    --data_dir SQUAD_DATA \
    --train_file train-v1.1.json \
    --predict_file dev-v1.1.json \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
    --model_name_or_path bert-base-uncased \
    --per_gpu_train_batch_size 16 \
    --warmup_steps 5400 \
    --num_train_epochs 10 \
    --learning_rate 3e-5 --mask_scores_learning_rate 1e-2 \
    --initial_threshold 1 --final_threshold 0.10 \
    --initial_warmup 1 --final_warmup 2 \
    --pruning_method topK --mask_init constant --mask_scale 0

08/01/2021 08:06:56 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, alpha_ce=0.5, alpha_distil=0.5, cache_dir='', config_name='', data_dir='SQUAD_DATA', device=device(type='cuda'), do_eval=True, do_lower_case=True, do_train=True, doc_stride=128, eval_all_checkpoints=False, evaluate_during_training=False, final_lambda=0.0, final_threshold=0.1, final_warmup=2, fp16=False, fp16_opt_level='O1', global_topk=False, global_topk_frequency_compute=25, gradient_accumulation_steps=1, initial_threshold=1.0, initial_warmup=1, lang_id=0, learning_rate=3e-05, local_rank=-1, logging_steps=500, mask_init='constant', mask_scale=0.0, mask_scores_learning_rate=0.01, max_answer_length=30, max_grad_norm=1.0, max_query_length=64, max_seq_length=384, max_steps=-1, model_name_or_path='SERIALIZATION_DIR_2/checkpoint-18000', model_type='masked_bert', n_best_size=20, n_gpu=2, no_cuda=False, null_score_diff_threshold=0.0, num_train_epochs=10.0, output_dir='SERIALIZATION_DIR', overw

Iteration:  57%|████████████████            | 1591/2771 [06:52<36:39,  1.86s/it][A
Iteration:  57%|████████████████            | 1592/2771 [06:54<36:37,  1.86s/it][A
Iteration:  57%|████████████████            | 1593/2771 [06:56<36:35,  1.86s/it][A
Iteration:  58%|████████████████            | 1594/2771 [06:58<36:33,  1.86s/it][A
Iteration:  58%|████████████████            | 1595/2771 [06:59<36:31,  1.86s/it][A
Iteration:  58%|████████████████▏           | 1596/2771 [07:01<36:30,  1.86s/it][A
Iteration:  58%|████████████████▏           | 1597/2771 [07:03<36:28,  1.86s/it][A
Iteration:  58%|████████████████▏           | 1598/2771 [07:05<36:26,  1.86s/it][A
Iteration:  58%|████████████████▏           | 1599/2771 [07:07<36:24,  1.86s/it][A
Iteration:  58%|████████████████▏           | 1600/2771 [07:09<36:22,  1.86s/it][A
Iteration:  58%|████████████████▏           | 1601/2771 [07:11<36:20,  1.86s/it][A
Iteration:  58%|████████████████▏           | 1602/2771 [07:13<36:18,  1.86s

Iteration:  64%|██████████████████          | 1785/2771 [12:54<30:37,  1.86s/it][A
Iteration:  64%|██████████████████          | 1786/2771 [12:56<30:35,  1.86s/it][A
Iteration:  64%|██████████████████          | 1787/2771 [12:57<30:33,  1.86s/it][A
Iteration:  65%|██████████████████          | 1788/2771 [12:59<30:34,  1.87s/it][A
Iteration:  65%|██████████████████          | 1789/2771 [13:01<30:31,  1.86s/it][A
Iteration:  65%|██████████████████          | 1790/2771 [13:03<30:29,  1.86s/it][A
Iteration:  65%|██████████████████          | 1791/2771 [13:05<30:26,  1.86s/it][A
Iteration:  65%|██████████████████          | 1792/2771 [13:07<30:24,  1.86s/it][A
Iteration:  65%|██████████████████          | 1793/2771 [13:09<30:23,  1.86s/it][A
Iteration:  65%|██████████████████▏         | 1794/2771 [13:10<30:21,  1.86s/it][A
Iteration:  65%|██████████████████▏         | 1795/2771 [13:12<30:19,  1.86s/it][A
Iteration:  65%|██████████████████▏         | 1796/2771 [13:14<30:17,  1.86s

Iteration:  71%|███████████████████▉        | 1973/2771 [18:48<24:48,  1.87s/it][A
Iteration:  71%|███████████████████▉        | 1974/2771 [18:50<24:46,  1.86s/it][A
Iteration:  71%|███████████████████▉        | 1975/2771 [18:52<24:43,  1.86s/it][A
Iteration:  71%|███████████████████▉        | 1976/2771 [18:54<24:41,  1.86s/it][A
Iteration:  71%|███████████████████▉        | 1977/2771 [18:56<24:39,  1.86s/it][A
Iteration:  71%|███████████████████▉        | 1978/2771 [18:57<24:37,  1.86s/it][A
Iteration:  71%|███████████████████▉        | 1979/2771 [18:59<24:35,  1.86s/it][A
Iteration:  71%|████████████████████        | 1980/2771 [19:01<24:34,  1.86s/it][A
Iteration:  71%|████████████████████        | 1981/2771 [19:03<24:32,  1.86s/it][A
Iteration:  72%|████████████████████        | 1982/2771 [19:05<24:30,  1.86s/it][A
Iteration:  72%|████████████████████        | 1983/2771 [19:07<24:28,  1.86s/it][A
Iteration:  72%|████████████████████        | 1984/2771 [19:09<24:26,  1.86s

Iteration:  78%|█████████████████████▉      | 2167/2771 [24:50<18:46,  1.87s/it][A
Iteration:  78%|█████████████████████▉      | 2168/2771 [24:52<18:44,  1.87s/it][A
Iteration:  78%|█████████████████████▉      | 2169/2771 [24:54<18:42,  1.86s/it][A
Iteration:  78%|█████████████████████▉      | 2170/2771 [24:55<18:40,  1.86s/it][A
Iteration:  78%|█████████████████████▉      | 2171/2771 [24:57<18:38,  1.86s/it][A
Iteration:  78%|█████████████████████▉      | 2172/2771 [24:59<18:36,  1.86s/it][A
Iteration:  78%|█████████████████████▉      | 2173/2771 [25:01<18:35,  1.86s/it][A
Iteration:  78%|█████████████████████▉      | 2174/2771 [25:03<18:33,  1.86s/it][A
Iteration:  78%|█████████████████████▉      | 2175/2771 [25:05<18:31,  1.86s/it][A
Iteration:  79%|█████████████████████▉      | 2176/2771 [25:07<18:29,  1.86s/it][A
Iteration:  79%|█████████████████████▉      | 2177/2771 [25:08<18:27,  1.86s/it][A
Iteration:  79%|██████████████████████      | 2178/2771 [25:10<18:25,  1.86s

Iteration:  85%|███████████████████████▊    | 2361/2771 [30:51<12:44,  1.86s/it][A
Iteration:  85%|███████████████████████▊    | 2362/2771 [30:53<12:42,  1.86s/it][A
Iteration:  85%|███████████████████████▉    | 2363/2771 [30:55<12:40,  1.86s/it][A
Iteration:  85%|███████████████████████▉    | 2364/2771 [30:57<12:38,  1.86s/it][A
Iteration:  85%|███████████████████████▉    | 2365/2771 [30:59<12:36,  1.86s/it][A
Iteration:  85%|███████████████████████▉    | 2366/2771 [31:01<12:34,  1.86s/it][A
Iteration:  85%|███████████████████████▉    | 2367/2771 [31:03<12:33,  1.86s/it][A
Iteration:  85%|███████████████████████▉    | 2368/2771 [31:05<12:31,  1.86s/it][A
Iteration:  85%|███████████████████████▉    | 2369/2771 [31:06<12:29,  1.86s/it][A
Iteration:  86%|███████████████████████▉    | 2370/2771 [31:08<12:27,  1.86s/it][A
Iteration:  86%|███████████████████████▉    | 2371/2771 [31:10<12:25,  1.86s/it][A
Iteration:  86%|███████████████████████▉    | 2372/2771 [31:12<12:24,  1.87s

Iteration:  92%|█████████████████████████▊  | 2552/2771 [36:51<06:48,  1.86s/it][A
Iteration:  92%|█████████████████████████▊  | 2553/2771 [36:53<06:46,  1.86s/it][A
Iteration:  92%|█████████████████████████▊  | 2554/2771 [36:55<06:44,  1.86s/it][A
Iteration:  92%|█████████████████████████▊  | 2555/2771 [36:57<06:42,  1.86s/it][A
Iteration:  92%|█████████████████████████▊  | 2556/2771 [36:59<06:40,  1.86s/it][A
Iteration:  92%|█████████████████████████▊  | 2557/2771 [37:01<06:39,  1.87s/it][A
Iteration:  92%|█████████████████████████▊  | 2558/2771 [37:02<06:37,  1.86s/it][A
Iteration:  92%|█████████████████████████▊  | 2559/2771 [37:04<06:35,  1.86s/it][A
Iteration:  92%|█████████████████████████▊  | 2560/2771 [37:06<06:33,  1.86s/it][A
Iteration:  92%|█████████████████████████▉  | 2561/2771 [37:08<06:31,  1.86s/it][A
Iteration:  92%|█████████████████████████▉  | 2562/2771 [37:10<06:29,  1.86s/it][A
Iteration:  92%|█████████████████████████▉  | 2563/2771 [37:12<06:27,  1.86s

Iteration:  99%|███████████████████████████▋| 2746/2771 [42:53<00:46,  1.87s/it][A
Iteration:  99%|███████████████████████████▊| 2747/2771 [42:55<00:44,  1.87s/it][A
Iteration:  99%|███████████████████████████▊| 2748/2771 [42:57<00:42,  1.87s/it][A
Iteration:  99%|███████████████████████████▊| 2749/2771 [42:59<00:41,  1.87s/it][A
Iteration:  99%|███████████████████████████▊| 2750/2771 [43:00<00:39,  1.87s/it][A
Iteration:  99%|███████████████████████████▊| 2751/2771 [43:02<00:37,  1.87s/it][A
Iteration:  99%|███████████████████████████▊| 2752/2771 [43:04<00:35,  1.87s/it][A
Iteration:  99%|███████████████████████████▊| 2753/2771 [43:06<00:33,  1.86s/it][A
Iteration:  99%|███████████████████████████▊| 2754/2771 [43:08<00:31,  1.86s/it][A
Iteration:  99%|███████████████████████████▊| 2755/2771 [43:10<00:29,  1.86s/it][A
Iteration:  99%|███████████████████████████▊| 2756/2771 [43:12<00:27,  1.86s/it][A
Iteration:  99%|███████████████████████████▊| 2757/2771 [43:13<00:26,  1.86s

Iteration:   6%|█▌                         | 164/2771 [05:09<1:20:59,  1.86s/it][A
Iteration:   6%|█▌                         | 165/2771 [05:11<1:20:56,  1.86s/it][A
Iteration:   6%|█▌                         | 166/2771 [05:13<1:20:54,  1.86s/it][A
Iteration:   6%|█▋                         | 167/2771 [05:14<1:20:53,  1.86s/it][A
Iteration:   6%|█▋                         | 168/2771 [05:16<1:20:51,  1.86s/it][A
Iteration:   6%|█▋                         | 169/2771 [05:18<1:20:49,  1.86s/it][A
Iteration:   6%|█▋                         | 170/2771 [05:20<1:20:47,  1.86s/it][A
Iteration:   6%|█▋                         | 171/2771 [05:22<1:20:45,  1.86s/it][A
Iteration:   6%|█▋                         | 172/2771 [05:24<1:20:43,  1.86s/it][A
Iteration:   6%|█▋                         | 173/2771 [05:26<1:20:42,  1.86s/it][A
Iteration:   6%|█▋                         | 174/2771 [05:27<1:20:40,  1.86s/it][A
Iteration:   6%|█▋                         | 175/2771 [05:29<1:20:37,  1.86s

Iteration:  13%|███▍                       | 358/2771 [11:10<1:14:57,  1.86s/it][A
Iteration:  13%|███▍                       | 359/2771 [11:12<1:14:55,  1.86s/it][A
Iteration:  13%|███▌                       | 360/2771 [11:14<1:14:53,  1.86s/it][A
Iteration:  13%|███▌                       | 361/2771 [11:16<1:14:50,  1.86s/it][A
Iteration:  13%|███▌                       | 362/2771 [11:18<1:14:48,  1.86s/it][A
Iteration:  13%|███▌                       | 363/2771 [11:20<1:14:45,  1.86s/it][A
Iteration:  13%|███▌                       | 364/2771 [11:22<1:14:43,  1.86s/it][A
Iteration:  13%|███▌                       | 365/2771 [11:23<1:14:41,  1.86s/it][A
Iteration:  13%|███▌                       | 366/2771 [11:25<1:14:40,  1.86s/it][A
Iteration:  13%|███▌                       | 367/2771 [11:27<1:14:38,  1.86s/it][A
Iteration:  13%|███▌                       | 368/2771 [11:29<1:14:36,  1.86s/it][A
Iteration:  13%|███▌                       | 369/2771 [11:31<1:14:34,  1.86s

Iteration:  20%|█████▍                     | 552/2771 [17:12<1:08:58,  1.87s/it][A
Iteration:  20%|█████▍                     | 553/2771 [17:14<1:08:55,  1.86s/it][A
Iteration:  20%|█████▍                     | 554/2771 [17:16<1:08:53,  1.86s/it][A
Iteration:  20%|█████▍                     | 555/2771 [17:18<1:08:50,  1.86s/it][A
Iteration:  20%|█████▍                     | 556/2771 [17:19<1:08:51,  1.87s/it][A
Iteration:  20%|█████▍                     | 557/2771 [17:21<1:08:49,  1.87s/it][A
Iteration:  20%|█████▍                     | 558/2771 [17:23<1:08:48,  1.87s/it][A
Iteration:  20%|█████▍                     | 559/2771 [17:25<1:08:44,  1.86s/it][A
Iteration:  20%|█████▍                     | 560/2771 [17:27<1:08:41,  1.86s/it][A
Iteration:  20%|█████▍                     | 561/2771 [17:29<1:08:39,  1.86s/it][A
Iteration:  20%|█████▍                     | 562/2771 [17:31<1:08:37,  1.86s/it][A
Iteration:  20%|█████▍                     | 563/2771 [17:33<1:08:34,  1.86s

Iteration:  27%|███████▏                   | 743/2771 [23:12<1:03:00,  1.86s/it][A
Iteration:  27%|███████▏                   | 744/2771 [23:14<1:02:59,  1.86s/it][A
Iteration:  27%|███████▎                   | 745/2771 [23:15<1:02:56,  1.86s/it][A
Iteration:  27%|███████▎                   | 746/2771 [23:17<1:02:54,  1.86s/it][A
Iteration:  27%|███████▎                   | 747/2771 [23:19<1:02:52,  1.86s/it][A
Iteration:  27%|███████▎                   | 748/2771 [23:21<1:02:50,  1.86s/it][A
Iteration:  27%|███████▎                   | 749/2771 [23:23<1:02:48,  1.86s/it][A
Iteration:  27%|███████▎                   | 750/2771 [23:25<1:02:46,  1.86s/it][A
Iteration:  27%|███████▎                   | 751/2771 [23:27<1:02:44,  1.86s/it][A
Iteration:  27%|███████▎                   | 752/2771 [23:29<1:02:43,  1.86s/it][A
Iteration:  27%|███████▎                   | 753/2771 [23:30<1:02:41,  1.86s/it][A
Iteration:  27%|███████▎                   | 754/2771 [23:32<1:02:39,  1.86s

Iteration:  34%|█████████▊                   | 937/2771 [29:13<56:57,  1.86s/it][A
Iteration:  34%|█████████▊                   | 938/2771 [29:15<56:55,  1.86s/it][A
Iteration:  34%|█████████▊                   | 939/2771 [29:17<56:53,  1.86s/it][A
Iteration:  34%|█████████▊                   | 940/2771 [29:19<56:50,  1.86s/it][A
Iteration:  34%|█████████▊                   | 941/2771 [29:21<56:49,  1.86s/it][A
Iteration:  34%|█████████▊                   | 942/2771 [29:23<56:47,  1.86s/it][A
Iteration:  34%|█████████▊                   | 943/2771 [29:25<56:53,  1.87s/it][A
Iteration:  34%|█████████▉                   | 944/2771 [29:26<56:52,  1.87s/it][A
Iteration:  34%|█████████▉                   | 945/2771 [29:28<56:48,  1.87s/it][A
Iteration:  34%|█████████▉                   | 946/2771 [29:30<56:44,  1.87s/it][A
Iteration:  34%|█████████▉                   | 947/2771 [29:32<56:42,  1.87s/it][A
Iteration:  34%|█████████▉                   | 948/2771 [29:34<56:39,  1.86s

Iteration:  41%|███████████▍                | 1128/2771 [35:13<51:14,  1.87s/it][A
Iteration:  41%|███████████▍                | 1129/2771 [35:15<51:11,  1.87s/it][A
Iteration:  41%|███████████▍                | 1130/2771 [35:17<51:09,  1.87s/it][A
Iteration:  41%|███████████▍                | 1131/2771 [35:19<51:07,  1.87s/it][A
Iteration:  41%|███████████▍                | 1132/2771 [35:21<51:07,  1.87s/it][A
Iteration:  41%|███████████▍                | 1133/2771 [35:22<51:04,  1.87s/it][A
Iteration:  41%|███████████▍                | 1134/2771 [35:24<51:02,  1.87s/it][A
Iteration:  41%|███████████▍                | 1135/2771 [35:26<51:00,  1.87s/it][A
Iteration:  41%|███████████▍                | 1136/2771 [35:28<50:58,  1.87s/it][A
Iteration:  41%|███████████▍                | 1137/2771 [35:30<50:56,  1.87s/it][A
Iteration:  41%|███████████▍                | 1138/2771 [35:32<50:54,  1.87s/it][A
Iteration:  41%|███████████▌                | 1139/2771 [35:34<50:52,  1.87s

Iteration:  48%|█████████████▎              | 1322/2771 [41:15<45:10,  1.87s/it][A
Iteration:  48%|█████████████▎              | 1323/2771 [41:17<45:08,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1324/2771 [41:19<45:08,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1325/2771 [41:21<45:06,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1326/2771 [41:23<45:03,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1327/2771 [41:24<45:02,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1328/2771 [41:26<45:01,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1329/2771 [41:28<45:01,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1330/2771 [41:30<44:58,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1331/2771 [41:32<44:56,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1332/2771 [41:34<44:53,  1.87s/it][A
Iteration:  48%|█████████████▍              | 1333/2771 [41:36<44:51,  1.87s

Iteration:  55%|███████████████▎            | 1516/2771 [47:17<38:58,  1.86s/it][A
Iteration:  55%|███████████████▎            | 1517/2771 [47:19<38:56,  1.86s/it][A
Iteration:  55%|███████████████▎            | 1518/2771 [47:20<38:55,  1.86s/it][A
Iteration:  55%|███████████████▎            | 1519/2771 [47:22<38:52,  1.86s/it][A
Iteration:  55%|███████████████▎            | 1520/2771 [47:24<38:51,  1.86s/it][A
Iteration:  55%|███████████████▎            | 1521/2771 [47:26<38:50,  1.86s/it][A
Iteration:  55%|███████████████▍            | 1522/2771 [47:28<38:47,  1.86s/it][A
Iteration:  55%|███████████████▍            | 1523/2771 [47:30<38:46,  1.86s/it][A
Iteration:  55%|███████████████▍            | 1524/2771 [47:32<38:44,  1.86s/it][A
Iteration:  55%|███████████████▍            | 1525/2771 [47:33<38:43,  1.86s/it][A
Iteration:  55%|███████████████▍            | 1526/2771 [47:35<38:41,  1.86s/it][A
Iteration:  55%|███████████████▍            | 1527/2771 [47:37<38:38,  1.86s

Iteration:  62%|█████████████████▏          | 1707/2771 [53:16<33:09,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1708/2771 [53:18<33:07,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1709/2771 [53:20<33:06,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1710/2771 [53:22<33:04,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1711/2771 [53:24<33:03,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1712/2771 [53:26<33:01,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1713/2771 [53:28<32:59,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1714/2771 [53:30<32:57,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1715/2771 [53:31<32:55,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1716/2771 [53:33<32:53,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1717/2771 [53:35<32:51,  1.87s/it][A
Iteration:  62%|█████████████████▎          | 1718/2771 [53:37<32:56,  1.88s

Iteration:  69%|███████████████████▏        | 1901/2771 [59:18<27:05,  1.87s/it][A
Iteration:  69%|███████████████████▏        | 1902/2771 [59:20<27:02,  1.87s/it][A
Iteration:  69%|███████████████████▏        | 1903/2771 [59:22<26:59,  1.87s/it][A
Iteration:  69%|███████████████████▏        | 1904/2771 [59:24<26:57,  1.87s/it][A
Iteration:  69%|███████████████████▏        | 1905/2771 [59:26<26:55,  1.87s/it][A
Iteration:  69%|███████████████████▎        | 1906/2771 [59:28<26:52,  1.86s/it][A
Iteration:  69%|███████████████████▎        | 1907/2771 [59:29<26:50,  1.86s/it][A
Iteration:  69%|███████████████████▎        | 1908/2771 [59:31<26:48,  1.86s/it][A
Iteration:  69%|███████████████████▎        | 1909/2771 [59:33<26:46,  1.86s/it][A
Iteration:  69%|███████████████████▎        | 1910/2771 [59:35<26:44,  1.86s/it][A
Iteration:  69%|███████████████████▎        | 1911/2771 [59:37<26:42,  1.86s/it][A
Iteration:  69%|███████████████████▎        | 1912/2771 [59:39<26:40,  1.86s

Iteration:  76%|███████████████████▋      | 2095/2771 [1:05:20<20:59,  1.86s/it][A
Iteration:  76%|███████████████████▋      | 2096/2771 [1:05:22<20:57,  1.86s/it][A
Iteration:  76%|███████████████████▋      | 2097/2771 [1:05:24<20:55,  1.86s/it][A
Iteration:  76%|███████████████████▋      | 2098/2771 [1:05:25<20:53,  1.86s/it][A
Iteration:  76%|███████████████████▋      | 2099/2771 [1:05:27<20:51,  1.86s/it][A
Iteration:  76%|███████████████████▋      | 2100/2771 [1:05:29<20:50,  1.86s/it][A
Iteration:  76%|███████████████████▋      | 2101/2771 [1:05:31<20:48,  1.86s/it][A
Iteration:  76%|███████████████████▋      | 2102/2771 [1:05:33<20:46,  1.86s/it][A08/01/2021 10:16:05 - INFO - __main__ - Saving model checkpoint to SERIALIZATION_DIR/checkpoint-21500
08/01/2021 10:16:07 - INFO - __main__ - Saving optimizer and scheduler states to SERIALIZATION_DIR/checkpoint-21500

Iteration:  76%|███████████████████▋      | 2103/2771 [1:05:38<29:59,  2.69s/it][A
Iteration:  76%|██████████

Iteration:  82%|█████████████████████▍    | 2286/2771 [1:11:20<15:07,  1.87s/it][A
Iteration:  83%|█████████████████████▍    | 2287/2771 [1:11:22<15:05,  1.87s/it][A
Iteration:  83%|█████████████████████▍    | 2288/2771 [1:11:24<15:03,  1.87s/it][A
Iteration:  83%|█████████████████████▍    | 2289/2771 [1:11:26<15:01,  1.87s/it][A
Iteration:  83%|█████████████████████▍    | 2290/2771 [1:11:27<14:59,  1.87s/it][A
Iteration:  83%|█████████████████████▍    | 2291/2771 [1:11:29<14:56,  1.87s/it][A
Iteration:  83%|█████████████████████▌    | 2292/2771 [1:11:31<14:53,  1.87s/it][A
Iteration:  83%|█████████████████████▌    | 2293/2771 [1:11:33<14:51,  1.86s/it][A
Iteration:  83%|█████████████████████▌    | 2294/2771 [1:11:35<14:49,  1.86s/it][A
Iteration:  83%|█████████████████████▌    | 2295/2771 [1:11:37<14:47,  1.86s/it][A
Iteration:  83%|█████████████████████▌    | 2296/2771 [1:11:39<14:45,  1.86s/it][A
Iteration:  83%|█████████████████████▌    | 2297/2771 [1:11:40<14:43,  1.86s

Iteration:  89%|███████████████████████▎  | 2480/2771 [1:17:21<09:02,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2481/2771 [1:17:23<09:00,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2482/2771 [1:17:25<08:58,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2483/2771 [1:17:27<08:56,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2484/2771 [1:17:29<08:54,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2485/2771 [1:17:31<08:52,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2486/2771 [1:17:33<08:50,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2487/2771 [1:17:34<08:48,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2488/2771 [1:17:36<08:46,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2489/2771 [1:17:38<08:45,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2490/2771 [1:17:40<08:43,  1.86s/it][A
Iteration:  90%|███████████████████████▎  | 2491/2771 [1:17:42<08:41,  1.86s

Iteration:  96%|█████████████████████████ | 2671/2771 [1:23:21<03:06,  1.86s/it][A
Iteration:  96%|█████████████████████████ | 2672/2771 [1:23:23<03:04,  1.86s/it][A
Iteration:  96%|█████████████████████████ | 2673/2771 [1:23:24<03:02,  1.86s/it][A
Iteration:  96%|█████████████████████████ | 2674/2771 [1:23:26<03:00,  1.86s/it][A
Iteration:  97%|█████████████████████████ | 2675/2771 [1:23:28<02:58,  1.86s/it][A
Iteration:  97%|█████████████████████████ | 2676/2771 [1:23:30<02:56,  1.86s/it][A
Iteration:  97%|█████████████████████████ | 2677/2771 [1:23:32<02:55,  1.86s/it][A
Iteration:  97%|█████████████████████████▏| 2678/2771 [1:23:34<02:53,  1.86s/it][A
Iteration:  97%|█████████████████████████▏| 2679/2771 [1:23:36<02:51,  1.86s/it][A
Iteration:  97%|█████████████████████████▏| 2680/2771 [1:23:38<02:49,  1.86s/it][A
Iteration:  97%|█████████████████████████▏| 2681/2771 [1:23:39<02:47,  1.86s/it][A
Iteration:  97%|█████████████████████████▏| 2682/2771 [1:23:41<02:45,  1.86s

Iteration:   3%|▉                           | 92/2771 [02:51<1:23:10,  1.86s/it][A
Iteration:   3%|▉                           | 93/2771 [02:53<1:23:07,  1.86s/it][A
Iteration:   3%|▉                           | 94/2771 [02:55<1:23:06,  1.86s/it][A
Iteration:   3%|▉                           | 95/2771 [02:56<1:23:05,  1.86s/it][A
Iteration:   3%|▉                           | 96/2771 [02:58<1:23:03,  1.86s/it][A
Iteration:   4%|▉                           | 97/2771 [03:00<1:23:01,  1.86s/it][A
Iteration:   4%|▉                           | 98/2771 [03:02<1:22:59,  1.86s/it][A
Iteration:   4%|█                           | 99/2771 [03:04<1:22:59,  1.86s/it][A
Iteration:   4%|▉                          | 100/2771 [03:06<1:22:56,  1.86s/it][A
Iteration:   4%|▉                          | 101/2771 [03:08<1:22:53,  1.86s/it][A
Iteration:   4%|▉                          | 102/2771 [03:09<1:22:51,  1.86s/it][A
Iteration:   4%|█                          | 103/2771 [03:11<1:22:49,  1.86s

Iteration:  10%|██▊                        | 286/2771 [08:52<1:17:08,  1.86s/it][A
Iteration:  10%|██▊                        | 287/2771 [08:54<1:17:06,  1.86s/it][A
Iteration:  10%|██▊                        | 288/2771 [08:56<1:17:04,  1.86s/it][A
Iteration:  10%|██▊                        | 289/2771 [08:58<1:17:02,  1.86s/it][A
Iteration:  10%|██▊                        | 290/2771 [09:00<1:17:00,  1.86s/it][A
Iteration:  11%|██▊                        | 291/2771 [09:02<1:16:59,  1.86s/it][A
Iteration:  11%|██▊                        | 292/2771 [09:03<1:16:57,  1.86s/it][A
Iteration:  11%|██▊                        | 293/2771 [09:05<1:16:56,  1.86s/it][A
Iteration:  11%|██▊                        | 294/2771 [09:07<1:16:53,  1.86s/it][A
Iteration:  11%|██▊                        | 295/2771 [09:09<1:16:52,  1.86s/it][A
Iteration:  11%|██▉                        | 296/2771 [09:11<1:16:50,  1.86s/it][A
Iteration:  11%|██▉                        | 297/2771 [09:13<1:16:47,  1.86s

Iteration:  17%|████▋                      | 477/2771 [14:52<1:11:14,  1.86s/it][A
Iteration:  17%|████▋                      | 478/2771 [14:54<1:11:12,  1.86s/it][A
Iteration:  17%|████▋                      | 479/2771 [14:55<1:11:09,  1.86s/it][A
Iteration:  17%|████▋                      | 480/2771 [14:57<1:11:07,  1.86s/it][A
Iteration:  17%|████▋                      | 481/2771 [14:59<1:11:06,  1.86s/it][A
Iteration:  17%|████▋                      | 482/2771 [15:01<1:11:04,  1.86s/it][A
Iteration:  17%|████▋                      | 483/2771 [15:03<1:11:02,  1.86s/it][A
Iteration:  17%|████▋                      | 484/2771 [15:05<1:11:00,  1.86s/it][A
Iteration:  18%|████▋                      | 485/2771 [15:07<1:10:58,  1.86s/it][A
Iteration:  18%|████▋                      | 486/2771 [15:08<1:10:56,  1.86s/it][A
Iteration:  18%|████▋                      | 487/2771 [15:10<1:10:53,  1.86s/it][A
Iteration:  18%|████▊                      | 488/2771 [15:12<1:10:53,  1.86s

Iteration:  24%|██████▌                    | 671/2771 [20:53<1:05:13,  1.86s/it][A
Iteration:  24%|██████▌                    | 672/2771 [20:55<1:05:10,  1.86s/it][A
Iteration:  24%|██████▌                    | 673/2771 [20:57<1:05:07,  1.86s/it][A
Iteration:  24%|██████▌                    | 674/2771 [20:59<1:05:05,  1.86s/it][A
Iteration:  24%|██████▌                    | 675/2771 [21:01<1:05:03,  1.86s/it][A
Iteration:  24%|██████▌                    | 676/2771 [21:03<1:05:01,  1.86s/it][A
Iteration:  24%|██████▌                    | 677/2771 [21:04<1:04:59,  1.86s/it][A
Iteration:  24%|██████▌                    | 678/2771 [21:06<1:04:58,  1.86s/it][A
Iteration:  25%|██████▌                    | 679/2771 [21:08<1:04:56,  1.86s/it][A
Iteration:  25%|██████▋                    | 680/2771 [21:10<1:04:54,  1.86s/it][A
Iteration:  25%|██████▋                    | 681/2771 [21:12<1:04:52,  1.86s/it][A
Iteration:  25%|██████▋                    | 682/2771 [21:14<1:04:50,  1.86s

Iteration:  31%|█████████                    | 862/2771 [26:53<59:30,  1.87s/it][A
Iteration:  31%|█████████                    | 863/2771 [26:55<59:28,  1.87s/it][A
Iteration:  31%|█████████                    | 864/2771 [26:57<59:26,  1.87s/it][A
Iteration:  31%|█████████                    | 865/2771 [26:58<59:23,  1.87s/it][A
Iteration:  31%|█████████                    | 866/2771 [27:00<59:17,  1.87s/it][A
Iteration:  31%|█████████                    | 867/2771 [27:02<59:13,  1.87s/it][A
Iteration:  31%|█████████                    | 868/2771 [27:04<59:09,  1.87s/it][A
Iteration:  31%|█████████                    | 869/2771 [27:06<59:06,  1.86s/it][A
Iteration:  31%|█████████                    | 870/2771 [27:08<59:03,  1.86s/it][A
Iteration:  31%|█████████                    | 871/2771 [27:10<59:05,  1.87s/it][A
Iteration:  31%|█████████▏                   | 872/2771 [27:12<59:05,  1.87s/it][A
Iteration:  32%|█████████▏                   | 873/2771 [27:13<59:04,  1.87s

Iteration:  38%|██████████▋                 | 1056/2771 [32:54<53:15,  1.86s/it][A
Iteration:  38%|██████████▋                 | 1057/2771 [32:56<53:12,  1.86s/it][A
Iteration:  38%|██████████▋                 | 1058/2771 [32:58<53:11,  1.86s/it][A
Iteration:  38%|██████████▋                 | 1059/2771 [33:00<53:08,  1.86s/it][A
Iteration:  38%|██████████▋                 | 1060/2771 [33:02<53:06,  1.86s/it][A
Iteration:  38%|██████████▋                 | 1061/2771 [33:04<53:05,  1.86s/it][A
Iteration:  38%|██████████▋                 | 1062/2771 [33:06<53:04,  1.86s/it][A
Iteration:  38%|██████████▋                 | 1063/2771 [33:07<53:01,  1.86s/it][A
Iteration:  38%|██████████▊                 | 1064/2771 [33:09<53:00,  1.86s/it][A
Iteration:  38%|██████████▊                 | 1065/2771 [33:11<52:57,  1.86s/it][A
Iteration:  38%|██████████▊                 | 1066/2771 [33:13<52:55,  1.86s/it][A
Iteration:  39%|██████████▊                 | 1067/2771 [33:15<52:53,  1.86s

Iteration:  45%|████████████▋               | 1250/2771 [38:56<47:12,  1.86s/it][A
Iteration:  45%|████████████▋               | 1251/2771 [38:58<47:10,  1.86s/it][A
Iteration:  45%|████████████▋               | 1252/2771 [39:00<47:08,  1.86s/it][A
Iteration:  45%|████████████▋               | 1253/2771 [39:01<47:09,  1.86s/it][A
Iteration:  45%|████████████▋               | 1254/2771 [39:03<47:06,  1.86s/it][A
Iteration:  45%|████████████▋               | 1255/2771 [39:05<47:04,  1.86s/it][A
Iteration:  45%|████████████▋               | 1256/2771 [39:07<47:02,  1.86s/it][A
Iteration:  45%|████████████▋               | 1257/2771 [39:09<46:59,  1.86s/it][A
Iteration:  45%|████████████▋               | 1258/2771 [39:11<46:58,  1.86s/it][A
Iteration:  45%|████████████▋               | 1259/2771 [39:13<46:57,  1.86s/it][A
Iteration:  45%|████████████▋               | 1260/2771 [39:14<46:55,  1.86s/it][A
Iteration:  46%|████████████▋               | 1261/2771 [39:16<46:53,  1.86s

Iteration:  52%|██████████████▌             | 1441/2771 [44:55<41:17,  1.86s/it][A
Iteration:  52%|██████████████▌             | 1442/2771 [44:57<41:15,  1.86s/it][A
Iteration:  52%|██████████████▌             | 1443/2771 [44:59<41:13,  1.86s/it][A
Iteration:  52%|██████████████▌             | 1444/2771 [45:01<41:11,  1.86s/it][A
Iteration:  52%|██████████████▌             | 1445/2771 [45:03<41:09,  1.86s/it][A
Iteration:  52%|██████████████▌             | 1446/2771 [45:05<41:07,  1.86s/it][A
Iteration:  52%|██████████████▌             | 1447/2771 [45:06<41:05,  1.86s/it][A
Iteration:  52%|██████████████▋             | 1448/2771 [45:08<41:03,  1.86s/it][A
Iteration:  52%|██████████████▋             | 1449/2771 [45:10<41:02,  1.86s/it][A
Iteration:  52%|██████████████▋             | 1450/2771 [45:12<41:00,  1.86s/it][A
Iteration:  52%|██████████████▋             | 1451/2771 [45:14<40:58,  1.86s/it][A
Iteration:  52%|██████████████▋             | 1452/2771 [45:16<40:56,  1.86s

Iteration:  59%|████████████████▌           | 1635/2771 [50:57<35:16,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1636/2771 [50:59<35:14,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1637/2771 [51:00<35:12,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1638/2771 [51:02<35:11,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1639/2771 [51:04<35:09,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1640/2771 [51:06<35:06,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1641/2771 [51:08<35:03,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1642/2771 [51:10<35:01,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1643/2771 [51:12<35:00,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1644/2771 [51:13<34:58,  1.86s/it][A
Iteration:  59%|████████████████▌           | 1645/2771 [51:15<34:56,  1.86s/it][A
Iteration:  59%|████████████████▋           | 1646/2771 [51:17<34:55,  1.86s

Iteration:  66%|██████████████████▍         | 1829/2771 [56:58<29:13,  1.86s/it][A
Iteration:  66%|██████████████████▍         | 1830/2771 [57:00<29:12,  1.86s/it][A
Iteration:  66%|██████████████████▌         | 1831/2771 [57:02<29:10,  1.86s/it][A08/01/2021 11:34:01 - INFO - __main__ - Saving model checkpoint to SERIALIZATION_DIR/checkpoint-24000
08/01/2021 11:34:02 - INFO - __main__ - Saving optimizer and scheduler states to SERIALIZATION_DIR/checkpoint-24000

Iteration:  66%|██████████████████▌         | 1832/2771 [57:06<42:20,  2.71s/it][A
Iteration:  66%|██████████████████▌         | 1833/2771 [57:09<41:08,  2.63s/it][A
Iteration:  66%|██████████████████▌         | 1834/2771 [57:11<37:31,  2.40s/it][A
Iteration:  66%|██████████████████▌         | 1835/2771 [57:12<34:56,  2.24s/it][A
Iteration:  66%|██████████████████▌         | 1836/2771 [57:14<33:08,  2.13s/it][A
Iteration:  66%|██████████████████▌         | 1837/2771 [57:16<31:52,  2.05s/it][A
Iteration:  66%|██████████

Iteration:  73%|██████████████████▉       | 2020/2771 [1:02:57<23:18,  1.86s/it][A
Iteration:  73%|██████████████████▉       | 2021/2771 [1:02:59<23:16,  1.86s/it][A
Iteration:  73%|██████████████████▉       | 2022/2771 [1:03:01<23:14,  1.86s/it][A
Iteration:  73%|██████████████████▉       | 2023/2771 [1:03:03<23:14,  1.86s/it][A
Iteration:  73%|██████████████████▉       | 2024/2771 [1:03:05<23:13,  1.87s/it][A
Iteration:  73%|███████████████████       | 2025/2771 [1:03:06<23:12,  1.87s/it][A
Iteration:  73%|███████████████████       | 2026/2771 [1:03:08<23:11,  1.87s/it][A
Iteration:  73%|███████████████████       | 2027/2771 [1:03:10<23:09,  1.87s/it][A
Iteration:  73%|███████████████████       | 2028/2771 [1:03:12<23:08,  1.87s/it][A
Iteration:  73%|███████████████████       | 2029/2771 [1:03:14<23:06,  1.87s/it][A
Iteration:  73%|███████████████████       | 2030/2771 [1:03:16<23:04,  1.87s/it][A
Iteration:  73%|███████████████████       | 2031/2771 [1:03:18<23:03,  1.87s

Iteration:  80%|████████████████████▊     | 2214/2771 [1:08:58<17:17,  1.86s/it][A
Iteration:  80%|████████████████████▊     | 2215/2771 [1:09:00<17:15,  1.86s/it][A
Iteration:  80%|████████████████████▊     | 2216/2771 [1:09:02<17:13,  1.86s/it][A
Iteration:  80%|████████████████████▊     | 2217/2771 [1:09:04<17:11,  1.86s/it][A
Iteration:  80%|████████████████████▊     | 2218/2771 [1:09:06<17:09,  1.86s/it][A
Iteration:  80%|████████████████████▊     | 2219/2771 [1:09:08<17:07,  1.86s/it][A
Iteration:  80%|████████████████████▊     | 2220/2771 [1:09:10<17:09,  1.87s/it][A
Iteration:  80%|████████████████████▊     | 2221/2771 [1:09:12<17:06,  1.87s/it][A
Iteration:  80%|████████████████████▊     | 2222/2771 [1:09:13<17:03,  1.87s/it][A
Iteration:  80%|████████████████████▊     | 2223/2771 [1:09:15<17:01,  1.86s/it][A
Iteration:  80%|████████████████████▊     | 2224/2771 [1:09:17<16:59,  1.86s/it][A
Iteration:  80%|████████████████████▉     | 2225/2771 [1:09:19<16:57,  1.86s

Iteration:  87%|██████████████████████▌   | 2405/2771 [1:14:58<11:21,  1.86s/it][A
Iteration:  87%|██████████████████████▌   | 2406/2771 [1:15:00<11:19,  1.86s/it][A
Iteration:  87%|██████████████████████▌   | 2407/2771 [1:15:01<11:17,  1.86s/it][A
Iteration:  87%|██████████████████████▌   | 2408/2771 [1:15:03<11:15,  1.86s/it][A
Iteration:  87%|██████████████████████▌   | 2409/2771 [1:15:05<11:14,  1.86s/it][A
Iteration:  87%|██████████████████████▌   | 2410/2771 [1:15:07<11:12,  1.86s/it][A
Iteration:  87%|██████████████████████▌   | 2411/2771 [1:15:09<11:10,  1.86s/it][A
Iteration:  87%|██████████████████████▋   | 2412/2771 [1:15:11<11:08,  1.86s/it][A
Iteration:  87%|██████████████████████▋   | 2413/2771 [1:15:13<11:06,  1.86s/it][A
Iteration:  87%|██████████████████████▋   | 2414/2771 [1:15:14<11:04,  1.86s/it][A
Iteration:  87%|██████████████████████▋   | 2415/2771 [1:15:16<11:02,  1.86s/it][A
Iteration:  87%|██████████████████████▋   | 2416/2771 [1:15:18<11:01,  1.86s

Iteration:  94%|████████████████████████▍ | 2599/2771 [1:20:59<05:20,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2600/2771 [1:21:01<05:18,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2601/2771 [1:21:03<05:16,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2602/2771 [1:21:05<05:14,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2603/2771 [1:21:06<05:12,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2604/2771 [1:21:08<05:11,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2605/2771 [1:21:10<05:09,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2606/2771 [1:21:12<05:07,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2607/2771 [1:21:14<05:05,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2608/2771 [1:21:16<05:03,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2609/2771 [1:21:18<05:01,  1.86s/it][A
Iteration:  94%|████████████████████████▍ | 2610/2771 [1:21:19<04:59,  1.86s

Iteration:   1%|▏                           | 20/2771 [00:37<1:25:23,  1.86s/it][A
Iteration:   1%|▏                           | 21/2771 [00:39<1:25:21,  1.86s/it][A
Iteration:   1%|▏                           | 22/2771 [00:40<1:25:18,  1.86s/it][A
Iteration:   1%|▏                           | 23/2771 [00:42<1:25:15,  1.86s/it][A
Iteration:   1%|▏                           | 24/2771 [00:44<1:25:14,  1.86s/it][A
Iteration:   1%|▎                           | 25/2771 [00:46<1:25:12,  1.86s/it][A
Iteration:   1%|▎                           | 26/2771 [00:48<1:25:10,  1.86s/it][A
Iteration:   1%|▎                           | 27/2771 [00:50<1:25:09,  1.86s/it][A
Iteration:   1%|▎                           | 28/2771 [00:52<1:25:18,  1.87s/it][A
Iteration:   1%|▎                           | 29/2771 [00:54<1:25:13,  1.86s/it][A
Iteration:   1%|▎                           | 30/2771 [00:55<1:25:09,  1.86s/it][A
Iteration:   1%|▎                           | 31/2771 [00:57<1:25:06,  1.86s

Iteration:   8%|██                         | 211/2771 [06:36<1:19:28,  1.86s/it][A
Iteration:   8%|██                         | 212/2771 [06:38<1:19:25,  1.86s/it][A
Iteration:   8%|██                         | 213/2771 [06:40<1:19:24,  1.86s/it][A
Iteration:   8%|██                         | 214/2771 [06:42<1:19:22,  1.86s/it][A
Iteration:   8%|██                         | 215/2771 [06:44<1:19:21,  1.86s/it][A
Iteration:   8%|██                         | 216/2771 [06:45<1:19:19,  1.86s/it][A
Iteration:   8%|██                         | 217/2771 [06:47<1:19:17,  1.86s/it][A
Iteration:   8%|██                         | 218/2771 [06:49<1:19:15,  1.86s/it][A
Iteration:   8%|██▏                        | 219/2771 [06:51<1:19:14,  1.86s/it][A
Iteration:   8%|██▏                        | 220/2771 [06:53<1:19:11,  1.86s/it][A
Iteration:   8%|██▏                        | 221/2771 [06:55<1:19:11,  1.86s/it][A
Iteration:   8%|██▏                        | 222/2771 [06:57<1:19:09,  1.86s

Iteration:  15%|███▉                       | 405/2771 [12:38<1:13:26,  1.86s/it][A
Iteration:  15%|███▉                       | 406/2771 [12:39<1:13:26,  1.86s/it][A
Iteration:  15%|███▉                       | 407/2771 [12:41<1:13:23,  1.86s/it][A
Iteration:  15%|███▉                       | 408/2771 [12:43<1:13:21,  1.86s/it][A
Iteration:  15%|███▉                       | 409/2771 [12:45<1:13:19,  1.86s/it][A
Iteration:  15%|███▉                       | 410/2771 [12:47<1:13:18,  1.86s/it][A
Iteration:  15%|████                       | 411/2771 [12:49<1:13:15,  1.86s/it][A
Iteration:  15%|████                       | 412/2771 [12:51<1:13:12,  1.86s/it][A
Iteration:  15%|████                       | 413/2771 [12:52<1:13:11,  1.86s/it][A
Iteration:  15%|████                       | 414/2771 [12:54<1:13:09,  1.86s/it][A
Iteration:  15%|████                       | 415/2771 [12:56<1:13:07,  1.86s/it][A
Iteration:  15%|████                       | 416/2771 [12:58<1:13:05,  1.86s

Iteration:  22%|█████▊                     | 596/2771 [18:37<1:07:30,  1.86s/it][A
Iteration:  22%|█████▊                     | 597/2771 [18:39<1:07:28,  1.86s/it][A
Iteration:  22%|█████▊                     | 598/2771 [18:41<1:07:26,  1.86s/it][A
Iteration:  22%|█████▊                     | 599/2771 [18:43<1:07:23,  1.86s/it][A
Iteration:  22%|█████▊                     | 600/2771 [18:44<1:07:22,  1.86s/it][A
Iteration:  22%|█████▊                     | 601/2771 [18:46<1:07:19,  1.86s/it][A
Iteration:  22%|█████▊                     | 602/2771 [18:48<1:07:18,  1.86s/it][A
Iteration:  22%|█████▉                     | 603/2771 [18:50<1:07:17,  1.86s/it][A
Iteration:  22%|█████▉                     | 604/2771 [18:52<1:07:14,  1.86s/it][A
Iteration:  22%|█████▉                     | 605/2771 [18:54<1:07:13,  1.86s/it][A
Iteration:  22%|█████▉                     | 606/2771 [18:56<1:07:11,  1.86s/it][A
Iteration:  22%|█████▉                     | 607/2771 [18:57<1:07:10,  1.86s

Iteration:  29%|███████▋                   | 790/2771 [24:38<1:01:28,  1.86s/it][A
Iteration:  29%|███████▋                   | 791/2771 [24:40<1:01:25,  1.86s/it][A
Iteration:  29%|███████▋                   | 792/2771 [24:42<1:01:24,  1.86s/it][A
Iteration:  29%|███████▋                   | 793/2771 [24:44<1:01:22,  1.86s/it][A
Iteration:  29%|███████▋                   | 794/2771 [24:46<1:01:21,  1.86s/it][A
Iteration:  29%|███████▋                   | 795/2771 [24:48<1:01:19,  1.86s/it][A
Iteration:  29%|███████▊                   | 796/2771 [24:49<1:01:16,  1.86s/it][A
Iteration:  29%|███████▊                   | 797/2771 [24:51<1:01:14,  1.86s/it][A
Iteration:  29%|███████▊                   | 798/2771 [24:53<1:01:12,  1.86s/it][A
Iteration:  29%|███████▊                   | 799/2771 [24:55<1:01:10,  1.86s/it][A
Iteration:  29%|███████▊                   | 800/2771 [24:57<1:01:08,  1.86s/it][A
Iteration:  29%|███████▊                   | 801/2771 [24:59<1:01:07,  1.86s

Iteration:  36%|██████████▎                  | 984/2771 [30:40<55:27,  1.86s/it][A
Iteration:  36%|██████████▎                  | 985/2771 [30:41<55:25,  1.86s/it][A
Iteration:  36%|██████████▎                  | 986/2771 [30:43<55:25,  1.86s/it][A
Iteration:  36%|██████████▎                  | 987/2771 [30:45<55:22,  1.86s/it][A
Iteration:  36%|██████████▎                  | 988/2771 [30:47<55:21,  1.86s/it][A
Iteration:  36%|██████████▎                  | 989/2771 [30:49<55:19,  1.86s/it][A
Iteration:  36%|██████████▎                  | 990/2771 [30:51<55:17,  1.86s/it][A
Iteration:  36%|██████████▎                  | 991/2771 [30:53<55:15,  1.86s/it][A
Iteration:  36%|██████████▍                  | 992/2771 [30:54<55:12,  1.86s/it][A
Iteration:  36%|██████████▍                  | 993/2771 [30:56<55:10,  1.86s/it][A
Iteration:  36%|██████████▍                  | 994/2771 [30:58<55:08,  1.86s/it][A
Iteration:  36%|██████████▍                  | 995/2771 [31:00<55:08,  1.86s

Iteration:  42%|███████████▊                | 1175/2771 [36:39<49:31,  1.86s/it][A
Iteration:  42%|███████████▉                | 1176/2771 [36:41<49:29,  1.86s/it][A
Iteration:  42%|███████████▉                | 1177/2771 [36:43<49:30,  1.86s/it][A
Iteration:  43%|███████████▉                | 1178/2771 [36:45<49:27,  1.86s/it][A
Iteration:  43%|███████████▉                | 1179/2771 [36:46<49:24,  1.86s/it][A
Iteration:  43%|███████████▉                | 1180/2771 [36:48<49:22,  1.86s/it][A
Iteration:  43%|███████████▉                | 1181/2771 [36:50<49:20,  1.86s/it][A
Iteration:  43%|███████████▉                | 1182/2771 [36:52<49:18,  1.86s/it][A
Iteration:  43%|███████████▉                | 1183/2771 [36:54<49:16,  1.86s/it][A
Iteration:  43%|███████████▉                | 1184/2771 [36:56<49:15,  1.86s/it][A
Iteration:  43%|███████████▉                | 1185/2771 [36:58<49:13,  1.86s/it][A
Iteration:  43%|███████████▉                | 1186/2771 [36:59<49:11,  1.86s

Iteration:  49%|█████████████▊              | 1369/2771 [42:40<43:31,  1.86s/it][A
Iteration:  49%|█████████████▊              | 1370/2771 [42:42<43:29,  1.86s/it][A
Iteration:  49%|█████████████▊              | 1371/2771 [42:44<43:27,  1.86s/it][A
Iteration:  50%|█████████████▊              | 1372/2771 [42:46<43:24,  1.86s/it][A
Iteration:  50%|█████████████▊              | 1373/2771 [42:48<43:22,  1.86s/it][A
Iteration:  50%|█████████████▉              | 1374/2771 [42:49<43:20,  1.86s/it][A
Iteration:  50%|█████████████▉              | 1375/2771 [42:51<43:20,  1.86s/it][A
Iteration:  50%|█████████████▉              | 1376/2771 [42:53<43:18,  1.86s/it][A
Iteration:  50%|█████████████▉              | 1377/2771 [42:55<43:16,  1.86s/it][A
Iteration:  50%|█████████████▉              | 1378/2771 [42:57<43:13,  1.86s/it][A
Iteration:  50%|█████████████▉              | 1379/2771 [42:59<43:12,  1.86s/it][A
Iteration:  50%|█████████████▉              | 1380/2771 [43:01<43:11,  1.86s

Iteration:  56%|███████████████▊            | 1561/2771 [48:41<55:04,  2.73s/it][A
Iteration:  56%|███████████████▊            | 1562/2771 [48:43<53:23,  2.65s/it][A
Iteration:  56%|███████████████▊            | 1563/2771 [48:45<48:38,  2.42s/it][A
Iteration:  56%|███████████████▊            | 1564/2771 [48:47<45:18,  2.25s/it][A
Iteration:  56%|███████████████▊            | 1565/2771 [48:49<42:57,  2.14s/it][A
Iteration:  57%|███████████████▊            | 1566/2771 [48:51<41:16,  2.05s/it][A
Iteration:  57%|███████████████▊            | 1567/2771 [48:52<40:04,  2.00s/it][A
Iteration:  57%|███████████████▊            | 1568/2771 [48:54<39:13,  1.96s/it][A
Iteration:  57%|███████████████▊            | 1569/2771 [48:56<38:37,  1.93s/it][A
Iteration:  57%|███████████████▊            | 1570/2771 [48:58<38:12,  1.91s/it][A
Iteration:  57%|███████████████▊            | 1571/2771 [49:00<37:53,  1.89s/it][A
Iteration:  57%|███████████████▉            | 1572/2771 [49:02<37:39,  1.88s

Iteration:  63%|█████████████████▋          | 1755/2771 [54:43<31:32,  1.86s/it][A
Iteration:  63%|█████████████████▋          | 1756/2771 [54:44<31:30,  1.86s/it][A
Iteration:  63%|█████████████████▊          | 1757/2771 [54:46<31:28,  1.86s/it][A
Iteration:  63%|█████████████████▊          | 1758/2771 [54:48<31:27,  1.86s/it][A
Iteration:  63%|█████████████████▊          | 1759/2771 [54:50<31:25,  1.86s/it][A
Iteration:  64%|█████████████████▊          | 1760/2771 [54:52<31:28,  1.87s/it][A
Iteration:  64%|█████████████████▊          | 1761/2771 [54:54<31:25,  1.87s/it][A
Iteration:  64%|█████████████████▊          | 1762/2771 [54:56<31:23,  1.87s/it][A
Iteration:  64%|█████████████████▊          | 1763/2771 [54:57<31:20,  1.87s/it][A
Iteration:  64%|█████████████████▊          | 1764/2771 [54:59<31:17,  1.86s/it][A
Iteration:  64%|█████████████████▊          | 1765/2771 [55:01<31:15,  1.86s/it][A
Iteration:  64%|█████████████████▊          | 1766/2771 [55:03<31:13,  1.86s

Iteration:  70%|██████████████████▎       | 1949/2771 [1:00:44<25:30,  1.86s/it][A
Iteration:  70%|██████████████████▎       | 1950/2771 [1:00:46<25:28,  1.86s/it][A
Iteration:  70%|██████████████████▎       | 1951/2771 [1:00:48<25:27,  1.86s/it][A
Iteration:  70%|██████████████████▎       | 1952/2771 [1:00:49<25:25,  1.86s/it][A
Iteration:  70%|██████████████████▎       | 1953/2771 [1:00:51<25:23,  1.86s/it][A
Iteration:  71%|██████████████████▎       | 1954/2771 [1:00:53<25:22,  1.86s/it][A
Iteration:  71%|██████████████████▎       | 1955/2771 [1:00:55<25:20,  1.86s/it][A
Iteration:  71%|██████████████████▎       | 1956/2771 [1:00:57<25:17,  1.86s/it][A
Iteration:  71%|██████████████████▎       | 1957/2771 [1:00:59<25:15,  1.86s/it][A
Iteration:  71%|██████████████████▎       | 1958/2771 [1:01:01<25:13,  1.86s/it][A
Iteration:  71%|██████████████████▍       | 1959/2771 [1:01:03<25:11,  1.86s/it][A
Iteration:  71%|██████████████████▍       | 1960/2771 [1:01:04<25:09,  1.86s

Iteration:  77%|████████████████████      | 2140/2771 [1:06:43<19:36,  1.86s/it][A
Iteration:  77%|████████████████████      | 2141/2771 [1:06:45<19:34,  1.86s/it][A
Iteration:  77%|████████████████████      | 2142/2771 [1:06:47<19:31,  1.86s/it][A
Iteration:  77%|████████████████████      | 2143/2771 [1:06:49<19:29,  1.86s/it][A
Iteration:  77%|████████████████████      | 2144/2771 [1:06:50<19:27,  1.86s/it][A
Iteration:  77%|████████████████████▏     | 2145/2771 [1:06:52<19:25,  1.86s/it][A
Iteration:  77%|████████████████████▏     | 2146/2771 [1:06:54<19:23,  1.86s/it][A
Iteration:  77%|████████████████████▏     | 2147/2771 [1:06:56<19:22,  1.86s/it][A
Iteration:  78%|████████████████████▏     | 2148/2771 [1:06:58<19:20,  1.86s/it][A
Iteration:  78%|████████████████████▏     | 2149/2771 [1:07:00<19:18,  1.86s/it][A
Iteration:  78%|████████████████████▏     | 2150/2771 [1:07:02<19:16,  1.86s/it][A
Iteration:  78%|████████████████████▏     | 2151/2771 [1:07:04<19:14,  1.86s

Iteration:  84%|█████████████████████▉    | 2334/2771 [1:12:44<13:33,  1.86s/it][A
Iteration:  84%|█████████████████████▉    | 2335/2771 [1:12:46<13:31,  1.86s/it][A
Iteration:  84%|█████████████████████▉    | 2336/2771 [1:12:48<13:30,  1.86s/it][A
Iteration:  84%|█████████████████████▉    | 2337/2771 [1:12:50<13:28,  1.86s/it][A
Iteration:  84%|█████████████████████▉    | 2338/2771 [1:12:52<13:26,  1.86s/it][A
Iteration:  84%|█████████████████████▉    | 2339/2771 [1:12:54<13:24,  1.86s/it][A
Iteration:  84%|█████████████████████▉    | 2340/2771 [1:12:55<13:22,  1.86s/it][A
Iteration:  84%|█████████████████████▉    | 2341/2771 [1:12:57<13:20,  1.86s/it][A
Iteration:  85%|█████████████████████▉    | 2342/2771 [1:12:59<13:18,  1.86s/it][A
Iteration:  85%|█████████████████████▉    | 2343/2771 [1:13:01<13:16,  1.86s/it][A
Iteration:  85%|█████████████████████▉    | 2344/2771 [1:13:03<13:15,  1.86s/it][A
Iteration:  85%|██████████████████████    | 2345/2771 [1:13:05<13:13,  1.86s

Iteration:  91%|███████████████████████▋  | 2528/2771 [1:18:46<07:32,  1.86s/it][A
Iteration:  91%|███████████████████████▋  | 2529/2771 [1:18:47<07:30,  1.86s/it][A
Iteration:  91%|███████████████████████▋  | 2530/2771 [1:18:49<07:28,  1.86s/it][A
Iteration:  91%|███████████████████████▋  | 2531/2771 [1:18:51<07:26,  1.86s/it][A
Iteration:  91%|███████████████████████▊  | 2532/2771 [1:18:53<07:24,  1.86s/it][A
Iteration:  91%|███████████████████████▊  | 2533/2771 [1:18:55<07:22,  1.86s/it][A
Iteration:  91%|███████████████████████▊  | 2534/2771 [1:18:57<07:21,  1.86s/it][A
Iteration:  91%|███████████████████████▊  | 2535/2771 [1:18:59<07:19,  1.86s/it][A
Iteration:  92%|███████████████████████▊  | 2536/2771 [1:19:01<07:17,  1.86s/it][A
Iteration:  92%|███████████████████████▊  | 2537/2771 [1:19:02<07:15,  1.86s/it][A
Iteration:  92%|███████████████████████▊  | 2538/2771 [1:19:04<07:13,  1.86s/it][A
Iteration:  92%|███████████████████████▊  | 2539/2771 [1:19:06<07:11,  1.86s

Iteration:  98%|█████████████████████████▌| 2719/2771 [1:24:45<01:36,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2720/2771 [1:24:47<01:34,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2721/2771 [1:24:49<01:33,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2722/2771 [1:24:50<01:31,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2723/2771 [1:24:52<01:29,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2724/2771 [1:24:54<01:27,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2725/2771 [1:24:56<01:25,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2726/2771 [1:24:58<01:23,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2727/2771 [1:25:00<01:21,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2728/2771 [1:25:02<01:20,  1.86s/it][A
Iteration:  98%|█████████████████████████▌| 2729/2771 [1:25:03<01:18,  1.86s/it][A
Iteration:  99%|█████████████████████████▌| 2730/2771 [1:25:05<01:16,  1.86s

### 1.2 Get sparsity level

In [34]:
# Generates the effective sparsity level in the encoder by counting the number of remaining (aka activated) weights.
import os
os.environ['MKL_THREADING_LAYER'] = 'GNU'

!python counts_parameters.py \
    --pruning_method topK \
    --threshold 0.15 \
    --serialization_dir SERIALIZATION_DIR

name                                                         Remaining Weights % Remaining Weight
bert.encoder.layer.0.attention.self.query.mask_scores        15.0                 88473.0
bert.encoder.layer.0.attention.self.key.mask_scores          15.0                 88473.0
bert.encoder.layer.0.attention.self.value.mask_scores        15.0                 88473.0
bert.encoder.layer.0.attention.output.dense.mask_scores      15.0                 88473.0
bert.encoder.layer.0.intermediate.dense.mask_scores          15.0                 353894.0
bert.encoder.layer.0.output.dense.mask_scores                15.0                 353894.0
bert.encoder.layer.1.attention.self.query.mask_scores        15.0                 88473.0
bert.encoder.layer.1.attention.self.key.mask_scores          15.0                 88473.0
bert.encoder.layer.1.attention.self.value.mask_scores        15.0                 88473.0
bert.encoder.layer.1.attention.output.dense.mask_scores      15.0                 88473.0


### 1.3 Convert back to BERT

In [35]:
# Converts the MaskedBert model back to Bert, as our goal is to produce a sparse Bert not the MaskedBert, which was modeled
# for with extra parameters compute the adaptive mask for pruning.
!CUDA_VISIBLE_DEVICES=2,3 python bertarize.py \
    --pruning_method topK \
    --threshold 0.1 \
    --model_name_or_path SERIALIZATION_DIR

Load fine-pruned model from SERIALIZATION_DIR
Copied layer bert.embeddings.word_embeddings.weight
Copied layer bert.embeddings.position_embeddings.weight
Copied layer bert.embeddings.token_type_embeddings.weight
Copied layer bert.embeddings.LayerNorm.weight
Copied layer bert.embeddings.LayerNorm.bias
Pruned layer bert.encoder.layer.0.attention.self.query.weight
Copied layer bert.encoder.layer.0.attention.self.query.bias
Pruned layer bert.encoder.layer.0.attention.self.key.weight
Copied layer bert.encoder.layer.0.attention.self.key.bias
Pruned layer bert.encoder.layer.0.attention.self.value.weight
Copied layer bert.encoder.layer.0.attention.self.value.bias
Pruned layer bert.encoder.layer.0.attention.output.dense.weight
Copied layer bert.encoder.layer.0.attention.output.dense.bias
Copied layer bert.encoder.layer.0.attention.output.LayerNorm.weight
Copied layer bert.encoder.layer.0.attention.output.LayerNorm.bias
Pruned layer bert.encoder.layer.0.intermediate.dense.weight
Copied layer ber


Created folder bertarized_SERIALIZATION_DIR

Pruned model saved! See you later!


### 1.3.1 Fix config file

In [36]:
# Changes all "Masked Bert" to just "Bert" in config.json of bertarized folder
# Since bertarize converts the MaskedBert model back to Bert, as our goal is to produce a sparse Bert not MaskedBert
import json
import os
import string

filename = 'bertarized_SERIALIZATION_DIR/config.json'

with open(filename, 'r+') as f:
    data = json.load(f)
    data['architectures'][0] = data['architectures'][0].replace("Masked","")
    data['model_type'] = data['model_type'].replace("masked_", "")
    
os.remove(filename)
with open(filename, 'w') as f:
    json.dump(data, f, indent = 2)

## 2. Convert to ONNX
Converts the Hugging Face model to ONNX. With the --quantize flag, the script also creates the optimized model which we will use for quantization. It does not quantize the model, we will do that in the next step.

In [37]:
"""Converts PyTorch to ONNX model and also generates the optimized model for quantization."""
import shutil

output = 'onnx_model/'
if os.path.isdir(output):
    print('Path already exists. Will overwrite.')
    shutil.rmtree(output)
os.mkdir(output)

!python ../../../src/transformers/convert_graph_to_onnx.py --framework pt --model bertarized_SERIALIZATION_DIR/ --quantize onnx_model/bert.onnx

print('ONNX model size (MB):', os.path.getsize("onnx_model/bert.onnx")/(1024*1024))


ONNX opset version set to: 11
Loading pipeline (model: bertarized_SERIALIZATION_DIR/, tokenizer: bertarized_SERIALIZATION_DIR/)
Using framework PyTorch: 1.9.0
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Found output output_1 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']
  input_tensor.shape[chunk_dim] == tensor_shape for input_tensor in input_tensors

2021-08-13 22:43:27.011814820 [W:onnxruntime:, inference_session.cc:1303 Initialize] Serializing optimized model with Graph Optimization level greater than ORT_ENABLE_EXTENDED and the NchwcTransformer enabled. The generated model may contain hardware specific optimizations, and shoul

## 3. Quantization with ONNX Runtime
Perform dynamic quantization on the optimized ONNX model generated from the conversion.

In [38]:
"""Quantizes the optimized ONNX model."""
import onnx
from onnxruntime.quantization import quantize_dynamic, quantize_qat, QuantType

onnx_model = 'onnx_model/bert-optimized.onnx'
model_quant = 'onnx_model/quantized/'
if os.path.isdir(model_quant):
    print('Path already exists. Will overwrite.')
    shutil.rmtree(model_quant)
os.mkdir(model_quant)

quantize_dynamic(onnx_model, 'onnx_model/quantized/bert_quantized.onnx', weight_type=QuantType.QInt8)

print('Size (MB):', os.path.getsize("onnx_model/bert_quantized.onnx")/(1024*1024))

## 4. Convert to sparse representation
Removes the zeros generated from masking the network during pruning.

In [39]:
"""The model is made up of sparse matrices, which are matrices with lots of zeros.
Sparsify removes the zeros, freeing up space."""
!python sparsify_initializers.py --input onnx_model/quantized/bert_quantized.onnx --output onnx_model/quantized/sparse_bert.onnx

sparsify_initializers.py: initializer=embeddings.position_ids is not converted. sparsity=0.001953125
sparsify_initializers.py: initializer=embeddings.LayerNorm.weight is not converted. sparsity=0.0
sparsify_initializers.py: initializer=embeddings.LayerNorm.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.0.attention.self.query.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.0.attention.self.key.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.0.attention.self.value.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.0.attention.output.dense.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.0.attention.output.LayerNorm.weight is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.0.attention.output.LayerNorm.bias is not converted. sparsity=0.0
sparsify_initializers.py: i

sparsify_initializers.py: initializer=encoder.layer.8.intermediate.dense.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.8.output.dense.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.8.output.LayerNorm.weight is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.8.output.LayerNorm.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.9.attention.self.query.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.9.attention.self.key.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.9.attention.self.value.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.9.attention.output.dense.bias is not converted. sparsity=0.0
sparsify_initializers.py: initializer=encoder.layer.9.attention.output.LayerNorm.weight is not converted. sparsity=0.0
sparsify_

sparsify_initializers.py: initializer=embeddings.word_embeddings.weight_quantized is not converted. sparsity=0.05675004914487913
sparsify_initializers.py: initializer=embeddings.word_embeddings.weight_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=embeddings.word_embeddings.weight_zero_point converted. sparsity=1.0
sparsify_initializers.py: initializer=1613_quantized converted. sparsity=0.9001820882161459
sparsify_initializers.py: initializer=1613_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1613_zero_point converted. sparsity=1.0
sparsify_initializers.py: initializer=1612_quantized converted. sparsity=0.9000566270616319
sparsify_initializers.py: initializer=1612_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1612_zero_point converted. sparsity=1.0
sparsify_initializers.py: initializer=1616_quantized converted. sparsity=0.9000193277994791
sparsify_initializers.py: initializer=1616_scale is not convert

sparsify_initializers.py: initializer=1681_quantized converted. sparsity=0.9000278049045138
sparsify_initializers.py: initializer=1681_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1681_zero_point converted. sparsity=1.0
sparsify_initializers.py: initializer=1687_quantized converted. sparsity=0.9000430636935763
sparsify_initializers.py: initializer=1687_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1687_zero_point converted. sparsity=1.0
sparsify_initializers.py: initializer=1688_quantized converted. sparsity=0.9000765482584635
sparsify_initializers.py: initializer=1688_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1688_zero_point converted. sparsity=1.0
sparsify_initializers.py: initializer=1689_quantized converted. sparsity=0.93773439195421
sparsify_initializers.py: initializer=1689_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1689_zero_point converted. sparsity=1.0
sp

sparsify_initializers.py: initializer=1754_quantized converted. sparsity=0.9401537577311198
sparsify_initializers.py: initializer=1754_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1754_zero_point converted. sparsity=1.0
sparsify_initializers.py: initializer=1756_quantized converted. sparsity=0.9000396728515625
sparsify_initializers.py: initializer=1756_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1756_zero_point converted. sparsity=1.0
sparsify_initializers.py: initializer=1755_quantized converted. sparsity=0.9002821180555556
sparsify_initializers.py: initializer=1755_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1755_zero_point converted. sparsity=1.0
sparsify_initializers.py: initializer=1759_quantized converted. sparsity=0.9000295003255209
sparsify_initializers.py: initializer=1759_scale is not converted. sparsity=0.0
sparsify_initializers.py: initializer=1759_zero_point converted. sparsity=1.0


## 5. Convert to ORT format
Following the [deployment](https://onnxruntime.ai/docs/how-to/mobile/) steps, we convert the ONNX model to ORT format in preparation to deploy on to mobile. 

In [43]:
"""Convert to ORT format to deploy to mobile"""
#!python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_level basic onnx_model/quantized/sparse_bert.onnx
!python convert_onnx_models_to_ort.py --optimization_level basic onnx_model/quantized/sparse_bert.onnx

Converting optimized ONNX model /bert_ort/emxuyu/transformers/examples/research_projects/movement-pruning/onnx_model/quantized/sparse_bert.onnx to ORT format model /bert_ort/emxuyu/transformers/examples/research_projects/movement-pruning/onnx_model/quantized/sparse_bert.basic.ort
Converted 1 models. 0 failures.
Processed /bert_ort/emxuyu/transformers/examples/research_projects/movement-pruning/onnx_model/quantized/sparse_bert.basic.ort
Created config in %s /bert_ort/emxuyu/transformers/examples/research_projects/movement-pruning/onnx_model/quantized/sparse_bert.basic.required_operators.config


In [45]:
print('Sparse model size (MB):', os.path.getsize("onnx_model/quantized/sparse_bert.onnx")/(1024*1024))
print('Sparse model ORT size (MB):', os.path.getsize("onnx_model/quantized/sparse_bert.basic.ort")/(1024*1024))

Sparse model size (MB): 89.27167415618896
Sparse model ORT size (MB): 93.25965118408203
