<h1>Table of contents</h1>

<ul>
    <li style="font-size: 15px;">Introduction</li>
    <li style="font-size: 15px;">Pre-tokenization / pre-encoding</li>
    <li style="font-size: 15px;">Turn Dropout off</li>
    <li style="font-size: 15px;">TorchScript</li>
    <li style="font-size: 15px;">DeepSpeed</li>
    <li style="font-size: 15px;">Layers Fusing</li>
    <li style="font-size: 15px;">Conclusion</li>
    <li style="font-size: 15px;">Feedback</li>
    <li style="font-size: 15px;">References</li>
    <li style="font-size: 15px;">Releases</li>
</ul>

<h1>Introduction</h1>

<p style="font-size: 15px;">
This article is the continuation of <a href="https://www.kaggle.com/code/vad13irt/optimization-approaches-for-transformers">Optimization approaches for Transformers</a>, where the authors described some optimization approaches, which can significantly reduce memory utilization and time for training or inference of the Deep Learning models, in particular language models (LMs) such as <a href="https://arxiv.org/abs/1810.04805">Transformers</a> in the Natural Language Understanding (NLP) tasks.<br><br>
In the article, we propose to study less known and less used approaches for optimization of language models, but the further described approaches can potentially reduce memory footprint and time by a lot too. Worth noting, that some of the proposed methods can be also successfully used not only with language models, and some others even don't directly belong to the model's optimization.<br><br>
Sometimes, the authors use quotes from other sources to provide more contextual information and not forget to note something important.
</p>

In [1]:
!pip uninstall -qq -y transformers

In [2]:
import sys
sys.path.append("../input/transformers/src/")
import transformers
import pandas as pd
import numpy as np
import warnings
import os


os.environ["TOKENIZERS_PARALLELISM"] = "true"

warnings.simplefilter("ignore")
transformers.logging.set_verbosity_error()

In [3]:
texts_path = "../input/feedback-prize-english-language-learning/train.csv"
texts = pd.read_csv(texts_path)["full_text"].values

<h1>Pre-tokenization / pre-encoding</h1>

<p style="font-size: 15px;">
The easiest to implement and at least powerful method for optimization training and inference of language models is pre-tokenization / pre-encoding. The idea behind the pre-tokenization is to tokenize the sequences beforehand neither do it on the fly, i.e during batching in training or inference respectively. Mathematically, such approach can potentially reduce data pre-processing time for training from $ O(n*e) $ to $ O(n) $, where $ n $ - number of sequences, $ e $ - number of epochs. In case of Cross-Validation from $ O(n*e*k) $ to $ O(n*k) $, where $ k $ - number of folds, and for inference from $ O(n*k) $ to $ O(n) $ respectively.<br><br>
    Note, that different tokenizers have different tokenization rules (<a href="https://arxiv.org/abs/1609.08144v2">Word Piece</a>, <a href="https://arxiv.org/abs/1808.06226">Sentence Piece</a>, <a href="https://arxiv.org/abs/1907.11692">Byte Pair Encoding</a>, etc.), so authors recommend checking how the tokenizer works before doing pre-tokenization, for example, Sentence Piece tokenization realizes <a href="https://github.com/google/sentencepiece#subword-regularization-and-bpe-dropout">"Subword regularization and BPE-dropout"</a> mechanisms, which add small augmentations during training or inference (similar to Test Time Augmentations), so doing tokenization on the fly with such tokenizers might be helpful to improve accuracy as well as the robustness of the language models.
</p>
<pre style="font-size: 15px; white-space: pre-wrap; background: #F6F8FA; padding: 20px; width: 100%;">
>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']
</pre>

<p style="font-size: 15px;">
Source: <a href="https://github.com/google/sentencepiece#subword-regularization-and-bpe-dropout">Subword regularization and BPE-dropout</a>
</p>

<h3>Implementation</h3>

In [4]:
from torch.utils.data import Dataset
from typing import List, Optional, Dict, Any
from transformers import PreTrainedTokenizer


class TextDataset(Dataset):
    def __init__(
        self, 
        texts: List[str], 
        tokenizer: PreTrainedTokenizer, 
        max_length: Optional[int] = None, 
        texts_pair: Optional[List[str]] = None,
        pre_tokenize: bool = False,
    ) -> None:
        super().__init__()
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.texts_pair = texts_pair
        self.pre_tokenize = pre_tokenize
        
        # pre-tokenization
        if self.pre_tokenize:
            if self.texts_pair is not None:
                self.all_tokenized = [self.tokenize(text=text, text_pair=text_pair) for text, text_pair in zip(self.texts, self.texts_pair)]
            else:
                self.all_tokenized = [self.tokenize(text=text) for text in self.texts]
        
    def __len__(self) -> int:
        return len(self.texts)
    
    def tokenize(self, text: str, text_pair: Optional[str] = None) -> Dict[str, Any]:
        tokenized = self.tokenizer(
            text=text, 
            text_pair=text_pair,
            max_length=self.max_length,
            truncation=True,
            padding=False,
            return_attention_mask=True,
            add_special_tokens=True,
            return_special_tokens_mask=True,
            return_token_type_ids=False,
            return_offsets_mapping=False,
            return_tensors=None,
        )
        
        return tokenized
    
    def __getitem__(self, index: int) -> Dict[str, Any]:
        if not self.pre_tokenize:
            text = self.texts[index]
            text_pair = None

            if self.texts_pair is not None:
                text_pair = self.texts_pair[index]

            tokenized = self.tokenize(text=text, text_pair=text_pair)
        else:
            tokenized = self.all_tokenized[index]
            
        return tokenized

In [5]:
import time
from transformers import AutoTokenizer
import gc

# extensions
gc.enable()

# config
epochs = 5
model_path = "distilbert-base-uncased"
use_fast_tokenizer = False
time_decimals = 2

# tokenization on fly
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=use_fast_tokenizer)

start_time = time.time()
dataset = TextDataset(
    texts=texts, 
    texts_pair=None, 
    max_length=None, 
    tokenizer=tokenizer,
    pre_tokenize=False,
)

for epoch in range(epochs):
    for index in range(len(dataset)):
        sample = dataset[index]
        
end_time = time.time()

fly_difference_time = end_time - start_time
print(f"Tokenization on fly: {fly_difference_time:.{time_decimals}f} seconds.")

# memory clearing
del dataset, tokenizer
gc.collect()

# pre-tokenization
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=use_fast_tokenizer)

start_time = time.time()
dataset = TextDataset(
    texts=texts, 
    texts_pair=None, 
    max_length=None, 
    tokenizer=tokenizer,
    pre_tokenize=True,
)

for epoch in range(epochs):
    for index in range(len(dataset)):
        sample = dataset[index]
        
end_time = time.time()

pre_difference_time = end_time - start_time
print(f"Pre-tokenization: {pre_difference_time:.{time_decimals}f} seconds.")

# memory clearing
del dataset, tokenizer
gc.collect()

# computing time difference
difference_time = (fly_difference_time - pre_difference_time)
percentage_difference_time = (fly_difference_time / pre_difference_time) * 100
print(f"Difference: {difference_time:.{time_decimals}f} seconds. Percentage difference: {percentage_difference_time:.{time_decimals}f}%.")

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Tokenization on fly: 220.79 seconds.
Pre-tokenization: 44.04 seconds.
Difference: 176.74 seconds. Percentage difference: 501.28%.


<h1>Turn Dropout off</h1>
<p style="font-size: 15px;">
<i>Dropout is a regularization technique for neural networks that drops a unit (along with connections) at training time with a specified probability $ p $ (a common value is 0.5). At test time, all units are present, but with weights scaled by $ p $ (i.e.  becomes $ w*p $).<br><br>
The idea is to prevent co-adaptation, where the neural network becomes too reliant on particular connections, as this could be symptomatic of overfitting. Intuitively, dropout can be thought of as creating an implicit ensemble of neural networks.</i><br>
Source: <a href="https://paperswithcode.com/method/dropout">Dropout</a>
</p>
<center>
    <img src="https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-23_at_6.19.24_PM.png" alt="Dropout Visualization"><br>
    <span style="font-size: 15px;">Visualization of how Dropout works</span>
</center><br><br>
<p style="font-size: 15px;">
In spite of the easy implementation of Dropout regularization, it can be a serious bottleneck in large-scale models, because the Dropout implementation is based on masking (choose ~ $ p*100 $ % of neurons) and dot product (multiply selected neurons by zero) operations, hence turn Dropout off (set $ p $  to  $ 0.0 $) may lead to quite significant increasing of model's performance during training, however, authors recommend turn Dropout off when there is no overfitting during training and the dataset is relatively large (one of the techniques to prevent overfitting).
</p>

<h3>Implementation</h3>
<p style="font-size: 15px;">
There are a lot of kinds of Dropout: <a href="https://arxiv.org/abs/1506.02142v6">Monte-Carlo Dropout</a>, <a href="https://theaisummer.com/regularization/">DropConnect, Gaussian Dropout</a>, and even <a href="https://github.com/huggingface/transformers/blob/main/src/transformers/models/deberta/modeling_deberta.py#L231">StableDropout</a> from the HuggingFace Transformers library (used in the <a href="https://arxiv.org/abs/2006.03654">DeBERTa</a> model), etc. However, our provided implementation for turning Dropout off only works for <a href="https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html">Dropout from the PyTorch framework</a>, despite this, the implementation can be easily re-implemented for another specific kind.
</p>

In [6]:
import torch
from torch import nn
from transformers import AutoModel
import gc


# extensions
gc.enable()

# utilities
@torch.no_grad()
def turn_off_dropout(module: nn.Module) -> None:
    if isinstance(module, nn.Dropout):
        module.p = 0.0
        
# initializing model
model_path = "distilbert-base-uncased"
model = AutoModel.from_pretrained(model_path)

# turning Dropout off
model.apply(turn_off_dropout)
print(model)

# memory clearing
del model
gc.collect()

Downloading pytorch_model.bin:   0%|          | 0.00/256M [00:00<?, ?B/s]

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.0, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.0, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

223

<h1>TorchScript</h1>


<i style="font-size: 15px;"><a href="https://pytorch.org/docs/stable/jit.html">TorchScript</a> is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency.<br><br>
We provide tools to incrementally transition a model from a pure Python program to a TorchScript program that can be run independently from Python, such as in a standalone C++ program. This makes it possible to train models in PyTorch using familiar tools in Python and then export the model via TorchScript to a production environment where Python programs may be disadvantageous for performance and multi-threading reasons.
</i>


    
<p style="font-size: 15px;">
    Source: <a href="https://pytorch.org/docs/stable/jit.html">TorchScript</a>
</p>

<h3>Implementation</h3>

In [7]:
import torch
from transformers import AutoModel, AutoTokenizer
import gc


# extensions
gc.enable()
        
# config
model_path = "distilbert-base-uncased"

# initializing tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# example inputs
text = "It is simple example text!"
tokenized = tokenizer(
    text=text, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    return_tensors="pt",
)

input_ids = tokenized["input_ids"]
attention_mask = tokenized["attention_mask"]
example_inputs = [input_ids, attention_mask]

# initializing model
model = AutoModel.from_pretrained(model_path, torchscript=True)

# model must be in evaluation mode
model.eval()

# wrapping and saving model via TorchScript
traced_model_path = "./traced_model.pt"
traced_model = torch.jit.trace(model, example_inputs=example_inputs)
torch.jit.save(traced_model, traced_model_path)

# loading TorchScript model
model = torch.jit.load(traced_model_path)
model.eval()

# inference...

# memory clearing
del model, traced_model, example_inputs
del tokenized, input_ids, attention_mask, tokenizer
gc.collect()

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

1066

<h1>DeepSpeed</h1>
<br>
<p style="font-size: 15px;"><i>
DeepSpeed is an open source deep learning optimization library for PyTorch. The library is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware. DeepSpeed is optimized for low latency, high throughput training. It includes the Zero Redundancy Optimizer (ZeRO) for training models with 1 trillion or more parameters. Features include mixed precision training, single-GPU, multi-GPU, and multi-node training as well as custom model parallelism. The DeepSpeed source code is licensed under MIT License and available on <a href="https://github.com/microsoft/DeepSpeed">GitHub</a>.<br><br>
The team claimed to achieve up to a 6.2x throughput improvement, 2.8x faster convergence, and 4.6x less communication.
</i><br>
Source: <a href="https://en.wikipedia.org/wiki/DeepSpeed">Wikipedia</a>
<p style="font-size: 15px;">
Also, DeepSpeed functionality is integrated in the popular libraries: <a href="https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#deepspeed">PyTorch Lightning</a>, <a href="https://huggingface.co/docs/transformers/main_classes/deepspeed">HuggingFace Transformers</a>, HuggingFace Accelerator, etc.
</p>
    <br><br>
<center>
    <iframe width="100%" height="500px" src="https://www.youtube-nocookie.com/embed/ovQC7FqXHXk?start=1" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><br>
    <span style="font-size: 15px;">DeepSpeed | PyTorch Developer Day 2020</span>
</center>
<br>
<p style="font-size: 15px;"><i>"I have tried DeepSpeed and I am able to fit a Longformer Large with BS 6 on my RTX 6000 24GB card rented from Jarvislabs . The best part is integrating DeepSpeed with huggingface trainer is super easy and although there is not much to it , I thought it would really help if I share a code example for it."</i> - <a href="https://www.kaggle.com/tanulsingh077">Tanul Singh</a>'s comment on <a href="https://www.kaggle.com/competitions/feedback-prize-2021/discussion/304706">the forum</a> under <a href="https://www.kaggle.com/competitions/feedback-prize-2021/overview">Feedback Prize - Evaluating Student Writing</a> competition.</p>

<p style="font-size: 15px;">
    <a href="">DeepSeed Compression</a> is a library based on DeepSpeed. DeepSpeed Compression provides easy-to-use, fast, and powerful functionality for compression large-scale language models such as <a href="https://huggingface.co/docs/transformers/model_doc/bloom">BLOOM</a> (~176 billion parameters), <a href="https://github.com/yandex/YaLM-100B">YaLM</a> (~100 billion parameters), <a href="https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf">GPT</a>, etc. DeepSpeed Compression uses modern compression techniques, e.g Quantization, Pruning, Distillation, etc, due to them the accuracy of the compressed model doesn't significantly differ from the original model. A detailed description of the DeepSpeed Compression library can be found in the DeepSpeed Compression developers' blog post - <a href="https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/">DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization</a>.
<center>
    <img src="https://www.microsoft.com/en-us/research/uploads/prod/2022/07/1400x788_Deepspeed_blog_hero_no_logo_V2-1920x1080.jpg">
    <br>
    <span style="font-size: 15px;">Results of compression by DeepSpeed Compression</span>
</center>
</p>

<h3>Implementation</h3>

<p style="font-size: 15px;">
The below implementation doesn't show all functionality of the DeepSpeed library, if you are interested in deeply learning DeepSpeed's API, please refer to the official documentation - <a href="https://deepspeed.readthedocs.io/en/latest/">DeepSpeed's Documentation</a>. DeepSpeed automatically does Gradient Clipping, Gradient Accumulation, Automatic Mixed Precision, and other approaches (if they are configured in <a href="https://www.deepspeed.ai/docs/config-json/">DeepSpeed Configuration</a>), so DeepSpeed library's users do not need to write and integrate them by themselves.
</p>

In [8]:
!conda install -qq -y mpi4py 
!pip install -qq deepspeed

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - mpi4py


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2022.6.15.1|       ha878542_0         150 KB  conda-forge
    certifi-2022.6.15.1        |     pyhd8ed1ab_0         155 KB  conda-forge
    conda-4.14.0               |   py37h89c1867_0        1010 KB  conda-forge
    mpi-1.0                    |            mpich           4 KB  conda-forge
    mpi4py-3.1.3               |   py37h52370cb_2         602 KB  conda-forge
    mpich-4.0.2                |     h846660c_100         6.0 MB  conda-forge
    openssl-1.1.1q             |       h166bdaf_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                    

In [9]:
from torch import nn
from torch.optim import AdamW
from transformers import AutoModel
import deepspeed
from typing import Iterator
import gc
import os


# extensions
gc.enable()

# utilities
no_decay_parameters = ("bias", "LayerNorm.bias", "LayerNorm.weight")

def get_decay_module_parameters(
    module: nn.Module, 
    no_decay_parameters: Iterator[str] = no_decay_parameters,
    recurse: bool = True,
) -> Iterator[nn.Parameter]:
                                
    for name, parameter in list(module.named_parameters(recurse=recurse)):
        if name not in no_decay_parameters:
            yield parameter

            
def get_no_decay_module_parameters(
    module: nn.Module, 
    no_decay_parameters: Iterator[str] = no_decay_parameters,
    recurse: bool = True,
) -> Iterator[nn.Parameter]:

    for name, parameter in list(module.named_parameters(recurse=recurse)):
        if name in no_decay_parameters:
            yield parameter

# config
model_path = "distilbert-base-uncased"
weight_decay = 0.01
lr = 1e-5
train_batch_size = 1
gradient_accumulation_steps = 1
device = "cpu"

# initializing model
model = AutoModel.from_pretrained(model_path)

# initializing optimizer
model_parameters = [
    {"params": get_decay_module_parameters(model), "weight_decay": weight_decay},
    {"params": get_no_decay_module_parameters(model), "weight_decay": 0.0},
]

optimizer = AdamW(params=model_parameters, lr=lr, weight_decay=weight_decay)

# initializing learning rate scheduler
scheduler = None

# DeepSpeed config
deepspeed_arguments = {}
deepspeed_config = {
    "zero_optimization": {
        "offload_param": {
            "device": device,
        }
    },
    "train_batch_size": train_batch_size,
    "gradient_accumulation_steps": gradient_accumulation_steps,
}

# wrapping model via DeepSpeed
deepspeed.init_distributed(dist_backend="gloo")
model_engine, optimizer, training_dataloader, scheduler = deepspeed.initialize(
    args=deepspeed_arguments,
    model=model.requires_grad_(False), # WARNING: here is the issue I need to figure out.
    optimizer=optimizer,
    model_parameters=None,
    lr_scheduler=scheduler,
    config=deepspeed_config,
    dist_init_required=False,
)

[2022-09-10 16:16:27,460] [INFO] [comm.py:618:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2022-09-10 16:16:27,505] [INFO] [comm.py:675:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=172.19.2.2, master_port=29500
[2022-09-10 16:16:27,506] [INFO] [comm.py:635:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo
[2022-09-10 16:16:27,514] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.2, git-hash=unknown, git-branch=unknown
[2022-09-10 16:16:32,973] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-09-10 16:16:32,975] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2022-09-10 16:16:32,976] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2022-09-10 16:16:32,980] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed

In [10]:
%%script false --no-raise-error

# training
for step, batch in enumerate(dataloader, 1):
    
    # prepare inputs and targets for the model and loss function respectively.
    
    # forward pass
    outputs = model_engine(inputs)
    
    # computing loss
    loss = loss_fn(outputs, targets)
    
    # backward pass
    model_engine.backward()
    
    # optimization step
    model_engine.step()

In [11]:
# memory clearing
del model, model_engine, optimizer, scheduler, model_parameters
gc.collect()

46

<h1>Layers Fusing</h1>

<p style="font-size: 15px;">
In progress...
</p>

<h1>Conclusion</h1>

<p style="font-size: 15px;">
    In this paper, the authors described and implemented additional approaches for further optimizing large-scale models, especially Transformers-based models in the NLP tasks. Authors showed that even primitive and simple methods such as Pre-tokenization and Turning Dropout off can significantly reduce training and inference time, then authors investigated more powerful approaches such as TorchScript and DeepSpeed for faster inference and training of large-scale models respectively. Authors will extend this work by the Layers Fusing approach as soon as possible, and also authors want to discover more approaches for optimization in the future.
</p>

<h1>Feedback</h1>

<p style="font-size: 15px;">
If you run into problems with running implementations, want to share personal results with some of the described approaches, or offer your methods for optimization, you can put comments under this article or write a personal message to one of the authors' social networks.
<ul>
    <li style="font-size: 15px;">Twitter - <a href="https://twitter.com/vad13irt">vad13irt</a></li>
    <li style="font-size: 15px;">Telegram - <a href="https://t.me/vad13irt">Vadim Irtlach</a></li>
    <li style="font-size: 15px;">Discord - vad13irt#0534</li>
    <li style="font-size: 15px;">Google Mail (gmail) - <a href="vadimirtlach@gmail.com">vadimirtlach@gmail.com</a></li>
</ul>
</p>

<h1>References</h1>
<p style="font-size: 15px;">
During writing this article, authors referred to this literature:
</p>
<ul>
    <li><a style="font-size: 15px;" href="https://huggingface.co/blog/hf-bitsandbytes-integration">A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes</a></li>
    <li><a style="font-size: 15px;" href="https://habr.com/ru/company/yandex/blog/672396/">Яндекс выложил YaLM 100B — сейчас это крупнейшая GPT-подобная нейросеть в свободном доступе. Вот как удалось её обучить</a></li>
    <li><a style="font-size: 15px;" href="https://paperswithcode.com/method/dropout">Dropout</a></li>
    <li><a style="font-size: 15px;" href="https://huggingface.co/docs/transformers/main/en/serialization">Export 🤗 Transformers Models</a></li>
    <li><a style="font-size: 15px;" href="https://www.kaggle.com/competitions/bms-molecular-translation/discussion/230498#1262312">Tips for Transformer inference</a></li>
    <li><a style="font-size: 15px;" href="https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/">DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization</a></li>
    <li><a style="font-size: 15px;" href="https://jarvislabs.ai/blogs/deepspeed">Training Large NLP Models Efficiently with DeepSpeed Hugging Face</a>
    <li><a style="font-size: 15px;" href="https://www.kaggle.com/competitions/feedback-prize-effectiveness/discussion/339147">DeepSpeed Compression</a></li>
    <li><a style="font-size: 15px;" href="https://www.kaggle.com/code/tanulsingh077/longformer-training-with-deepspeed-and-hf-trainer/notebook?scriptVersionId=86784214">LongFormer Training with DeepSpeed and HF-Trainer</a></li>
    <li><a style="font-size: 15px;" href="https://www.kaggle.com/competitions/feedback-prize-effectiveness/discussion/347537">Team Hydrogen: Efficiency Prize 1st Place</a></li>
</ul>

<h1>Releases</h1>
<ul>
    <li style="font-size: 15px;"><b>12.09.2022</b> - initial release.</li>
</ul>