# Architecture analysis and comparison

The main goal of this notebook is to understand what detoxification models exists, what architecture they have, and what techniques are common for this problem.

Prerequirements to run this notebook:
* download all libraries mentioned in `requirements.txt`

# How detoxification model pipeline looks like

We can formulate detoxification task as text-to-text translation, or to be more precise sequence-to-sequence (because languages is a sequences of words/text). And now it is pretty easy to understand that we are working with translation task. One of the most popular architecture for seq-to-seq tasks is transformers.

Transformers is a subclass of neural network which solves a problem of Sequence Transduction (machine translation). Transformets are gaining popularity as they solve variety of tasks related to sequences (such as text/speech related problems). Such neural network models are based on idea of memory: recurrent layers, memory cells. It is also a common thing to see some kind of importance and attention to highlight key parts model should be focused on.

Knowing all that we can start looking at existing solutions and try to understand how and why they work

# How initial model from the dataset designed

As we already know, scores in this dataset was genertated by `SkolkovoInstitute/roberta_toxicity_classifier`, however authours of such model also made a detoxification model, so let's look into architecture that original authors proposed.


In [12]:
from transformers import BartForConditionalGeneration

model_name = 'SkolkovoInstitute/bart-base-detox'

model = BartForConditionalGeneration.from_pretrained(model_name)

model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50266, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50266, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

# Other detoxification architectures

Let's look what other architectures exists. One of the most popular choices is t5 model. Let's compare it with proposed one.  

In [14]:
from transformers import AutoModelForSeq2SeqLM

model_name = 's-nlp/t5-paraphrase-paws-msrp-opinosis-paranmt'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

# Architectures for other seq-to-seq tasks

And for curiosity it would be interesting to check how architecture differs with models for other tasks. For example whether there is difference with language model architectures.

In [16]:
from transformers import BertForMaskedLM

model_name = 'bert-base-uncased'

model = BertForMaskedLM.from_pretrained(model_name)

model

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_a

# Model comparison analysis

As we can see from different architectures, all models have something in common:\
3.1. Models embedd words to a 768-dimentional vector\
3.2. Each model consists of two main components encoder-decoder, where decoder vary from encoder (not symmetrical)\
3.3. Recurrent layers is the key models rely on\
3.4. All recurrent layers have some kind of normalization step

Also worth mentioning that 2/3 considered models used `GELUActivation`, included `Dropout(p=0.1)`, used `LSTM` to deal with vanishing gradients.

# Outcomes of the session

During this session we looked at architecture of several popular seq-to-seq model architectures and noted aspects we can use when designing our own model. Such aspects mentioned in 3.1-3.4. It is also worth mentionaning, that this session helped to identify good practices that could be also considered, such as using `GELU` or `ReLU` activation functions, basing our solution on `LSTM` layers, and using dropout with rate close to `0.1`

# Credits

Notebook created by Polina Zelenskaya\
Innopolis University DS21-03

Github: [github.com/cutefluffyfox](https://github.com/cutefluffyfox)\
Email: p.zelenskaya@innopolis.university
