# Models exploration 
We can approach the detoxification task as a form of text-to-text translation, specifically sequence-to-sequence translation, since language can be viewed as a sequence of words or text. The scores in this dataset were produced using the SkolkovoInstitute/roberta-toxicity-classifier model

## Model from original authors

This model embeds words into a 768-dimensional vector and comprises an encoder-decoder architecture. The encoder and decoder are not symmetrical, and recurrent layers, each incorporating a normalization step, form the model's crucial elements. These recurrent layers are pivotal for the model's functioning.

In [4]:
from transformers import BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained('SkolkovoInstitute/bart-base-detox')
model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50266, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50266, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

## T5-small model

The T5-Small model, a variant of the Text-to-Text Transfer Transformer (T5) architecture by Google AI, is pretrained on extensive text data and can be fine-tuned for specific tasks with smaller, task-specific datasets. It consists of encoder and decoder layers, allowing it to transform text from one form to another, making it versatile for various text-related tasks.

In [5]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained('s-nlp/t5-paraphrase-paws-msrp-opinosis-paranmt')
model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

##  GPT-2 model

Built on the Transformer architecture, GPT-2 features multiple layers of self-attention mechanisms and feed-forward neural networks. It employs self-attention layers, enabling the model to weigh the importance of different words in the input text while generating output. GPT-2 utilizes stacked transformer decoder blocks, each incorporating multiple attention heads. This design enables GPT-2 to capture intricate patterns and dependencies in the data, making it suitable for tasks requiring a deep understanding of context.

In [6]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained('gpt2')
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## BERT

BERT, short for Bidirectional Encoder Representations from Transformers, is a transformer-based model specifically designed for bidirectional contextualized word embeddings. Unlike traditional models, BERT processes text bidirectionally, capturing contextual information from both left and right sides of a word. BERT's architecture includes multiple transformer encoder layers, each integrating self-attention mechanisms and feed-forward neural networks. This bidirectional reading approach allows BERT to capture intricate semantic relationships and nuances in the text, making it effective for various natural language processing tasks.

In [7]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

## Similarities and Differences
**Similarities:**

- *Transformer Architecture:* Both GPT-2 and BERT are built on the Transformer architecture, featuring self-attention mechanisms. This architecture enables capturing complex relationships in the data and understanding contextual dependencies.
- *Deep Learning Techniques:* All models utilize deep learning techniques, leveraging neural networks to process and understand textual data.
- *Bidirectional Contextualization:* BERT and GPT-2 incorporate bidirectional contextualization, allowing them to consider the context from both left and right sides of a word. This bidirectional approach enhances their understanding of word meanings within a sentence.

**Differences:**

- *BERT:* BERT is designed for bidirectional contextualized word embeddings. It employs multiple transformer encoder layers, capturing bidirectional dependencies in the text.
- *GPT-2:* GPT-2 predominantly uses transformer decoder blocks with stacked self-attention layers. It excels in generating coherent and contextually rich text.
- *T5-Small:* T5-Small is a text-to-text transformer that converts text from one form to another. It consists of both encoder and decoder layers, making it versatile for various text transformation tasks.