In [1]:
from transformers import AutoModel, AutoTokenizer
from transformers import pipeline

## BART - The Encoder/Decoder Architecture
Encoder/Decoders strive to reconstitute the original input. They're useful for things like:
- Translations (e.g. French to English)
- Generating captions from images
- Summarization tasks

In [2]:
BART = AutoModel.from_pretrained("facebook/bart-large")

### Analysis
- Bart consists of 12 encoder layers and 12 decoder layers
- The encoder component utilizes self attention.
- Additionally, the decoder layer features `encoder_attn` which is an implementation known as **cross-attention**.
- Cross attention can be thought of as a common link between the encoder and decoder components. Because of it, the decoder knows which part of the encoder's output are the most relevant parts of its information. The cross-attention component will condition the decoder output based on the encoder representations.

In [3]:
print( BART )

BartModel(
  (shared): BartScaledWordEmbedding(50265, 1024, padding_idx=1)
  (encoder): BartEncoder(
    (embed_tokens): BartScaledWordEmbedding(50265, 1024, padding_idx=1)
    (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
    (layers): ModuleList(
      (0-11): 12 x BartEncoderLayer(
        (self_attn): BartSdpaAttention(
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (activation_fn): GELUActivation()
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=

### Summarization with BART
- This uses the `pipeline` feature of the `transformers` module

In [4]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
sum = summarizer("""Gaga was best known in the 2010s for pop hits like “Poker Face” and avant-garde experimentation on albums like “Artpop,” and Bennett, a singer who mostly stuck to standards, was in his 80s when the pair met. And yet Bennett and Gaga became fast friends and close collaborators, which they remained until Bennett’s death at 96 on Friday. They recorded two albums together, 2014’s “Cheek to Cheek” and 2021’s “Love for Sale,” which both won Grammys for best traditional pop vocal album.""", min_length=20, max_length=50)
print( sum[0]['summary_text'] )

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


Bennett and Gaga became fast friends and close collaborators. They recorded two albums together, 2014's "Cheek to Cheek" and 2021's "Love for Sale"


## BERT - The Encoder-Only Architecture

The encoder-only models are created by stacking multiple encoder components. Because the encoder output is not coupled to a decoder, it can only be used to find a vector encoding for the input. It can also be paired with a classification head (feedforward layer, on top) to help with label prediction.

A fundamental distinction in the encoder-only architecture is the absence of a masked self-attention l ayer used in decoder models to prevent future tokens from influencing the current token during training. In encoder-only models, the self-attention mechanism processes the entire input sequence simultaneously, allwing the model to capture the full context. This makes them exceptionally well suited for generating comprehensive vector representations of documents, ensuring that all information is retained.


In [5]:
BERT = AutoModel.from_pretrained("bert-base-uncased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [6]:
print( BERT )

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

### More about BERT
BERT introduced the encoder-only model. It improved state-of-the-art scores on various NLP tasks. The model itself is pre-trained with two learning objectives:
1. **Masked Language Modeling (MLM)** - Random tokens in the input are masked, and the model is trained to predict these masked tokens, allowing it to learn deep bidirectional representations.
2. **Next Sentence Prediction (NSP)** - Sentences are presented in pairs, and the model is trained to determine whether the first sentence entails the second, helping it understand sentence relationships.

BERT follows the traditional transformer architecture with 12 stacked encoder blocks. However, the network's output will be passed on to a **pooler layer**, a **feedforward linear layer**, **followed by non-linearity** that will construct the final representation. This representation will be used for other tasks like **classification and similarity assessment**.

## Sentiment Analysis
This next cell utilizes a fine-tuned version of the BERT model for sentiment analysis. 

In [7]:
classifier = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")
lbl = classifier("""This restaurant is awesome.""")

print( lbl )

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': '5 stars', 'score': 0.8550481200218201}]


## GPT-2 - The Decoder-Only Architecture
These models primarily focus on predicting the next token of their output. Scaling up the decoder-only models can considerably imprive the network's language understanding and generalization capabilities. As a result, a single model can excel at various tasks just by employing varied prompts, such as:
- Classification
- Summarization
- Translation
  
Just as well as an encoder-only model.

LLMs such as those in the GPT family, are pre-trained with the Causal Language Modeling objective. This means the model attempts to predict the next word, whereas the attention mechanism can only attend to previous tokens on the left. This in turn means the model can only anticipate the next token based on the previous context and cannot peek at future tokens, avoiding cheating.

In [8]:
gpt2 = AutoModel.from_pretrained("gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [9]:
print( gpt2 )

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D(nf=2304, nx=768)
        (c_proj): Conv1D(nf=768, nx=768)
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D(nf=3072, nx=768)
        (c_proj): Conv1D(nf=768, nx=3072)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)


### Analysis

As you can see from the output above, this is the typical transformer decoder block **without the cross-attention layer**.

The GPT family also uses distinct linear layers (Conv1D) to transpose the weights. This is not related to PyTorch's convolutional layer. This design choice is unique to OpenAI, whereas other large open-source language models employ the conventional linear layer.

The following example uses GPT-2 for text prediction. It generates four possibilities to complete the statement: "This movie was a very"...

In [11]:
generator = pipeline(model="gpt2")

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [12]:
output = generator("This movie was a very", do_sample=True, top_p=0.95, num_return_sequences=4, max_new_tokens=50, return_full_text=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [13]:
for item in output:
  print( ">", item['generated_text'] )

>  emotional moment for me. My wife was so supportive and so kind. It brought back something that didn't happen in my previous films… A lot of emotions, from sadness to happiness… She was a perfect person for this story. She knew how to
>  good one to watch and I had one great time.


"I love the art of the craftsmanship," said producer Jeff Greening. "I just think it's funny. I like to put all these drawings together."

Greening
>  nice project that made me want to see more of it than any other. A very nice, fun show, for people who love to watch movies. I love The Hunger Games but because of how much it inspired me to make this and so much more
>  successful and extremely rewarding experience for me, as I got to go inside the film and learn so much. And the film set the tone for me, as it set the tone for people that know me. That's really what's going on, and


## Transformer Architecture Optimization Techniques

The more input a transformer model receives, the more context it has to generate a text continuation appropriately. For this reason, input is commonly referred to, generally, as *"context"*.

As the context length expands, the computational resouces required for training and inference increase substantially. **See: [Quadratic time and space complexity](https://github.com/bradtraversy/traversy-js-challenges/blob/main/05-complexity/05-quadratic-time-complexity/readme.md)**

This complexity scales from two primary operations:
1. The *multiplication* between the embeddings of the input tokens with the *learned matrices* (which creates the Query, Key and Value matrices—which entails O(nd**2) where *n* is the number of input tokens)
2. The *multiplication* of the Query, Key and Value matrices.

In regards to the *attention layer*, the computational complexity grows **quadratically** as the context length *or* the number of embeddings increases.

Optimization is clearly necessary when models are used as an assistant with documents or corpi of texts.