# PCS5024 - Aprendizado Estatístico - Statistical Learning - 2023/1
### Professors: 
### Anna Helena Reali Costa (anna.reali@usp.br)
### Fabio G. Cozman (fgcozman@usp.br)

In [1]:
#!pip install --quiet torch transformers

In [2]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Transformers

The Transformers architecture, introduced by Vaswani et al. in their 2017 paper "Attention is All You Need," has revolutionized the field of natural language processing (NLP) and deep learning. This architecture is based on the concept of self-attention, which allows the model to weigh the importance of different words in a sequence. It has led to state-of-the-art results in various NLP tasks, such as machine translation, question-answering, and text summarization. This text aims to provide a detailed overview of the Transformers architecture and its key components.

1. Word Embeddings:
    The first step in the Transformers architecture is to convert the input text into a sequence of word embeddings. The word embeddings are learned during training and are used to represent the meaning of the words.

2. Positional Encoding:
    The Transformer architecture incorporates positional encoding to provide information about the relative positions of words. The positional encoding is added to the input word embeddings before they are fed into the self-attention layers. This enables the model to learn and exploit the order of the words in the sequence during training.

3. Attention Mechanism:
    The core innovation in the Transformers architecture is the attention mechanism, which allows the model to weigh the importance of different words in a sequence with respect to a given word. This is achieved by computing an attention score for each word, which is based on the similarity between the given word's representation and the other words in the sequence. The attention scores are then used to compute a weighted sum of the input representations, which forms the output of the attention layer.

3. Pretraining and Fine-tuning:
    The large-scale pretrained models have demonstrated the power of the Transformers architecture. These models are pretrained on massive amounts of text data in a self-supervised manner, learning to generate or predict masked words in a sentence. After pretraining, the models can be fine-tuned on specific tasks such as text classification, question-answering, and machine translation.

In general, Transformers can be represented as following:


<img src='https://upload.wikimedia.org/wikipedia/commons/8/8f/The-Transformer-model-architecture.png'  width="50%" height="50%">



The transformer architecture is the "engine" behind all state-of-the-art large language models (LLMs), such as, GPT, Bard, T5 and others.

## Word embeddings

Embeddings in deep learning are maps between indices and high dimensional representations. In LLMs each token is associated to an integer. The embedding module maps each one of this integers to a learnable vector of size K.

BERT (Devlin et al.) is a very influential Transformer model that uses $K = 512$, which means each token is internally represented by a vector of size 512.

As the training progresses the parameters associated to each word are learned. This is important so tokens that are related map to similar embeddings.

## The Attention Mechanism

The attention mechanism is a powerful technique used in deep learning models, particularly in sequence-to-sequence tasks like neural machine translation, text summarization, and question answering. The primary purpose of attention is to enable the model to selectively focus on different parts of the input sequence while generating the output. It helps the model to capture long-range dependencies and better understand the context.

The attention mechanism can be understood in three main steps:

1. Query, Key, and Value Vectors:
    For each word in the input sequence, the model generates three vectors, called the query (Q), key (K), and value (V) vectors. These vectors are derived by multiplying the input word embeddings with their respective weight matrices ($W_q$, $W_k$, and $W_v$), which are learned during training. The query vector is used to represent the current word, while the key and value vectors represent the other words in the sequence.

2. Attention Scores:
    The attention mechanism computes a score for each word in the input sequence with respect to the current word. This score indicates the importance or relevance of the other words to the current word. The attention score is calculated as the dot product between the query vector (Q) of the current word and the key vector (K) of the other words, followed by scaling it with the square root of the key vector's dimension (usually denoted as $d_k$). The attention scores are then passed through a softmax function to convert them into probabilities that sum to one.

 <img src='https://production-media.paperswithcode.com/methods/multi-head-attention_l1A3G7a.png'  width="20%" height="20%"> 

 <img src='https://drive.google.com/uc?id=177C6VCnAXHnE1WLK_tSUi6dHV8FFLvHp'  width="15%" height="15%">

The attention mechanism analyses the fully connected **graph** in which each token is a node. Based on the value of the dot product between the vector representations of each token pair it determines which tokens are related.

<img src='https://1.bp.blogspot.com/-AVGK0ApREtk/WaiAuzddKVI/AAAAAAAAB_A/WPV5ropBU-cxrcMpqJBFHg73K9NX4vywwCLcBGAs/s1600/image2.png'  width="60%" height="60%">


## Positional Encoding

Positional encoding is a critical component of the Transformer architecture. The purpose of positional encoding is to provide the model with information about the positions of the tokens in a sequence, as the self-attention mechanism in transformers does not inherently capture this information.

The original Transformer paper uses sine and cosine functions of varying frequencies as the basis for creating the positional encodings. The encoding is added to the token embeddings before they are fed into the layers of the Transformer model. The encoding function is defined as:

$PE(pos, 2i) = sin(\frac{pos}{10000^{(2i/d_{m})}})$

$PE(pos, 2i+1) = cos(\frac{pos}{10000^{(2i/d_{m})}})$

where:

- $pos$ is the position of the token in the sequence
- $i$ is the dimension of the encoding
- $d_m$ is the dimension of the token embeddings

These functions generate a unique encoding for each position in the input sequence, which is then added to the token embeddings. The intuition behind these functions is that they can represent the position of a token in the sequence using a mix of different frequency sinusoids. This allows the model to learn and generalize patterns based on relative positions, rather than absolute positions.

Adding positional encodings to the token embeddings enables the attention mechanism to consider both the content of the tokens and their positions in the sequence. This is particularly important for natural language processing tasks, where word order is crucial for understanding the meaning of a sentence.

In [3]:
def classify_sentiment(text):
    # Load the tokenizer and the model
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

    # Tokenize the input text and convert tokens to tensor
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    # Run the model and get the classification logits
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits

    # Convert the logits to probabilities and find the predicted class
    probs = torch.softmax(logits, dim=-1)
    predicted_class = torch.argmax(probs, dim=-1).item()

    # Map the predicted class to a sentiment label
    sentiment_label = "positive" if predicted_class == 1 else "negative"
    return sentiment_label

In [4]:
text = "I absolutely love this product! It's amazing."
sentiment = classify_sentiment(text)
print(f"Sentiment: {sentiment}")

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Sentiment: positive


In [5]:
text = "I'm not sure how to feel about this."
sentiment = classify_sentiment(text)
print(f"Sentiment: {sentiment}")

Sentiment: negative


In [6]:
text = "I'm not sure how to feel about this, but I'm looking forward to trying it"
sentiment = classify_sentiment(text)
print(f"Sentiment: {sentiment}")

Sentiment: positive


Transformers have been extremely successful, but there are also several limitations that represent very active areas of research.

- Resource intensity: Transformers demand substantial computational power and memory during both training and inference stages, which can limit their use in settings with restricted resources or on less powerful devices.

- Handling long sequences: Due to their quadratic self-attention mechanism, transformers have difficulty processing very long sequences, restricting their applicability in certain domains.

- Lack of interpretability: The complex inner workings of transformers are challenging to comprehend, making it problematic to explain their decisions or detect biases in their outputs.

- Overfitting and generalization issues: Transformers, particularly large-scale models, may sometimes overfit the training data, resulting in poor generalization to new, unseen data.

- Inefficient data utilization: Transformers often necessitate vast amounts of labeled data for training, which can be costly or time-consuming to collect.

A more comprehensive overview of the Transformer architecture can be found at: https://jalammar.github.io/illustrated-transformer/