# Introduction 

The Transformer is a state-of-the-art model architecture in the field of natural language processing (NLP) that has gained significant attention for its remarkable performance on various sequence modeling tasks. Unlike traditional recurrent models that process inputs sequentially, the Transformer takes a different approach by leveraging the power of self-attention mechanisms. The core idea behind the Transformer is to enable each word in a sentence to directly attend to all other words, capturing rich contextual information and dependencies in parallel.

At the heart of the Transformer is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when generating representations. By attending to relevant words, the model can effectively focus on the most informative context for each word, regardless of its position in the sequence. This ability to capture long-range dependencies and consider the entire context simultaneously is a key strength of the Transformer, enabling it to handle complex linguistic structures and capture subtle relationships between words.

Another significant advantage of the Transformer is its parallelizable nature. Unlike recurrent models, which process inputs sequentially and suffer from sequential computation bottlenecks, the Transformer can process the entire input sequence in parallel. This parallel processing, made possible by the self-attention mechanism, accelerates training and inference, making the Transformer highly efficient for large-scale NLP tasks. Additionally, the Transformer can handle variable-length input sequences without the need for padding or truncation, allowing it to accommodate diverse sentence lengths commonly found in natural language data.

In [3]:
import torch
import torch.nn as nn

class Transformer(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads, num_layers):
        super(Transformer, self).__init__()
        
        self.embedding = nn.Linear(input_dim, hidden_dim)
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(hidden_dim, num_heads) for _ in range(num_layers)
        ])
        self.output_layer = nn.Linear(hidden_dim, input_dim)
        
    def forward(self, x):
        x = self.embedding(x)
        
        for layer in self.encoder_layers:
            x = layer(x)
        
        output = self.output_layer(x)
        return output

# Example input and output for Transformer class
input_dim = 512
hidden_dim = 256
num_heads = 8
num_layers = 4

model = Transformer(input_dim, hidden_dim, num_heads, num_layers)
input_data = torch.randn(10, 20, input_dim)  # Example input tensor of shape (batch_size, sequence_length, input_dim)
output_data = model(input_data)  # Example output tensor

print("Transformer:")
print("Input shape:", input_data.shape)
print("Output shape:", output_data.shape)
print()


class EncoderLayer(nn.Module):
    def __init__(self, hidden_dim, num_heads):
        super(EncoderLayer, self).__init__()
        
        self.self_attention = MultiheadAttention(hidden_dim, num_heads)
        self.feed_forward = FeedForward(hidden_dim)
        
        self.layer_norm1 = nn.LayerNorm(hidden_dim)
        self.layer_norm2 = nn.LayerNorm(hidden_dim)
        
    def forward(self, x):
        x_residual = x
        
        x = self.layer_norm1(x)
        x = self.self_attention(x)
        x = x_residual + x
        
        x_residual = x
        
        x = self.layer_norm2(x)
        x = self.feed_forward(x)
        x = x_residual + x
        
        return x

class MultiheadAttention(nn.Module):
    def __init__(self, hidden_dim, num_heads):
        super(MultiheadAttention, self).__init__()
        
        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        
        self.query = nn.Linear(hidden_dim, hidden_dim)
        self.key = nn.Linear(hidden_dim, hidden_dim)
        self.value = nn.Linear(hidden_dim, hidden_dim)
        self.output = nn.Linear(hidden_dim, hidden_dim)
        
    def forward(self, x):
        batch_size, seq_len, hidden_dim = x.size()
        
        query = self.query(x).view(batch_size, seq_len, self.num_heads, hidden_dim // self.num_heads).transpose(1, 2)
        key = self.key(x).view(batch_size, seq_len, self.num_heads, hidden_dim // self.num_heads).transpose(1, 2)
        value = self.value(x).view(batch_size, seq_len, self.num_heads, hidden_dim // self.num_heads).transpose(1, 2)
        
        scores = torch.matmul(query, key.transpose(-2, -1)) / (hidden_dim // self.num_heads) ** 0.5
        attention_weights = nn.functional.softmax(scores, dim=-1)
        
        x = torch.matmul(attention_weights, value).transpose(1, 2).contiguous().view(batch_size, seq_len, hidden_dim)
        x = self.output(x)
        
        return x

class FeedForward(nn.Module):
    def __init__(self, hidden_dim):
        super(FeedForward, self).__init__()
        
        self.hidden_dim = hidden_dim
        
        self.linear1 = nn.Linear(hidden_dim, hidden_dim * 4)
        self.linear2 = nn.Linear(hidden_dim * 4, hidden_dim)
        
    def forward(self, x):
        x = self.linear1(x)
        x = nn.functional.relu(x)
        x = self.linear2(x)
        
        return x




Transformer:
Input shape: torch.Size([10, 20, 512])
Output shape: torch.Size([10, 20, 512])



In [9]:
# Example usage
model = Transformer(input_dim=512, hidden_dim=256, num_heads=8, num_layers=4)
input_data = torch.randn(10, 20, 512)  # Input tensor of shape (batch_size, sequence_length, input_dimension)
output = model(input_data)  # Forward pass through the model
print(output.shape)  # Print the shape of the output

torch.Size([10, 20, 512])


# Remark, what does the tensor torch.randn(10, 20, 512) mean? 

 the input tensor torch.randn(10, 20, 512) corresponds to a batch of 10 sentences, where each sentence can have a maximum of 20 words. Each word is represented by a 512-dimensional vector (word embedding). This tensor can be fed into a Transformer model for further processing, such as self-attention and positional encoding, to capture the relationships and dependencies between the words in each sentence.

# MultiheadAttention


In [8]:

# Example input for MultiheadAttention
input_dim = 20
hidden_dim = 20
num_heads = 4

attention = MultiheadAttention(hidden_dim, num_heads)

# Create a random input tensor
input_tensor = torch.randn(30, 5, input_dim)  # Example input tensor of shape (batch_size, sequence_length, input_dim)

# Perform the forward pass through the attention module
output = attention(input_tensor)

# Print the input and output shapes
print("Input shape:", input_tensor.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([30, 5, 20])
Output shape: torch.Size([30, 5, 20])


# Lets run a transformer on an NLP example 

In [41]:
import torch
import torch.nn as nn

class Transformer(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads, num_layers):
        super(Transformer, self).__init__()

        self.embedding = nn.Embedding(input_dim, hidden_dim)
        self.encoder_layer = nn.TransformerEncoderLayer(hidden_dim, num_heads)
        self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers)

    def forward(self, input_tensor):
        embedded = self.embedding(input_tensor)
        encoded = self.encoder(embedded)
        return encoded

# Example input sentences
sentences = ["I love using transformers!", "This is another example sentence.", "Transformers are powerful models."]

# Tokenize input sentences using whitespace tokenizer
tokenized_sentences = [sentence.split() for sentence in sentences]

# Determine the maximum sequence length
max_seq_length = max(len(tokens) for tokens in tokenized_sentences)

# Pad the tokenized sentences
padded_sentences = [tokens + ['<pad>'] * (max_seq_length - len(tokens)) for tokens in tokenized_sentences]

# Build vocabulary from padded sentences
vocab = {}
for tokens in padded_sentences:
    for token in tokens:
        if token not in vocab:
            vocab[token] = len(vocab)

# Convert padded sentences to input tensors
batch_size = len(padded_sentences)
input_tensor = torch.tensor([[vocab[token] for token in tokens] for tokens in padded_sentences], dtype=torch.long)




# Define the Transformer model
input_dim = len(vocab)
hidden_dim = 32
num_heads = 4
num_layers = 2

transformer = Transformer(input_dim, hidden_dim, num_heads, num_layers)


# the input tensor input_tensor represents a batch of input sentences encoded as integer indices, with dimensions [batch_size, sequence_length]. 
# It is converted into dense word embeddings by the embedding layer and then processed by the Transformer encoder to capture contextual information.


# Run input tensor through the Transformer model
output = transformer(input_tensor)

# Print the input and output shapes
print("Input shape:", input_tensor.shape)
print("Output shape:", output.shape)


Input shape: torch.Size([3, 5])
Output shape: torch.Size([3, 5, 32])


# The input tensor meaning 

 the context of the Transformer model, the input tensor input_tensor represents a batch of input sentences encoded as integers. It has the shape **[batch_size, sequence_length]**, where:

* batch_size refers to the number of input sentences in the batch.
* sequence_length corresponds to the maximum length of the input sentences in terms of the number of tokens or words.
For example, if we have a batch of three input sentences with varying lengths, the input_tensor might have a shape of [3, 10], indicating a batch size of 3 and a maximum sequence length of 10 tokens.

The values within the input_tensor are integer indices that represent the tokens of the input sentences. Each integer value corresponds to a specific word in the vocabulary.

The input_tensor tensor is created by converting the tokenized and padded sentences into a tensor using the vocabulary. Each element of the tensor represents the index of a word in the vocabulary.

During the forward pass of the Transformer model, the input tensor is passed through the embedding layer (nn.Embedding), which maps each word index to its corresponding dense word embedding of size hidden_dim. This embedding layer converts the input tensor of indices into a tensor of dense word embeddings, allowing the model to work with continuous representations of words.

The input tensor is then fed into the Transformer encoder, which applies self-attention and feed-forward operations to the embeddings to capture contextual information and relationships between the words in the input sentences. The Transformer encoder produces an output tensor of the same shape as the input tensor.

#Remark

In the Transformer model, the output shape after passing the input tensor through the model will have the same number of hidden dimensions as the input tensor.

The output shape is determined by the configuration of the Transformer model's encoder layer and encoder. In this minimal implementation, we use the nn.TransformerEncoderLayer and nn.TransformerEncoder classes provided by PyTorch.

The nn.TransformerEncoderLayer applies self-attention and feed-forward operations to the input tensor. It operates on a sequence of word embeddings and transforms it by attending to different positions within the sequence. The resulting output from the self-attention mechanism is then passed through a feed-forward neural network. This process is repeated for multiple layers specified by the num_layers parameter.

The nn.TransformerEncoder combines multiple nn.TransformerEncoderLayer instances and applies them sequentially to the input tensor. This helps capture dependencies between words at different positions in the input sequence.

Since the input tensor is passed through the self-attention and feed-forward operations, the output tensor retains the same sequence length and hidden dimension as the input tensor. The purpose of the Transformer model is to capture and encode meaningful representations of the input sequence while preserving its structure.

Therefore, the output shape of the Transformer model will be [batch_size, sequence_length, hidden_dim], where batch_size represents the number of input sentences, sequence_length is the maximum length of the input sentences, and hidden_dim represents the hidden dimension size.

In summary, the Transformer model processes the input tensor through self-attention and feed-forward operations, resulting in an output tensor with the same sequence length and hidden dimension as the input tensor.

# Bert Models 

BERT (Bidirectional Encoder Representations from Transformers) is a popular pre-trained Transformer-based model developed by Google. It is designed to capture bidirectional contextual information from input text, enabling it to understand the meaning and relationships between words in a given sentence.

BERT models are trained on large amounts of text data using unsupervised learning. The training objective involves predicting missing words in a sentence, which helps the model learn the contextual representations of words.

The key features of BERT models include:

Transformer Architecture: BERT models are based on the Transformer architecture, which consists of multiple layers of self-attention and feed-forward neural networks. This architecture allows the model to efficiently process sequential data and capture dependencies between words.

Bidirectional Context: Unlike traditional language models that process text in one direction (either left-to-right or right-to-left), BERT leverages bidirectional context. It considers both the left and right contexts of each word during training, resulting in better contextual understanding.

Pre-training and Fine-tuning: BERT models are pre-trained on large-scale text corpora using unsupervised learning. After pre-training, they can be fine-tuned on specific downstream tasks such as sentiment analysis, named entity recognition, question answering, and more. Fine-tuning involves training the model on task-specific labeled data to adapt it to the specific task.

Large-Scale Training Data: BERT models are trained on massive amounts of publicly available text data, such as Wikipedia articles and books. This extensive training data helps the models learn rich language representations that can be generalized to various natural language processing tasks.

BERT models have achieved state-of-the-art performance on various NLP benchmarks and tasks, showcasing their effectiveness in capturing contextual information and understanding natural language. Due to their versatility, BERT models have become widely adopted in both research and industry for a wide range of NLP applications.

It's worth noting that BERT is just one example of a pre-trained Transformer-based model, and there are other variants and architectures available as well, such as GPT (Generative Pre-trained Transformer) and RoBERTa (Robustly Optimized BERT Approach).

# Example

In [48]:
from transformers import BertTokenizer, BertForSequenceClassification

# Load the pre-trained Transformer model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Example input sentences
sentences = [
    "I love this product! It's amazing.",
    "This movie was terrible. I wouldn't recommend it.",
    "I think I like other movies better.",
]

# Tokenize the input sentences
encoded_inputs = tokenizer.batch_encode_plus(
    sentences,
    add_special_tokens=True,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

# Obtain the input tensors
input_ids = encoded_inputs["input_ids"]
attention_mask = encoded_inputs["attention_mask"]

# Perform sentiment analysis
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits

# Get the predicted labels
predicted_labels = logits.argmax(dim=1)

# Define the label mapping
label_mapping = {0: "Negative", 1: "Positive"}

# Print the predicted labels
for i, label_idx in enumerate(predicted_labels):
    label = label_mapping[label_idx.item()]
    print(f"Sentence: {sentences[i]}")
    print(f"Predicted Sentiment: {label}")
    print()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Sentence: I love this product! It's amazing.
Predicted Sentiment: Negative

Sentence: This movie was terrible. I wouldn't recommend it.
Predicted Sentiment: Positive

Sentence: I think I like other movies better.
Predicted Sentiment: Negative



# IN the example above
* We import the necessary modules: BertTokenizer for tokenization and BertForSequenceClassification for the pre-trained sentiment analysis model.

* We specify the pre-trained model to load. In this case, we're using the "bert-base-uncased" model, which is trained on lowercased text.

* We create a tokenizer using the BertTokenizer.from_pretrained method and load the pre-trained model using BertForSequenceClassification.from_pretrained.

* We define a list of example input sentences.

* We use the tokenizer's batch_encode_plus method to tokenize and encode the input sentences. We add special tokens, pad the sequences, and truncate if necessary. The resulting encoded inputs include input_ids and attention_mask.

* We pass the input_ids and attention_mask tensors to the pre-trained model, which performs sentiment analysis and produces logits.

* We obtain the predicted labels by taking the argmax of the logits along the dimension representing the classes.

* We define a label mapping to map the label indices to their corresponding sentiment labels.

* Finally, we print the input sentences and their predicted sentiment labels.