# Recent Methods in Transformers

Here are some of the latest methods in transformers, explained with intuitions, a bit of math, and simple documented implementations where possible:

### 1. FlashAttention

Paper:  https://arxiv.org/abs/2205.14135

**Intuition**: Think of FlashAttention as a way to make the process of finding which words in a sentence should be focused on faster and more efficient. It's like making sure your computer doesn't waste time and memory when figuring out which parts of your essay are most important.

**Math**: FlashAttention reduces the complexity of the attention mechanism from $O(n^2)$ to $O(n)$ in terms of memory usage. It does this by breaking down the input data into smaller, more manageable pieces (tiling) and using efficient memory access patterns (kernel fusion).

For a detailed implementation and tutorial, check out the [Flash-Attention-Tutorial repository](https://github.com/galenwilkerson/Flash-Attention-Tutorial).

### 2. MatMul-Free Transformers

Paper:  https://arxiv.org/abs/2406.02528

**Intuition**: Imagine instead of using complex calculations to multiply large matrices (like multiplying huge grids of numbers), you just add or subtract simple numbers like -1, 0, and 1. This makes the calculations much faster and easier for computers to handle.

**Math**: This method uses ternary weights (-1, 0, 1) instead of full-precision weights and replaces matrix multiplication (MatMul) with addition and negation operations, which are simpler and faster.

**Simple Implementation**:

In [1]:
import torch

class TernaryLinear(torch.nn.Module):
    def __init__(self, input_dim, output_dim):
        super(TernaryLinear, self).__init__()
        self.weights = torch.nn.Parameter(torch.randint(-1, 2, (input_dim, output_dim)).float())

    def forward(self, x):
        return torch.matmul(x, self.weights)

# Example usage
x = torch.randn(10, 5)
layer = TernaryLinear(5, 3)
output = layer(x)
print(output)

tensor([[-0.0376,  0.1829,  1.3351],
        [-1.5252, -1.4622, -2.4705],
        [-0.3266,  0.9343, -0.6891],
        [-0.5586, -1.4840, -2.7908],
        [ 0.1397, -0.3241, -1.4200],
        [-0.0884, -2.0239, -1.2238],
        [-0.6058,  2.7050,  5.5249],
        [-2.3538, -0.6834, -0.4525],
        [-1.2831,  0.4118,  0.6496],
        [-3.7471, -1.4224, -1.8167]], grad_fn=<MmBackward0>)


### 3. The Mamba Architecture

Paper:  https://arxiv.org/abs/2312.00752

**Intuition**: The Mamba architecture is like a very efficient and organized way to handle long pieces of text or data by remembering important parts and forgetting less important ones, making it much faster to process.

**Math**: Mamba uses structured state space models (SSMs) that selectively propagate relevant information and scale linearly with input length, $O(n)$.

### 4. Tandem Transformers

Paper:  https://arxiv.org/abs/2402.08644

**Intuition**: Think of having two workers: the first one does the main job, and the second one checks and improves on blocks of the work done by the first. This teamwork makes the process more efficient.

**Math**: Tandem Transformers involve a primary model that processes the input sequence, and a secondary model that processes blocks of tokens using representations from the primary model. This setup reduces computational requirements and improves inference efficiency.

### 5. Advanced Positional Embeddings (ALiBi and RoPE)

Papers: 

https://arxiv.org/abs/2108.12409  

https://arxiv.org/abs/2310.13017 

**Intuition**: Advanced positional embeddings help the transformer model understand the order of words in a sentence better by adding special numbers to the word representations, making it easier to capture the sequence information.

**Math**: 
- **ALiBi**: Adds a linear bias to the attention scores based on the distance between tokens.
- **RoPE**: Uses rotations of embeddings to encode relative positions, maintaining rotational invariance.

**Simple Implementation for ALiBi**:

In [2]:
import torch
import torch.nn.functional as F

def alibi_attention(Q, K, V, bias):
    """
    ALiBi Attention implementation: adding linear biases to attention scores.
    Args:
        Q: Queries matrix (batch_size, num_heads, seq_length, depth)
        K: Keys matrix (batch_size, num_heads, seq_length, depth)
        V: Values matrix (batch_size, num_heads, seq_length, depth)
        bias: Linear bias matrix (num_heads, seq_length, seq_length)
    Returns:
        Output matrix (batch_size, num_heads, seq_length, depth)
    """
    scores = torch.einsum('bhqd, bhkd -> bhqk', Q, K) + bias
    attention = F.softmax(scores, dim=-1)
    output = torch.einsum('bhqk, bhvd -> bhqd', attention, V)
    return output

# Example usage
batch_size, num_heads, seq_length, depth = 2, 4, 8, 16
Q = torch.randn(batch_size, num_heads, seq_length, depth)
K = torch.randn(batch_size, num_heads, seq_length, depth)
V = torch.randn(batch_size, num_heads, seq_length, depth)
bias = torch.randn(num_heads, seq_length, seq_length)

output = alibi_attention(Q, K, V, bias)
print(output)

tensor([[[[ 1.6854, -3.7547,  1.3007,  ..., -0.0780,  0.1969, -0.5888],
          [ 1.6854, -3.7547,  1.3007,  ..., -0.0780,  0.1969, -0.5888],
          [ 1.6854, -3.7547,  1.3007,  ..., -0.0780,  0.1969, -0.5888],
          ...,
          [ 1.6854, -3.7547,  1.3007,  ..., -0.0780,  0.1969, -0.5888],
          [ 1.6854, -3.7547,  1.3007,  ..., -0.0780,  0.1969, -0.5888],
          [ 1.6854, -3.7547,  1.3007,  ..., -0.0780,  0.1969, -0.5888]],

         [[ 2.0957,  3.3346, -1.3362,  ..., -0.8367, -0.6130,  1.0071],
          [ 2.0957,  3.3346, -1.3362,  ..., -0.8367, -0.6130,  1.0071],
          [ 2.0957,  3.3346, -1.3362,  ..., -0.8367, -0.6130,  1.0071],
          ...,
          [ 2.0957,  3.3346, -1.3362,  ..., -0.8367, -0.6130,  1.0071],
          [ 2.0957,  3.3346, -1.3362,  ..., -0.8367, -0.6130,  1.0071],
          [ 2.0957,  3.3346, -1.3362,  ..., -0.8367, -0.6130,  1.0071]],

         [[ 0.5329, -1.7503,  2.9328,  ...,  0.8077, -2.3167,  2.9411],
          [ 0.5329, -1.7503,  

### 6. Prompt Learning

Paper: https://arxiv.org/abs/2001.07676

**Intuition**: Prompt learning is like giving the model a hint or a specific way to answer a question or complete a task, which helps it perform better without needing to be retrained from scratch.

**Math**: Involves designing effective prompts that guide the model to produce the desired output. This can be seen as conditioning the model's responses on specific input patterns.

**Simple Implementation**:

In [4]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

prompt = "Translate English to French: Hello, how are you?"
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(inputs['input_ids'], max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Translate English to French: Hello, how are you?

Hello, how are you? Translate English to Spanish: Hello, how are you?

Hello, how are you? Translate English to Portuguese: Hello, how are


These methods represent the forefront of research and development in transformer models, focusing on improving efficiency, scalability, and performance in various tasks.