# From Scratch Implementation of Bernoulli-Beta transformer 

We implement a transformer that outputs the optimal bayes PPD. We do so through the following steps:

1. Train a MLP to convert the sufficient statistics into the PPD
2. **Explicitly** define an attention head that computes the sufficient statistics


## Attention Head Implementation

The sufficient statistics for the data is the tuple $(H,N)$ (or $(T,N)$ depending on the final token) where $H$ is the number of heads and $N$ is the length of the sequence. The attention head will compute these statistics from the input sequence. We suppose that $N$ is purely encoded via the positional embedding. Empirically we find that a gradient descent trained transformer implements a counting head, counting the observations of the *final* token in the sequence.


## MLP Implementation

The MLP is trained. Given an embedding vector that encodes the sufficient statistics, i.e 

$$
E_{H,N} = f(H,N)
$$

the MLP outputs a vector corresponding to the PPD. We can form a linear map down to the log-odds, in which the solution is

$$
\log \left( \frac{\alpha + H}{\beta + (N-H)}\right)
$$

# Setup

In [1]:
import numpy as np
import torch

# Defining Variables

A perhaps easy way to define our embeddings is to simply use a unit vector in euclidean space. However to show this works without a privileged basis we randomise our embedding vectors.



In [2]:
embedding_dim = 128

def random_embedding(dim):
    """
    Generate a random unit embedding vector of given dimension.
    """
    vec = torch.randn(dim)
    return vec / torch.norm(vec)  # normalize to unit vector

pos_embed = random_embedding(embedding_dim) #this direction corresponds to length of the sequence, and is in the null space of the attention head
head_embed = random_embedding(embedding_dim) #this direction corresponds to the head of the sequence
tail_embed = random_embedding(embedding_dim) #this direction corresponds to the tail of the sequence
bos_embed = random_embedding(embedding_dim) #this direction corresponds to the beginning of the sequence



In [3]:
def get_residual(last_token, pos_embed, head_embed, tail_embed, sequence_length, n_heads_or_tails):
    """
    Compute the residual vector for the last token in the sequence.
    """
    if last_token == "head":
        return pos_embed * sequence_length + head_embed * n_heads_or_tails
    else:
        return pos_embed * sequence_length + tail_embed * n_heads_or_tails


def get_PPD(last_token, n_heads_or_tails, sequence_length, alpha, beta):
    """
    Compute the PPD (Probability of Previous Distribution) for the last token.
    """
    prob = (alpha + n_heads_or_tails) / (sequence_length + alpha + beta)
    complement = 1 - prob 

    if last_token == "head":
        return torch.tensor([0, prob, complement])
    else:
        return torch.tensor([0, complement, prob]) 

# Training the MLP

MLP takes in residual vector, and outputs logits corresponding to the PPD.

In [6]:
class MLP(torch.nn.Module):
    def __init__(self, embedding_dim, hidden_factor=4):
        super().__init__()
        self.layer1 = torch.nn.Linear(embedding_dim, embedding_dim * hidden_factor)
        self.layer2 = torch.nn.Linear(embedding_dim * hidden_factor, 3)  # Output logits for head and tail
        self.activation = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)
        
    def forward(self, x):
        x = self.activation(self.layer1(x))
        return self.softmax(self.layer2(x))

# Initialize the MLP
mlp = MLP(embedding_dim)

# Define loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(mlp.parameters(), lr=0.001)

