**Architecture of Embedding Models.**

which are neural networks designed to convert text, images, or
other data into dense vector representations.
Key components of embedding model architecture:

**Stage#1 The input layer processes raw data** (like text tokens or image pixels) and typically includes preprocessing steps:

For text: Tokenization breaks input into subword units
For images: Convolutional layers extract visual features
Positional encodings may be added to maintain sequence order

**Stage#2 The encoder layers then transform this input into increasingly abstract representations**:

Multiple transformer blocks or deep neural network layers
Each layer typically consists of:

Self-attention mechanisms to capture relationships between elements
Feed-forward neural networks to process these relationships
Layer normalization and residual connections


Stage #3 The pooling layer combines the encoded representations:
**bold text**
Mean pooling takes the average across all encoded elements
Max pooling selects the strongest features
[CLS] token pooling (common in BERT-style models) uses a special token to aggregate sequence information

**Stage #4 The final embedding layer projects the pooled representation into the desired embedding space**:

Linear transformation to achieve target embedding dimension
Normalization may be applied (L2 norm, etc.)

The resulting embedding vector captures semantic meaning in a fixed-dimensional space

Training objectives shape how these embeddings capture relationships:

Contrastive learning pulls similar items together and pushes dissimilar items apart
Masked language modeling predicts missing tokens
Next sentence prediction determines if sequences are related
Classification tasks supervise embeddings to capture relevant features

The specific architecture choices depend on the use case:

Text embeddings may prioritize capturing linguistic relationships
Image embeddings focus on visual feature hierarchies
Cross-modal embeddings align representations across different types of data

In [1]:
#1 What is input processing?

# a) Tokenization:Converting raw text into token IDs
# b) Token Embeddings:Converting token IDs into vectors
# c) Position Embeddings:Adding positional information
# d) Layer Normalization:Stabilizing the representation
# e) Droput: Applyuing regularization



In [3]:
# simple embedding model
import numpy as np

class SimpleEmbeddingModel:
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.embedding_matrix = np.random.uniform(-0.1, 0.1, (vocab_size, embedding_dim))

    def embed(self, input_ids):
        """Convert token IDs to embeddings"""
        return self.embedding_matrix[input_ids]

In [6]:
SimpleEmbeddingModel(5,3).embedding_matrix

array([[ 0.0749917 , -0.05158836, -0.06308906],
       [ 0.04143047, -0.06643029,  0.02618513],
       [-0.08275793, -0.09106689,  0.0705244 ],
       [ 0.08773272, -0.00780366, -0.05142994],
       [ 0.011674  ,  0.01717707,  0.0930771 ]])

In [4]:
# Create a tiny vocabulary
vocab = {
    "hello": 0,
    "world": 1,
    "hi": 2,
    "earth": 3
}

model = SimpleEmbeddingModel(vocab_size=len(vocab), embedding_dim=3)
sentense = "hello world"
input_ids = [vocab[word] for word in sentense.split()]
embeddings = model.embed(input_ids)
print("Embeddings shape:", embeddings.shape)  # Should be (2, 3)
print("Embeddings:", embeddings)

Embeddings shape: (2, 3)
Embeddings: [[-0.02467514 -0.08158735 -0.07707473]
 [ 0.0938968  -0.07724188 -0.04269936]]


In [13]:
import numpy as np

d_model = 4

# For one position, we get a vector of 4 numbers
# Calculate frequencies
freq1 = 1/np.power(10000, 0/d_model)    # = 1
freq2 = 1/np.power(10000, 2/d_model)    # smaller frequency

print("Frequencies:")
print(f"freq1 = {freq1}")
print(f"freq2 = {freq2}")

# Position 0 gets:
pos0 = [
    np.sin(0 * freq1),   # index 0
    np.cos(0 * freq1),   # index 1
    np.sin(0 * freq2),   # index 2
    np.cos(0 * freq2)    # index 3
]

# Position 1 gets:
pos1 = [
    np.sin(1 * freq1),   # index 0
    np.cos(1 * freq1),   # index 1
    np.sin(1 * freq2),   # index 2
    np.cos(1 * freq2)    # index 3
]

print("\nPosition 0 vector:", pos0)
print("Position 1 vector:", pos1)

Frequencies:
freq1 = 1.0
freq2 = 0.01

Position 0 vector: [0.0, 1.0, 0.0, 1.0]
Position 1 vector: [0.8414709848078965, 0.5403023058681398, 0.009999833334166664, 0.9999500004166653]


For each position, you create a vector where:

Elements alternate between sin and cos
Each sin/cos PAIR shares the same frequency
Different pairs use different frequencies



In [None]:
# positional encoding

import numpy as np

def create_position_encodings(max_seq_len:int, embedding_dim:int) -> np.ndarray:
    """
    Args:
        max_seq_length: how many positions we need to encode (e.g., 512)
        embedding_dim: size of each position's encoding vector (e.g., 4)
                      must be even because we use sin/cos pairs
    Returns:
        Array of shape (max_seq_length, embedding_dim)
        Each row is a position's encoding vector
    """



    position_encodings = np.zeros((max_seq_len, embedding_dim))


    for pos in range(max_seq_len):
        for dim in range(0,embedding_dim,2):

        # original paper "attention is all you need" 1000**(-dim/embedding_dim)
            freq = np.exp(dim * -np.log(10000.0) / embedding_dim)

        # because pow(a,b) = exp(b*ln(a)) for all a > 0


            # First in pair: sine
            position_encodings[pos, dim] = np.sin(pos * freq)

            # Second in pair: cosine(if it exists)

            if dim+1< embedding_dim:
                position_encodings[pos, dim+1] = np.cos(pos * freq)

    return position_encodings


