### Linear ALgebra

#### Vectors - The Building Blocks of Embeddings

#### What is a Vector?
A vector is a mathematical object that has both magnitude (length) and direction. In RAG systems, vectors represent text embeddings - numerical representations of words, sentences, or documents.
Mathematical Definition:
A vector v in n-dimensional space is written as:

v = [v₁, v₂, v₃, ..., vₙ]

#### Example in RAG Context:

If we have a sentence **The cat sits on the mat**, it might be represented as a 384-dimensional vector:

In [2]:
sentence_embedding = [0.23, -0.15, 0.67, 0.89, -0.34, ..., 0.12]

#### Vector Properties

**Magnitude (Length) of a Vector**:
The magnitude of a vector v is calculated using the Euclidean norm:

||v|| = √(v₁² + v₂² + v₃² + ... + vₙ²)

In [3]:
import numpy as np

# Example embedding vector
embedding = np.array([0.5, -0.3, 0.8, 0.1])
magnitude = np.linalg.norm(embedding)
print(f"Vector magnitude: {magnitude:.4f}")

Vector magnitude: 0.9950


#### Unit Vectors (Normalized Vectors):

A unit vector has a magnitude of 1. This is crucial in RAG systems for fair similarity comparisons:

v̂ = v / ||v||

In [4]:
# Normalize the vector
normalized_embedding = embedding / np.linalg.norm(embedding)
print(f"Normalized vector: {normalized_embedding}")
print(f"New magnitude: {np.linalg.norm(normalized_embedding):.4f}")

Normalized vector: [ 0.50251891 -0.30151134  0.80403025  0.10050378]
New magnitude: 1.0000


### Vector Operations

**Vector Addition**:
u + v = [u₁ + v₁, u₂ + v₂, ..., uₙ + vₙ]

**Scalar Multiplication**:
c × v = [c × v₁, c × v₂, ..., c × vₙ]

In [5]:
# Combining embeddings from different sources
query_embedding = np.array([0.4, -0.2, 0.6])
context_embedding = np.array([0.3, 0.1, -0.4])

# Weighted combination for enhanced retrieval
combined = 0.7 * query_embedding + 0.3 * context_embedding
print(f"Combined embedding: {combined}")

Combined embedding: [ 0.37 -0.11  0.3 ]


### Matrices - Operations on Multiple Vectors

**Matrix Basics**

A matrix is a rectangular array of numbers. In RAG systems, matrices often represent collections of embeddings.
Matrix Representation:

A = ```[a₁₁  a₁₂  a₁₃]
       [a₂₁  a₂₂  a₂₃]
       [a₃₁  a₃₂  a₃₃]
       ```

### Document-Term Matrix Example:

In [6]:
# Each row is a document embedding, each column is a dimension
document_embeddings = np.array([
    [0.5, -0.3, 0.8],  # Document 1
    [0.2,  0.7, -0.1], # Document 2
    [-0.4, 0.1,  0.6]  # Document 3
])
print(f"Shape: {document_embeddings.shape}")  # (3, 3)

Shape: (3, 3)


## Similarity Measures
1.) **Dot Product - The Foundation of Similarity**

**Mathematical Definition**
The dot product of two vectors u and v is:

### u · v = u₁v₁ + u₂v₂ + ... + uₙvₙ = Σ(i=1 to n) uᵢvᵢ

**Geometric Interpretation**

### u · v = ||u|| × ||v|| × cos(θ)

where θ is the angle between the vectors.

In [7]:
def dot_product(u, v):
    """Calculate dot product manually"""
    return sum(u[i] * v[i] for i in range(len(u)))

# Example vectors (sentence embeddings)
sentence1 = np.array([0.5, -0.3, 0.8, 0.1])
sentence2 = np.array([0.4, -0.2, 0.7, 0.2])

# Calculate dot product
dot_prod = np.dot(sentence1, sentence2)
manual_dot = dot_product(sentence1, sentence2)

print(f"NumPy dot product: {dot_prod:.4f}")
print(f"Manual calculation: {manual_dot:.4f}")

NumPy dot product: 0.8400
Manual calculation: 0.8400


### RAG Application - Document Ranking

In [8]:
def rank_documents_by_dot_product(query_embedding, document_embeddings):
    """Rank documents by dot product similarity"""
    scores = []
    for i, doc_emb in enumerate(document_embeddings):
        score = np.dot(query_embedding, doc_emb)
        scores.append((i, score))
    
    # Sort by score (descending)
    return sorted(scores, key=lambda x: x[1], reverse=True)

# Example usage
query = np.array([0.6, -0.2, 0.4])
docs = np.array([
    [0.5, -0.3, 0.8],
    [0.2,  0.7, -0.1],
    [-0.4, 0.1,  0.6]
])

rankings = rank_documents_by_dot_product(query, docs)
print("Document rankings (index, score):")
for idx, score in rankings:
    print(f"Document {idx}: {score:.4f}")

Document rankings (index, score):
Document 0: 0.6800
Document 2: -0.0200
Document 1: -0.0600


### Cosine Similarity - Angle-Based Similarity

Cosine similarity measures the cosine of the angle between two vectors:
### cos_sim(u, v) = (u · v) / (||u|| × ||v||)


**Why Cosine Similarity**?

- Magnitude Independent: Focuses on direction, not magnitude
- Range: Always between -1 and 1
- Interpretation: 1 = identical direction, 0 = orthogonal, -1 = opposite direction

In [9]:
def cosine_similarity(u, v):
    """Calculate cosine similarity between two vectors"""
    dot_product = np.dot(u, v)
    magnitude_u = np.linalg.norm(u)
    magnitude_v = np.linalg.norm(v)
    
    if magnitude_u == 0 or magnitude_v == 0:
        return 0.0
    
    return dot_product / (magnitude_u * magnitude_v)

# Example calculation
sentence1 = np.array([3, -1, 2])
sentence2 = np.array([1, 0, 1])

cos_sim = cosine_similarity(sentence1, sentence2)
print(f"Cosine similarity: {cos_sim:.4f}")

# Using sklearn for verification
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine
cos_sim_sklearn = sklearn_cosine([sentence1], [sentence2])[0][0]
print(f"Sklearn result: {cos_sim_sklearn:.4f}")

Cosine similarity: 0.9449
Sklearn result: 0.9449


### Batch Cosine Similarity Computation

In [10]:
def batch_cosine_similarity(query, documents):
    """Efficiently compute cosine similarity for multiple documents"""
    # Normalize query
    query_norm = query / np.linalg.norm(query)
    
    # Normalize documents
    doc_norms = documents / np.linalg.norm(documents, axis=1, keepdims=True)
    
    # Compute similarities
    similarities = np.dot(doc_norms, query_norm)
    return similarities

# Example with multiple documents
query = np.array([0.6, -0.2, 0.4])
docs = np.array([
    [0.5, -0.3, 0.8],
    [0.2,  0.7, -0.1],
    [-0.4, 0.1,  0.6],
    [0.6, -0.2, 0.4]  # Same as query
])

similarities = batch_cosine_similarity(query, docs)
print("Cosine similarities:")
for i, sim in enumerate(similarities):
    print(f"Document {i}: {sim:.4f}")

Cosine similarities:
Document 0: 0.9179
Document 1: -0.1091
Document 2: -0.0367
Document 3: 1.0000


### Euclidean Distance - Geometric Distance


The Euclidean distance between two points (vectors) is:

### d(u, v) = √[(u₁-v₁)² + (u₂-v₂)² + ... + (uₙ-vₙ)²]

In [11]:
def euclidean_distance(u, v):
    """Calculate Euclidean distance between two vectors"""
    return np.sqrt(np.sum((u - v) ** 2))

# Alternative using numpy
def euclidean_distance_np(u, v):
    return np.linalg.norm(u - v)

# Example
point1 = np.array([1, 2, 3])
point2 = np.array([4, 6, 8])

dist1 = euclidean_distance(point1, point2)
dist2 = euclidean_distance_np(point1, point2)

print(f"Manual calculation: {dist1:.4f}")
print(f"NumPy calculation: {dist2:.4f}")

Manual calculation: 7.0711
NumPy calculation: 7.0711


### Manhattan Distance (L1 Distance)

In [12]:
def manhattan_distance(u, v):
    """Calculate Manhattan distance"""
    return np.sum(np.abs(u - v))

# Example
point1 = np.array([1, 2, 3])
point2 = np.array([4, 6, 8])

manhattan_dist = manhattan_distance(point1, point2)
print(f"Manhattan distance: {manhattan_dist}")

Manhattan distance: 12


## Probability and Statistics Fundamentals

**Probability Distributions**
In RAG systems, we often work with probability distributions over vocabulary, documents, or similarity scores.

### Softmax Function - Converting Scores to Probabilities
The softmax function converts a vector of real numbers into a probability distribution:

In [13]:
def softmax(x):
    """Compute softmax of vector x"""
    # Subtract max for numerical stability
    exp_x = np.exp(x - np.max(x))
    return exp_x / np.sum(exp_x)

# Example: Converting similarity scores to probabilities
similarity_scores = np.array([2.5, 1.8, 3.2, 0.9])
probabilities = softmax(similarity_scores)

print("Similarity scores:", similarity_scores)
print("Probabilities:", probabilities)
print("Sum of probabilities:", np.sum(probabilities))

Similarity scores: [2.5 1.8 3.2 0.9]
Probabilities: [0.26937953 0.13376992 0.54246376 0.05438679]
Sum of probabilities: 1.0


### RAG Application - Document Selection:

In [14]:
def probabilistic_document_selection(query_embedding, doc_embeddings, temperature=1.0):
    """Select documents probabilistically based on similarity"""
    # Compute similarities
    similarities = []
    for doc_emb in doc_embeddings:
        sim = cosine_similarity(query_embedding, doc_emb)
        similarities.append(sim)
    
    # Apply temperature scaling
    scaled_scores = np.array(similarities) / temperature
    
    # Convert to probabilities
    probs = softmax(scaled_scores)
    
    return probs

# Example usage
query = np.array([0.6, -0.2, 0.4])
docs = np.array([
    [0.5, -0.3, 0.8],
    [0.2,  0.7, -0.1],
    [-0.4, 0.1,  0.6]
])

probs = probabilistic_document_selection(query, docs, temperature=0.5)
print("Document selection probabilities:")
for i, p in enumerate(probs):
    print(f"Document {i}: {p:.4f}")

Document selection probabilities:
Document 0: 0.7834
Document 1: 0.1005
Document 2: 0.1161


### Statistical Measures

In [15]:
def embedding_statistics(embeddings):
    """Calculate statistics for a collection of embeddings"""
    # Convert to numpy array if needed
    embeddings = np.array(embeddings)
    
    # Calculate statistics along each dimension
    means = np.mean(embeddings, axis=0)
    variances = np.var(embeddings, axis=0)
    std_devs = np.std(embeddings, axis=0)
    
    # Overall statistics
    overall_mean = np.mean(embeddings)
    overall_std = np.std(embeddings)
    
    return {
        'dimension_means': means,
        'dimension_variances': variances,
        'dimension_std_devs': std_devs,
        'overall_mean': overall_mean,
        'overall_std': overall_std
    }

# Example with document embeddings
doc_embeddings = np.random.randn(100, 384)  # 100 docs, 384-dim embeddings
stats = embedding_statistics(doc_embeddings)

print(f"Overall mean: {stats['overall_mean']:.4f}")
print(f"Overall std: {stats['overall_std']:.4f}")
print(f"Mean of first 5 dimensions: {stats['dimension_means'][:5]}")

Overall mean: 0.0060
Overall std: 0.9968
Mean of first 5 dimensions: [-0.22433486 -0.27268405 -0.05164164  0.10443067 -0.02384187]
