# FlatIndex (Brute-Force Search) Examples in FAISS

This notebook demonstrates the FlatIndex (exact nearest neighbor search) in FAISS with different distance metrics.

## Overview

FlatIndex is the simplest index type in FAISS:
- **Exact search**: Returns true nearest neighbors (100% recall)
- **No training required**: Just add vectors and search
- **O(n×d) search complexity**: Compares query to every database vector
- **Best for**: Small datasets, accuracy baselines, or when exact results are required

## Distance Metrics Covered

1. **L2 (Euclidean)**: `IndexFlatL2` - Standard squared Euclidean distance
2. **Inner Product**: `IndexFlatIP` - Dot product (cosine similarity when normalized)
3. **L1 (Manhattan)**: Sum of absolute differences
4. **Linf (Chebyshev)**: Maximum absolute difference
5. **Cosine Similarity**: Inner product with normalized vectors

## Key Differences

| Metric | Formula | Range | Use Case |
|--------|---------|-------|----------|
| L2 | Σ(xᵢ - yᵢ)² | [0, ∞) | General purpose, geometric similarity |
| Inner Product | Σ(xᵢ × yᵢ) | (-∞, ∞) | Recommendation, when magnitude matters |
| Cosine | (x·y)/(‖x‖‖y‖) | [-1, 1] | Text embeddings, semantic similarity |
| L1 | Σ\|xᵢ - yᵢ\| | [0, ∞) | Sparse data, robust to outliers |
| Linf | max\|xᵢ - yᵢ\| | [0, ∞) | Worst-case difference |

In [None]:
import numpy as np
import faiss
import time
import matplotlib.pyplot as plt
from matplotlib.patches import Circle
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Plotting style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Print available metric types
print("Available FAISS metric types:")
print(f"  METRIC_L2 = {faiss.METRIC_L2}")
print(f"  METRIC_INNER_PRODUCT = {faiss.METRIC_INNER_PRODUCT}")
print(f"  METRIC_L1 = {faiss.METRIC_L1}")
print(f"  METRIC_Linf = {faiss.METRIC_Linf}")
print(f"  METRIC_Canberra = {faiss.METRIC_Canberra}")
print(f"  METRIC_BrayCurtis = {faiss.METRIC_BrayCurtis}")
print(f"  METRIC_JensenShannon = {faiss.METRIC_JensenShannon}")

## 1. Dataset Generation

We'll create synthetic datasets for our experiments.

In [None]:
def generate_dataset(nb, nq, d):
    """Generate random database and query vectors."""
    xb = np.random.random((nb, d)).astype('float32')
    xq = np.random.random((nq, d)).astype('float32')
    return xb, xq

def normalize_vectors(x):
    """L2 normalize vectors for cosine similarity."""
    norms = np.linalg.norm(x, axis=1, keepdims=True)
    return (x / norms).astype('float32')

# Dataset parameters
nb = 10000    # Database size
nq = 100      # Number of queries
d = 128       # Vector dimension
k = 10        # Number of nearest neighbors

print(f"Generating dataset: {nb:,} database vectors, {nq:,} queries, dimension {d}")
xb, xq = generate_dataset(nb, nq, d)
print(f"Database shape: {xb.shape}, Query shape: {xq.shape}")

# Create normalized versions for cosine similarity
xb_normalized = normalize_vectors(xb)
xq_normalized = normalize_vectors(xq)
print(f"\nNormalized vectors created (L2 norm = 1.0)")
print(f"  Sample norm before: {np.linalg.norm(xb[0]):.4f}")
print(f"  Sample norm after: {np.linalg.norm(xb_normalized[0]):.4f}")

## 2. Basic FlatIndex Usage: L2 vs Inner Product

The two most common distance metrics are L2 (Euclidean) and Inner Product.

In [None]:
# IndexFlatL2: Euclidean (L2) distance
print("="*60)
print("IndexFlatL2 - Euclidean Distance")
print("="*60)

index_l2 = faiss.IndexFlatL2(d)
print(f"Index created: dimension={index_l2.d}, is_trained={index_l2.is_trained}")

# Add vectors (no training needed for Flat index)
index_l2.add(xb)
print(f"Vectors added: ntotal={index_l2.ntotal}")

# Search
start = time.time()
distances_l2, indices_l2 = index_l2.search(xq, k)
search_time_l2 = time.time() - start

print(f"\nSearch completed in {search_time_l2*1000:.2f}ms")
print(f"QPS: {nq/search_time_l2:.0f}")

# Show sample results
print(f"\nSample results for query 0:")
print(f"  Nearest neighbor indices: {indices_l2[0][:5]}")
print(f"  Squared L2 distances: {distances_l2[0][:5]}")

In [None]:
# IndexFlatIP: Inner Product (dot product)
print("="*60)
print("IndexFlatIP - Inner Product")
print("="*60)

index_ip = faiss.IndexFlatIP(d)
print(f"Index created: dimension={index_ip.d}")

# Add vectors
index_ip.add(xb)
print(f"Vectors added: ntotal={index_ip.ntotal}")

# Search
start = time.time()
distances_ip, indices_ip = index_ip.search(xq, k)
search_time_ip = time.time() - start

print(f"\nSearch completed in {search_time_ip*1000:.2f}ms")
print(f"QPS: {nq/search_time_ip:.0f}")

# Show sample results
print(f"\nSample results for query 0:")
print(f"  Nearest neighbor indices: {indices_ip[0][:5]}")
print(f"  Inner products (higher = more similar): {distances_ip[0][:5]}")

# Note: Inner product returns NEGATIVE distances for max-heap, 
# but FAISS shows actual inner products (higher is better)

In [None]:
# Compare L2 and IP results
print("Comparing L2 and Inner Product results:")
print("-" * 50)

# Check how many neighbors are the same
overlap_count = 0
for i in range(nq):
    l2_set = set(indices_l2[i])
    ip_set = set(indices_ip[i])
    overlap_count += len(l2_set & ip_set)

avg_overlap = overlap_count / nq
print(f"Average overlap in top-{k} neighbors: {avg_overlap:.2f} / {k} ({avg_overlap/k*100:.1f}%)")

# Show example where they differ
for i in range(min(5, nq)):
    l2_nn = indices_l2[i, 0]
    ip_nn = indices_ip[i, 0]
    if l2_nn != ip_nn:
        print(f"\nQuery {i}: L2 nearest = {l2_nn}, IP nearest = {ip_nn}")
        print(f"  L2 distance to L2-best: {distances_l2[i, 0]:.4f}")
        print(f"  IP score to IP-best: {distances_ip[i, 0]:.4f}")
        break

## 3. Cosine Similarity with Normalized Vectors

Cosine similarity can be computed using Inner Product on L2-normalized vectors.

In [None]:
# Cosine similarity using normalized vectors + Inner Product
print("="*60)
print("Cosine Similarity (Normalized vectors + Inner Product)")
print("="*60)

# Create index with normalized vectors
index_cosine = faiss.IndexFlatIP(d)
index_cosine.add(xb_normalized)
print(f"Index created with {index_cosine.ntotal} normalized vectors")

# Search with normalized queries
distances_cosine, indices_cosine = index_cosine.search(xq_normalized, k)

print(f"\nSample results for query 0:")
print(f"  Nearest neighbor indices: {indices_cosine[0][:5]}")
print(f"  Cosine similarities (range [-1, 1]): {distances_cosine[0][:5]}")

# Verify cosine similarity calculation
q_vec = xq_normalized[0]
nn_idx = indices_cosine[0, 0]
nn_vec = xb_normalized[nn_idx]
manual_cosine = np.dot(q_vec, nn_vec)
print(f"\n  Manual verification: np.dot(q, nn) = {manual_cosine:.6f}")
print(f"  FAISS result: {distances_cosine[0, 0]:.6f}")

In [None]:
# Relationship between L2 distance and cosine similarity for normalized vectors
print("Relationship between L2 and Cosine for normalized vectors:")
print("-" * 60)
print("For unit vectors: ||x - y||² = 2 - 2*cos(x,y)")
print("Therefore: cos(x,y) = 1 - ||x - y||²/2")
print()

# Create L2 index with normalized vectors
index_l2_norm = faiss.IndexFlatL2(d)
index_l2_norm.add(xb_normalized)
distances_l2_norm, indices_l2_norm = index_l2_norm.search(xq_normalized, k)

# Convert L2 to cosine: cos = 1 - L2²/2
cosine_from_l2 = 1 - distances_l2_norm / 2

# Compare
print("For query 0, top-3 neighbors:")
print(f"{'Index':>6} {'L2 dist':>10} {'Cosine (IP)':>12} {'Cosine (from L2)':>16}")
print("-" * 50)
for i in range(3):
    idx = indices_l2_norm[0, i]
    l2_dist = distances_l2_norm[0, i]
    cos_ip = distances_cosine[0, i] if idx == indices_cosine[0, i] else "N/A"
    cos_l2 = cosine_from_l2[0, i]
    print(f"{idx:>6} {l2_dist:>10.4f} {cos_ip if isinstance(cos_ip, str) else f'{cos_ip:.4f}':>12} {cos_l2:>16.4f}")

# Note: L2 and IP give same ranking for normalized vectors
print(f"\nL2 neighbors: {indices_l2_norm[0][:5]}")
print(f"IP neighbors:  {indices_cosine[0][:5]}")

## 4. Other Distance Metrics (L1, Linf, etc.)

FAISS supports additional distance metrics through the IndexFlat constructor.

In [None]:
# IndexFlat with different metric types
metrics = {
    'L2': faiss.METRIC_L2,
    'Inner Product': faiss.METRIC_INNER_PRODUCT,
    'L1 (Manhattan)': faiss.METRIC_L1,
    'Linf (Chebyshev)': faiss.METRIC_Linf,
}

results = {}

print("Testing different distance metrics:")
print("="*70)

for name, metric in metrics.items():
    # Create index with specific metric
    index = faiss.IndexFlat(d, metric)
    index.add(xb)
    
    # Search
    start = time.time()
    distances, indices = index.search(xq, k)
    search_time = time.time() - start
    
    results[name] = {
        'distances': distances,
        'indices': indices,
        'search_time': search_time
    }
    
    print(f"\n{name}:")
    print(f"  Search time: {search_time*1000:.2f}ms")
    print(f"  Query 0 - Top 3 neighbors: {indices[0][:3]}")
    print(f"  Query 0 - Top 3 distances: {distances[0][:3]}")

In [None]:
# Manual verification of distance calculations
print("Manual verification of distance calculations:")
print("="*60)

q = xq[0]
nn_l2 = results['L2']['indices'][0, 0]
nn_vec = xb[nn_l2]

# L2 (squared Euclidean)
l2_manual = np.sum((q - nn_vec) ** 2)
l2_faiss = results['L2']['distances'][0, 0]
print(f"\nL2 (squared Euclidean) to nearest neighbor:")
print(f"  Manual: Σ(xᵢ - yᵢ)² = {l2_manual:.6f}")
print(f"  FAISS:  {l2_faiss:.6f}")

# Inner Product
ip_manual = np.dot(q, nn_vec)
nn_ip = results['Inner Product']['indices'][0, 0]
ip_faiss = results['Inner Product']['distances'][0, 0]
print(f"\nInner Product to IP-nearest neighbor (idx={nn_ip}):")
print(f"  Manual: Σ(xᵢ × yᵢ) = {np.dot(q, xb[nn_ip]):.6f}")
print(f"  FAISS:  {ip_faiss:.6f}")

# L1 (Manhattan)
nn_l1 = results['L1 (Manhattan)']['indices'][0, 0]
l1_manual = np.sum(np.abs(q - xb[nn_l1]))
l1_faiss = results['L1 (Manhattan)']['distances'][0, 0]
print(f"\nL1 (Manhattan) to L1-nearest neighbor (idx={nn_l1}):")
print(f"  Manual: Σ|xᵢ - yᵢ| = {l1_manual:.6f}")
print(f"  FAISS:  {l1_faiss:.6f}")

# Linf (Chebyshev)
nn_linf = results['Linf (Chebyshev)']['indices'][0, 0]
linf_manual = np.max(np.abs(q - xb[nn_linf]))
linf_faiss = results['Linf (Chebyshev)']['distances'][0, 0]
print(f"\nLinf (Chebyshev) to Linf-nearest neighbor (idx={nn_linf}):")
print(f"  Manual: max|xᵢ - yᵢ| = {linf_manual:.6f}")
print(f"  FAISS:  {linf_faiss:.6f}")

## 5. Visualization: How Different Metrics Rank Neighbors

Let's visualize how different distance metrics produce different neighbor rankings.

In [None]:
# Compare neighbor rankings across metrics
print("Neighbor ranking comparison across metrics:")
print("="*70)

# For each metric, show top-5 neighbors for query 0
metric_names = list(results.keys())
query_idx = 0

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for ax, name in zip(axes, metric_names):
    indices = results[name]['indices'][query_idx][:10]
    distances = results[name]['distances'][query_idx][:10]
    
    # Create bar chart
    colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(indices)))
    bars = ax.barh(range(len(indices)), distances, color=colors, edgecolor='black')
    ax.set_yticks(range(len(indices)))
    ax.set_yticklabels([f"#{i+1}: idx={idx}" for i, idx in enumerate(indices)])
    ax.invert_yaxis()
    
    if 'Inner Product' in name:
        ax.set_xlabel('Inner Product (higher = more similar)')
    else:
        ax.set_xlabel('Distance (lower = more similar)')
    ax.set_title(f'{name}')
    ax.grid(True, alpha=0.3)

plt.suptitle(f'Top-10 Neighbors for Query {query_idx} by Different Metrics', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Neighbor overlap matrix between metrics
print("Computing neighbor overlap between different metrics...")

overlap_matrix = np.zeros((len(metric_names), len(metric_names)))

for i, name1 in enumerate(metric_names):
    for j, name2 in enumerate(metric_names):
        total_overlap = 0
        for q in range(nq):
            set1 = set(results[name1]['indices'][q])
            set2 = set(results[name2]['indices'][q])
            total_overlap += len(set1 & set2)
        overlap_matrix[i, j] = total_overlap / (nq * k) * 100

# Plot heatmap
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(overlap_matrix, cmap='YlGn', vmin=0, vmax=100)

ax.set_xticks(range(len(metric_names)))
ax.set_xticklabels(metric_names, rotation=45, ha='right')
ax.set_yticks(range(len(metric_names)))
ax.set_yticklabels(metric_names)

# Add text annotations
for i in range(len(metric_names)):
    for j in range(len(metric_names)):
        text = ax.text(j, i, f'{overlap_matrix[i, j]:.1f}%',
                      ha='center', va='center', color='black', fontsize=11)

ax.set_title(f'Top-{k} Neighbor Overlap Between Metrics (%)', fontsize=14, fontweight='bold')
fig.colorbar(im, ax=ax, label='Overlap %')
plt.tight_layout()
plt.show()

## 6. 2D Visualization: Distance Metric Contours

Let's visualize what "equal distance" looks like for different metrics in 2D.

In [None]:
# 2D visualization of distance metric contours
fig, axes = plt.subplots(2, 2, figsize=(12, 12))

# Create a grid
x = np.linspace(-2, 2, 200)
y = np.linspace(-2, 2, 200)
X, Y = np.meshgrid(x, y)

# Origin point
origin = np.array([0, 0])

# L2 distance
ax1 = axes[0, 0]
Z_l2 = np.sqrt(X**2 + Y**2)
contour1 = ax1.contour(X, Y, Z_l2, levels=[0.5, 1.0, 1.5, 2.0], colors='blue')
ax1.clabel(contour1, inline=True, fontsize=10)
ax1.plot(0, 0, 'ro', markersize=10, label='Origin')
ax1.set_title('L2 (Euclidean) Distance\nCircular contours', fontsize=12)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_aspect('equal')
ax1.grid(True, alpha=0.3)
ax1.legend()

# L1 distance (Manhattan)
ax2 = axes[0, 1]
Z_l1 = np.abs(X) + np.abs(Y)
contour2 = ax2.contour(X, Y, Z_l1, levels=[0.5, 1.0, 1.5, 2.0], colors='green')
ax2.clabel(contour2, inline=True, fontsize=10)
ax2.plot(0, 0, 'ro', markersize=10, label='Origin')
ax2.set_title('L1 (Manhattan) Distance\nDiamond contours', fontsize=12)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_aspect('equal')
ax2.grid(True, alpha=0.3)
ax2.legend()

# Linf distance (Chebyshev)
ax3 = axes[1, 0]
Z_linf = np.maximum(np.abs(X), np.abs(Y))
contour3 = ax3.contour(X, Y, Z_linf, levels=[0.5, 1.0, 1.5, 2.0], colors='purple')
ax3.clabel(contour3, inline=True, fontsize=10)
ax3.plot(0, 0, 'ro', markersize=10, label='Origin')
ax3.set_title('Linf (Chebyshev) Distance\nSquare contours', fontsize=12)
ax3.set_xlabel('x')
ax3.set_ylabel('y')
ax3.set_aspect('equal')
ax3.grid(True, alpha=0.3)
ax3.legend()

# Inner Product (for reference vector [1, 0])
ax4 = axes[1, 1]
ref_vec = np.array([1, 0])
Z_ip = X * ref_vec[0] + Y * ref_vec[1]  # = X for ref=[1,0]
contour4 = ax4.contour(X, Y, Z_ip, levels=[-1, -0.5, 0, 0.5, 1, 1.5], colors='orange')
ax4.clabel(contour4, inline=True, fontsize=10)
ax4.plot(0, 0, 'ro', markersize=10, label='Origin')
ax4.arrow(0, 0, 1, 0, head_width=0.1, head_length=0.05, fc='red', ec='red')
ax4.annotate('ref vector', (0.5, 0.15), fontsize=10)
ax4.set_title('Inner Product (with ref=[1,0])\nLinear contours', fontsize=12)
ax4.set_xlabel('x')
ax4.set_ylabel('y')
ax4.set_aspect('equal')
ax4.grid(True, alpha=0.3)
ax4.legend()

plt.suptitle('Distance Metric Contours (Equal distance from origin)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 7. Performance Comparison Across Metrics

Let's compare search speed across different distance metrics.

In [None]:
# Performance comparison
print("Performance Comparison:")
print("="*60)

# Run multiple times for accurate timing
n_runs = 5
performance_results = []

all_metrics = {
    'L2 (IndexFlatL2)': (faiss.IndexFlatL2, xb, xq),
    'IP (IndexFlatIP)': (faiss.IndexFlatIP, xb, xq),
    'Cosine (IP+norm)': (faiss.IndexFlatIP, xb_normalized, xq_normalized),
    'L1 (IndexFlat)': (lambda d: faiss.IndexFlat(d, faiss.METRIC_L1), xb, xq),
    'Linf (IndexFlat)': (lambda d: faiss.IndexFlat(d, faiss.METRIC_Linf), xb, xq),
}

for name, (index_class, db, queries) in all_metrics.items():
    # Create index
    if callable(index_class) and not isinstance(index_class, type):
        index = index_class(d)
    else:
        index = index_class(d)
    index.add(db)
    
    # Warmup
    index.search(queries[:10], k)
    
    # Timed runs
    times = []
    for _ in range(n_runs):
        start = time.time()
        index.search(queries, k)
        times.append(time.time() - start)
    
    avg_time = np.mean(times) * 1000
    std_time = np.std(times) * 1000
    qps = len(queries) / np.mean(times)
    
    performance_results.append({
        'Metric': name,
        'Avg Time (ms)': avg_time,
        'Std (ms)': std_time,
        'QPS': qps
    })
    
    print(f"{name:>20}: {avg_time:.2f} ± {std_time:.2f} ms ({qps:.0f} QPS)")

df_perf = pd.DataFrame(performance_results)

In [None]:
# Visualize performance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors = plt.cm.Set2(np.linspace(0, 1, len(df_perf)))

# Search time comparison
ax1 = axes[0]
bars1 = ax1.bar(range(len(df_perf)), df_perf['Avg Time (ms)'], 
                yerr=df_perf['Std (ms)'], capsize=5,
                color=colors, edgecolor='black')
ax1.set_xticks(range(len(df_perf)))
ax1.set_xticklabels(df_perf['Metric'], rotation=45, ha='right')
ax1.set_ylabel('Search Time (ms)')
ax1.set_title('Search Time by Distance Metric')
ax1.grid(True, alpha=0.3, axis='y')

# QPS comparison
ax2 = axes[1]
bars2 = ax2.bar(range(len(df_perf)), df_perf['QPS'], 
                color=colors, edgecolor='black')
ax2.set_xticks(range(len(df_perf)))
ax2.set_xticklabels(df_perf['Metric'], rotation=45, ha='right')
ax2.set_ylabel('Queries Per Second')
ax2.set_title('Throughput by Distance Metric')
ax2.grid(True, alpha=0.3, axis='y')

# Add value labels
for bar, val in zip(bars2, df_perf['QPS']):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
             f'{val:.0f}', ha='center', fontsize=9)

plt.suptitle(f'FlatIndex Performance ({nb:,} vectors, {nq} queries)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 8. Scaling: How FlatIndex Performance Changes with Database Size

Since FlatIndex is brute-force, search time scales linearly with database size.

In [None]:
# Scaling experiment
db_sizes = [1000, 5000, 10000, 50000, 100000]
nq_test = 100
d_test = 128
k_test = 10

scaling_results = []

print("Scaling experiment: Search time vs database size")
print("="*60)

for nb_test in db_sizes:
    # Generate data
    xb_test = np.random.random((nb_test, d_test)).astype('float32')
    xq_test = np.random.random((nq_test, d_test)).astype('float32')
    
    # Build index
    index = faiss.IndexFlatL2(d_test)
    
    start = time.time()
    index.add(xb_test)
    add_time = time.time() - start
    
    # Search (multiple runs)
    times = []
    for _ in range(3):
        start = time.time()
        index.search(xq_test, k_test)
        times.append(time.time() - start)
    
    avg_time = np.mean(times) * 1000
    qps = nq_test / np.mean(times)
    
    scaling_results.append({
        'db_size': nb_test,
        'add_time_ms': add_time * 1000,
        'search_time_ms': avg_time,
        'qps': qps
    })
    
    print(f"  {nb_test:>7,} vectors: add={add_time*1000:.1f}ms, search={avg_time:.2f}ms, QPS={qps:.0f}")

df_scaling = pd.DataFrame(scaling_results)

In [None]:
# Visualize scaling
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Search time vs database size
ax1 = axes[0]
ax1.plot(df_scaling['db_size'], df_scaling['search_time_ms'], 'o-', 
         linewidth=2, markersize=8, color='#e74c3c')
ax1.set_xlabel('Database Size')
ax1.set_ylabel('Search Time (ms)')
ax1.set_title('Search Time vs Database Size\n(Linear scaling expected)')
ax1.set_xscale('log')
ax1.set_yscale('log')
ax1.grid(True, alpha=0.3)

# Add linear reference line
x_ref = np.array([df_scaling['db_size'].min(), df_scaling['db_size'].max()])
y_ref_start = df_scaling['search_time_ms'].iloc[0]
y_ref = y_ref_start * (x_ref / x_ref[0])
ax1.plot(x_ref, y_ref, '--', color='gray', alpha=0.7, label='Linear reference')
ax1.legend()

# QPS vs database size
ax2 = axes[1]
ax2.plot(df_scaling['db_size'], df_scaling['qps'], 'o-', 
         linewidth=2, markersize=8, color='#3498db')
ax2.set_xlabel('Database Size')
ax2.set_ylabel('Queries Per Second')
ax2.set_title('Throughput vs Database Size\n(Inverse scaling)')
ax2.set_xscale('log')
ax2.grid(True, alpha=0.3)

plt.suptitle('FlatIndex Scaling Behavior', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Show scaling factor
print("\nScaling analysis:")
print(f"  10x database size → ~{df_scaling['search_time_ms'].iloc[-1]/df_scaling['search_time_ms'].iloc[0]:.1f}x search time")

## 9. 2D Nearest Neighbor Visualization

Let's visualize nearest neighbor search in 2D to understand how different metrics find neighbors.

In [None]:
# 2D nearest neighbor visualization
np.random.seed(123)

# Create 2D dataset for visualization
nb_2d = 200
d_2d = 2
k_2d = 5

# Generate clustered data for interesting visualization
xb_2d = np.vstack([
    np.random.randn(50, 2) * 0.3 + [1, 1],
    np.random.randn(50, 2) * 0.3 + [-1, 1],
    np.random.randn(50, 2) * 0.3 + [0, -1],
    np.random.randn(100, 2) * 0.8
]).astype('float32')

# Single query point
xq_2d = np.array([[0.0, 0.0]]).astype('float32')

# Find neighbors with different metrics
metrics_2d = {
    'L2': faiss.METRIC_L2,
    'L1': faiss.METRIC_L1,
    'Linf': faiss.METRIC_Linf,
}

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, (name, metric) in zip(axes, metrics_2d.items()):
    # Create index and search
    index = faiss.IndexFlat(d_2d, metric)
    index.add(xb_2d)
    distances, indices = index.search(xq_2d, k_2d)
    
    # Plot all points
    ax.scatter(xb_2d[:, 0], xb_2d[:, 1], c='lightgray', s=30, alpha=0.6, label='Database')
    
    # Highlight neighbors
    nn_points = xb_2d[indices[0]]
    ax.scatter(nn_points[:, 0], nn_points[:, 1], c='green', s=100, 
               edgecolors='black', linewidth=2, label=f'Top-{k_2d} neighbors', zorder=5)
    
    # Plot query point
    ax.scatter(xq_2d[0, 0], xq_2d[0, 1], c='red', s=200, marker='*', 
               edgecolors='black', linewidth=2, label='Query', zorder=10)
    
    # Draw lines to neighbors
    for i, idx in enumerate(indices[0]):
        ax.plot([xq_2d[0, 0], xb_2d[idx, 0]], [xq_2d[0, 1], xb_2d[idx, 1]], 
                'g--', alpha=0.5, linewidth=1)
    
    ax.set_title(f'{name} Distance\nNearest: {indices[0][:3]}')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend(loc='upper right', fontsize=8)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)

plt.suptitle('2D Nearest Neighbor Search with Different Metrics', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
print("="*70)
print("Practical Use Cases for Each Distance Metric")
print("="*70)

use_cases = """
L2 (Euclidean) Distance - IndexFlatL2
=====================================
Best for:
  • Image similarity (when using raw pixels or certain embeddings)
  • Geographic distance (latitude/longitude after projection)
  • General-purpose similarity when magnitude matters
  • K-means clustering

Characteristics:
  • Sensitive to scale differences between dimensions
  • Works well when features are normalized/standardized
  • Gives circular "equal distance" contours


Inner Product / Cosine Similarity - IndexFlatIP
===============================================
Best for:
  • Text embeddings (word2vec, BERT, sentence transformers)
  • Document similarity
  • Recommendation systems
  • Any case where direction matters more than magnitude

Characteristics:
  • Cosine = IP with normalized vectors
  • Invariant to vector magnitude (when normalized)
  • Linear "equal distance" contours
  • Returns similarity (higher = more similar), not distance


L1 (Manhattan) Distance
=======================
Best for:
  • Sparse data (bag-of-words, TF-IDF)
  • When outliers should have less influence
  • Grid-based pathfinding

Characteristics:
  • More robust to outliers than L2
  • Diamond-shaped "equal distance" contours
  • Often faster to compute


Linf (Chebyshev) Distance
=========================
Best for:
  • Worst-case analysis
  • Games (chessboard distance)
  • When any single large difference is important

Characteristics:
  • Only considers the maximum difference
  • Square "equal distance" contours
  • Good when tolerance in any dimension matters
"""
print(use_cases)

## 10. Practical Use Cases for Each Metric

## 11. Using index_factory for FlatIndex

FAISS provides a convenient `index_factory` function for creating indexes.

In [None]:
# Using index_factory for FlatIndex
print("Creating FlatIndex using index_factory:")
print("="*60)

factory_examples = [
    ("Flat", "L2 distance (default)"),
    ("FlatL2", "Explicit L2 distance"),
    ("FlatIP", "Inner Product"),
]

for factory_str, description in factory_examples:
    print(f"\n'{factory_str}': {description}")
    
    # Create index
    index = faiss.index_factory(d, factory_str)
    index.add(xb)
    
    # Search
    D, I = index.search(xq[:5], k)
    
    print(f"  Created index type: {type(index).__name__}")
    print(f"  Query 0 - Top 3: {I[0][:3]}, distances: {D[0][:3]}")

## 12. Interactive Parameter Explorer

Use this cell to experiment with different distance metrics.

In [None]:
def test_flat_index(metric_name, xb, xq, k, normalize=False):
    """
    Test FlatIndex with a specific metric.
    
    Args:
        metric_name: 'L2', 'IP', 'L1', 'Linf', or 'Cosine'
        xb: Database vectors
        xq: Query vectors
        k: Number of neighbors
        normalize: Whether to L2-normalize vectors (for cosine)
    """
    d = xb.shape[1]
    
    # Handle normalization for cosine
    if normalize or metric_name == 'Cosine':
        xb_use = normalize_vectors(xb)
        xq_use = normalize_vectors(xq)
        metric = faiss.METRIC_INNER_PRODUCT
        print(f"Using normalized vectors for Cosine similarity")
    else:
        xb_use = xb
        xq_use = xq
        metric_map = {
            'L2': faiss.METRIC_L2,
            'IP': faiss.METRIC_INNER_PRODUCT,
            'L1': faiss.METRIC_L1,
            'Linf': faiss.METRIC_Linf,
        }
        metric = metric_map.get(metric_name, faiss.METRIC_L2)
    
    print(f"\nTesting FlatIndex with {metric_name} metric")
    print("-" * 50)
    
    # Create and populate index
    index = faiss.IndexFlat(d, metric)
    index.add(xb_use)
    print(f"Index: {index.ntotal} vectors, dimension {d}")
    
    # Search
    start = time.time()
    distances, indices = index.search(xq_use, k)
    search_time = time.time() - start
    
    # Results
    qps = len(xq) / search_time
    print(f"Search time: {search_time*1000:.2f}ms ({qps:.0f} QPS)")
    print(f"\nTop-{k} neighbors for query 0:")
    for i in range(min(5, k)):
        print(f"  #{i+1}: index={indices[0, i]}, distance={distances[0, i]:.4f}")
    
    return index, distances, indices

# Example: Try different metrics!
# Options: 'L2', 'IP', 'L1', 'Linf', 'Cosine'
my_metric = 'Cosine'
my_k = 10

test_flat_index(my_metric, xb, xq, my_k)

## 13. Summary

Key takeaways from this notebook:

In [None]:
print("="*70)
print("FlatIndex Summary")
print("="*70)

summary = """
WHEN TO USE FLATINDEX
=====================
✓ Small datasets (< 100K vectors typically)
✓ Need exact/ground truth results
✓ Creating baselines for comparison
✓ When search time is not critical
✓ Index building speed is important (no training needed)

WHEN TO AVOID FLATINDEX  
========================
✗ Large datasets (> 1M vectors)
✗ Real-time applications requiring low latency
✗ Memory-constrained environments (stores full vectors)

DISTANCE METRIC QUICK REFERENCE
===============================
┌────────────┬──────────────────┬────────────────────────────┐
│ Metric     │ FAISS Index      │ Best Use Case              │
├────────────┼──────────────────┼────────────────────────────┤
│ L2         │ IndexFlatL2      │ General purpose            │
│ Cosine     │ IndexFlatIP+norm │ Text/document embeddings   │
│ IP         │ IndexFlatIP      │ Recommendations            │
│ L1         │ IndexFlat(L1)    │ Sparse data, robustness    │
│ Linf       │ IndexFlat(Linf)  │ Worst-case analysis        │
└────────────┴──────────────────┴────────────────────────────┘

PERFORMANCE NOTES
=================
• Search time: O(n × d) - linear in database size
• Memory: O(n × d × 4) bytes for float32 vectors
• No training required - instant index creation
• All metrics have similar computational cost

For larger datasets, consider:
• IVFFlat: Approximate search with inverted file
• IVFPQ: Compressed vectors for memory efficiency  
• HNSW: Graph-based approximate search
"""
print(summary)

# Show memory estimate for current dataset
memory_mb = nb * d * 4 / (1024 * 1024)
print(f"\nMemory for current dataset ({nb:,} × {d} float32): {memory_mb:.1f} MB")