# üîç Hierarchical Reasoning via Hierarchical Navigable Small World (HNSW)

This notebook demonstrates **Hierarchical Navigable Small World (HNSW)** graphs and how they enable efficient hierarchical reasoning for similarity search and retrieval-augmented generation (RAG) applications.

## What is HNSW?

HNSW is a graph-based algorithm for **Approximate Nearest Neighbor (ANN)** search that achieves:
- **Logarithmic search complexity**: O(log N) vs O(N) for brute force
- **High recall**: Often 95%+ accuracy compared to exact search
- **Hierarchical structure**: Multiple layers for efficient navigation

## Hierarchical Structure

```
Layer 2 (sparse):     [A] -------- [B]
                       |            |
Layer 1 (medium):     [A] -- [C] -- [B] -- [D]
                       |     |      |      |
Layer 0 (dense):      [A]-[E]-[C]-[F]-[B]-[G]-[D]-[H]
```

- **Higher layers**: Fewer nodes, longer connections (coarse navigation)
- **Lower layers**: More nodes, shorter connections (fine navigation)
- **Search**: Start at top layer, navigate down through layers

---

## üì¶ 1. Installation

Install the required libraries for HNSW and visualization.

In [None]:
# Install dependencies
%pip install hnswlib numpy matplotlib scikit-learn --quiet
%pip install sentence-transformers --quiet

print("‚úÖ Dependencies installed!")

In [None]:
import numpy as np
import hnswlib
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import time
from typing import List, Dict, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported!")

## üéØ 2. Understanding HNSW Basics

Let's start with a simple example to understand how HNSW works.

In [None]:
# Generate sample data: 1000 points in 128 dimensions
np.random.seed(42)
num_elements = 1000
dim = 128

# Create clustered data for more interesting hierarchical structure
data, labels = make_blobs(n_samples=num_elements, n_features=dim, centers=10, random_state=42)
data = data.astype('float32')

# Normalize vectors (for cosine similarity)
data = data / np.linalg.norm(data, axis=1, keepdims=True)

print(f"üìä Data shape: {data.shape}")
print(f"üìä Number of clusters: {len(np.unique(labels))}")
print(f"üìä Data range: [{data.min():.3f}, {data.max():.3f}]")

In [None]:
# Create HNSW index
# Key parameters:
# - M: Number of bi-directional links per element (affects memory and search quality)
# - ef_construction: Size of dynamic candidate list during construction

# Initialize the index
hnsw_index = hnswlib.Index(space='cosine', dim=dim)  # 'cosine', 'l2', or 'ip'

# Initialize index - max_elements is the maximum number of elements
hnsw_index.init_index(
    max_elements=num_elements,
    ef_construction=200,  # Higher = better quality, slower construction
    M=16                   # Number of connections per layer
)

# Add data to the index
start_time = time.time()
hnsw_index.add_items(data, np.arange(num_elements))
build_time = time.time() - start_time

print(f"‚úÖ HNSW index built in {build_time:.3f} seconds")
print(f"üìä Index parameters:")
print(f"   - M (connections): {hnsw_index.M}")
print(f"   - ef_construction: {hnsw_index.ef_construction}")
print(f"   - Max elements: {hnsw_index.max_elements}")

## üîé 3. Search: Hierarchical Navigation

The search process navigates through layers:
1. **Start at top layer** with a random entry point
2. **Greedy search** to find closest node in current layer
3. **Move down** to next layer using the found node as entry point
4. **Repeat** until reaching layer 0
5. **Return** the k nearest neighbors from layer 0

In [None]:
# Search for nearest neighbors
# ef (search parameter): Size of dynamic candidate list during search
# Higher ef = better recall but slower search

# Set search parameter
hnsw_index.set_ef(50)  # ef should be >= k (number of neighbors to return)

# Query vector (use first data point as query)
query = data[0:1]  # Shape: (1, 128)
k = 10  # Find 10 nearest neighbors

# Search
start_time = time.time()
labels_result, distances = hnsw_index.knn_query(query, k=k)
search_time = time.time() - start_time

print(f"üîç Query completed in {search_time*1000:.3f} ms")
print(f"\nüìä Top {k} nearest neighbors:")
print(f"{'Rank':<6} {'ID':<8} {'Distance':<12} {'Similarity':<12}")
print("-" * 40)
for i, (idx, dist) in enumerate(zip(labels_result[0], distances[0])):
    # For cosine space, distance = 1 - similarity
    similarity = 1 - dist
    print(f"{i+1:<6} {idx:<8} {dist:<12.6f} {similarity:<12.6f}")

In [None]:
# Compare with brute-force search
def brute_force_search(data: np.ndarray, query: np.ndarray, k: int) -> Tuple[np.ndarray, np.ndarray]:
    """Exact nearest neighbor search using brute force."""
    similarities = cosine_similarity(query, data)[0]
    top_k_indices = np.argsort(similarities)[::-1][:k]
    top_k_similarities = similarities[top_k_indices]
    return top_k_indices, 1 - top_k_similarities  # Convert to distances

# Brute force search
start_time = time.time()
bf_indices, bf_distances = brute_force_search(data, query, k)
bf_time = time.time() - start_time

print(f"‚è±Ô∏è Search Time Comparison:")
print(f"   HNSW:        {search_time*1000:.3f} ms")
print(f"   Brute Force: {bf_time*1000:.3f} ms")
print(f"   Speedup:     {bf_time/search_time:.1f}x faster")

# Calculate recall (how many of the true nearest neighbors did HNSW find?)
hnsw_set = set(labels_result[0])
bf_set = set(bf_indices)
recall = len(hnsw_set.intersection(bf_set)) / k * 100

print(f"\nüìä Recall: {recall:.1f}% ({int(recall*k/100)}/{k} correct neighbors)")

## üèóÔ∏è 4. Visualizing the Hierarchical Structure

Let's visualize how HNSW creates a hierarchical structure for efficient navigation.

In [None]:
# Reduce dimensions for visualization
pca = PCA(n_components=2)
data_2d = pca.fit_transform(data)

# Visualize the data with cluster colors
plt.figure(figsize=(12, 5))

# Plot 1: Data points colored by cluster
plt.subplot(1, 2, 1)
scatter = plt.scatter(data_2d[:, 0], data_2d[:, 1], c=labels, cmap='tab10', alpha=0.6, s=20)
plt.colorbar(scatter, label='Cluster')
plt.title('Data Distribution (10 Clusters)')
plt.xlabel('PC1')
plt.ylabel('PC2')

# Plot 2: Highlight query and its neighbors
plt.subplot(1, 2, 2)
plt.scatter(data_2d[:, 0], data_2d[:, 1], c='lightgray', alpha=0.3, s=20, label='All points')
plt.scatter(data_2d[labels_result[0], 0], data_2d[labels_result[0], 1], 
            c='blue', s=100, label='HNSW neighbors', edgecolors='black')
plt.scatter(data_2d[0, 0], data_2d[0, 1], c='red', s=200, marker='*', label='Query', edgecolors='black')
plt.title('Query and Retrieved Neighbors')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()

plt.tight_layout()
plt.show()

## üß† 5. Hierarchical Reasoning with HNSW

Now let's implement **hierarchical reasoning** - using HNSW for multi-level semantic search and retrieval.

### Structure:
- **Level 0 (Documents)**: Full document embeddings (coarse)
- **Level 1 (Sections)**: Section/paragraph embeddings (medium)
- **Level 2 (Sentences)**: Sentence-level embeddings (fine)

In [None]:
class HierarchicalKnowledgeBase:
    """
    A hierarchical knowledge base using HNSW for multi-level reasoning.
    
    This enables hierarchical reasoning:
    1. Find relevant documents (coarse)
    2. Find relevant sections within those documents (medium)
    3. Find specific sentences (fine)
    """
    
    def __init__(self, dim: int = 384):
        self.dim = dim
        self.levels = {}
        self.metadata = {}
        
    def add_level(self, level_name: str, embeddings: np.ndarray, 
                  metadata: List[Dict], M: int = 16, ef_construction: int = 200):
        """Add a hierarchical level to the knowledge base."""
        n_items = len(embeddings)
        
        # Create HNSW index for this level
        index = hnswlib.Index(space='cosine', dim=self.dim)
        index.init_index(max_elements=n_items, ef_construction=ef_construction, M=M)
        index.add_items(embeddings.astype('float32'), np.arange(n_items))
        
        self.levels[level_name] = {
            'index': index,
            'embeddings': embeddings,
            'count': n_items
        }
        self.metadata[level_name] = metadata
        
        print(f"‚úÖ Added level '{level_name}' with {n_items} items")
    
    def search_level(self, level_name: str, query_embedding: np.ndarray, 
                     k: int = 5, ef: int = 50) -> List[Dict]:
        """Search within a specific level."""
        if level_name not in self.levels:
            raise ValueError(f"Level '{level_name}' not found")
        
        index = self.levels[level_name]['index']
        index.set_ef(ef)
        
        labels, distances = index.knn_query(query_embedding.reshape(1, -1).astype('float32'), k=k)
        
        results = []
        for idx, dist in zip(labels[0], distances[0]):
            results.append({
                'id': int(idx),
                'distance': float(dist),
                'similarity': float(1 - dist),
                'metadata': self.metadata[level_name][idx]
            })
        
        return results
    
    def hierarchical_search(self, query_embedding: np.ndarray,
                           levels: List[str] = None,
                           k_per_level: List[int] = None) -> Dict[str, List[Dict]]:
        """Perform hierarchical search across multiple levels."""
        if levels is None:
            levels = list(self.levels.keys())
        if k_per_level is None:
            k_per_level = [5] * len(levels)
        
        results = {}
        for level_name, k in zip(levels, k_per_level):
            level_results = self.search_level(level_name, query_embedding, k=k)
            results[level_name] = level_results
        
        return results

print("‚úÖ HierarchicalKnowledgeBase class defined!")

## üìö 6. Example: Hierarchical Document Search

Let's create a practical example with simulated document hierarchies.

In [None]:
# Simulate hierarchical document structure
np.random.seed(42)

# Configuration
num_documents = 50
sections_per_doc = 5
sentences_per_section = 10
embedding_dim = 128

# Create document topics
topics = ["Machine Learning", "Natural Language Processing", "Computer Vision", 
          "Reinforcement Learning", "Neural Networks"]

# Generate embeddings with hierarchical structure
# Documents within same topic will have similar embeddings

# Level 0: Documents
doc_embeddings = []
doc_metadata = []

for doc_id in range(num_documents):
    topic_id = doc_id % len(topics)
    # Base embedding for topic
    topic_base = np.random.randn(embedding_dim) * 0.5
    topic_base[topic_id * 25:(topic_id + 1) * 25] += 2
    # Add document-specific variation
    doc_emb = topic_base + np.random.randn(embedding_dim) * 0.2
    doc_emb = doc_emb / np.linalg.norm(doc_emb)
    
    doc_embeddings.append(doc_emb)
    doc_metadata.append({
        'doc_id': doc_id,
        'title': f"Document {doc_id}: {topics[topic_id]} Guide Part {doc_id // len(topics) + 1}",
        'topic': topics[topic_id]
    })

doc_embeddings = np.array(doc_embeddings, dtype='float32')

# Level 1: Sections
section_embeddings = []
section_metadata = []

for doc_id in range(num_documents):
    for sec_id in range(sections_per_doc):
        sec_emb = doc_embeddings[doc_id] + np.random.randn(embedding_dim) * 0.15
        sec_emb = sec_emb / np.linalg.norm(sec_emb)
        
        section_embeddings.append(sec_emb)
        section_metadata.append({
            'section_id': len(section_embeddings) - 1,
            'parent_doc_id': doc_id,
            'title': f"Section {sec_id + 1} of Doc {doc_id}",
            'topic': doc_metadata[doc_id]['topic']
        })

section_embeddings = np.array(section_embeddings, dtype='float32')

# Level 2: Sentences
sentence_embeddings = []
sentence_metadata = []

for sec_idx, sec_emb in enumerate(section_embeddings):
    for sent_id in range(sentences_per_section):
        sent_emb = sec_emb + np.random.randn(embedding_dim) * 0.1
        sent_emb = sent_emb / np.linalg.norm(sent_emb)
        
        sentence_embeddings.append(sent_emb)
        sentence_metadata.append({
            'sentence_id': len(sentence_embeddings) - 1,
            'parent_section_id': sec_idx,
            'parent_doc_id': section_metadata[sec_idx]['parent_doc_id'],
            'text': f"Sentence {sent_id + 1} about {section_metadata[sec_idx]['topic']}",
            'topic': section_metadata[sec_idx]['topic']
        })

sentence_embeddings = np.array(sentence_embeddings, dtype='float32')

print(f"üìö Created hierarchical document structure:")
print(f"   Level 0 (Documents):  {len(doc_embeddings):,} items")
print(f"   Level 1 (Sections):   {len(section_embeddings):,} items")
print(f"   Level 2 (Sentences):  {len(sentence_embeddings):,} items")
print(f"   Total:                {len(doc_embeddings) + len(section_embeddings) + len(sentence_embeddings):,} items")

In [None]:
# Create hierarchical knowledge base
kb = HierarchicalKnowledgeBase(dim=embedding_dim)

# Add levels
kb.add_level('documents', doc_embeddings, doc_metadata, M=16, ef_construction=100)
kb.add_level('sections', section_embeddings, section_metadata, M=16, ef_construction=100)
kb.add_level('sentences', sentence_embeddings, sentence_metadata, M=16, ef_construction=100)

print("\n‚úÖ Hierarchical knowledge base ready!")

In [None]:
# Create a query embedding (simulate a query about "Neural Networks")
query_topic_id = 4  # Neural Networks
query_embedding = np.random.randn(embedding_dim) * 0.3
query_embedding[query_topic_id * 25:(query_topic_id + 1) * 25] += 2
query_embedding = query_embedding / np.linalg.norm(query_embedding)
query_embedding = query_embedding.astype('float32')

print("üîç Query: Find information about Neural Networks")
print("=" * 60)

# Perform hierarchical search
results = kb.hierarchical_search(
    query_embedding,
    levels=['documents', 'sections', 'sentences'],
    k_per_level=[3, 5, 10]
)

# Display results
print("\nüìÑ LEVEL 0: Top Documents")
print("-" * 60)
for r in results['documents']:
    print(f"  [{r['similarity']:.3f}] {r['metadata']['title']}")
    print(f"           Topic: {r['metadata']['topic']}")

print("\nüìë LEVEL 1: Top Sections")
print("-" * 60)
for r in results['sections'][:5]:
    print(f"  [{r['similarity']:.3f}] {r['metadata']['title']}")
    print(f"           Parent Doc: {r['metadata']['parent_doc_id']}, Topic: {r['metadata']['topic']}")

print("\nüìù LEVEL 2: Top Sentences")
print("-" * 60)
for r in results['sentences'][:5]:
    print(f"  [{r['similarity']:.3f}] {r['metadata']['text']}")
    print(f"           Section: {r['metadata']['parent_section_id']}, Doc: {r['metadata']['parent_doc_id']}")

## üéì 7. Summary & Best Practices

### Key Concepts

1. **HNSW Structure**:
   - Multiple layers with decreasing density
   - Higher layers for coarse navigation
   - Lower layers for fine-grained search

2. **Hierarchical Reasoning**:
   - Build indices at multiple semantic levels
   - Use appropriate strategy based on query type
   - Combine coarse-to-fine or fine-to-coarse approaches

### Best Practices

| Parameter | Recommendation | Trade-off |
|-----------|---------------|----------|
| **M** | 16-64 | Higher = better quality, more memory |
| **ef_construction** | 100-500 | Higher = better index, slower build |
| **ef** (search) | 50-200 | Higher = better recall, slower search |

### Reasoning Strategies

| Strategy | Use Case |
|----------|----------|
| **Top-Down** | Exploratory queries, need context |
| **Bottom-Up** | Precise fact retrieval |
| **Multi-Hop** | Complex reasoning, multiple evidence |

---

## Resources

- [HNSW Paper](https://arxiv.org/abs/1603.09320)
- [hnswlib Documentation](https://github.com/nmslib/hnswlib)
- [FAISS (Facebook AI Similarity Search)](https://github.com/facebookresearch/faiss)

In [None]:
print("‚úÖ Notebook complete!")
print("\nüìö Key takeaways:")
print("   1. HNSW provides O(log N) search complexity")
print("   2. Hierarchical structure enables multi-level reasoning")
print("   3. Different strategies suit different query types")
print("   4. Trade-offs between speed, memory, and accuracy are configurable")