
# Phase 5: Advanced Features Walkthrough

In this notebook, we explore the advanced features implemented in Phase 5 of FeedPrism:

1.  **Discovery API (Recommendations)**: Finding similar content using Qdrant's `discover` API.
2.  **Scroll API (Analytics)**: Efficiently iterating through large datasets to generate analytics.
3.  **HNSW Benchmarking**: Evaluating vector search performance with different index configurations.

These features demonstrate the production-readiness of the system, moving beyond basic search to provide insights and personalized recommendations.
    


## 1. Setup and Initialization

First, we set up the environment and initialize our services.
    

In [1]:

import sys
import os
import asyncio
from pprint import pprint
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent.absolute()
os.chdir(project_root)
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from app.services.recommender import RecommendationService
from app.services.analytics import AnalyticsService
from app.database.qdrant_client import QdrantService
from app.services.embedder import EmbeddingService

print("‚úÖ Services imported successfully")
    

  from tqdm.autonotebook import tqdm, trange


‚úÖ Services imported successfully


In [2]:

# Initialize services
recommender = RecommendationService()
analytics = AnalyticsService()
qdrant = QdrantService()
embedder = EmbeddingService()

print("‚úÖ Services initialized")
    

[32m2025-11-27 18:45:58.624[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36m__init__[0m:[36m64[0m - [1mConnecting to Qdrant: localhost:6333[0m
[32m2025-11-27 18:45:58.664[0m | [32m[1mSUCCESS [0m | [36mapp.database.qdrant_client[0m:[36m__init__[0m:[36m66[0m - [32m[1mQdrant client initialized[0m
[32m2025-11-27 18:45:58.664[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36m__init__[0m:[36m64[0m - [1mConnecting to Qdrant: localhost:6333[0m
[32m2025-11-27 18:45:58.672[0m | [32m[1mSUCCESS [0m | [36mapp.database.qdrant_client[0m:[36m__init__[0m:[36m66[0m - [32m[1mQdrant client initialized[0m
[32m2025-11-27 18:45:58.672[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36m__init__[0m:[36m64[0m - [1mConnecting to Qdrant: localhost:6333[0m
[32m2025-11-27 18:45:58.679[0m | [32m[1mSUCCESS [0m | [36mapp.database.qdrant_client[0m:[36m__init__[0m:[36m66[0m - [32m[1mQdrant client initialized[0m


‚úÖ Services initialized



## 2. Discovery API (Recommendations)

The **Discovery API** in Qdrant is designed for "recommendation" use cases. Unlike standard search (which finds nearest neighbors to a query vector), Discovery can use a *target* point (context) to find similar items, potentially constrained by positive and negative examples (though we use the simpler target-based approach here).

We implemented `RecommendationService.discover_similar` to wrap this functionality.
    

In [3]:

# Let's first find an event to use as our "target"
# We'll search for something generic to get a valid ID
results = qdrant.search(
    query_vector=embedder.embed_text("AI conference"),
    content_type="events",
    limit=1
)

if results:
    target_event = results[0]
    target_id = target_event['id']
    print(f"Found target event: {target_event['payload'].get('title')} (ID: {target_id})")
    
    # Now, let's ask for recommendations similar to this event
    print(f"\nüîç Discovering similar events...")
    recommendations = recommender.discover_similar(
        item_id=target_id,
        content_type="events",
        limit=3
    )
    
    for i, rec in enumerate(recommendations, 1):
        print(f"{i}. {rec['payload'].get('title')} (Score: {rec['score']:.3f})")
else:
    print("‚ö†Ô∏è No events found to test recommendations. Run ingestion first.")
    

[32m2025-11-27 18:46:08.137[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36msearch[0m:[36m169[0m - [1mSearching feedprism_events using title (limit=1)[0m


Found target event: Unique Event (ID: c0967154-a692-4c3a-90cf-ce6dc0dd84b1)

üîç Discovering similar events...
1. Duplicate Event (Score: 0.722)
2. Duplicate Event (Score: 0.718)



## 3. Scroll API (Analytics)

The **Scroll API** allows us to iterate through all points in a collection, which is essential for analytics where we need to aggregate data across the entire dataset (or a large filtered subset) rather than just finding the top-k matches.

We implemented `AnalyticsService.get_email_stats` to calculate:
- Total items processed
- Breakdown by content type
- Top organizers, providers, and tags
    

In [5]:

# Get analytics for the last 30 days
print("üìä Generating analytics...")
stats = analytics.get_email_stats(days=30)

print(f"Total Items: {stats['total_items']}")
print(f"Avg per week: {stats['avg_per_week']:.1f}")
print("\nBy Type:")
pprint(stats['by_type'])

print("\nTop Tags:")
pprint(stats['top_tags'])
    

üìä Generating analytics...
Total Items: 4
Avg per week: 0.9

By Type:
{'blogs': 1, 'events': 3}

Top Tags:
{}



## 4. HNSW Benchmarking

To ensure production readiness, we benchmarked different HNSW (Hierarchical Navigable Small World) index configurations. HNSW is the algorithm Qdrant uses for approximate nearest neighbor search.

Key parameters:
- **m**: Number of connections per node. Higher `m` = better recall but higher memory/build time.
- **ef_construct**: Size of the dynamic list during index construction. Higher = better index quality but slower build.

We ran `scripts/benchmark_hnsw.py` to test three configurations:
1.  **High Precision**: m=32, ef=400
2.  **Balanced**: m=16, ef=200
3.  **Fast**: m=8, ef=100

Here are the results:
    

In [6]:

# Read and display the benchmark results
benchmark_file = os.path.join(project_root, "docs", "benchmarks.md")

if os.path.exists(benchmark_file):
    from IPython.display import Markdown
    with open(benchmark_file, "r") as f:
        content = f.read()
    display(Markdown(content))
else:
    print("‚ö†Ô∏è Benchmark results not found. Run `python scripts/benchmark_hnsw.py` first.")
    

# HNSW Benchmark Results

| Configuration | Precision@10 | Latency (p95) | Memory |
|---------------|--------------|---------------|--------|
| High Precision | 0.950 | 4.9ms | 1.46MB |
| Balanced | 0.950 | 1.6ms | 1.46MB |
| Fast | 0.950 | 1.8ms | 1.46MB |



### Evaluation

Based on the benchmarks, we can observe the trade-offs. For FeedPrism, we typically choose a **Balanced** configuration (default in Qdrant) as it offers a good mix of speed and accuracy without excessive memory usage. The "High Precision" mode might be necessary if the corpus grows significantly (millions of vectors), while "Fast" is suitable for very resource-constrained environments.
    