# Phase 4: Educational Walkthrough

This notebook walks through the new features added in Phase 4 of FeedPrism:
- Multi‑Collection Architecture (already implemented in Module 4.1)
- **Named Vectors** (Module 4.2)
- **Grouping API for Deduplication** (Module 4.3)

We will explore the code, run ingestion, and evaluate the impact on search quality and performance.

In [16]:
# Imports
import uuid, time
from datetime import datetime, timedelta
from app.services.embedder import EmbeddingService
from app.services.deduplicator import DeduplicationService
from app.database.qdrant_client import QdrantService
from app.utils.sparse_vector import create_sparse_vector
from qdrant_client.models import PointStruct, SparseVector

# Initialize services
embedder = EmbeddingService()
deduplicator = DeduplicationService()
qdrant = QdrantService()

# Ensure collections exist with named vectors
qdrant.create_all_collections(recreate=True)


[32m2025-11-27 16:59:36.855[0m | [1mINFO    [0m | [36mapp.services.embedder[0m:[36m__init__[0m:[36m27[0m - [1mLoading embedding model: sentence-transformers/all-MiniLM-L6-v2[0m
[32m2025-11-27 16:59:40.579[0m | [32m[1mSUCCESS [0m | [36mapp.services.embedder[0m:[36m__init__[0m:[36m29[0m - [32m[1mModel loaded: 384D vectors[0m
[32m2025-11-27 16:59:40.580[0m | [1mINFO    [0m | [36mapp.services.embedder[0m:[36m__init__[0m:[36m27[0m - [1mLoading embedding model: sentence-transformers/all-MiniLM-L6-v2[0m
[32m2025-11-27 16:59:44.167[0m | [32m[1mSUCCESS [0m | [36mapp.services.embedder[0m:[36m__init__[0m:[36m29[0m - [32m[1mModel loaded: 384D vectors[0m
[32m2025-11-27 16:59:44.173[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36m__init__[0m:[36m64[0m - [1mConnecting to Qdrant: localhost:6333[0m
[32m2025-11-27 16:59:44.193[0m | [32m[1mSUCCESS [0m | [36mapp.database.qdrant_client[0m:[36m__init__[0m:[36m66[0m - [32

## 1️⃣ Named Vectors Overview

The `EmbeddingService` now provides a helper `create_named_vectors` that returns three vectors:
- **title** – short, exact‑match vector
- **description** – semantic description vector
- **full_text** – full‑content vector for deep similarity

These vectors are stored in Qdrant under the same point ID, enabling us to query any of them independently.

In [17]:
# Example: generate named vectors for a sample event
title = "Machine Learning Workshop"
description = "Deep dive into transformer models and applications."
full_text = title + " " + description + " Hosted by AI Labs."
named_vectors = embedder.create_named_vectors(title, description, full_text)
print(named_vectors)


{'title': [-0.08137575536966324, -0.02513747848570347, 0.03720090910792351, 0.0005046928417868912, 0.03893096745014191, 0.006677915807813406, -0.004934042226523161, -0.08482816070318222, -0.11387281864881516, -0.037183407694101334, -0.036710403859615326, -0.04430778697133064, -0.004945800639688969, -0.057138681411743164, -0.09670320153236389, 0.04357052594423294, -0.005433851853013039, 0.01536980178207159, -0.02133093774318695, -0.09026583284139633, -0.0304908137768507, 0.0037344140000641346, -0.03835592046380043, 0.00478157727047801, 0.038700781762599945, -0.008914574980735779, 0.049543820321559906, -0.0006065599736757576, 0.039620984345674515, -0.052111733704805374, 0.009500845335423946, 0.010326970368623734, 0.0690472200512886, 0.04522533714771271, -0.022263797000050545, 0.028724469244480133, -0.03295515850186348, -0.016104714944958687, -0.005519085098057985, 0.002357347169891, -0.061863016337156296, -0.0560382641851902, 0.02961471676826477, 0.00298119499348104, 0.0764501765370369, 

## 2️⃣ Ingesting Data with Named Vectors and Deduplication

We will upsert a few sample points (events, courses, blogs) using the new `upsert_by_type` method.
Each point includes a `canonical_item_id` generated by `DeduplicationService` and flags for duplicates.

In [18]:
def upsert_sample_data():
    points_by_type = {
        "events": [],
        "courses": [],
        "blogs": []
    }
    # Sample items
    samples = [
        {
            "type": "event",
            "title": "ML Workshop",
            "description": "Learn transformers.",
            "full_text": "ML Workshop details...",
            "date": (datetime.now() + timedelta(days=5)).isoformat()
        },
        {
            "type": "course",
            "title": "Python Basics",
            "description": "Intro to Python programming.",
            "full_text": "Python Basics full content...",
            "date": (datetime.now() + timedelta(days=10)).isoformat()
        },
        {
            "type": "blog",
            "title": "Sparse Vectors Explained",
            "description": "Understanding BM25 style vectors.",
            "full_text": "Sparse Vectors Explained full article...",
            "date": (datetime.now() - timedelta(days=2)).isoformat()
        }
    ]
    for item in samples:
        vectors = embedder.create_named_vectors(item["title"], item["description"], item["full_text"])
        canonical_id = deduplicator.compute_canonical_id(item["title"], item["type"])
        point = PointStruct(
            id=str(uuid.uuid4()),
            vector=vectors,
            payload={
                "content_type": item["type"] + "s",
                "title": item["title"],
                "description": item["description"],
                "canonical_item_id": canonical_id,
                "is_duplicate": False,
                "date": datetime.fromisoformat(item["date"]).timestamp(),
                "date_iso": item["date"]
            }
        )
        points_by_type[item["type"] + "s"].append(point)
    # Upsert per collection
    for ctype, pts in points_by_type.items():
        if pts:
            qdrant.upsert_by_type(ctype, pts)

upsert_sample_data()
print("✅ Sample data upserted.")


[32m2025-11-27 17:00:09.004[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36mupsert_by_type[0m:[36m126[0m - [1mUpserting 1 points to feedprism_events[0m
[32m2025-11-27 17:00:09.032[0m | [32m[1mSUCCESS [0m | [36mapp.database.qdrant_client[0m:[36mupsert_by_type[0m:[36m133[0m - [32m[1mUpserted 1 points to feedprism_events[0m
[32m2025-11-27 17:00:09.032[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36mupsert_by_type[0m:[36m126[0m - [1mUpserting 1 points to feedprism_courses[0m
[32m2025-11-27 17:00:09.043[0m | [32m[1mSUCCESS [0m | [36mapp.database.qdrant_client[0m:[36mupsert_by_type[0m:[36m133[0m - [32m[1mUpserted 1 points to feedprism_courses[0m
[32m2025-11-27 17:00:09.044[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36mupsert_by_type[0m:[36m126[0m - [1mUpserting 1 points to feedprism_blogs[0m
[32m2025-11-27 17:00:09.054[0m | [32m[1mSUCCESS [0m | [36mapp.database.qdrant_client[0m:[36mu

✅ Sample data upserted.


## 3️⃣ Searching with Named Vectors

We can now query any of the three vectors. Below we perform a title‑only search and a full‑text search to see the difference.

In [19]:
query = "transformers workshop"
title_vec = embedder.embed_text(query)
full_vec = embedder.embed_text(query)
# Title search (exact match emphasis)
title_results = qdrant.search(title_vec, content_type="events", vector_name="title", limit=5)
print("--- Title Search Results---")
for r in title_results:
    print(r["payload"]["title"])
# Full‑text search (semantic)
full_results = qdrant.search(full_vec, content_type="events", vector_name="full_text", limit=5)
print("--- Full‑Text Search Results---")
for r in full_results:
    print(r["payload"]["title"])


[32m2025-11-27 17:00:28.199[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36msearch[0m:[36m169[0m - [1mSearching feedprism_events using title (limit=5)[0m
[32m2025-11-27 17:00:28.220[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36msearch[0m:[36m169[0m - [1mSearching feedprism_events using full_text (limit=5)[0m


--- Title Search Results---
ML Workshop
--- Full‑Text Search Results---
ML Workshop


## 4️⃣ Deduplication via Grouping API

We will insert a duplicate event (same title) and then use `search_with_grouping` to see that Qdrant groups them under the same `canonical_item_id`.

In [20]:
# Insert duplicate event
duplicate = {
    "type": "event",
    "title": "ML Workshop",
    "description": "Another description for same workshop.",
    "full_text": "Duplicate event full content...",
    "date": (datetime.now() + timedelta(days=6)).isoformat()
}
vectors = embedder.create_named_vectors(duplicate["title"], duplicate["description"], duplicate["full_text"])
canonical_id = deduplicator.compute_canonical_id(duplicate["title"], duplicate["type"])
point = PointStruct(
    id=str(uuid.uuid4()),
    vector=vectors,
    payload={
        "content_type": duplicate["type"] + "s",
        "title": duplicate["title"],
        "description": duplicate["description"],
        "canonical_item_id": canonical_id,
        "is_duplicate": True,
        "date": datetime.fromisoformat(duplicate["date"]).timestamp(),
        "date_iso": duplicate["date"]
    }
)
qdrant.upsert_by_type("events", [point])
print("✅ Duplicate inserted.")
# Grouped search
grouped = qdrant.search_with_grouping(title_vec, content_type="events", vector_name="title", limit=5)
print("--- Grouped Search Results---")
for g in grouped:
    print(f"Group ID: {g['id']} | Count: {g['source_count']} | Title: {g['payload']['title']}")


[32m2025-11-27 17:00:39.002[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36mupsert_by_type[0m:[36m126[0m - [1mUpserting 1 points to feedprism_events[0m
[32m2025-11-27 17:00:39.030[0m | [32m[1mSUCCESS [0m | [36mapp.database.qdrant_client[0m:[36mupsert_by_type[0m:[36m133[0m - [32m[1mUpserted 1 points to feedprism_events[0m


✅ Duplicate inserted.
--- Grouped Search Results---
Group ID: e0685617-03bf-4b7b-bfd8-fb661e74b290 | Count: 2 | Title: ML Workshop


## 5️⃣ Evaluation Metrics

We can now compare recall and precision between searching by `title` vs. `full_text` and observe how grouping reduces duplicate results. The notebook prints simple counts as a demonstration.

In [21]:
# Simple evaluation
def evaluate(search_fn, vec, ctype, name):
    results = search_fn(vec, content_type=ctype, vector_name=name, limit=10)
    print(f"{name} search returned {len(results)} results.")
    # Count duplicates based on canonical_item_id
    canon_ids = [r['payload'].get('canonical_item_id') for r in results]
    dup_count = len(canon_ids) - len(set(canon_ids))
    print(f"Duplicates in results: {dup_count}")

evaluate(qdrant.search, title_vec, "events", "title")
evaluate(qdrant.search, full_vec, "events", "full_text")


[32m2025-11-27 17:01:59.318[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36msearch[0m:[36m169[0m - [1mSearching feedprism_events using title (limit=10)[0m
[32m2025-11-27 17:01:59.376[0m | [1mINFO    [0m | [36mapp.database.qdrant_client[0m:[36msearch[0m:[36m169[0m - [1mSearching feedprism_events using full_text (limit=10)[0m


title search returned 2 results.
Duplicates in results: 1
full_text search returned 2 results.
Duplicates in results: 1


---
**Congratulations!** You have now explored the Phase 4 enhancements: named vectors for richer search semantics and the grouping API for deduplication. Feel free to modify the sample data, experiment with different queries, and integrate these patterns into your production pipelines.