# Advanced Vector Store Queries

This notebook demonstrates detailed querying of the Qdrant vector store with **triple-collection architecture**:
- **`resume_data` collection** (from `resume_ale.md`): work experience, education, skills, continuing studies, personal info
- **`personality` collection** (from `personalities_16.md`): personality traits with fixed-size chunking
- **`projects` collection** (from `portfolio_projects.md`): portfolio projects with hierarchical chunking

**Architecture Benefits:**
- Semantic separation of resume facts, personality traits, and portfolio projects
- Hierarchical chunking for projects (technical summaries + full content)
- Faster queries with focused, smaller collections
- No cross-contamination between different data types

We'll explore:
1. Collection metadata and structure (ALL collections)
2. Filtering by section type within collections
3. Viewing embeddings and payloads
4. Projects collection hierarchical querying (NEW)
5. Semantic search examples (per collection)
6. Complete RAG workflow with all three collections

## 1. Initialize Vector Store Connection

In [1]:
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
from pathlib import Path
import json

# Initialize OpenAI embeddings for semantic search
import sys
sys.path.append('..')
from src.core.embeddings import OpenAIEmbeddings

embedder = OpenAIEmbeddings()
print("‚úÖ Embedder initialized for semantic search queries")

# Initialize Qdrant client with local storage
storage_path = "../vector_db/qdrant_storage"
client = QdrantClient(path=storage_path)

# Collection names (triple-collection architecture)
resume_collection = "resume_data"
personality_collection = "personality"
projects_collection = "projects"

print("‚úÖ Connected to Qdrant vector store")
print(f"üìÇ Storage path: {Path(storage_path).absolute()}")
print(f"\nüì¶ Collections:")
print(f"   - {resume_collection} (resume content)")
print(f"   - {personality_collection} (personality traits)")
print(f"   - {projects_collection} (portfolio projects)")

‚úÖ Embedder initialized for semantic search queries
‚úÖ Connected to Qdrant vector store
üìÇ Storage path: c:\Users\Ale\Documents\Data-Projects\GitHub\Resume_Claude_SDK_Agent\notebooks\..\vector_db\qdrant_storage

üì¶ Collections:
   - resume_data (resume content)
   - personality (personality traits)
   - projects (portfolio projects)


## 2. Explore Collection Structure (Including Projects)

In [2]:
# Get all collections
collections = client.get_collections()
print("üìö Available Collections:")
for collection in collections.collections:
    print(f"   - {collection.name}")

print("\n" + "="*80)

# Explore ALL collections
for collection_name in [resume_collection, personality_collection, projects_collection]:
    if client.collection_exists(collection_name):
        collection_info = client.get_collection(collection_name)
        
        print(f"\nüìä Collection '{collection_name}' Details:")
        print(f"   Total documents: {collection_info.points_count}")
        print(f"   Vector dimensions: {collection_info.config.params.vectors.size}")
        print(f"   Distance metric: {collection_info.config.params.vectors.distance}")
        print(f"   Status: {collection_info.status}")
        
        # Count by section type
        from collections import Counter
        all_records, _ = client.scroll(
            collection_name=collection_name,
            limit=1000,
            with_payload=True,
            with_vectors=False
        )
        
        section_counts = Counter(r.payload.get('section_type', 'unknown') for r in all_records)
        
        print(f"\n   üìà Documents by Section Type:")
        for section, count in sorted(section_counts.items()):
            print(f"      {section:20s}: {count:3d} chunks")
        
        print("   " + "-"*76)
    else:
        print(f"\n‚ùå Collection '{collection_name}' not found")

print("\n" + "="*80)

üìö Available Collections:
   - resume_data
   - personality
   - projects


üìä Collection 'resume_data' Details:
   Total documents: 35
   Vector dimensions: 1536
   Distance metric: Cosine
   Status: green

   üìà Documents by Section Type:
      continuing_studies  :   7 chunks
      education           :   2 chunks
      personal_info       :   1 chunks
      professional_summary:   1 chunks
      skills              :   5 chunks
      work_experience     :  19 chunks
   ----------------------------------------------------------------------------

üìä Collection 'personality' Details:
   Total documents: 8
   Vector dimensions: 1536
   Distance metric: Cosine
   Status: green

   üìà Documents by Section Type:
                          :   8 chunks
   ----------------------------------------------------------------------------

üìä Collection 'projects' Details:
   Total documents: 4
   Vector dimensions: 1536
   Distance metric: Cosine
   Status: green

   üìà Documents by

## 3. Query Resume Data (from resume_ale.md)

### 3.1 View Work Experience with Full Metadata

In [3]:
# Filter for work experience entries (from resume_data collection)
work_filter = Filter(
    must=[
        FieldCondition(
            key="section_type",
            match=MatchValue(value="work_experience")
        )
    ]
)

work_records, _ = client.scroll(
    collection_name=resume_collection,  # Query resume_data collection
    scroll_filter=work_filter,
    limit=20,
    with_payload=True,
    with_vectors=False  # Set True to see embeddings
)

print(f"üíº Work Experience Chunks from '{resume_collection}' collection (showing {len(work_records)}):\n")

for i, record in enumerate(work_records, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"CHUNK {i} - ID: {record.id}")
    print(f"{'='*80}")
    print(f"üìÑ Content (Achievement):")
    print(f"   {payload.get('content', 'N/A')}")
    print(f"\nüè¢ Metadata:")
    print(f"   Company:        {metadata.get('company', 'N/A')}")
    print(f"   Position:       {metadata.get('position', 'N/A')}")
    print(f"   Start Date:     {metadata.get('start_date', 'N/A')}")
    print(f"   End Date:       {metadata.get('end_date', 'N/A')}")
    print(f"   Source File:    {payload.get('source_file', 'N/A')}")
    print(f"   Section Type:   {payload.get('section_type', 'N/A')}")
    print()

üíº Work Experience Chunks from 'resume_data' collection (showing 19):

CHUNK 1 - ID: 0b2b312a-3c2d-4a6d-a6af-cad1bc2a98bc
üìÑ Content (Achievement):
   Quality Assurance Technician: Coordinated supply chain and production teams to ensure food safety compliance by leading cross-functional meetings and implementing compliance checks, maintaining operational continuity.

üè¢ Metadata:
   Company:        The Very Good Food Company
   Position:       Quality Assurance Technician
   Start Date:     February-2021
   End Date:       February-2022
   Source File:    resume_ale.md
   Section Type:   work_experience

CHUNK 2 - ID: 120e8829-f4e5-4f0e-aba9-00c0632d6772
üìÑ Content (Achievement):
   Data Scientist II: Developed a Power BI dashboard to track changes in imported food volumes, collaborating with import inspectors to define metrics and design visualizations in Power BI for stakeholder use.

üè¢ Metadata:
   Company:        Canadian Food Inspection Agency
   Position:       Data Sc

### 3.2 View Work Experience WITH Embeddings

Each chunk has a 1536-dimensional embedding vector generated by OpenAI's `text-embedding-3-small` model.

In [4]:
# Get one work experience record WITH embeddings
work_with_vector, _ = client.scroll(
    collection_name=resume_collection,
    scroll_filter=work_filter,
    limit=20,
    with_payload=True,
    with_vectors=True  # Include embeddings
)

if work_with_vector:
    record = work_with_vector[0]
    vector = record.vector
    
    print(f"üî¢ Embedding Vector Details:")
    print(f"   Vector dimensions: {len(vector)}")
    print(f"   Vector type: {type(vector)}")
    print(f"   First 10 values: {vector[:10]}")
    print(f"   Last 10 values:  {vector[-10:]}")
    print(f"\nüìä Vector Statistics:")
    import numpy as np
    vector_array = np.array(vector)
    print(f"   Min value:  {vector_array.min():.6f}")
    print(f"   Max value:  {vector_array.max():.6f}")
    print(f"   Mean value: {vector_array.mean():.6f}")
    print(f"   Std dev:    {vector_array.std():.6f}")
    
    print(f"\nüìÑ Associated Content:")
    print(f"   {record.payload.get('content', 'N/A')[:150]}...")

üî¢ Embedding Vector Details:
   Vector dimensions: 1536
   Vector type: <class 'list'>
   First 10 values: [-0.019191697239875793, 0.011832953430712223, 0.06691659241914749, 0.00033292826265096664, -0.016301382333040237, 0.029689325019717216, 0.020486559718847275, 0.019688831642270088, 0.03362015634775162, 0.01056121475994587]
   Last 10 values:  [0.024139918386936188, 0.00474011804908514, -0.013596045784652233, -0.0012016488471999764, -0.01553833857178688, -0.006566797848790884, -0.006665068678557873, 0.014879346825182438, 0.003659140085801482, -0.0037920945324003696]

üìä Vector Statistics:
   Min value:  -0.097485
   Max value:  0.109878
   Mean value: 0.000860
   Std dev:    0.025501

üìÑ Associated Content:
   Quality Assurance Technician: Coordinated supply chain and production teams to ensure food safety compliance by leading cross-functional meetings and ...


## 6. Query Personality Traits Data (from personalities_16.md)

### 6.1 View Personality Sections

In [5]:
# Query education entries (from resume_data collection)
education_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value="education"))]
)

education_records, _ = client.scroll(
    collection_name=resume_collection,  # ‚Üê Query resume_data collection
    scroll_filter=education_filter,
    limit=20,
    with_payload=True
)

print(f"üéì Education Entries from '{resume_collection}' collection ({len(education_records)}):\n")
for i, record in enumerate(education_records, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"EDUCATION CHUNK {i}")
    print(f"{'='*80}")
    print(f"üìù Degree:        {metadata.get('degree', 'N/A')}")
    print(f"üè´ Institution:   {metadata.get('institution', 'N/A')}")
    print(f"üìÖ Year:          {metadata.get('year', 'N/A')}")
    print(f"üìÇ Source File:   {payload.get('source_file', 'N/A')}")
    print(f"üè∑Ô∏è  Section Type:  {payload.get('section_type', 'N/A')}")
    print(f"\nüìÑ Content:\n   {payload.get('content', 'N/A')}")
    print(f"\nüîç Full Metadata: {json.dumps(metadata, indent=2)}")
    print()

# Query skills (from resume_data collection)
skills_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value="skills"))]
)

skills_records, _ = client.scroll(
    collection_name=resume_collection,
    scroll_filter=skills_filter,
    limit=20,
    with_payload=True
)

print(f"\nüõ†Ô∏è  Skills Entries from '{resume_collection}' collection ({len(skills_records)}):\n")
for i, record in enumerate(skills_records, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"SKILL CHUNK {i}")
    print(f"{'='*80}")
    print(f"üìÇ Category:      {metadata.get('category', 'N/A')}")
    print(f"üìÑ Skills:        {payload.get('content', 'N/A')}")
    print(f"üìÅ Source File:   {payload.get('source_file', 'N/A')}")
    print(f"üè∑Ô∏è  Section Type:  {payload.get('section_type', 'N/A')}")
    print(f"\nüîç Full Metadata: {json.dumps(metadata, indent=2)}")
    print()

üéì Education Entries from 'resume_data' collection (2):

EDUCATION CHUNK 1
üìù Degree:        BSc in Biotechnology Engineering
üè´ Institution:   Tec de Monterrey
üìÖ Year:          N/A
üìÇ Source File:   resume_ale.md
üè∑Ô∏è  Section Type:  education

üìÑ Content:
   BSc in Biotechnology Engineering from Tec de Monterrey. August-2012 - May-2017 | Mexico

üîç Full Metadata: {
  "degree": "BSc in Biotechnology Engineering",
  "institution": "Tec de Monterrey",
  "dates": "August-2012 - May-2017 | Mexico"
}

EDUCATION CHUNK 2
üìù Degree:        MSc in Food Science
üè´ Institution:   University of British Columbia
üìÖ Year:          N/A
üìÇ Source File:   resume_ale.md
üè∑Ô∏è  Section Type:  education

üìÑ Content:
   MSc in Food Science from University of British Columbia. January-2019 - October-2020 | Canada

üîç Full Metadata: {
  "degree": "MSc in Food Science",
  "institution": "University of British Columbia",
  "dates": "January-2019 - October-2020 | Canada"
}


üõ

## 4. Query Personality Traits Data (from personalities_16.md)

### 4.1 View Personality Sections

In [6]:
personality_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value=""))]
)

personality_records, _ = client.scroll(
    collection_name=personality_collection,
    scroll_filter=personality_filter,
    limit=20,
    with_payload=True
)

print(f"üß† Personality Trait Chunks from '{personality_collection}' collection ({len(personality_records)}):\n")
print(f"üí° Note: This collection contains ONLY personality data with simplified fixed-size chunking\n")

for i, record in enumerate(personality_records, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"PERSONALITY CHUNK {i}")
    print(f"{'='*80}")
    print(f"üìù Chunk Index: {metadata.get('chunk_index', 'N/A')}")
    print(f"üìÇ Source File: {payload.get('source_file', 'N/A')}")
    print(f"üìè Character Range: {metadata.get('char_start', 'N/A')} - {metadata.get('char_end', 'N/A')}")
    print(f"\nüìÑ Content:\n   {payload.get('content', 'N/A')}")
    print(f"\nüîç Full Metadata: {json.dumps(metadata, indent=2)}")
    print()

üß† Personality Trait Chunks from 'personality' collection (8):

üí° Note: This collection contains ONLY personality data with simplified fixed-size chunking

PERSONALITY CHUNK 1
üìù Chunk Index: 3
üìÇ Source File: personalities_16.md
üìè Character Range: 900 - 1300

üìÑ Content:
   also attending to crucial details, which makes me a valuable asset in any organization.

## Strengths
### Innovative Mindset
My ability to see possibilities others overlook often helps me find smarter solutions and effective improvements at work.

### Independent Worker
My talent for working productively on my own allows me to manage tasks effectively without the need for constant direction or sup

üîç Full Metadata: {
  "chunk_index": 3,
  "char_start": 900,
  "char_end": 1300,
  "overlap_chars": 100
}

PERSONALITY CHUNK 2
üìù Chunk Index: 0
üìÇ Source File: personalities_16.md
üìè Character Range: 0 - 400

üìÑ Content:
   # Personality Traits
I'm very analytical, highly curious and constantly s

In [7]:
# Complete workflow simulation
print(f"{'='*80}")
print("COMPLETE PROJECTS RETRIEVAL WORKFLOW")
print(f"{'='*80}\n")

job_tech_requirements = "data visualization matplotlib seaborn statistical analysis"

print(f"üìã Job Requirements: {job_tech_requirements}\n")

# PHASE 1: Search technical summaries
print(f"{'='*80}")
print("PHASE 1: Search Technical Summaries")
print(f"{'='*80}\n")

query_vector = embedder.embed_query(job_tech_requirements)

tech_results = client.query_points(
    collection_name=projects_collection,
    query=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="section_type", match=MatchValue(value="project_technical"))]
    ),
    limit=3,
    score_threshold=0.3  # Minimum similarity
).points

print(f"‚úÖ Found {len(tech_results)} matching projects:\n")

matched_project_ids = []
for i, result in enumerate(tech_results, 1):
    metadata = result.payload.get('metadata', {})
    project_id = metadata.get('project_id')
    matched_project_ids.append(project_id)
    
    print(f"{i}. [{result.score:.3f}] {metadata.get('project_title', 'N/A')}")
    print(f"   ID: {project_id}")
    print(f"   Tech: {', '.join(metadata.get('tech_stack', []))}\n")

# PHASE 2: Retrieve full content for matched projects
print(f"\n{'='*80}")
print("PHASE 2: Retrieve Full Project Content")
print(f"{'='*80}\n")

print(f"üîç Retrieving full content for {len(matched_project_ids)} projects...\n")

for project_id in matched_project_ids:
    # Filter for full content of this project
    full_filter = Filter(
        must=[
            FieldCondition(key="metadata.project_id", match=MatchValue(value=project_id)),
            FieldCondition(key="section_type", match=MatchValue(value="project_full"))
        ]
    )
    
    full_results, _ = client.scroll(
        collection_name=projects_collection,
        scroll_filter=full_filter,
        limit=1,
        with_payload=True,
        with_vectors=False
    )
    
    if full_results:
        record = full_results[0]
        payload = record.payload
        metadata = payload.get('metadata', {})
        
        print(f"{'='*80}")
        print(f"PROJECT: {metadata.get('project_title', 'N/A')}")
        print(f"{'='*80}")
        print(f"üÜî ID: {project_id}")
        print(f"üíª Tech Stack: {', '.join(metadata.get('tech_stack', []))}")
        print(f"üîó URL: {metadata.get('project_url', 'N/A')}")
        print(f"\nüìÑ Full Content (first 600 chars):")
        print(payload.get('content', 'N/A')[:600])
        print("...\n")

print(f"\n{'='*80}")
print("‚úÖ WORKFLOW COMPLETE")
print(f"{'='*80}")
print("\nüí° Benefits of Hierarchical Chunking:")
print("   ‚úì Fast initial matching using technical summaries")
print("   ‚úì Retrieve full context only for relevant projects")
print("   ‚úì Efficient token usage (don't embed full content for initial search)")
print("   ‚úì Clear separation of filtering vs. detailed context")

COMPLETE PROJECTS RETRIEVAL WORKFLOW

üìã Job Requirements: data visualization matplotlib seaborn statistical analysis

PHASE 1: Search Technical Summaries

‚úÖ Found 2 matching projects:

1. [0.352] Renew Amazon Prime? A Cost-Benefit Analysis
   ID: project_0
   Tech: Python, Jupyter Notebook, pandas, numpy, seaborn, matplotlib, CSV data (personal order export)

2. [0.334] SQL Database of Save-On-Foods products extracted using API\
   ID: project_1
   Tech: Python, Jupyter Notebook, pandas, requests, numpy, json, SQLAlchemy, SQLite, Postman (for request prototyping), and Git


PHASE 2: Retrieve Full Project Content

üîç Retrieving full content for 2 projects...

PROJECT: Renew Amazon Prime? A Cost-Benefit Analysis
üÜî ID: project_0
üíª Tech Stack: Python, Jupyter Notebook, pandas, numpy, seaborn, matplotlib, CSV data (personal order export)
üîó URL: https://github.com/aleivaar94/Renew-Amazon-Prime-2022

üìÑ Full Content (first 600 chars):
# Renew Amazon Prime? A Cost-Benefit Ana

### 5.6 Complete Projects Retrieval Workflow

Demonstrates the two-phase retrieval:
1. **Phase 1**: Search `project_technical` for matching projects
2. **Phase 2**: Retrieve `project_full` content using project_ids

In [8]:
# Simulate job requirements
job_requirements = "Python data visualization pandas matplotlib plotly"

print(f"üîç Semantic Search: Finding projects matching job requirements")
print(f"{'='*80}\n")
print(f"Query: '{job_requirements}'")
print(f"Collection: {projects_collection}")
print(f"Target: project_technical (for fast matching)\n")

# Generate query embedding
query_vector = embedder.embed_query(job_requirements)

# Search technical summaries first (faster, focused)
tech_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value="project_technical"))]
)

tech_results = client.query_points(
    collection_name=projects_collection,
    query=query_vector,
    query_filter=tech_filter,
    limit=5
).points

print(f"‚úÖ Top {len(tech_results)} Matching Projects (by technical summary):\n")

for i, result in enumerate(tech_results, 1):
    payload = result.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"MATCH {i} - Similarity Score: {result.score:.4f}")
    print(f"{'='*80}")
    print(f"üì¶ Project: {metadata.get('project_title', 'N/A')}")
    print(f"üÜî ID: {metadata.get('project_id', 'N/A')}")
    print(f"üíª Tech Stack: {', '.join(metadata.get('tech_stack', []))}")
    print(f"üîó URL: {metadata.get('project_url', 'N/A')}")
    print(f"\nüìÑ Technical Summary (first 300 chars):")
    print(f"   {payload.get('content', 'N/A')[:300]}...")
    print()

print("\nüí° Workflow: Search technical summaries ‚Üí Get project_id ‚Üí Retrieve full content using metadata filter")

üîç Semantic Search: Finding projects matching job requirements

Query: 'Python data visualization pandas matplotlib plotly'
Collection: projects
Target: project_technical (for fast matching)

‚úÖ Top 2 Matching Projects (by technical summary):

MATCH 1 - Similarity Score: 0.3120
üì¶ Project: SQL Database of Save-On-Foods products extracted using API\
üÜî ID: project_1
üíª Tech Stack: Python, Jupyter Notebook, pandas, requests, numpy, json, SQLAlchemy, SQLite, Postman (for request prototyping), and Git
üîó URL: https://github.com/aleivaar94/SQL-Database-of-Save-On-Foods-Products-Extracted-Using-API/blob/master/images/save-on-foods-logo.png

üìÑ Technical Summary (first 300 chars):
   Project: SQL Database of Save-On-Foods products extracted using API\

Technologies: Python, Jupyter Notebook, pandas, requests, numpy, json, SQLAlchemy, SQLite, Postman (for request prototyping), and Git.

Technical Work:
The solution identifies the vendor API via browser developer tools, converts t.

### 5.5 Semantic Search: Find Projects by Technical Requirements

Search for projects matching specific technologies or technical work.

In [9]:
# Query for a specific project using metadata filtering
target_project_id = "project_0"  # Change this to test different projects

print(f"üîç Searching for project by metadata: project_id = '{target_project_id}'")
print(f"{'='*80}\n")

# Build filter for exact metadata match
metadata_filter = Filter(
    must=[
        FieldCondition(
            key="metadata.project_id",
            match=MatchValue(value=target_project_id)
        )
    ]
)

# Retrieve matching documents
results, _ = client.scroll(
    collection_name=projects_collection,
    scroll_filter=metadata_filter,
    limit=10,
    with_payload=True,
    with_vectors=False
)

print(f"‚úÖ Found {len(results)} chunks for {target_project_id}\n")

for i, record in enumerate(results, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    section_type = payload.get('section_type', 'N/A')
    
    print(f"{'='*80}")
    print(f"CHUNK {i}: {section_type}")
    print(f"{'='*80}")
    print(f"üì¶ Project Title: {metadata.get('project_title', 'N/A')}")
    print(f"üÜî Project ID: {metadata.get('project_id', 'N/A')}")
    print(f"üíª Tech Stack: {', '.join(metadata.get('tech_stack', []))}")
    print(f"üì¶ Chunk Type: {metadata.get('chunk_type', 'N/A')}")
    
    if section_type == "project_technical":
        print(f"\nüîß Technical Summary (first 400 chars):")
        print(f"   {payload.get('content', 'N/A')[:400]}...")
    else:
        print(f"\nüìö Full Content (first 500 chars):")
        print(f"   {payload.get('content', 'N/A')[:500]}...")
    print()

print("\nüí° Use Case: Retrieve full project details after finding relevant technical summary")

üîç Searching for project by metadata: project_id = 'project_0'

‚úÖ Found 2 chunks for project_0

CHUNK 1: project_technical
üì¶ Project Title: Renew Amazon Prime? A Cost-Benefit Analysis
üÜî Project ID: project_0
üíª Tech Stack: Python, Jupyter Notebook, pandas, numpy, seaborn, matplotlib, CSV data (personal order export)
üì¶ Chunk Type: technical_summary

üîß Technical Summary (first 400 chars):
   Project: Renew Amazon Prime? A Cost-Benefit Analysis

Technologies: Python, Jupyter Notebook, pandas, numpy, seaborn, matplotlib, CSV data (personal order export).

Technical Work:
Cleaned and normalized the raw Amazon order export, including sensitive-data removal, numeric coercion, and datetime parsing. Aggregated orders by order_id and order_date and derived year/month features to enable per-ye...

CHUNK 2: project_full
üì¶ Project Title: Renew Amazon Prime? A Cost-Benefit Analysis
üÜî Project ID: project_0
üíª Tech Stack: Python, Jupyter Notebook, pandas, numpy, seaborn, matp

### 5.4 Query by Metadata: Retrieve Specific Project by ID

This demonstrates the `search_by_metadata()` pattern from the vector_store module.

In [10]:
# Filter for full content chunks
full_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value="project_full"))]
)

full_records, _ = client.scroll(
    collection_name=projects_collection,
    scroll_filter=full_filter,
    limit=100,
    with_payload=True,
    with_vectors=False
)

print(f"üìö Full Project Content Chunks (project_full)")
print(f"{'='*80}\n")
print(f"üí° These chunks contain ALL sections for complete project context\n")

# Display first project in detail
if full_records:
    record = full_records[0]
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"EXAMPLE FULL PROJECT: {metadata.get('project_title', 'N/A')}")
    print(f"{'='*80}")
    print(f"üÜî Project ID: {metadata.get('project_id', 'N/A')}")
    print(f"üîó URL: {metadata.get('project_url', 'N/A')}")
    print(f"üíª Tech Stack: {', '.join(metadata.get('tech_stack', []))}")
    print(f"üì¶ Chunk Type: {metadata.get('chunk_type', 'N/A')}")
    print(f"üè∑Ô∏è  Section Type: {payload.get('section_type', 'N/A')}")
    print(f"\nüìÑ Full Content:")
    print(payload.get('content', 'N/A'))
    print(f"\n{'='*80}\n")

# Summary of all full projects
print(f"\nüìä Summary of All Full Project Chunks ({len(full_records)} total):\n")
for i, record in enumerate(full_records, 1):
    metadata = record.payload.get('metadata', {})
    print(f"{i}. {metadata.get('project_title', 'N/A')}")
    print(f"   Tech: {', '.join(metadata.get('tech_stack', []))}")
    print()

üìö Full Project Content Chunks (project_full)

üí° These chunks contain ALL sections for complete project context

EXAMPLE FULL PROJECT: SQL Database of Save-On-Foods products extracted using API\
üÜî Project ID: project_1
üîó URL: https://github.com/aleivaar94/SQL-Database-of-Save-On-Foods-Products-Extracted-Using-API/blob/master/images/save-on-foods-logo.png
üíª Tech Stack: Python, Jupyter Notebook, pandas, requests, numpy, json, SQLAlchemy, SQLite, Postman (for request prototyping), and Git
üì¶ Chunk Type: full_content
üè∑Ô∏è  Section Type: project_full

üìÑ Full Content:
# SQL Database of Save-On-Foods products extracted using API\

## Purpose
This project programmatically extracts product data from an e‚Äëcommerce API to build a clean, queryable dataset for analysis and downstream tooling. It demonstrates a repeatable ETL workflow to turn paginated JSON API results into analytics-ready CSV and relational data.

## Tech Stack
Python, Jupyter Notebook, pandas, requests, num

### 5.3 View Full Project Content (project_full chunks)

In [11]:
# Filter for technical summary chunks
tech_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value="project_technical"))]
)

tech_records, _ = client.scroll(
    collection_name=projects_collection,
    scroll_filter=tech_filter,
    limit=100,
    with_payload=True,
    with_vectors=False
)

print(f"üîß Technical Summary Chunks (project_technical)")
print(f"{'='*80}\n")
print(f"üí° These chunks contain Tech Stack + Technical Highlights for fast matching\n")

for i, record in enumerate(tech_records, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"PROJECT {i}: {metadata.get('project_title', 'N/A')}")
    print(f"{'='*80}")
    print(f"üÜî Project ID: {metadata.get('project_id', 'N/A')}")
    print(f"üîó URL: {metadata.get('project_url', 'N/A')}")
    print(f"üíª Tech Stack: {', '.join(metadata.get('tech_stack', []))}")
    print(f"üì¶ Chunk Type: {metadata.get('chunk_type', 'N/A')}")
    print(f"üè∑Ô∏è  Section Type: {payload.get('section_type', 'N/A')}")
    print(f"\nüìÑ Content Preview (first 300 chars):")
    print(f"   {payload.get('content', 'N/A')[:300]}...")
    print()

üîß Technical Summary Chunks (project_technical)

üí° These chunks contain Tech Stack + Technical Highlights for fast matching

PROJECT 1: Renew Amazon Prime? A Cost-Benefit Analysis
üÜî Project ID: project_0
üîó URL: https://github.com/aleivaar94/Renew-Amazon-Prime-2022
üíª Tech Stack: Python, Jupyter Notebook, pandas, numpy, seaborn, matplotlib, CSV data (personal order export)
üì¶ Chunk Type: technical_summary
üè∑Ô∏è  Section Type: project_technical

üìÑ Content Preview (first 300 chars):
   Project: Renew Amazon Prime? A Cost-Benefit Analysis

Technologies: Python, Jupyter Notebook, pandas, numpy, seaborn, matplotlib, CSV data (personal order export).

Technical Work:
Cleaned and normalized the raw Amazon order export, including sensitive-data removal, numeric coercion, and datetime pa...

PROJECT 2: SQL Database of Save-On-Foods products extracted using API\
üÜî Project ID: project_1
üîó URL: https://github.com/aleivaar94/SQL-Database-of-Save-On-Foods-Products-Extracted-

### 5.2 View Technical Summaries (project_technical chunks)

In [12]:
# View all projects in the collection
all_projects, _ = client.scroll(
    collection_name=projects_collection,
    limit=100,
    with_payload=True,
    with_vectors=False
)

print(f"üìÇ Projects Collection Overview")
print(f"{'='*80}\n")
print(f"Total chunks in collection: {len(all_projects)}")

# Group by project_id
from collections import defaultdict
projects_by_id = defaultdict(list)
for record in all_projects:
    project_id = record.payload.get('metadata', {}).get('project_id')
    projects_by_id[project_id].append(record)

print(f"Total unique projects: {len(projects_by_id)}")
print(f"\nüí° Each project has 2 chunks: technical_summary + full_content")
print(f"\n{'='*80}\n")

# Display each project's chunks
for project_id, chunks in sorted(projects_by_id.items()):
    tech_chunk = next((c for c in chunks if c.payload.get('section_type') == 'project_technical'), None)
    full_chunk = next((c for c in chunks if c.payload.get('section_type') == 'project_full'), None)
    
    if tech_chunk:
        metadata = tech_chunk.payload.get('metadata', {})
        print(f"üì¶ Project: {metadata.get('project_title', 'N/A')}")
        print(f"   ID: {project_id}")
        print(f"   URL: {metadata.get('project_url', 'N/A')}")
        print(f"   Tech Stack: {', '.join(metadata.get('tech_stack', []))}")
        print(f"   Chunks: {len(chunks)} (technical + full)")
        print()

üìÇ Projects Collection Overview

Total chunks in collection: 4
Total unique projects: 2

üí° Each project has 2 chunks: technical_summary + full_content


üì¶ Project: Renew Amazon Prime? A Cost-Benefit Analysis
   ID: project_0
   URL: https://github.com/aleivaar94/Renew-Amazon-Prime-2022
   Tech Stack: Python, Jupyter Notebook, pandas, numpy, seaborn, matplotlib, CSV data (personal order export)
   Chunks: 2 (technical + full)

üì¶ Project: SQL Database of Save-On-Foods products extracted using API\
   ID: project_1
   URL: https://github.com/aleivaar94/SQL-Database-of-Save-On-Foods-Products-Extracted-Using-API/blob/master/images/save-on-foods-logo.png
   Tech Stack: Python, Jupyter Notebook, pandas, requests, numpy, json, SQLAlchemy, SQLite, Postman (for request prototyping), and Git
   Chunks: 2 (technical + full)



## 5. Query Portfolio Projects Collection (from portfolio_projects.md)

### 5.1 Understanding Hierarchical Project Chunking

The `projects` collection uses a **hierarchical chunking strategy** with two chunk types per project:

1. **`project_technical`**: Tech Stack + Technical Highlights (for filtering/matching)
2. **`project_full`**: Complete project with all sections (Purpose, Tech Stack, Technical Highlights, Skills Demonstrated, Result/Impact)

This dual-chunk approach enables:
- Fast filtering by technology stack
- Quick matching based on technical work
- Full context retrieval when needed

### 4.2 View All Personality Chunks

With simplified fixed-size chunking, all chunks in the personality collection are treated equally.

## 7. Semantic Search Examples

### 7.1 Search for Python-Related Work Experience

This demonstrates how semantic search works with embeddings.

In [13]:
# Create a query for Python-related achievements
query_text = "Python data analysis ETL pipeline machine learning"
query_vector = embedder.embed_query(query_text)

print(f"üîç Semantic Search Query: '{query_text}'")
print(f"   Query vector dimensions: {len(query_vector)}")
print(f"   Searching in: {resume_collection} collection")

# Search with vector similarity using query_points (newer API)
results = client.query_points(
    collection_name=resume_collection,  # ‚Üê Query resume_data collection
    query=query_vector,
    limit=5,
    score_threshold=0.5  # Only return results with similarity > 0.5
).points

print(f"\nüìä Top {len(results)} Results (by semantic similarity):\n")

for i, result in enumerate(results, 1):
    payload = result.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"RESULT {i} - Similarity Score: {result.score:.4f}")
    print(f"{'='*80}")
    print(f"üìÑ Content: {payload.get('content', 'N/A')}")
    print(f"üè∑Ô∏è  Section Type: {payload.get('section_type', 'N/A')}")
    if payload.get('section_type') == 'work_experience':
        print(f"   Company: {metadata.get('company', 'N/A')}")
        print(f"   Position: {metadata.get('position', 'N/A')}")
    print()

üîç Semantic Search Query: 'Python data analysis ETL pipeline machine learning'
   Query vector dimensions: 1536
   Searching in: resume_data collection

üìä Top 4 Results (by semantic similarity):

RESULT 1 - Similarity Score: 0.6346
üìÑ Content: Data Analyst: Built an ETL pipeline integrating five data sources totaling over 1M records using SQL and Python, automating ingestion and cleaning and saving 8 hours weekly in data preparation.
üè∑Ô∏è  Section Type: work_experience
   Company: Rubicon Organics
   Position: Data Analyst

RESULT 2 - Similarity Score: 0.5505
üìÑ Content: Data Scientist II: Extracted and processed millions of import/export transactions by building web-scraping collectors and a PySpark ETL pipeline to load cleaned data into a Microsoft Fabric lakehouse.
üè∑Ô∏è  Section Type: work_experience
   Company: Canadian Food Inspection Agency
   Position: Data Scientist II

RESULT 3 - Similarity Score: 0.5154
üìÑ Content: Data Scientist II: Automated data categoriza

### 7.2 Search for Personality Traits Matching Job Requirements

**NEW: Direct query to personality collection (no filtering needed!)**

This mimics how `retrieve_personality_traits()` works in the resume generator with the new architecture.

In [14]:
# Simulate a job analysis with soft skills and keywords
job_analysis = {
    'soft_skills': ['analytical thinking', 'problem-solving', 'collaboration'],
    'keywords': ['strategic', 'innovative', 'team player']
}

# Build query (same logic as retrieve_personality_traits)
query_parts = job_analysis.get('soft_skills', []) + job_analysis.get('keywords', [])
query_text = ' '.join(query_parts)
query_vector = embedder.embed_query(query_text)

print(f"üîç Job Requirements Query: '{query_text}'")
print(f"   Searching in: {personality_collection} collection (NEW!)\n")

# Search the PERSONALITY collection directly (no filtering needed!)
all_results = client.query_points(
    collection_name=personality_collection,  # ‚Üê Query personality collection directly!
    query=query_vector,
    limit=10
).points

print(f"‚úÖ Retrieved {len(all_results)} results from personality collection")

print(f"\nüìä Top 5 Personality Trait Chunks by Semantic Similarity:\n")
print(f"üí° Benefits of simplified chunking:")
print(f"   - Pure semantic search without complex filtering")
print(f"   - Fixed-size chunks maintain consistent context windows")
print(f"   - Faster search (smaller collection)\n")

for i, result in enumerate(all_results[:5], 1):  # Top 5
    payload = result.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"TRAIT {i} - Similarity: {result.score:.4f}")
    print(f"{'='*80}")
    print(f"üìù Chunk Index: {metadata.get('chunk_index', 'N/A')}")
    print(f"üìè Characters: {metadata.get('char_start', 'N/A')} - {metadata.get('char_end', 'N/A')}")
    print(f"üìÑ Content:\n   {payload.get('content', 'N/A')}")
    print()

print("\nüí° These traits would be deduplicated (removing 100-char overlaps) and injected into the cover letter prompt!")

üîç Job Requirements Query: 'analytical thinking problem-solving collaboration strategic innovative team player'
   Searching in: personality collection (NEW!)

‚úÖ Retrieved 8 results from personality collection

üìä Top 5 Personality Trait Chunks by Semantic Similarity:

üí° Benefits of simplified chunking:
   - Pure semantic search without complex filtering
   - Fixed-size chunks maintain consistent context windows
   - Faster search (smaller collection)

TRAIT 1 - Similarity: 0.5586
üìù Chunk Index: 0
üìè Characters: 0 - 400
üìÑ Content:
   # Personality Traits
I'm very analytical, highly curious and constantly seek to improve systems. I approach life with a strategic mindset, always looking several steps ahead and planning for contingencies. I value autonomy but also enjoy collaborating with others. This allows me to tackle complex problems with confidence and innovation. I hold high standards for myself and others, always striving

TRAIT 2 - Similarity: 0.4968
üìù Chunk In

### 7.3 Semantic Search with Section Filtering

Combine semantic search with metadata filters for precise results.

In [15]:
# Search for data science achievements ONLY in work experience (resume_data collection)
query_text = "data science machine learning SQL Python dashboard visualization"
query_vector = embedder.embed_query(query_text)

# Apply filter to only search work_experience
work_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value="work_experience"))]
)

results = client.query_points(
    collection_name=resume_collection,  # ‚Üê Query resume_data collection
    query=query_vector,
    query_filter=work_filter,  # ‚Üê Apply filter during search
    limit=5
).points

print(f"üîç Query: '{query_text}'")
print(f"üì¶ Collection: {resume_collection}")
print(f"üéØ Filter: section_type = 'work_experience'")
print(f"\nüìä Top {len(results)} Work Achievements:\n")

for i, result in enumerate(results, 1):
    payload = result.payload
    metadata = payload.get('metadata', {})
    
    print(f"{i}. [Score: {result.score:.4f}] {metadata.get('company', 'N/A')} - {metadata.get('position', 'N/A')}")
    print(f"   {payload.get('content', 'N/A')[:100]}...")
    print()

üîç Query: 'data science machine learning SQL Python dashboard visualization'
üì¶ Collection: resume_data
üéØ Filter: section_type = 'work_experience'

üìä Top 5 Work Achievements:

1. [Score: 0.5813] Canadian Food Inspection Agency - Data Scientist
   Data Scientist: Standardized descriptive and statistical reporting in Power BI, reducing report-gene...

2. [Score: 0.5107] Rubicon Organics - Data Analyst
   Data Analyst: Built an ETL pipeline integrating five data sources totaling over 1M records using SQL...

3. [Score: 0.5098] Rubicon Organics - Data Analyst
   Data Analyst: Built three Power BI dashboards for sales and marketing by collaborating with stakehol...

4. [Score: 0.4945] Canadian Food Inspection Agency - Data Scientist II
   Data Scientist II: Automated forecasting and reduced manual effort by 40 hours per month by deployin...

5. [Score: 0.4919] Canadian Food Inspection Agency - Data Scientist II
   Data Scientist II: Implemented daily automated data refreshes, repl

## 8. Complete RAG Workflow Example (Triple-Collection Architecture)

This demonstrates the updated retrieval flow using **separate collections** for resume, personality, and projects data.

In [16]:
# Simulate a complete RAG workflow for a Data Scientist job
print("="*80)
print("COMPLETE RAG WORKFLOW: Data Scientist Position")
print("(Using Triple-Collection Architecture)")
print("="*80)

# 1. Job context
job_title = "Senior Data Scientist"
company = "Tech Corp"
job_description = """
Looking for a data scientist with strong Python skills, experience with machine learning,
SQL databases, and data visualization. Must have excellent analytical and problem-solving
abilities with strong communication skills.
"""

print(f"\nüìã Job: {job_title} at {company}")
print(f"üìù Requirements: Python, ML, SQL, data viz, analytical thinking, communication\n")

# 2. PHASE 1: RETRIEVAL
print("="*80)
print("PHASE 1: RETRIEVAL (Vector Similarity Search)")
print("="*80)

# Create query embedding
query_text = f"{job_title} {company} {job_description}"
query_vector = embedder.embed_query(query_text)

# Retrieve work experience from RESUME_DATA collection
print(f"\nüîç Searching resume_data collection for work achievements...")
work_results = client.query_points(
    collection_name=resume_collection,
    query=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="section_type", match=MatchValue(value="work_experience"))]
    ),
    limit=10
).points

print(f"‚úÖ Retrieved {len(work_results)} relevant work achievements:")
for i, result in enumerate(work_results[:5], 1):
    metadata = result.payload.get('metadata', {})
    print(f"   {i}. [{result.score:.3f}] {metadata.get('company')} - {result.payload.get('content', '')[:60]}...")

# Retrieve relevant projects from PROJECTS collection
print(f"\nüìÇ Searching projects collection for matching portfolio work...")
projects_results = client.query_points(
    collection_name=projects_collection,
    query=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="section_type", match=MatchValue(value="project_technical"))]
    ),
    limit=3
).points

print(f"‚úÖ Retrieved {len(projects_results)} relevant projects:")
for i, result in enumerate(projects_results, 1):
    metadata = result.payload.get('metadata', {})
    print(f"   {i}. [{result.score:.3f}] {metadata.get('project_title')} - Tech: {', '.join(metadata.get('tech_stack', [])[:3])}")

# Retrieve personality traits from PERSONALITY collection
job_analysis = {
    'soft_skills': ['analytical', 'problem-solving', 'communication'],
    'keywords': ['data-driven', 'collaborative']
}

personality_query = ' '.join(job_analysis['soft_skills'] + job_analysis['keywords'])
personality_vector = embedder.embed_query(personality_query)

print(f"\nüß† Searching personality collection for matching traits...")
personality_results = client.query_points(
    collection_name=personality_collection,
    query=personality_vector,
    limit=10
).points

# Filter for personality/strength (exclude weaknesses)
personality_filtered = [
    r for r in personality_results 
    if r.payload.get('section_type') in ['personality', 'strength']
][:5]

print(f"‚úÖ Retrieved {len(personality_filtered)} personality traits:")
for i, result in enumerate(personality_filtered, 1):
    print(f"   {i}. [{result.score:.3f}] {result.payload.get('content', '')[:60]}...")

print(f"\nüí° Architecture Benefits:")
print(f"   ‚úì Resume, projects, and personality queries run independently")
print(f"   ‚úì No cross-contamination between collections")
print(f"   ‚úì Faster searches (smaller, focused collections)")
print(f"   ‚úì Hierarchical project retrieval (technical summaries first)")

# 3. PHASE 2: AUGMENTATION
print(f"\n{'='*80}")
print("PHASE 2: AUGMENTATION (Combine Context)")
print("="*80)
print("\n‚úÖ Would combine:")
print(f"   - Job requirements: {job_title}, Python, ML, SQL...")
print(f"   - {len(work_results[:5])} work achievements (from resume_data collection)")
print(f"   - {len(projects_results)} portfolio projects (from projects collection)")
print(f"   - {len(personality_filtered)} personality traits (from personality collection)")
print("   - Into a structured prompt for Claude")

# 4. PHASE 3: GENERATION
print(f"\n{'='*80}")
print("PHASE 3: GENERATION (Claude LLM)")
print("="*80)
print("\n‚úÖ Would call Claude API with augmented prompt to generate:")
print("   - Tailored resume sections")
print("   - Personalized cover letter")
print("   - Using ONLY the retrieved context from all three collections")

print(f"\n{'='*80}")
print("‚úÖ RAG WORKFLOW COMPLETE")
print("="*80)

COMPLETE RAG WORKFLOW: Data Scientist Position
(Using Triple-Collection Architecture)

üìã Job: Senior Data Scientist at Tech Corp
üìù Requirements: Python, ML, SQL, data viz, analytical thinking, communication

PHASE 1: RETRIEVAL (Vector Similarity Search)

üîç Searching resume_data collection for work achievements...
‚úÖ Retrieved 10 relevant work achievements:
   1. [0.452] Canadian Food Inspection Agency - Data Scientist II: Extracted and processed millions of impor...
   2. [0.448] Canadian Food Inspection Agency - Data Scientist: Standardized descriptive and statistical rep...
   3. [0.448] Canadian Food Inspection Agency - Data Scientist II: Implemented daily automated data refreshe...
   4. [0.441] Canadian Food Inspection Agency - Data Scientist: Analyzed pathogen occurrence trends across 5...
   5. [0.435] Rubicon Organics - Data Analyst: Built an ETL pipeline integrating five data so...

üìÇ Searching projects collection for matching portfolio work...
‚úÖ Retrieved 2 rel

## Summary

This notebook demonstrated the **triple-collection architecture** with comprehensive querying patterns:

### Collections Overview

1. **`resume_data`**: Work experience, education, skills, continuing studies, personal info
2. **`personality`**: Personality traits with fixed-size chunking (400 chars, 100 overlap)
3. **`projects`**: Portfolio projects with hierarchical chunking (technical summaries + full content)

### Key Features Demonstrated

#### Resume Collection
- Section-based chunking with rich metadata (company, position, dates)
- Filtered queries by section_type (work_experience, education, skills)
- Semantic search for relevant achievements

#### Personality Collection
- Simplified fixed-size chunking without complex section parsing
- Direct semantic search without filtering
- Efficient retrieval of matching traits

#### Projects Collection (NEW)
- **Hierarchical chunking**: Each project has 2 chunks
  - `project_technical`: Tech Stack + Technical Highlights (fast matching)
  - `project_full`: Complete project with all sections (detailed context)
- **Metadata-based retrieval**: Query by `project_id` to get specific projects
- **Two-phase workflow**:
  1. Search technical summaries for matches
  2. Retrieve full content using project_ids
- **Benefits**:
  - Fast filtering by technology stack
  - Efficient token usage (only embed full content when needed)
  - Clear separation of filtering vs. detailed context

### Query Patterns Explored

1. **Collection structure inspection**: View document counts and section types
2. **Filtered scrolling**: Retrieve documents by section_type or metadata
3. **Semantic search**: Find similar content using embeddings
4. **Combined filters**: Semantic search + metadata filtering
5. **Metadata queries**: Exact match on specific fields (e.g., project_id)
6. **Multi-phase retrieval**: Search summaries ‚Üí retrieve full content

### Architecture Benefits

- ‚úÖ **Semantic separation**: Resume, projects, and personality data stored independently
- ‚úÖ **Hierarchical retrieval**: Fast matching ‚Üí detailed context (projects)
- ‚úÖ **No cross-contamination**: Queries return only relevant collection data
- ‚úÖ **Faster searches**: Smaller, focused collections improve performance
- ‚úÖ **Flexible workflows**: Different retrieval patterns per collection
- ‚úÖ **Efficient tokens**: Don't embed full project content for initial search

### Use Cases

- **Resume generation**: Retrieve relevant work achievements + matching projects + personality traits
- **Cover letter**: Search all collections for job-specific content
- **Portfolio showcase**: Find projects by technology stack or technical requirements
- **Skills matching**: Semantic search across resume, projects, and personality

### Next Steps

- Run cells to explore your actual triple-collection database
- Experiment with different query combinations
- Test hierarchical project retrieval workflow
- Try metadata-based queries for specific projects
- Combine results from multiple collections for complete RAG workflows