# Advanced Vector Store Queries

This notebook demonstrates detailed querying of the Qdrant vector store with **dual-collection architecture**:
- **`resume_data` collection** (from `resume_ale.md`): work experience, education, skills, continuing studies, personal info
- **`personality` collection** (from `personalities_16.md`): personality traits, strengths, weaknesses

**New Architecture Benefits:**
- Semantic separation of resume facts vs personality traits
- Faster queries (smaller, focused collections)
- No cross-contamination between resume and personality data

We'll explore:
1. Collection metadata and structure (BOTH collections)
2. Filtering by section type within collections
3. Viewing embeddings and payloads
4. Semantic search examples (separate collection queries)
5. Specific queries for resume vs personality data

## 1. Initialize Vector Store Connection

In [1]:
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
from pathlib import Path
import json

# Initialize Qdrant client with local storage
storage_path = "../vector_db/qdrant_storage"
client = QdrantClient(path=storage_path)

# Collection names (dual-collection architecture)
resume_collection = "resume_data"
personality_collection = "personality"

print("‚úÖ Connected to Qdrant vector store")
print(f"üìÇ Storage path: {Path(storage_path).absolute()}")
print(f"\nüì¶ Collections:")
print(f"   - {resume_collection} (resume content)")
print(f"   - {personality_collection} (personality traits)")

‚úÖ Connected to Qdrant vector store
üìÇ Storage path: c:\Users\Ale\Documents\Data-Science-Projects\GitHub\Resume_Claude_SDK_Agent\notebooks\..\vector_db\qdrant_storage

üì¶ Collections:
   - resume_data (resume content)
   - personality (personality traits)


## 2. Explore Collection Structure

In [3]:
# Get all collections
collections = client.get_collections()
print("üìö Available Collections:")
for collection in collections.collections:
    print(f"   - {collection.name}")

print("\n" + "="*80)

# Explore BOTH collections
for collection_name in [resume_collection, personality_collection]:
    if client.collection_exists(collection_name):
        collection_info = client.get_collection(collection_name)
        
        print(f"\nüìä Collection '{collection_name}' Details:")
        print(f"   Total documents: {collection_info.points_count}")
        print(f"   Vector dimensions: {collection_info.config.params.vectors.size}")
        print(f"   Distance metric: {collection_info.config.params.vectors.distance}")
        print(f"   Status: {collection_info.status}")
        
        # Count by section type
        from collections import Counter
        all_records, _ = client.scroll(
            collection_name=collection_name,
            limit=1000,
            with_payload=True,
            with_vectors=False
        )
        
        section_counts = Counter(r.payload.get('section_type', 'unknown') for r in all_records)
        
        print(f"\n   üìà Documents by Section Type:")
        for section, count in sorted(section_counts.items()):
            print(f"      {section:20s}: {count:3d} chunks")
        
        print("   " + "-"*76)
    else:
        print(f"\n‚ùå Collection '{collection_name}' not found")

print("\n" + "="*80)

üìö Available Collections:
   - resume_data
   - personality


üìä Collection 'resume_data' Details:
   Total documents: 35
   Vector dimensions: 1536
   Distance metric: Cosine
   Status: green

   üìà Documents by Section Type:
      continuing_studies  :   7 chunks
      education           :   2 chunks
      personal_info       :   1 chunks
      professional_summary:   1 chunks
      skills              :   5 chunks
      work_experience     :  19 chunks
   ----------------------------------------------------------------------------

üìä Collection 'personality' Details:
   Total documents: 14
   Vector dimensions: 1536
   Distance metric: Cosine
   Status: green

   üìà Documents by Section Type:
      personality         :  14 chunks
   ----------------------------------------------------------------------------



## 3. Query Resume Data (from resume_ale.md)

### 3.1 View Work Experience with Full Metadata

In [None]:
# Filter for work experience entries (from resume_data collection)
work_filter = Filter(
    must=[
        FieldCondition(
            key="section_type",
            match=MatchValue(value="work_experience")
        )
    ]
)

work_records, _ = client.scroll(
    collection_name=resume_collection,  # Query resume_data collection
    scroll_filter=work_filter,
    limit=20,
    with_payload=True,
    with_vectors=False  # Set True to see embeddings
)

print(f"üíº Work Experience Chunks from '{resume_collection}' collection (showing {len(work_records)}):\n")

for i, record in enumerate(work_records, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"CHUNK {i} - ID: {record.id}")
    print(f"{'='*80}")
    print(f"üìÑ Content (Achievement):")
    print(f"   {payload.get('content', 'N/A')}")
    print(f"\nüè¢ Metadata:")
    print(f"   Company:        {metadata.get('company', 'N/A')}")
    print(f"   Position:       {metadata.get('position', 'N/A')}")
    print(f"   Start Date:     {metadata.get('start_date', 'N/A')}")
    print(f"   End Date:       {metadata.get('end_date', 'N/A')}")
    print(f"   Source File:    {payload.get('source_file', 'N/A')}")
    print(f"   Section Type:   {payload.get('section_type', 'N/A')}")
    print()

üíº Work Experience Chunks from 'resume_data' collection (showing 19):

CHUNK 1 - ID: 11e55900-961b-449b-88e5-080429d6688a
üìÑ Content (Achievement):
   Data Scientist II: Implemented a Python algorithm to automatically select sampling plans, reducing inspector manual work by 3 hours per inspector per day and generating approximately CAD 4,000,000 in annual savings by translating business rules into an automated algorithm.

üè¢ Metadata:
   Company:        Canadian Food Inspection Agency
   Position:       Data Scientist II
   Start Date:     March-2025
   End Date:       November-2025
   Source File:    resume_ale.md
   Section Type:   work_experience

CHUNK 2 - ID: 1a038c02-a181-4bf5-88a4-fb38dcea017a
üìÑ Content (Achievement):
   Data Scientist: Standardized descriptive and statistical reporting in Power BI, reducing report-generation time and improving inspection efficiency by creating templated reports and automated data queries.

üè¢ Metadata:
   Company:        Canadian Foo

### 3.2 View Work Experience WITH Embeddings

Each chunk has a 1536-dimensional embedding vector generated by OpenAI's `text-embedding-3-small` model.

In [None]:
# Get one work experience record WITH embeddings
work_with_vector, _ = client.scroll(
    collection_name=resume_collection,
    scroll_filter=work_filter,
    limit=20,
    with_payload=True,
    with_vectors=True  # Include embeddings
)

if work_with_vector:
    record = work_with_vector[0]
    vector = record.vector
    
    print(f"üî¢ Embedding Vector Details:")
    print(f"   Vector dimensions: {len(vector)}")
    print(f"   Vector type: {type(vector)}")
    print(f"   First 10 values: {vector[:10]}")
    print(f"   Last 10 values:  {vector[-10:]}")
    print(f"\nüìä Vector Statistics:")
    import numpy as np
    vector_array = np.array(vector)
    print(f"   Min value:  {vector_array.min():.6f}")
    print(f"   Max value:  {vector_array.max():.6f}")
    print(f"   Mean value: {vector_array.mean():.6f}")
    print(f"   Std dev:    {vector_array.std():.6f}")
    
    print(f"\nüìÑ Associated Content:")
    print(f"   {record.payload.get('content', 'N/A')[:150]}...")

üî¢ Embedding Vector Details:
   Vector dimensions: 1536
   Vector type: <class 'list'>
   First 10 values: [0.02034231647849083, -0.03654035925865173, 0.06752133369445801, -0.015060896053910255, -0.008598103187978268, -0.007789465133100748, -0.008636008948087692, -0.007606257684528828, -0.010771320201456547, 0.0278222244232893]
   Last 10 values:  [0.0033703807275742292, 0.006867111194878817, 0.008636008948087692, 0.016122233122587204, 0.0037968114484101534, -0.0016251743072643876, 0.022692423313856125, -0.04225137084722519, 0.027266286313533783, 0.02266715280711651]

üìä Vector Statistics:
   Min value:  -0.123924
   Max value:  0.070857
   Mean value: 0.000012
   Std dev:    0.025516

üìÑ Associated Content:
   Data Scientist II: Implemented a Python algorithm to automatically select sampling plans, reducing inspector manual work by 3 hours per inspector per ...


### 3.3 Query Education & Skills Sections

In [None]:
# Query education entries (from resume_data collection)
education_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value="education"))]
)

education_records, _ = client.scroll(
    collection_name=resume_collection,  # ‚Üê Query resume_data collection
    scroll_filter=education_filter,
    limit=20,
    with_payload=True
)

print(f"üéì Education Entries from '{resume_collection}' collection ({len(education_records)}):\n")
for i, record in enumerate(education_records, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"EDUCATION CHUNK {i}")
    print(f"{'='*80}")
    print(f"üìù Degree:        {metadata.get('degree', 'N/A')}")
    print(f"üè´ Institution:   {metadata.get('institution', 'N/A')}")
    print(f"üìÖ Year:          {metadata.get('year', 'N/A')}")
    print(f"üìÇ Source File:   {payload.get('source_file', 'N/A')}")
    print(f"üè∑Ô∏è  Section Type:  {payload.get('section_type', 'N/A')}")
    print(f"\nüìÑ Content:\n   {payload.get('content', 'N/A')}")
    print(f"\nüîç Full Metadata: {json.dumps(metadata, indent=2)}")
    print()

# Query skills (from resume_data collection)
skills_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value="skills"))]
)

skills_records, _ = client.scroll(
    collection_name=resume_collection,
    scroll_filter=skills_filter,
    limit=20,
    with_payload=True
)

print(f"\nüõ†Ô∏è  Skills Entries from '{resume_collection}' collection ({len(skills_records)}):\n")
for i, record in enumerate(skills_records, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"SKILL CHUNK {i}")
    print(f"{'='*80}")
    print(f"üìÇ Category:      {metadata.get('category', 'N/A')}")
    print(f"üìÑ Skills:        {payload.get('content', 'N/A')}")
    print(f"üìÅ Source File:   {payload.get('source_file', 'N/A')}")
    print(f"üè∑Ô∏è  Section Type:  {payload.get('section_type', 'N/A')}")
    print(f"\nüîç Full Metadata: {json.dumps(metadata, indent=2)}")
    print()

üéì Education Entries from 'resume_data' collection (2):

EDUCATION CHUNK 1
üìù Degree:        MSc in Food Science
üè´ Institution:   University of British Columbia
üìÖ Year:          N/A
üìÇ Source File:   resume_ale.md
üè∑Ô∏è  Section Type:  education

üìÑ Content:
   MSc in Food Science from University of British Columbia. January-2019 - October-2020 | Canada

üîç Full Metadata: {
  "degree": "MSc in Food Science",
  "institution": "University of British Columbia",
  "dates": "January-2019 - October-2020 | Canada"
}

EDUCATION CHUNK 2
üìù Degree:        BSc in Biotechnology Engineering
üè´ Institution:   Tec de Monterrey
üìÖ Year:          N/A
üìÇ Source File:   resume_ale.md
üè∑Ô∏è  Section Type:  education

üìÑ Content:
   BSc in Biotechnology Engineering from Tec de Monterrey. August-2012 - May-2017 | Mexico

üîç Full Metadata: {
  "degree": "BSc in Biotechnology Engineering",
  "institution": "Tec de Monterrey",
  "dates": "August-2012 - May-2017 | Mexico"
}


üõ

## 4. Query Personality Traits Data (from personalities_16.md)

### 4.1 View Personality Sections

In [None]:
personality_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value=""))]
)

personality_records, _ = client.scroll(
    collection_name=personality_collection,
    scroll_filter=personality_filter,
    limit=20,
    with_payload=True
)

print(f"üß† Personality Trait Chunks from '{personality_collection}' collection ({len(personality_records)}):\n")
print(f"üí° Note: This collection contains ONLY personality data with simplified fixed-size chunking\n")

for i, record in enumerate(personality_records, 1):
    payload = record.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"PERSONALITY CHUNK {i}")
    print(f"{'='*80}")
    print(f"üìù Chunk Index: {metadata.get('chunk_index', 'N/A')}")
    print(f"üìÇ Source File: {payload.get('source_file', 'N/A')}")
    print(f"üìè Character Range: {metadata.get('char_start', 'N/A')} - {metadata.get('char_end', 'N/A')}")
    print(f"\nüìÑ Content:\n   {payload.get('content', 'N/A')}")
    print(f"\nüîç Full Metadata: {json.dumps(metadata, indent=2)}")
    print()

üß† Personality Trait Chunks from 'personality' collection (14):

üí° Note: This collection contains ONLY personality data (no resume content)

PERSONALITY CHUNK 1
üìù Section:       Big-Picture Focus
üìÇ Source File:   personalities_16.md
üè∑Ô∏è  Section Type:  personality

üìÑ Content:
   I prefer focusing on overarching goals and strategies rather than micromanaging small details.

üîç Full Metadata: {
  "section": "Big-Picture Focus"
}

PERSONALITY CHUNK 2
üìù Section:       Conceptual Thinking
üìÇ Source File:   personalities_16.md
üè∑Ô∏è  Section Type:  personality

üìÑ Content:
   I effortlessly grasp abstract, complex ideas, making me particularly suited to roles that require strategic analysis and long-term planning.

üîç Full Metadata: {
  "section": "Conceptual Thinking"
}

PERSONALITY CHUNK 3
üìù Section:       Reluctance to Delegate Tasks
üìÇ Source File:   personalities_16.md
üè∑Ô∏è  Section Type:  personality

üìÑ Content:
   Believing strongly in my own 

### 4.2 View All Personality Chunks

With simplified fixed-size chunking, all chunks in the personality collection are treated equally.

## 5. Semantic Search Examples

### 5.1 Search for Python-Related Work Experience

This demonstrates how semantic search works with embeddings.

In [10]:
# Import OpenAI embeddings to create query vectors
import sys
sys.path.append('..')
from src.core.embeddings import OpenAIEmbeddings

# Initialize embedder
embedder = OpenAIEmbeddings()

# Create a query for Python-related achievements
query_text = "Python data analysis ETL pipeline machine learning"
query_vector = embedder.embed_query(query_text)

print(f"üîç Semantic Search Query: '{query_text}'")
print(f"   Query vector dimensions: {len(query_vector)}")
print(f"   Searching in: {resume_collection} collection")

# Search with vector similarity using query_points (newer API)
results = client.query_points(
    collection_name=resume_collection,  # ‚Üê Query resume_data collection
    query=query_vector,
    limit=5,
    score_threshold=0.5  # Only return results with similarity > 0.5
).points

print(f"\nüìä Top {len(results)} Results (by semantic similarity):\n")

for i, result in enumerate(results, 1):
    payload = result.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"RESULT {i} - Similarity Score: {result.score:.4f}")
    print(f"{'='*80}")
    print(f"üìÑ Content: {payload.get('content', 'N/A')}")
    print(f"üè∑Ô∏è  Section Type: {payload.get('section_type', 'N/A')}")
    if payload.get('section_type') == 'work_experience':
        print(f"   Company: {metadata.get('company', 'N/A')}")
        print(f"   Position: {metadata.get('position', 'N/A')}")
    print()

üîç Semantic Search Query: 'Python data analysis ETL pipeline machine learning'
   Query vector dimensions: 1536
   Searching in: resume_data collection

üìä Top 4 Results (by semantic similarity):

RESULT 1 - Similarity Score: 0.6346
üìÑ Content: Data Analyst: Built an ETL pipeline integrating five data sources totaling over 1M records using SQL and Python, automating ingestion and cleaning and saving 8 hours weekly in data preparation.
üè∑Ô∏è  Section Type: work_experience
   Company: Rubicon Organics
   Position: Data Analyst

RESULT 2 - Similarity Score: 0.5538
üìÑ Content: Data Scientist II: Extracted and processed millions of import/export transactions by building web-scraping collectors and a PySpark ETL pipeline to load cleaned data into a Microsoft Fabric lakehouse.
üè∑Ô∏è  Section Type: work_experience
   Company: Canadian Food Inspection Agency
   Position: Data Scientist II

RESULT 3 - Similarity Score: 0.5154
üìÑ Content: Data Scientist II: Automated data categoriza

### 5.2 Search for Personality Traits Matching Job Requirements

**NEW: Direct query to personality collection (no filtering needed!)**

This mimics how `retrieve_personality_traits()` works in the resume generator with the new architecture.

In [None]:
# Simulate a job analysis with soft skills and keywords
job_analysis = {
    'soft_skills': ['analytical thinking', 'problem-solving', 'collaboration'],
    'keywords': ['strategic', 'innovative', 'team player']
}

# Build query (same logic as retrieve_personality_traits)
query_parts = job_analysis.get('soft_skills', []) + job_analysis.get('keywords', [])
query_text = ' '.join(query_parts)
query_vector = embedder.embed_query(query_text)

print(f"üîç Job Requirements Query: '{query_text}'")
print(f"   Searching in: {personality_collection} collection (NEW!)\n")

# Search the PERSONALITY collection directly (no filtering needed!)
all_results = client.query_points(
    collection_name=personality_collection,  # ‚Üê Query personality collection directly!
    query=query_vector,
    limit=10
).points

print(f"‚úÖ Retrieved {len(all_results)} results from personality collection")

print(f"\nüìä Top 5 Personality Trait Chunks by Semantic Similarity:\n")
print(f"üí° Benefits of simplified chunking:")
print(f"   - Pure semantic search without complex filtering")
print(f"   - Fixed-size chunks maintain consistent context windows")
print(f"   - Faster search (smaller collection)\n")

for i, result in enumerate(all_results[:5], 1):  # Top 5
    payload = result.payload
    metadata = payload.get('metadata', {})
    
    print(f"{'='*80}")
    print(f"TRAIT {i} - Similarity: {result.score:.4f}")
    print(f"{'='*80}")
    print(f"üìù Chunk Index: {metadata.get('chunk_index', 'N/A')}")
    print(f"üìè Characters: {metadata.get('char_start', 'N/A')} - {metadata.get('char_end', 'N/A')}")
    print(f"üìÑ Content:\n   {payload.get('content', 'N/A')}")
    print()

print("\nüí° These traits would be deduplicated (removing 100-char overlaps) and injected into the cover letter prompt!")

üîç Job Requirements Query: 'analytical thinking problem-solving collaboration strategic innovative team player'
   Searching in: personality collection (NEW!)

‚úÖ Retrieved 10 results from personality collection
üìä After filtering weaknesses: 10 Personality/Strength Traits

üí° Benefits of separate collection:
   - No work experience contamination (used to get mixed in old architecture)
   - Faster search (14 docs vs 49)
   - Cleaner semantic space

TRAIT 1 - Similarity: 0.4947
üè∑Ô∏è  Type: personality
üìù Section: Conceptual Thinking
üìÑ Content:
   I effortlessly grasp abstract, complex ideas, making me particularly suited to roles that require strategic analysis and long-term planning.

TRAIT 2 - Similarity: 0.4716
üè∑Ô∏è  Type: personality
üìù Section: Innovative Mindset
üìÑ Content:
   My ability to see possibilities others overlook often helps me find smarter solutions and effective improvements at work.

TRAIT 3 - Similarity: 0.4081
üè∑Ô∏è  Type: personality
üìù S

### 5.3 Semantic Search with Section Filtering

Combine semantic search with metadata filters for precise results.

In [14]:
# Search for data science achievements ONLY in work experience (resume_data collection)
query_text = "data science machine learning SQL Python dashboard visualization"
query_vector = embedder.embed_query(query_text)

# Apply filter to only search work_experience
work_filter = Filter(
    must=[FieldCondition(key="section_type", match=MatchValue(value="work_experience"))]
)

results = client.query_points(
    collection_name=resume_collection,  # ‚Üê Query resume_data collection
    query=query_vector,
    query_filter=work_filter,  # ‚Üê Apply filter during search
    limit=5
).points

print(f"üîç Query: '{query_text}'")
print(f"üì¶ Collection: {resume_collection}")
print(f"üéØ Filter: section_type = 'work_experience'")
print(f"\nüìä Top {len(results)} Work Achievements:\n")

for i, result in enumerate(results, 1):
    payload = result.payload
    metadata = payload.get('metadata', {})
    
    print(f"{i}. [Score: {result.score:.4f}] {metadata.get('company', 'N/A')} - {metadata.get('position', 'N/A')}")
    print(f"   {payload.get('content', 'N/A')[:100]}...")
    print()

üîç Query: 'data science machine learning SQL Python dashboard visualization'
üì¶ Collection: resume_data
üéØ Filter: section_type = 'work_experience'

üìä Top 5 Work Achievements:

1. [Score: 0.5813] Canadian Food Inspection Agency - Data Scientist
   Data Scientist: Standardized descriptive and statistical reporting in Power BI, reducing report-gene...

2. [Score: 0.5107] Rubicon Organics - Data Analyst
   Data Analyst: Built an ETL pipeline integrating five data sources totaling over 1M records using SQL...

3. [Score: 0.5098] Rubicon Organics - Data Analyst
   Data Analyst: Built three Power BI dashboards for sales and marketing by collaborating with stakehol...

4. [Score: 0.4948] Canadian Food Inspection Agency - Data Scientist II
   Data Scientist II: Automated forecasting and reduced manual effort by 40 hours per month by deployin...

5. [Score: 0.4914] Rubicon Organics - Data Analyst
   Data Analyst: Designed and implemented a reporting tool to pinpoint SKU opportunities a

## 6. Complete RAG Workflow Example (NEW Dual-Collection Architecture)

This demonstrates the updated retrieval flow using **separate collections** for resume and personality data.

In [15]:
# Simulate a complete RAG workflow for a Data Scientist job
print("="*80)
print("COMPLETE RAG WORKFLOW: Data Scientist Position")
print("(Using NEW Dual-Collection Architecture)")
print("="*80)

# 1. Job context
job_title = "Senior Data Scientist"
company = "Tech Corp"
job_description = """
Looking for a data scientist with strong Python skills, experience with machine learning,
SQL databases, and data visualization. Must have excellent analytical and problem-solving
abilities with strong communication skills.
"""

print(f"\nüìã Job: {job_title} at {company}")
print(f"üìù Requirements: Python, ML, SQL, data viz, analytical thinking, communication\n")

# 2. PHASE 1: RETRIEVAL
print("="*80)
print("PHASE 1: RETRIEVAL (Vector Similarity Search)")
print("="*80)

# Create query embedding
query_text = f"{job_title} {company} {job_description}"
query_vector = embedder.embed_query(query_text)

# Retrieve work experience from RESUME_DATA collection
print(f"\nüîç Searching resume_data collection for work achievements...")
work_results = client.query_points(
    collection_name=resume_collection,  # ‚Üê Query resume collection
    query=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="section_type", match=MatchValue(value="work_experience"))]
    ),
    limit=10
).points

print(f"‚úÖ Retrieved {len(work_results)} relevant work achievements:")
for i, result in enumerate(work_results[:5], 1):
    metadata = result.payload.get('metadata', {})
    print(f"   {i}. [{result.score:.3f}] {metadata.get('company')} - {result.payload.get('content', '')[:60]}...")

# Retrieve personality traits from PERSONALITY collection
job_analysis = {
    'soft_skills': ['analytical', 'problem-solving', 'communication'],
    'keywords': ['data-driven', 'collaborative']
}

personality_query = ' '.join(job_analysis['soft_skills'] + job_analysis['keywords'])
personality_vector = embedder.embed_query(personality_query)

print(f"\nüß† Searching personality collection for matching traits...")
personality_results = client.query_points(
    collection_name=personality_collection,  # ‚Üê Query personality collection (NEW!)
    query=personality_vector,
    limit=10
).points

# Filter for personality/strength (exclude weaknesses)
personality_filtered = [
    r for r in personality_results 
    if r.payload.get('section_type') in ['personality', 'strength']
][:5]

print(f"‚úÖ Retrieved {len(personality_filtered)} personality traits:")
for i, result in enumerate(personality_filtered, 1):
    print(f"   {i}. [{result.score:.3f}] {result.payload.get('content', '')[:60]}...")

print(f"\nüí° Architecture Benefits:")
print(f"   ‚úì Resume and personality queries run independently")
print(f"   ‚úì No cross-contamination (personality search can't return work achievements)")
print(f"   ‚úì Faster searches (smaller collections)")

# 3. PHASE 2: AUGMENTATION
print(f"\n{'='*80}")
print("PHASE 2: AUGMENTATION (Combine Context)")
print("="*80)
print("\n‚úÖ Would combine:")
print(f"   - Job requirements: {job_title}, Python, ML, SQL...")
print(f"   - {len(work_results[:5])} work achievements (from resume_data collection)")
print(f"   - {len(personality_filtered)} personality traits (from personality collection)")
print("   - Into a structured prompt for Claude")

# 4. PHASE 3: GENERATION
print(f"\n{'='*80}")
print("PHASE 3: GENERATION (Claude LLM)")
print("="*80)
print("\n‚úÖ Would call Claude API with augmented prompt to generate:")
print("   - Tailored resume sections")
print("   - Personalized cover letter")
print("   - Using ONLY the retrieved context")

print(f"\n{'='*80}")
print("‚úÖ RAG WORKFLOW COMPLETE")
print("="*80)

COMPLETE RAG WORKFLOW: Data Scientist Position
(Using NEW Dual-Collection Architecture)

üìã Job: Senior Data Scientist at Tech Corp
üìù Requirements: Python, ML, SQL, data viz, analytical thinking, communication

PHASE 1: RETRIEVAL (Vector Similarity Search)

üîç Searching resume_data collection for work achievements...
‚úÖ Retrieved 10 relevant work achievements:
   1. [0.454] Canadian Food Inspection Agency - Data Scientist II: Extracted and processed millions of impor...
   2. [0.448] Canadian Food Inspection Agency - Data Scientist: Standardized descriptive and statistical rep...
   3. [0.447] Canadian Food Inspection Agency - Data Scientist II: Implemented daily automated data refreshe...
   4. [0.441] Canadian Food Inspection Agency - Data Scientist: Analyzed pathogen occurrence trends across 5...
   5. [0.435] Rubicon Organics - Data Analyst: Built an ETL pipeline integrating five data so...

üß† Searching personality collection for matching traits...
‚úÖ Retrieved 5 person

## Summary

This notebook demonstrated the **dual-collection architecture** with **simplified personality chunking**:

### What Changed

**Before (Section-Aware Chunking for Personality):**
- Personality collection used regex to identify sections (Personality Traits, Career Preferences, Strengths)
- Complex metadata with `section_type`, `section_name`, `traits_included`
- Required header parsing and section-based grouping

**After (Simple Fixed-Size Chunking):**
- Personality collection uses simple 400-character chunks with 100-character overlap
- No header identification or section parsing
- Minimal metadata: `chunk_index`, `char_start`, `char_end`, `overlap_chars`
- No `section_type` field (empty string for personality chunks)
- No `traits_included` metadata

### Key Features Demonstrated

1. **Collection Structure**: Viewing BOTH collections with separate document counts
2. **Resume Data Queries**: Querying `resume_data` for work experience, education, skills with full metadata
3. **Personality Data Queries**: Querying `personality` collection with simplified fixed-size chunking
4. **Embeddings**: Inspecting 1536-dimensional vectors and their statistics
5. **Semantic Search**: Using OpenAI embeddings for similarity-based retrieval from specific collections
6. **Section Filtering**: Combining semantic search with metadata filters within resume collection
7. **Complete RAG Flow**: End-to-end retrieval ‚Üí augmentation ‚Üí generation workflow using both collections

### Architecture Benefits

- ‚úÖ **Semantic separation**: Resume facts vs personality traits stored independently
- ‚úÖ **Simplified storage**: Personality collection doesn't require complex section metadata
- ‚úÖ **No cross-contamination**: Personality searches retrieve only personality data
- ‚úÖ **Faster queries**: Smaller, focused collections = faster semantic search
- ‚úÖ **Cleaner code**: No complex section_type filtering or header parsing logic
- ‚úÖ **Better relevance**: Semantic matching within focused collections yields better results

### Key Insights

- **Chunking preserves context**: Each 400-char chunk maintains semantic meaning through overlap
- **Embeddings enable semantic matching**: Query "analytical thinking" matches related personality traits
- **Collection isolation prevents noise**: Searching personality collection won't return work achievements
- **Metadata enables filtering**: Can retrieve specific section types within resume collection
- **Similarity scores guide selection**: Higher scores = more relevant to query
- **Simplified approach still works**: Fixed-size chunking is sufficient for personality content

### Next Steps

- Run cells to explore your actual dual-collection vector database
- Compare query results between collections
- Modify queries to test different job requirements
- Experiment with `score_threshold` values
- Try combining multiple filters within each collection