# Upgrade Embeddings to Larger Dimensions

This notebook re-embeds your clinical chunks and DAIC-WOZ conversations with larger, better embedding models.

**Recommended Models:**
- `Alibaba-NLP/gte-large-en-v1.5` (1024d) - Best quality ‚≠ê
- `nomic-ai/nomic-embed-text-v1.5` (768d) - Great for long texts
- `BAAI/bge-large-en-v1.5` (1024d) - Excellent general purpose

**Expected Time:** ~5-10 minutes with GPU

## 1. Setup & Configuration

In [1]:
# Imports
import os
import pickle
import numpy as np
import faiss
import torch
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import shutil
from datetime import datetime

print("‚úÖ Imports successful")

‚úÖ Imports successful


In [2]:
# Configuration - CHANGE THESE AS NEEDED
INPUT_DIR = 'data/RAG'  # Your current data directory
OUTPUT_DIR = 'data/RAG_1024d'  # Where to save upgraded embeddings

# Choose your embedding model
EMBEDDING_MODEL = 'Alibaba-NLP/gte-large-en-v1.5'  # 1024d - Best quality
# EMBEDDING_MODEL = 'nomic-ai/nomic-embed-text-v1.5'  # 768d - Good for long texts
# EMBEDDING_MODEL = 'BAAI/bge-large-en-v1.5'  # 1024d - Excellent general

# Batch sizes (adjust based on GPU memory)
BATCH_SIZE_CHUNKS = 32  # For short clinical chunks
BATCH_SIZE_CONVOS = 16  # For long conversations

print(f"Configuration:")
print(f"  Input: {INPUT_DIR}")
print(f"  Output: {OUTPUT_DIR}")
print(f"  Model: {EMBEDDING_MODEL}")
print(f"  Batch sizes: chunks={BATCH_SIZE_CHUNKS}, convos={BATCH_SIZE_CONVOS}")

Configuration:
  Input: data/RAG
  Output: data/RAG_1024d
  Model: Alibaba-NLP/gte-large-en-v1.5
  Batch sizes: chunks=32, convos=16


In [3]:
# Check GPU availability
if torch.cuda.is_available():
    print(f"‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    
    # Clear GPU cache
    torch.cuda.empty_cache()
    
    allocated = torch.cuda.memory_allocated(0) / 1e9
    reserved = torch.cuda.memory_reserved(0) / 1e9
    print(f"   Allocated: {allocated:.2f} GB")
    print(f"   Available: {torch.cuda.get_device_properties(0).total_memory / 1e9 - reserved:.2f} GB")
    USE_GPU = True
else:
    print("‚ö†Ô∏è  No GPU detected, using CPU (this will be slow)")
    USE_GPU = False

‚úÖ GPU Available: NVIDIA GeForce RTX 4070 SUPER
   GPU Memory: 12.88 GB
   Allocated: 0.00 GB
   Available: 12.88 GB


In [4]:
# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"‚úÖ Output directory ready: {OUTPUT_DIR}")

‚úÖ Output directory ready: data/RAG_1024d


## 2. Load Embedding Model

In [5]:
# Load the embedding model
print(f"Loading model: {EMBEDDING_MODEL}...")
print("This may take a minute the first time (downloading model)")

device = 'cuda' if USE_GPU else 'cpu'

# Add trust_remote_code=True for Alibaba models
embed_model = SentenceTransformer(
    EMBEDDING_MODEL, 
    device=device,
    trust_remote_code=True  # Required for Alibaba-NLP models
)

# Get embedding dimension
EMBEDDING_DIM = embed_model.get_sentence_embedding_dimension()

print(f"\n‚úÖ Model loaded successfully")
print(f"   Device: {device}")
print(f"   Embedding dimension: {EMBEDDING_DIM}")
print(f"   Max sequence length: {embed_model.max_seq_length}")

Loading model: Alibaba-NLP/gte-large-en-v1.5...
This may take a minute the first time (downloading model)

‚úÖ Model loaded successfully
   Device: cuda
   Embedding dimension: 1024
   Max sequence length: 8192


## 3. Load Original Data

In [6]:
# Load clinical chunks
chunks_path = os.path.join(INPUT_DIR, 'chunks.pkl')
print(f"Loading chunks from: {chunks_path}")

with open(chunks_path, 'rb') as f:
    chunks = pickle.load(f)

print(f"‚úÖ Loaded {len(chunks)} clinical chunks")
print(f"   First chunk preview (200 chars): {chunks[0][:200]}...")

Loading chunks from: data/RAG\chunks.pkl
‚úÖ Loaded 704 clinical chunks
   First chunk preview (200 chars): Depressive disorders
include disruptive mood dysregulation
disorder, major depressive disorder (including major depressive episode),
persistent depressive disorder, premenstrual dysphoric disorder,
su...


In [8]:
# Load DAIC-WOZ data
daic_path = os.path.join(INPUT_DIR, 'diac_woz_data.pkl')
print(f"Loading DAIC-WOZ data from: {daic_path}")

with open(daic_path, 'rb') as f:
    daic_woz_data = pickle.load(f)

print(f"‚úÖ Loaded {len(daic_woz_data['patient_ids'])} patient conversations")
print(f"   Old embedding shape: {daic_woz_data['embeddings'].shape}")
print(f"   Available keys: {list(daic_woz_data.keys())}")

Loading DAIC-WOZ data from: data/RAG\diac_woz_data.pkl
‚úÖ Loaded 189 patient conversations
   Old embedding shape: (189, 384)
   Available keys: ['patient_ids', 'conversations', 'embeddings', 'mdd_binary', 'phq8_scores']


In [11]:
# Load mandatory context files
phq8_path = os.path.join(INPUT_DIR, 'phq8.txt')
dsm5_path = os.path.join(INPUT_DIR, 'mandatory_context_DSM5_MMD.txt')

with open(phq8_path, 'r', encoding='utf-8') as f:
    phq8 = f.read()

with open(dsm5_path, 'r', encoding='utf-8') as f:
    dsm5 = f.read()

print(f"‚úÖ Loaded mandatory context files")
print(f"   PHQ-8: {len(phq8)} characters")
print(f"   DSM-5: {len(dsm5)} characters")

‚úÖ Loaded mandatory context files
   PHQ-8: 1517 characters
   DSM-5: 15750 characters


## 4. Re-embed Clinical Chunks

In [12]:
# Embed clinical chunks
print(f"\n{'='*80}")
print(f"Embedding {len(chunks)} clinical chunks with {EMBEDDING_MODEL}")
print(f"{'='*80}\n")

chunk_embeddings = embed_model.encode(
    chunks,
    batch_size=BATCH_SIZE_CHUNKS,
    show_progress_bar=True,
    normalize_embeddings=True,  # Important for cosine similarity
    convert_to_numpy=True
)

print(f"\n‚úÖ Chunks embedded successfully")
print(f"   Shape: {chunk_embeddings.shape}")
print(f"   Dimension: {chunk_embeddings.shape[1]}d")
print(f"   Upgrade: 384d ‚Üí {chunk_embeddings.shape[1]}d")


Embedding 704 clinical chunks with Alibaba-NLP/gte-large-en-v1.5



Batches:   0%|          | 0/22 [00:00<?, ?it/s]


‚úÖ Chunks embedded successfully
   Shape: (704, 1024)
   Dimension: 1024d
   Upgrade: 384d ‚Üí 1024d


In [13]:
# Create FAISS index for clinical chunks
print(f"\nCreating FAISS index...")

dimension = chunk_embeddings.shape[1]

# Use IndexFlatIP for inner product (cosine similarity with normalized vectors)
chunk_index = faiss.IndexFlatIP(dimension)

# Add embeddings
chunk_index.add(chunk_embeddings.astype('float32'))

print(f"‚úÖ FAISS index created")
print(f"   Total vectors: {chunk_index.ntotal}")
print(f"   Dimension: {dimension}")


Creating FAISS index...
‚úÖ FAISS index created
   Total vectors: 704
   Dimension: 1024


In [14]:
# Save chunks and index
output_chunks_path = os.path.join(OUTPUT_DIR, 'chunks.pkl')
output_index_path = os.path.join(OUTPUT_DIR, f'depression_embeddings_{dimension}d.index')

with open(output_chunks_path, 'wb') as f:
    pickle.dump(chunks, f)

faiss.write_index(chunk_index, output_index_path)

print(f"\n‚úÖ Saved clinical chunks and index:")
print(f"   Chunks: {output_chunks_path}")
print(f"   Index: {output_index_path}")


‚úÖ Saved clinical chunks and index:
   Chunks: data/RAG_1024d\chunks.pkl
   Index: data/RAG_1024d\depression_embeddings_1024d.index


In [15]:
# Extract patient-only responses for better retrieval

def extract_patient_responses(conversation):
    """
    Extract only patient/participant responses from DAIC-WOZ conversation
    Format: tab-separated with "speaker\\tvalue"
    
    Args:
        conversation: str - Full conversation with tab-separated format
    
    Returns:
        str - Patient responses only, space-separated
    """
    lines = conversation.split('\n')
    patient_lines = []
    
    for line in lines:
        # Skip empty lines
        if not line.strip():
            continue
            
        # Split by tab
        parts = line.split('\t')
        
        # Need at least 2 parts (speaker and value)
        if len(parts) >= 2:
            speaker = parts[0].strip().lower()
            value = parts[1].strip()
            
            # Check if speaker is participant
            if speaker == 'participant' and value:
                patient_lines.append(value)
    
    return ' '.join(patient_lines)

print("="*80)
print("EXTRACTING PATIENT-ONLY RESPONSES")
print("="*80)

# Extract patient-only from all conversations
patient_only_conversations = []

for conv in tqdm(daic_woz_data['conversations'], desc="Extracting patient responses"):
    patient_only = extract_patient_responses(conv)
    patient_only_conversations.append(patient_only)

# Calculate statistics
full_lengths = [len(conv) for conv in daic_woz_data['conversations']]
patient_lengths = [len(patient) for patient in patient_only_conversations]

avg_full = np.mean(full_lengths)
avg_patient = np.mean(patient_lengths)
reduction = (1 - avg_patient / avg_full) * 100

print(f"\n‚úÖ Extraction complete!")
print(f"   Total conversations: {len(patient_only_conversations)}")
print(f"   Average full conversation: {avg_full:.0f} characters")
print(f"   Average patient-only: {avg_patient:.0f} characters")
print(f"   Average reduction: {reduction:.1f}%")

# Show example
print(f"\n{'='*80}")
print("EXAMPLE: Patient 0")
print(f"{'='*80}")
print(f"\nFull conversation (first 500 chars):")
print(daic_woz_data['conversations'][0][:500])
print(f"\nPatient-only (first 500 chars):")
print(patient_only_conversations[0][:500])

# Count how many patient responses per conversation
response_counts = [text.split() for text in patient_only_conversations]
avg_words = np.mean([len(words) for words in response_counts])
print(f"\n‚úÖ Statistics:")
print(f"   Average patient words per conversation: {avg_words:.0f}")
print(f"   Shortest patient response: {min(patient_lengths)} chars")
print(f"   Longest patient response: {max(patient_lengths)} chars")

# Store both versions
daic_woz_data['full_conversations'] = daic_woz_data['conversations'].copy()
daic_woz_data['patient_only_conversations'] = patient_only_conversations

print(f"\n‚úÖ Added to daic_woz_data:")
print(f"   'full_conversations' - original with Ellie + Participant")
print(f"   'patient_only_conversations' - patient responses only")
print(f"\nüöÄ Ready to embed patient-only conversations (will be much faster!)")

EXTRACTING PATIENT-ONLY RESPONSES


Extracting patient responses: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 189/189 [00:00<00:00, 16416.57it/s]


‚úÖ Extraction complete!
   Total conversations: 189
   Average full conversation: 13102 characters
   Average patient-only: 7292 characters
   Average reduction: 44.3%

EXAMPLE: Patient 0

Full conversation (first 500 chars):
speaker	value
Ellie	hi i'm ellie thanks for coming in today
Ellie	i was created to talk to people in a safe and secure environment
Ellie	think of me as a friend i don't judge i can't i'm a computer
Ellie	i'm here to learn about people and would love to learn about you
Ellie	i'll ask a few questions to get us started and please feel free to tell me anything your answers are totally confidential
Ellie	how are you doing today
Participant	good
Ellie	that's good
Ellie	where are you from originally
Pa

Patient-only (first 500 chars):
good atlanta georgia um my parents are from here um i love it i like the weather i like the opportunities um yes um it took a minute somewhat easy congestion that's it um i took up business and administration uh yeah i am here and there i




## 5. Re-embed DAIC-WOZ Conversations

In [16]:
# Embed PATIENT-ONLY conversations (faster and better retrieval!)
print(f"\n{'='*80}")
print(f"Embedding {len(patient_only_conversations)} patient-only conversations")
print(f"{'='*80}\n")

conversation_embeddings = embed_model.encode(
    patient_only_conversations,  # ‚úÖ Use patient-only
    batch_size=BATCH_SIZE_CONVOS,
    show_progress_bar=True,
    normalize_embeddings=True,
    convert_to_numpy=True
)

print(f"\n‚úÖ Conversations embedded successfully")
print(f"   Old shape: {daic_woz_data['embeddings'].shape}")
print(f"   New shape: {conversation_embeddings.shape}")
print(f"   Upgrade: {daic_woz_data['embeddings'].shape[1]}d ‚Üí {conversation_embeddings.shape[1]}d")
print(f"   üöÄ Using patient-only text for better retrieval!")


Embedding 189 patient-only conversations



Batches:   0%|          | 0/12 [00:00<?, ?it/s]


‚úÖ Conversations embedded successfully
   Old shape: (189, 384)
   New shape: (189, 1024)
   Upgrade: 384d ‚Üí 1024d
   üöÄ Using patient-only text for better retrieval!


In [17]:
# Update DAIC-WOZ data with new embeddings
daic_woz_data['embeddings'] = conversation_embeddings
daic_woz_data['embedding_model'] = EMBEDDING_MODEL
daic_woz_data['embedding_dimension'] = dimension
daic_woz_data['upgrade_date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

print(f"‚úÖ Updated DAIC-WOZ data dictionary")
print(f"   New keys added: embedding_model, embedding_dimension, upgrade_date")

‚úÖ Updated DAIC-WOZ data dictionary
   New keys added: embedding_model, embedding_dimension, upgrade_date


In [18]:
# Save updated DAIC-WOZ data
output_daic_path = os.path.join(OUTPUT_DIR, f'diac_woz_data_{dimension}d.pkl')

with open(output_daic_path, 'wb') as f:
    pickle.dump(daic_woz_data, f)

print(f"‚úÖ Saved DAIC-WOZ data:")
print(f"   Path: {output_daic_path}")
print(f"   Embedding shape: {conversation_embeddings.shape}")

‚úÖ Saved DAIC-WOZ data:
   Path: data/RAG_1024d\diac_woz_data_1024d.pkl
   Embedding shape: (189, 1024)


## 6. Copy Mandatory Context Files

In [19]:
# Copy mandatory context files to new directory
files_to_copy = ['phq8.txt', 'mandatory_context_DSM5_MMD.txt']

print("Copying mandatory context files...")

for filename in files_to_copy:
    src = os.path.join(INPUT_DIR, filename)
    dst = os.path.join(OUTPUT_DIR, filename)
    
    if os.path.exists(src):
        shutil.copy2(src, dst)
        print(f"  ‚úÖ Copied: {filename}")
    else:
        print(f"  ‚ö†Ô∏è  Not found: {filename}")

Copying mandatory context files...
  ‚úÖ Copied: phq8.txt
  ‚úÖ Copied: mandatory_context_DSM5_MMD.txt


## 7. Verify Embeddings (Optional)

In [21]:
# Test retrieval with new embeddings
print("\nTesting retrieval with upgraded embeddings...")
print("="*80)

# Pick a test conversation (e.g., patient 0)
test_idx = 8
test_embedding = conversation_embeddings[test_idx].reshape(1, -1).astype('float32')
faiss.normalize_L2(test_embedding)

# Search for top-5 similar chunks
k = 5
distances, indices = chunk_index.search(test_embedding, k)

print(f"\nTest query: Patient {daic_woz_data['patient_ids'][test_idx]}")
print(f"True MDD: {daic_woz_data['mdd_binary'][test_idx]}")
print(f"PHQ-8 Score: {daic_woz_data['phq8_scores'][test_idx]}")
print(f"\nTop-{k} retrieved chunks:\n")

for i, (idx, score) in enumerate(zip(indices[0], distances[0])):
    print(f"Rank {i+1} | Similarity: {score:.4f} | Chunk Index: {idx}")
    print(f"{chunks[idx][:200]}...")
    print("-"*80)

print("\n‚úÖ Retrieval test complete!")


Testing retrieval with upgraded embeddings...

Test query: Patient 308
True MDD: 1
PHQ-8 Score: 22

Top-5 retrieved chunks:

Rank 1 | Similarity: 0.5679 | Chunk Index: 485
with intense and persistent yearning and longing for the Latinos appear less likely to receive treatment for mood
deceased person, and complicated by guilty or angry ru- disorders (663‚Äì665).
minations...
--------------------------------------------------------------------------------
Rank 2 | Similarity: 0.5591 | Chunk Index: 489
tion that depression in the context of bereavement differs more likely to prefer counseling than whites, whereas Af-
from other major depressive episodes, and data indicate rican Americans varied acro...
--------------------------------------------------------------------------------
Rank 3 | Similarity: 0.5487 | Chunk Index: 540
Hispanic, or black decreased risk (655). (about one-fifth of the total) received adequate treatment
The impact of major depressive disorders on individu- (976). 

## 8. Create README

In [22]:
# Create README with information about the upgrade
readme_content = f"""# Upgraded Embeddings

## Model Information
- **Model**: {EMBEDDING_MODEL}
- **Embedding dimension**: {dimension}
- **Original dimension**: 384
- **Date created**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
- **Device used**: {device.upper()}

## Files
- `chunks.pkl` - {len(chunks)} clinical text chunks
- `depression_embeddings_{dimension}d.index` - FAISS index for chunks
- `diac_woz_data_{dimension}d.pkl` - {len(daic_woz_data['patient_ids'])} patient conversations with embeddings
- `phq8.txt` - PHQ-8 questionnaire (mandatory context)
- `mandatory_context_DSM5_MMD.txt` - DSM-5 criteria (mandatory context)

## Usage
Update your notebook to load from this directory:

```python
# Load chunks and index
with open('{OUTPUT_DIR}/chunks.pkl', 'rb') as f:
    chunks = pickle.load(f)

index = faiss.read_index('{OUTPUT_DIR}/depression_embeddings_{dimension}d.index')

# Load DAIC-WOZ data
with open('{OUTPUT_DIR}/diac_woz_data_{dimension}d.pkl', 'rb') as f:
    daic_woz_data = pickle.load(f)

# Load mandatory context
with open('{OUTPUT_DIR}/phq8.txt', 'r', encoding='utf-8') as f:
    phq8 = f.read()
    
with open('{OUTPUT_DIR}/mandatory_context_DSM5_MMD.txt', 'r', encoding='utf-8') as f:
    dsm5 = f.read()
```

## Expected Performance Improvements
With {dimension}d embeddings vs 384d:
- Better semantic understanding of clinical concepts
- Improved retrieval accuracy (+5-10% typical)
- More nuanced similarity scores
- Better handling of long conversation contexts
"""

readme_path = os.path.join(OUTPUT_DIR, 'README.md')
with open(readme_path, 'w') as f:
    f.write(readme_content)

print(f"‚úÖ Created README: {readme_path}")

‚úÖ Created README: data/RAG_1024d\README.md


## 9. Summary

In [23]:
# Print summary
print("\n" + "="*80)
print("üéâ EMBEDDING UPGRADE COMPLETE!")
print("="*80)
print(f"\nUpgrade Summary:")
print(f"  Old embeddings: 384 dimensions")
print(f"  New embeddings: {dimension} dimensions")
print(f"  Model used: {EMBEDDING_MODEL}")
print(f"  Device: {device.upper()}")
print(f"\nFiles saved to: {OUTPUT_DIR}")
print(f"  ‚úÖ chunks.pkl")
print(f"  ‚úÖ depression_embeddings_{dimension}d.index")
print(f"  ‚úÖ diac_woz_data_{dimension}d.pkl")
print(f"  ‚úÖ phq8.txt")
print(f"  ‚úÖ mandatory_context_DSM5_MMD.txt")
print(f"  ‚úÖ README.md")
print(f"\nNext steps:")
print(f"  1. Run 'run_large_model_rag.ipynb' with this new data directory")
print(f"  2. Use a larger model (e.g., Llama 70B) for better inference")
print(f"  3. Compare results with old 384d embeddings")
print("="*80)


üéâ EMBEDDING UPGRADE COMPLETE!

Upgrade Summary:
  Old embeddings: 384 dimensions
  New embeddings: 1024 dimensions
  Model used: Alibaba-NLP/gte-large-en-v1.5
  Device: CUDA

Files saved to: data/RAG_1024d
  ‚úÖ chunks.pkl
  ‚úÖ depression_embeddings_1024d.index
  ‚úÖ diac_woz_data_1024d.pkl
  ‚úÖ phq8.txt
  ‚úÖ mandatory_context_DSM5_MMD.txt
  ‚úÖ README.md

Next steps:
  1. Run 'run_large_model_rag.ipynb' with this new data directory
  2. Use a larger model (e.g., Llama 70B) for better inference
  3. Compare results with old 384d embeddings
