# RAG Debugging - Step 1: Embedding Tests (Simplified)

**Problem**: RAG findet kaum relevante Chunks bei deutschen/englischen Queries

**Hypothese**: Das Embedding-Modell ist schlecht f√ºr multilinguale Inhalte

**Test**: Direkt mit sentence-transformers testen (umgeht dependency issues)

## Setup - Direkt mit sentence-transformers

In [1]:
# Einfacher direkter Ansatz
from sentence_transformers import SentenceTransformer
import numpy as np
import time

print("‚úÖ sentence-transformers loaded")

def cosine_similarity(a, b):
    """Simple cosine similarity function"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("‚úÖ Helper functions ready")

ModuleNotFoundError: No module named 'sentence_transformers'

## Test Data - Realistische RAG-Beispiele

Das sind typische Inhalte aus technischer Dokumentation:

In [None]:
# Deutsche Tech-Texte (typische Chunk-Inhalte)
german_docs = [
    "Die Vektorsuche in ChromaDB verwendet Cosinus-√Ñhnlichkeit f√ºr semantische Suchen.",
    "Chunking-Strategien sollten bei technischen Dokumentationen header-bewusst sein.",
    "Der Similarity-Threshold von 0.7 ist oft zu restriktiv f√ºr multilinguale Inhalte."
]

# Englische Tech-Texte (semantisch verwandt)
english_docs = [
    "Vector search in ChromaDB uses cosine similarity for semantic searches.",
    "Chunking strategies should be header-aware for technical documentation.",
    "The similarity threshold of 0.7 is often too restrictive for multilingual content."
]

# Typische User-Queries
queries = {
    "german": "Wie funktioniert Vektorsuche?",
    "english": "How does vector search work?",
    "mixed": "ChromaDB similarity threshold"
}

print(f"üìù Test Setup:")
print(f"   - {len(german_docs)} deutsche Dokumente")
print(f"   - {len(english_docs)} englische Dokumente") 
print(f"   - {len(queries)} Test-Queries")

## Test 1: Aktuelles Modell (all-MiniLM-L6-v2)

Das ist das Modell, das momentan in deinem RAG-System verwendet wird:

In [None]:
# Das aktuelle RAG-Modell laden
print("ü§ñ Loading current model: all-MiniLM-L6-v2")
current_model = SentenceTransformer('all-MiniLM-L6-v2')

print(f"üìê Max sequence length: {current_model.max_seq_length}")
print(f"üìä Embedding dimension: {current_model.get_sentence_embedding_dimension()}")

# Encode all documents
print("\nüîÑ Encoding documents...")
de_embeddings = current_model.encode(german_docs)
en_embeddings = current_model.encode(english_docs)

print(f"‚úÖ Encoded {len(de_embeddings + en_embeddings)} documents")

## Test 2: Cross-Language Similarity Matrix

**Das ist der kritische Test**: Erkennt das Modell, dass deutsche und englische Texte mit gleicher Bedeutung √§hnlich sind?

In [None]:
print("üìä Cross-Language Similarity Analysis")
print("=" * 60)

for i, (de_text, de_emb) in enumerate(zip(german_docs, de_embeddings)):
    print(f"\nüá©üá™ German Doc {i+1}: {de_text[:50]}...")
    
    for j, (en_text, en_emb) in enumerate(zip(english_docs, en_embeddings)):
        similarity = cosine_similarity(de_emb, en_emb)
        
        # Erwartete Paare (gleiche Bedeutung) highlighten
        marker = "üéØ" if i == j else "  "
        
        print(f"   {marker} vs EN Doc {j+1}: {similarity:.3f}")
        if i == j:
            print(f"       üá¨üáß {en_text[:50]}...")
            # Bewertung der Similarity
            if similarity > 0.7:
                print(f"       ‚úÖ Sehr gut erkannt!")
            elif similarity > 0.5:
                print(f"       ‚ö†Ô∏è  Okay erkannt")
            else:
                print(f"       ‚ùå Schlecht erkannt!")

## Test 3: Query-Retrieval Simulation

Simulieren wir eine echte RAG-Suche: Welche Dokumente w√ºrden f√ºr jede Query gefunden?

In [None]:
# Kombiniere alle Dokumente f√ºr die Suche
all_docs = german_docs + english_docs
all_embeddings = np.vstack([de_embeddings, en_embeddings])

def test_query(query_text, query_name):
    print(f"\nüîç Query Test: {query_name}")
    print(f"Query: '{query_text}'")
    print("-" * 50)
    
    # Query embedding
    query_emb = current_model.encode([query_text])[0]
    
    # Similarity zu allen Dokumenten
    similarities = []
    for i, (doc, doc_emb) in enumerate(zip(all_docs, all_embeddings)):
        sim = cosine_similarity(query_emb, doc_emb)
        lang = "üá©üá™" if i < len(german_docs) else "üá¨üáß"
        similarities.append((sim, doc, lang, i+1))
    
    # Sortiere nach Relevanz
    similarities.sort(key=lambda x: x[0], reverse=True)
    
    # Zeige Results
    print("Top Results:")
    for rank, (score, doc, lang, doc_id) in enumerate(similarities[:3], 1):
        print(f"  {rank}. [{score:.3f}] {lang} Doc{doc_id}: {doc[:60]}...")
    
    # RAG-Threshold Analysis
    above_07 = sum(1 for score, _, _, _ in similarities if score >= 0.7)
    above_05 = sum(1 for score, _, _, _ in similarities if score >= 0.5)
    above_03 = sum(1 for score, _, _, _ in similarities if score >= 0.3)
    
    print(f"\nüìä Threshold Analysis:")
    print(f"   - Above 0.7: {above_07}/{len(similarities)} docs (current RAG threshold)")
    print(f"   - Above 0.5: {above_05}/{len(similarities)} docs")
    print(f"   - Above 0.3: {above_03}/{len(similarities)} docs")
    
    if above_07 == 0:
        print(f"   ‚ùå Bei 0.7 Threshold: KEINE Ergebnisse!")
    
    return similarities

# Teste alle Queries
results = {}
for query_name, query_text in queries.items():
    results[query_name] = test_query(query_text, query_name)

## Analyse: Was sehen wir?

**Jetzt k√∂nnen wir konkret bewerten:**

In [None]:
print("üéØ ZUSAMMENFASSUNG DER ERKENNTNISSE")
print("=" * 50)

# Cross-Language Performance
cross_lang_scores = []
for i in range(len(german_docs)):
    score = cosine_similarity(de_embeddings[i], en_embeddings[i])
    cross_lang_scores.append(score)

avg_cross_lang = np.mean(cross_lang_scores)
print(f"üìä Cross-Language Performance:")
print(f"   - Durchschnitt DE<->EN: {avg_cross_lang:.3f}")
print(f"   - Range: {min(cross_lang_scores):.3f} - {max(cross_lang_scores):.3f}")

if avg_cross_lang < 0.5:
    print(f"   ‚ùå PROBLEM: Sehr schlechte Cross-Language Performance!")
elif avg_cross_lang < 0.7:
    print(f"   ‚ö†Ô∏è  PROBLEM: M√§√üige Cross-Language Performance")
else:
    print(f"   ‚úÖ Gute Cross-Language Performance")

# Query Performance
print(f"\nüîç Query Performance:")
for query_name, query_results in results.items():
    best_score = query_results[0][0]
    above_threshold = sum(1 for score, _, _, _ in query_results if score >= 0.7)
    
    print(f"   - {query_name}: Best={best_score:.3f}, Above0.7={above_threshold}")
    
print(f"\nüí° EMPFEHLUNGEN:")
if avg_cross_lang < 0.5:
    print(f"   1. üö® DRINGEND: Multilingual Model verwenden!")
    print(f"   2. üìâ Similarity Threshold auf 0.3 reduzieren")
    print(f"   3. üß™ Test: paraphrase-multilingual-MiniLM-L12-v2")
else:
    print(f"   1. üìâ Similarity Threshold anpassen")
    print(f"   2. üîß Query-Enhancement implementieren")

## Next Steps

**Based on the results above, our next notebook will test:**

1. **Multilingual Models**: `paraphrase-multilingual-MiniLM-L12-v2`
2. **Optimized Thresholds**: Find the sweet spot for your use case
3. **Query Enhancement**: Add synonyms and translations

---

üéØ **This notebook shows the EXACT problem** - no complex RAG pipeline needed!