# Vietnamese Text Search v·ªõi FAISS
# T√¨m ki·∫øm VƒÉn b·∫£n Ti·∫øng Vi·ªát v·ªõi Google Embeddings

Notebook n√†y demo:
- **Real Vietnamese text dataset** (1000+ chunks)
- **Google Embedding API** ƒë·ªÉ t·∫°o embeddings
- **So s√°nh c√°c ph∆∞∆°ng ph√°p FAISS**: Flat, IVF, HNSW
- **Semantic search** v·ªõi queries ti·∫øng Vi·ªát
- **Performance analysis** cho production

## Prerequisites

### 1. Install dependencies
```bash
pip install google-generativeai python-dotenv faiss-cpu numpy pandas matplotlib seaborn
```

### 2. Setup Google API Key
1. Get API key t·ª´: https://makersuite.google.com/app/apikey
2. T·∫°o file `.env` trong root directory:
```
GOOGLE_API_KEY=your_api_key_here
```

### 3. Generate dataset v√† embeddings
```bash
# Generate Vietnamese text
python data/vietnamese_dataset_generator.py

# Create embeddings (requires API key)
python data/embed_vietnamese_text.py
```

In [1]:
import numpy as np
import faiss
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import time
import os
import sys
from typing import List, Tuple
from dotenv import load_dotenv

# Add parent to path
sys.path.append(os.path.dirname(os.path.abspath('')))

from data.embed_vietnamese_text import embed_query
from utils.benchmark import benchmark_index, print_index_info

# Load .env
load_dotenv()

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úì Imports ho√†n t·∫•t!")

‚úì Imports ho√†n t·∫•t!


## 1. Load Dataset v√† Embeddings

In [2]:
# Load embeddings
data_dir = '../data'
embeddings_file = os.path.join(data_dir, 'vn.npy')
texts_file = os.path.join(data_dir, 'vietnamese_embeddings_texts.txt')

if not os.path.exists(embeddings_file):
    print("‚ùå Embeddings not found!")
    print("Please run: python data/embed_vietnamese_text.py")
    print("Make sure you have GOOGLE_API_KEY in .env file")
    raise FileNotFoundError(embeddings_file)

print("Loading embeddings...")
embeddings = np.load(embeddings_file)

print("Loading texts...")
with open(texts_file, 'r', encoding='utf-8') as f:
    texts = [line.strip() for line in f]

print(f"\n‚úì Loaded:")
print(f"  Embeddings shape: {embeddings.shape}")
print(f"  Number of texts: {len(texts)}")
print(f"  Embedding dimension: {embeddings.shape[1]}")
print(f"  Memory size: {embeddings.nbytes / (1024**2):.2f} MB")

# Show samples
print(f"\nSample texts:")
for i in range(min(5, len(texts))):
    print(f"  [{i+1}] {texts[i][:100]}...")

Loading embeddings...
Loading texts...


FileNotFoundError: [Errno 2] No such file or directory: '../data/vietnamese_embeddings_texts.txt'

## 2. Build FAISS Indexes

Ch√∫ng ta s·∫Ω build 3 lo·∫°i index ƒë·ªÉ so s√°nh:
1. **Flat** - 100% accuracy, baseline
2. **IVF** - Fast, good for production
3. **HNSW** - Best quality, moderate speed

In [None]:
dimension = embeddings.shape[1]
n_texts = len(embeddings)

print(f"Building FAISS indexes for {n_texts} texts...\n")

indexes = {}
build_times = {}

# Normalize embeddings for cosine similarity
print("Normalizing embeddings for cosine similarity...")
faiss.normalize_L2(embeddings)
print("‚úì Normalized\n")

# 1. Flat Index (Baseline)
print("[1/3] Building Flat Index...")
start = time.time()
index_flat = faiss.IndexFlatIP(dimension)  # Inner Product for cosine
index_flat.add(embeddings)
build_times['Flat'] = time.time() - start
indexes['Flat'] = index_flat
print(f"  ‚úì Build time: {build_times['Flat']:.3f}s\n")

# 2. IVF Index
print("[2/3] Building IVF Index...")
nlist = int(np.sqrt(n_texts))  # Rule of thumb
start = time.time()
quantizer = faiss.IndexFlatIP(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
index_ivf.train(embeddings)
index_ivf.add(embeddings)
index_ivf.nprobe = 10
build_times['IVF'] = time.time() - start
indexes['IVF'] = index_ivf
print(f"  ‚úì Build time: {build_times['IVF']:.3f}s")
print(f"  ‚úì nlist={nlist}, nprobe=10\n")

# 3. HNSW Index
print("[3/3] Building HNSW Index...")
M = 32
start = time.time()
index_hnsw = faiss.IndexHNSWFlat(dimension, M, faiss.METRIC_INNER_PRODUCT)
index_hnsw.hnsw.efConstruction = 40
index_hnsw.add(embeddings)
index_hnsw.hnsw.efSearch = 32
build_times['HNSW'] = time.time() - start
indexes['HNSW'] = index_hnsw
print(f"  ‚úì Build time: {build_times['HNSW']:.3f}s")
print(f"  ‚úì M={M}, efSearch=32\n")

print("="*70)
print("All indexes built successfully!")
print("="*70)

## 3. Semantic Search Demo

Test v·ªõi c√°c queries ti·∫øng Vi·ªát

In [None]:
def search_vietnamese(query: str, index_name: str = 'HNSW', k: int = 5) -> pd.DataFrame:
    """
    T√¨m ki·∫øm semantic v·ªõi query ti·∫øng Vi·ªát
    
    Args:
        query: C√¢u query ti·∫øng Vi·ªát
        index_name: T√™n index ('Flat', 'IVF', 'HNSW')
        k: S·ªë k·∫øt qu·∫£ tr·∫£ v·ªÅ
    
    Returns:
        DataFrame with results
    """
    # Embed query
    print(f"Embedding query: '{query}'...")
    query_emb = embed_query(query)
    faiss.normalize_L2(query_emb)  # Normalize for cosine similarity
    
    # Search
    index = indexes[index_name]
    start = time.time()
    distances, indices = index.search(query_emb, k)
    search_time = time.time() - start
    
    # Prepare results
    results = []
    for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
        results.append({
            'rank': i + 1,
            'text': texts[idx],
            'similarity': float(dist),  # Cosine similarity (1 = perfect match)
            'index': int(idx)
        })
    
    df = pd.DataFrame(results)
    
    print(f"‚úì Found {k} results in {search_time*1000:.2f}ms using {index_name}\n")
    
    return df

# Test queries
test_queries = [
    "tr√≠ tu·ªá nh√¢n t·∫°o v√† machine learning",
    "s·ª©c kh·ªèe v√† t·∫≠p th·ªÉ d·ª•c",
    "kinh doanh v√† kh·ªüi nghi·ªáp",
    "gi√°o d·ª•c tr·ª±c tuy·∫øn",
    "b·∫£o v·ªá m√¥i tr∆∞·ªùng",
]

In [None]:
# Demo search with first query
query = test_queries[0]

print("="*70)
print(f"Query: '{query}'")
print("="*70)

results = search_vietnamese(query, index_name='HNSW', k=10)

print("Top 10 Results:")
print("="*70)
for _, row in results.iterrows():
    print(f"\n[{row['rank']}] Similarity: {row['similarity']:.4f}")
    print(f"    {row['text']}")

print("\n" + "="*70)

## 4. So s√°nh c√°c Index Methods

In [None]:
# Compare all index methods with same query
print("Comparing index methods...\n")

comparison_results = {}

for index_name in ['Flat', 'IVF', 'HNSW']:
    print(f"Testing {index_name}...")
    results = search_vietnamese(query, index_name=index_name, k=10)
    comparison_results[index_name] = results

print("\n" + "="*70)
print("COMPARISON SUMMARY")
print("="*70)

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, (index_name, results) in zip(axes, comparison_results.items()):
    # Plot similarity scores
    ranks = results['rank'].values
    similarities = results['similarity'].values
    
    bars = ax.barh(ranks, similarities, alpha=0.7, edgecolor='black', linewidth=1.5)
    
    # Color code by similarity
    colors = plt.cm.RdYlGn(similarities)
    for bar, color in zip(bars, colors):
        bar.set_color(color)
    
    ax.set_xlabel('Similarity Score', fontsize=11)
    ax.set_ylabel('Rank', fontsize=11)
    ax.set_title(f'{index_name} Index', fontsize=13, fontweight='bold')
    ax.invert_yaxis()
    ax.set_xlim([0, 1])
    ax.grid(True, alpha=0.3, axis='x')
    
    # Add value labels
    for i, (rank, sim) in enumerate(zip(ranks, similarities)):
        ax.text(sim + 0.02, rank, f'{sim:.3f}', 
                va='center', fontsize=9, fontweight='bold')

plt.suptitle(f'Search Results Comparison\nQuery: "{query}"', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('vietnamese_search_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úì ƒê√£ l∆∞u: vietnamese_search_comparison.png")

## 5. Performance Benchmarking

In [None]:
# Benchmark v·ªõi nhi·ªÅu queries
print("Benchmarking v·ªõi multiple queries...\n")

# Embed all test queries
print("Embedding test queries...")
test_embeddings = []
for q in test_queries:
    emb = embed_query(q)
    faiss.normalize_L2(emb)
    test_embeddings.append(emb[0])
test_embeddings = np.array(test_embeddings, dtype='float32')
print(f"‚úì Embedded {len(test_queries)} queries\n")

# Benchmark each index
benchmark_results = {}

for name, index in indexes.items():
    print(f"Benchmarking {name}...")
    
    # Warmup
    index.search(test_embeddings[:2], 10)
    
    # Measure
    times = []
    for _ in range(10):  # 10 iterations
        start = time.time()
        index.search(test_embeddings, 10)
        times.append(time.time() - start)
    
    avg_time = np.mean(times)
    p95_time = np.percentile(times, 95)
    qps = len(test_queries) / avg_time
    
    benchmark_results[name] = {
        'avg_latency_ms': avg_time / len(test_queries) * 1000,
        'p95_latency_ms': p95_time / len(test_queries) * 1000,
        'qps': qps,
        'build_time_s': build_times[name]
    }
    
    print(f"  Avg latency: {benchmark_results[name]['avg_latency_ms']:.2f}ms")
    print(f"  QPS: {qps:.1f}\n")

# Create comparison table
df_bench = pd.DataFrame(benchmark_results).T
print("\n" + "="*70)
print("PERFORMANCE COMPARISON")
print("="*70)
print(df_bench.to_string())
print("="*70)

In [None]:
# Visualize performance
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

index_names = list(benchmark_results.keys())
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

# Plot 1: Average Latency
ax = axes[0, 0]
latencies = [benchmark_results[name]['avg_latency_ms'] for name in index_names]
bars = ax.barh(index_names, latencies, color=colors, edgecolor='black', linewidth=2)
ax.set_xlabel('Avg Latency (ms)', fontsize=12)
ax.set_title('Average Search Latency', fontsize=14, fontweight='bold')
for i, (bar, val) in enumerate(zip(bars, latencies)):
    ax.text(val + max(latencies)*0.02, i, f'{val:.2f}ms', 
            va='center', fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')

# Plot 2: QPS
ax = axes[0, 1]
qps_vals = [benchmark_results[name]['qps'] for name in index_names]
bars = ax.barh(index_names, qps_vals, color=colors, edgecolor='black', linewidth=2)
ax.set_xlabel('Queries Per Second', fontsize=12)
ax.set_title('Throughput (QPS)', fontsize=14, fontweight='bold')
for i, (bar, val) in enumerate(zip(bars, qps_vals)):
    ax.text(val + max(qps_vals)*0.02, i, f'{val:.1f}', 
            va='center', fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')

# Plot 3: Build Time
ax = axes[1, 0]
build_times_vals = [benchmark_results[name]['build_time_s'] for name in index_names]
bars = ax.barh(index_names, build_times_vals, color=colors, edgecolor='black', linewidth=2)
ax.set_xlabel('Build Time (seconds)', fontsize=12)
ax.set_title('Index Build Time', fontsize=14, fontweight='bold')
for i, (bar, val) in enumerate(zip(bars, build_times_vals)):
    ax.text(val + max(build_times_vals)*0.02, i, f'{val:.3f}s', 
            va='center', fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')

# Plot 4: Latency vs QPS scatter
ax = axes[1, 1]
for i, name in enumerate(index_names):
    lat = benchmark_results[name]['avg_latency_ms']
    qps = benchmark_results[name]['qps']
    ax.scatter(lat, qps, s=500, c=[colors[i]], 
               edgecolors='black', linewidth=2, zorder=5)
    ax.annotate(name, (lat, qps), xytext=(10, 10),
                textcoords='offset points', fontsize=11, fontweight='bold')
ax.set_xlabel('Latency (ms)', fontsize=12)
ax.set_ylabel('QPS', fontsize=12)
ax.set_title('Latency vs Throughput', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.suptitle('Vietnamese Text Search Performance', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig('vietnamese_performance_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úì ƒê√£ l∆∞u: vietnamese_performance_comparison.png")

## 6. Interactive Search

In [None]:
# Function ƒë·ªÉ test v·ªõi custom query
def interactive_search(query_text: str, k: int = 5):
    """
    Search v·ªõi custom query v√† hi·ªÉn th·ªã k·∫øt qu·∫£ t·ª´ t·∫•t c·∫£ indexes
    """
    print("="*80)
    print(f"Query: '{query_text}'")
    print("="*80)
    
    for index_name in ['Flat', 'IVF', 'HNSW']:
        print(f"\n{index_name} Results:")
        print("-"*80)
        
        results = search_vietnamese(query_text, index_name=index_name, k=k)
        
        for _, row in results.head(3).iterrows():  # Show top 3
            print(f"[{row['rank']}] Similarity: {row['similarity']:.4f}")
            print(f"    {row['text'][:150]}...")
    
    print("\n" + "="*80)

# Test v·ªõi c√°c queries kh√°c nhau
print("Testing v·ªõi c√°c queries kh√°c nhau:\n")

for q in test_queries[1:3]:  # Test 2 queries
    interactive_search(q, k=5)
    print("\n")

## 7. Summary v√† Recommendations

In [None]:
print("="*80)
print("VIETNAMESE TEXT SEARCH - SUMMARY")
print("="*80)

print(f"\nüìä Dataset:")
print(f"  Texts: {len(texts):,}")
print(f"  Embedding model: Google text-embedding-004")
print(f"  Dimension: {dimension}")
print(f"  Memory: {embeddings.nbytes / (1024**2):.2f} MB")

print(f"\n‚ö° Performance Comparison:")
print(f"\n  {'Index':<10} {'Latency':<15} {'QPS':<15} {'Build Time':<15}")
print(f"  {'-'*55}")
for name in index_names:
    print(f"  {name:<10} "
          f"{benchmark_results[name]['avg_latency_ms']:<15.2f} "
          f"{benchmark_results[name]['qps']:<15.1f} "
          f"{benchmark_results[name]['build_time_s']:<15.3f}")

# Find best
best_latency = min(benchmark_results.items(), key=lambda x: x[1]['avg_latency_ms'])
best_qps = max(benchmark_results.items(), key=lambda x: x[1]['qps'])

print(f"\nüèÜ Winners:")
print(f"  Lowest latency: {best_latency[0]} ({best_latency[1]['avg_latency_ms']:.2f}ms)")
print(f"  Highest QPS: {best_qps[0]} ({best_qps[1]['qps']:.1f} QPS)")

print(f"\nüí° Recommendations:")
print(f"\n  For Vietnamese text search ({len(texts)} texts):")
if len(texts) < 10000:
    print(f"    ‚úì Use HNSW for best quality (high recall)")
    print(f"    ‚úì Use Flat if you need 100% accuracy")
    print(f"    ‚úì IVF is good balance for production")
else:
    print(f"    ‚úì Use IVF or IVF+PQ for large scale")
    print(f"    ‚úì Consider sharding if >1M texts")

print(f"\n  Production deployment:")
print(f"    ‚Ä¢ Cache embeddings for frequently queried texts")
print(f"    ‚Ä¢ Batch queries when possible")
print(f"    ‚Ä¢ Monitor latency P95 < 100ms")
print(f"    ‚Ä¢ Consider GPU if dataset > 1M")

print(f"\n  Accuracy tips:")
print(f"    ‚Ä¢ Use same embedding model for queries and documents")
print(f"    ‚Ä¢ Normalize embeddings for cosine similarity")
print(f"    ‚Ä¢ Consider reranking top-K results")
print(f"    ‚Ä¢ A/B test different index types")

print(f"\n" + "="*80)
print("‚úÖ Analysis Complete!")
print("="*80)

## Next Steps

1. **Scale Up**: Th·ª≠ v·ªõi dataset l·ªõn h∆°n (10K-100K texts)
2. **Fine-tune**: Optimize index parameters (nprobe, efSearch)
3. **Hybrid Search**: K·∫øt h·ª£p v·ªõi keyword search
4. **Filtering**: Th√™m metadata filtering (category, date, etc.)
5. **Production**: Build API service v·ªõi caching layer

**Happy searching! üöÄ**