# Semantic Similarity Exploration

This notebook explores how YT Study Buddy uses semantic similarity for auto-categorization and improved cross-referencing.

## Learning Goals:
- Understand sentence transformers and embedding models
- Experiment with semantic similarity calculations
- Compare different similarity metrics
- Find optimal thresholds for YouTube content categorization

In [None]:
# Install required packages if not already installed
# !pip install sentence-transformers scikit-learn matplotlib numpy

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.manifold import TSNE
import seaborn as sns

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

## 1. Loading and Comparing Different Models

Let's explore different sentence transformer models and their characteristics:

In [None]:
# Load different models for comparison
models = {
    'MiniLM': 'all-MiniLM-L6-v2',  # Fast, good quality (default in YT Study Buddy)
    'MPNet': 'all-mpnet-base-v2',   # Higher quality, slower
    'Distilbert': 'all-distilroberta-v1'  # Alternative architecture
}

loaded_models = {}
for name, model_name in models.items():
    print(f"Loading {name} ({model_name})...")
    loaded_models[name] = SentenceTransformer(model_name)
    print(f"  Embedding dimension: {loaded_models[name].get_sentence_embedding_dimension()}")

print("\nAll models loaded!")

## 2. Sample YouTube Video Content

Let's create sample content that represents different subjects YT Study Buddy might encounter:

In [None]:
# Sample video content from different subjects
sample_content = {
    'Machine Learning': [
        "Understanding neural networks and deep learning algorithms",
        "Training models with backpropagation and gradient descent",
        "Transformer architecture and attention mechanisms",
        "Computer vision with convolutional neural networks"
    ],
    'Web Development': [
        "Building React applications with modern JavaScript",
        "Creating RESTful APIs with Node.js and Express",
        "CSS Grid and Flexbox for responsive design",
        "Database design with PostgreSQL and MongoDB"
    ],
    'Data Science': [
        "Exploratory data analysis with pandas and matplotlib",
        "Statistical modeling and hypothesis testing",
        "Data visualization techniques and best practices",
        "Time series forecasting and trend analysis"
    ],
    'Physics': [
        "Quantum mechanics and wave-particle duality",
        "Thermodynamics and statistical mechanics",
        "Electromagnetic theory and Maxwell's equations",
        "Relativity theory and spacetime curvature"
    ]
}

# Flatten for easier processing
all_content = []
content_labels = []
for subject, descriptions in sample_content.items():
    all_content.extend(descriptions)
    content_labels.extend([subject] * len(descriptions))

print(f"Sample content: {len(all_content)} descriptions across {len(sample_content)} subjects")
for subject, descriptions in sample_content.items():
    print(f"  {subject}: {len(descriptions)} examples")

## 3. Encoding Content and Calculating Similarities

Let's encode our sample content using different models and compare the results:

In [None]:
# Encode content with each model
embeddings = {}
for model_name, model in loaded_models.items():
    print(f"Encoding content with {model_name}...")
    embeddings[model_name] = model.encode(all_content)
    print(f"  Shape: {embeddings[model_name].shape}")

print("\nEncoding complete!")

In [None]:
# Calculate similarity matrices for each model
similarities = {}
for model_name, emb in embeddings.items():
    similarities[model_name] = cosine_similarity(emb)

# Visualize similarity matrices
fig, axes = plt.subplots(1, len(loaded_models), figsize=(15, 4))
if len(loaded_models) == 1:
    axes = [axes]

for i, (model_name, sim_matrix) in enumerate(similarities.items()):
    im = axes[i].imshow(sim_matrix, cmap='viridis', vmin=0, vmax=1)
    axes[i].set_title(f'{model_name} Similarity Matrix')
    axes[i].set_xlabel('Content Index')
    axes[i].set_ylabel('Content Index')
    
plt.colorbar(im, ax=axes, shrink=0.6)
plt.tight_layout()
plt.show()

print("Darker areas indicate higher similarity between content pieces.")

## 4. Threshold Analysis

Let's find optimal similarity thresholds for categorization:

In [None]:
def analyze_thresholds(similarity_matrix, labels, thresholds=np.arange(0.1, 1.0, 0.05)):
    """Analyze categorization accuracy at different similarity thresholds."""
    results = []
    
    for threshold in thresholds:
        correct_matches = 0
        total_matches = 0
        
        for i in range(len(labels)):
            for j in range(i + 1, len(labels)):
                sim_score = similarity_matrix[i, j]
                
                if sim_score > threshold:
                    total_matches += 1
                    if labels[i] == labels[j]:  # Same subject
                        correct_matches += 1
        
        precision = correct_matches / total_matches if total_matches > 0 else 0
        results.append((threshold, precision, total_matches))
    
    return results

# Analyze thresholds for each model
threshold_results = {}
for model_name, sim_matrix in similarities.items():
    threshold_results[model_name] = analyze_thresholds(sim_matrix, content_labels)

# Plot threshold analysis
plt.figure(figsize=(12, 8))

for model_name, results in threshold_results.items():
    thresholds, precisions, match_counts = zip(*results)
    plt.subplot(2, 1, 1)
    plt.plot(thresholds, precisions, marker='o', label=f'{model_name} Precision')
    
    plt.subplot(2, 1, 2)
    plt.plot(thresholds, match_counts, marker='s', label=f'{model_name} Total Matches')

plt.subplot(2, 1, 1)
plt.xlabel('Similarity Threshold')
plt.ylabel('Precision (Correct/Total)')
plt.title('Categorization Precision vs Similarity Threshold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(2, 1, 2)
plt.xlabel('Similarity Threshold')
plt.ylabel('Number of Matches')
plt.title('Total Matches vs Similarity Threshold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Visualization with t-SNE

Let's visualize how well the embeddings cluster similar content:

In [None]:
# Create t-SNE visualization for each model
fig, axes = plt.subplots(1, len(loaded_models), figsize=(5 * len(loaded_models), 4))
if len(loaded_models) == 1:
    axes = [axes]

colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown']
subject_to_color = {subject: colors[i] for i, subject in enumerate(sample_content.keys())}

for i, (model_name, emb) in enumerate(embeddings.items()):
    # Apply t-SNE
    tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(all_content)-1))
    embedded = tsne.fit_transform(emb)
    
    # Plot each subject with different colors
    for subject in sample_content.keys():
        mask = [label == subject for label in content_labels]
        axes[i].scatter(embedded[mask, 0], embedded[mask, 1], 
                       c=subject_to_color[subject], label=subject, alpha=0.7, s=60)
    
    axes[i].set_title(f'{model_name} t-SNE Visualization')
    axes[i].set_xlabel('t-SNE Dimension 1')
    axes[i].set_ylabel('t-SNE Dimension 2')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Good clustering means similar subjects are grouped together visually.")

## 6. Real-World Testing

Let's test with actual YouTube video titles and see how well our categorization works:

In [None]:
# Real YouTube video titles for testing
real_videos = [
    "Building a Neural Network from Scratch in Python",
    "React Hooks Tutorial - useState and useEffect",
    "Introduction to Quantum Computing",
    "Data Analysis with Pandas - Complete Guide",
    "Understanding Transformers in NLP",
    "CSS Grid Layout Tutorial",
    "Statistical Significance and P-Values",
    "Einstein's Theory of General Relativity",
    "Machine Learning Model Deployment",
    "JavaScript Async/Await Explained"
]

# Expected categories
expected_categories = [
    "Machine Learning", "Web Development", "Physics", "Data Science",
    "Machine Learning", "Web Development", "Data Science", "Physics",
    "Machine Learning", "Web Development"
]

# Test auto-categorization
def test_categorization(video_titles, existing_subjects, model, threshold=0.6):
    """Test how well the model categorizes new video titles."""
    # Encode existing subjects and new titles
    subject_embeddings = model.encode(list(existing_subjects.keys()))
    title_embeddings = model.encode(video_titles)
    
    # Calculate similarities
    similarities = cosine_similarity(title_embeddings, subject_embeddings)
    
    results = []
    for i, title in enumerate(video_titles):
        best_match_idx = np.argmax(similarities[i])
        best_score = similarities[i][best_match_idx]
        
        if best_score > threshold:
            predicted_subject = list(existing_subjects.keys())[best_match_idx]
        else:
            predicted_subject = "New Subject"
        
        results.append({
            'title': title,
            'predicted': predicted_subject,
            'confidence': best_score,
            'expected': expected_categories[i]
        })
    
    return results

# Test with MiniLM model (default in YT Study Buddy)
model = loaded_models['MiniLM']
categorization_results = test_categorization(real_videos, sample_content, model)

# Display results
print("Auto-Categorization Results:")
print("=" * 80)
correct = 0

for result in categorization_results:
    is_correct = result['predicted'] == result['expected']
    if is_correct:
        correct += 1
    
    status = "✓" if is_correct else "✗"
    print(f"{status} {result['title'][:50]:<50} | {result['predicted']:<15} | {result['confidence']:.3f}")

accuracy = correct / len(categorization_results)
print("=" * 80)
print(f"Accuracy: {accuracy:.1%} ({correct}/{len(categorization_results)})")

## 7. Performance Comparison

Let's compare the performance and speed of different models:

In [None]:
import time

# Performance testing
test_texts = real_videos * 10  # 100 texts for speed testing

performance_results = {}
for model_name, model in loaded_models.items():
    # Time the encoding
    start_time = time.time()
    embeddings_test = model.encode(test_texts)
    end_time = time.time()
    
    encoding_time = end_time - start_time
    texts_per_second = len(test_texts) / encoding_time
    
    performance_results[model_name] = {
        'encoding_time': encoding_time,
        'texts_per_second': texts_per_second,
        'embedding_dim': model.get_sentence_embedding_dimension()
    }

# Display performance comparison
print("Model Performance Comparison:")
print("=" * 60)
print(f"{'Model':<15} {'Time (s)':<10} {'Texts/sec':<10} {'Embed Dim':<10}")
print("=" * 60)

for model_name, perf in performance_results.items():
    print(f"{model_name:<15} {perf['encoding_time']:<10.2f} {perf['texts_per_second']:<10.1f} {perf['embedding_dim']:<10}")

print("\n💡 Key Insights:")
print("- MiniLM is fastest and works well for YT Study Buddy's use case")
print("- MPNet provides highest quality but is slower")
print("- Higher embedding dimensions generally mean better representation")
print("- Choose based on your speed vs. accuracy requirements")

## 8. Experimentation Section

Try your own experiments here:

In [None]:
# 🧪 Experiment: Try your own video titles
your_video_titles = [
    "Add your own YouTube video titles here",
    "Test how well the categorization works",
    "Experiment with different subjects"
]

# Test with your titles
if your_video_titles[0] != "Add your own YouTube video titles here":
    your_results = test_categorization(your_video_titles, sample_content, loaded_models['MiniLM'])
    
    print("Your Categorization Results:")
    for result in your_results:
        print(f"{result['title'][:50]:<50} | {result['predicted']:<15} | {result['confidence']:.3f}")
else:
    print("Replace the sample titles above with your own to test categorization!")

In [None]:
# 🧪 Experiment: Try different similarity metrics
from scipy.spatial.distance import pdist, squareform

# Compare cosine vs euclidean distance
sample_embeddings = loaded_models['MiniLM'].encode(["Machine learning tutorial", "Deep learning basics", "Web development guide"])

cosine_sim = cosine_similarity(sample_embeddings)
euclidean_dist = squareform(pdist(sample_embeddings, metric='euclidean'))

print("Cosine Similarity:")
print(cosine_sim)
print("\nEuclidean Distance:")
print(euclidean_dist)

print("\n💡 Note: Lower euclidean distance = higher similarity")
print("💡 Note: Higher cosine similarity = higher similarity")

## 🎯 Key Takeaways

1. **Model Choice**: MiniLM provides the best speed/accuracy tradeoff for YT Study Buddy
2. **Threshold Selection**: ~0.6 cosine similarity works well for categorization
3. **Semantic Understanding**: Embeddings capture meaning better than keyword matching
4. **Visualization**: t-SNE helps validate that similar content clusters together
5. **Performance**: Consider encoding speed for real-time applications

## 🚀 Next Steps

- Experiment with fine-tuning models on your specific domain
- Try different similarity thresholds based on your content
- Explore multi-modal embeddings (text + other features)
- Implement vector databases for large-scale similarity search