# Transformer-Based Semantic Networks

This notebook demonstrates how to build semantic networks using transformer models (sentence embeddings) instead of traditional co-occurrence methods.

## What You'll Learn:
1. **Setup** - Install dependencies and import libraries
2. **Basic Concepts** - Understanding sentence embeddings and similarity
3. **Core Implementation** - Building document and term networks
4. **Practical Examples** - Real-world use cases with sample data
5. **Visualization & Analysis** - Comparing models and analyzing results

## Prerequisites:
- Python 3.8+
- `sentence-transformers` package
- `scikit-learn` for cosine similarity
- `networkx` for network analysis

Let's get started! 🚀

## Section 1: Setup and Imports

First, let's import all necessary libraries and verify our environment.

In [None]:
# Core imports
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add parent directory to path to import our modules
sys.path.insert(0, str(Path.cwd().parent))

# Import our transformer modules
from src.semantic.transformers_enhanced import (
    TransformerEmbeddings,
    TransformerSemanticNetwork
)

# NetworkX for graph analysis
import networkx as nx

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

print("✓ All imports successful!")
print(f"✓ Python version: {sys.version}")
print(f"✓ NumPy version: {np.__version__}")
print(f"✓ Pandas version: {pd.__version__}")

## Section 2: Basic Concepts - Understanding Sentence Embeddings

### What are Sentence Embeddings?

Sentence embeddings are dense vector representations of text that capture semantic meaning. Unlike traditional co-occurrence methods (which count word frequencies), embeddings represent text in a continuous vector space where similar meanings are close together.

**Key Advantages:**
- 🎯 Capture semantic similarity (e.g., "car" and "automobile" are close)
- 🌍 Work across different phrasings and synonyms
- 📊 Produce dense vectors (384-768 dimensions) vs. sparse co-occurrence matrices
- 🚀 Pre-trained on massive corpora

Let's see this in action!

In [None]:
# Initialize the transformer embeddings model
# Using MiniLM (small, fast, 384 dimensions)
embedder = TransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Example sentences
sentences = [
    "The cat sat on the mat.",
    "A feline rested on the rug.",  # Semantically similar to #1
    "Python is a programming language.",
    "Machine learning uses neural networks.",
    "The dog played in the park."
]

# Encode sentences to embeddings
embeddings = embedder.encode(sentences)

print(f"Shape of embeddings: {embeddings.shape}")
print(f"  → {len(sentences)} sentences")
print(f"  → {embeddings.shape[1]} dimensions per sentence")
print(f"\nFirst embedding (first 10 dimensions):")
print(embeddings[0][:10])

In [None]:
# Compute similarity matrix (cosine similarity)
similarity_matrix = embedder.compute_similarity_matrix(sentences)

# Visualize the similarity matrix
plt.figure(figsize=(10, 8))
sns.heatmap(
    similarity_matrix, 
    annot=True, 
    fmt='.3f',
    cmap='RdYlGn',
    xticklabels=[f"S{i+1}" for i in range(len(sentences))],
    yticklabels=[f"S{i+1}" for i in range(len(sentences))],
    vmin=0, 
    vmax=1,
    cbar_kws={'label': 'Cosine Similarity'}
)
plt.title('Sentence Similarity Matrix\n(Notice how S1 and S2 are highly similar!)', fontsize=14)
plt.tight_layout()
plt.show()

# Show the actual sentences for reference
print("\nSentence Reference:")
for i, sent in enumerate(sentences):
    print(f"S{i+1}: {sent}")

## Section 3: Core Implementation - Building Networks

Now let's build actual semantic networks! We'll create two types:
1. **Document Network**: Connect similar documents
2. **Term Network**: Connect similar terms/phrases

### 3.1 Document Network

This creates a network where each node is a document, and edges connect similar documents.

In [None]:
# Create sample documents for a larger dataset
documents = [
    "Climate change is affecting global temperatures and weather patterns.",
    "Global warming leads to rising sea levels and extreme weather events.",
    "Machine learning algorithms can predict patterns in large datasets.",
    "Neural networks are the foundation of modern AI systems.",
    "Deep learning models require significant computational resources.",
    "Electric vehicles are becoming more popular as battery technology improves.",
    "Tesla and other companies are investing heavily in EV infrastructure.",
    "Renewable energy sources like solar and wind power are growing rapidly.",
    "Solar panels convert sunlight directly into electricity.",
    "The stock market experienced significant volatility last quarter.",
    "Cryptocurrency prices fluctuate based on market sentiment and adoption.",
    "Bitcoin and Ethereum are the most well-known cryptocurrencies.",
]

# Initialize the network builder
network_builder = TransformerSemanticNetwork(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Build document network
doc_edges = network_builder.build_document_network(
    documents=documents,
    similarity_threshold=0.3,  # Connect docs with >30% similarity
    top_k=5  # Keep top 5 connections per document
)

print(f"Document Network Created!")
print(f"  Documents: {len(documents)}")
print(f"  Edges: {len(doc_edges)}")
print(f"\nFirst few edges:")
print(doc_edges.head())

In [None]:
# Create NetworkX graph for visualization
G = nx.Graph()

# Add edges from our document network
for _, row in doc_edges.iterrows():
    G.add_edge(
        row['source'], 
        row['target'], 
        weight=row['similarity']
    )

# Visualize the network
plt.figure(figsize=(14, 10))
pos = nx.spring_layout(G, k=2, iterations=50, seed=42)

# Draw edges with varying thickness based on similarity
edges = G.edges()
weights = [G[u][v]['weight'] for u, v in edges]

nx.draw_networkx_edges(
    G, pos, 
    width=[w*3 for w in weights],
    alpha=0.3,
    edge_color='gray'
)

# Draw nodes
nx.draw_networkx_nodes(
    G, pos,
    node_color='lightblue',
    node_size=800,
    alpha=0.9
)

# Draw labels
nx.draw_networkx_labels(
    G, pos,
    font_size=10,
    font_weight='bold'
)

plt.title('Document Similarity Network\n(Thickness = Similarity Strength)', fontsize=14)
plt.axis('off')
plt.tight_layout()
plt.show()

print(f"\nNetwork Statistics:")
print(f"  Nodes: {G.number_of_nodes()}")
print(f"  Edges: {G.number_of_edges()}")
print(f"  Avg Degree: {sum(dict(G.degree()).values()) / G.number_of_nodes():.2f}")

### 3.2 Term Network

Now let's build a network of terms/phrases instead of documents. This is useful for understanding which concepts are semantically related.

In [None]:
# Define key terms/concepts to analyze
terms = [
    "artificial intelligence",
    "machine learning",
    "deep learning",
    "neural networks",
    "natural language processing",
    "computer vision",
    "climate change",
    "global warming",
    "renewable energy",
    "solar power",
    "electric vehicles",
    "cryptocurrency",
    "blockchain",
    "bitcoin",
    "stock market",
    "financial markets",
]

# Build term network
term_edges = network_builder.build_term_network(
    terms=terms,
    similarity_threshold=0.4,  # Higher threshold for terms
    top_k=5
)

print(f"Term Network Created!")
print(f"  Terms: {len(terms)}")
print(f"  Edges: {len(term_edges)}")
print(f"\nStrongest connections:")
print(term_edges.nlargest(10, 'similarity')[['source', 'target', 'similarity']])

## Section 4: Practical Examples - Real-World Use Cases

Let's demonstrate practical applications with more realistic data scenarios.

### Example 1: Topic Clustering from News Headlines

In [None]:
# Simulate news headlines dataset
news_headlines = [
    "Tech Giants Invest Billions in AI Research and Development",
    "New AI Model Breaks Records in Language Understanding Tasks",
    "Stock Market Hits All-Time High Amid Economic Recovery",
    "Cryptocurrency Regulation Debate Intensifies in Congress",
    "Bitcoin Price Surges Following Institutional Adoption",
    "Climate Summit Reaches Historic Agreement on Emissions",
    "Scientists Warn of Accelerating Global Temperature Rise",
    "Renewable Energy Capacity Doubles in Past Five Years",
    "Solar Panel Efficiency Reaches New Milestone in Labs",
    "Electric Vehicle Sales Outpace Traditional Cars in Europe",
    "Tesla Opens New Gigafactory to Meet Growing Demand",
    "Medical AI Diagnoses Diseases with 95% Accuracy",
    "Quantum Computing Breakthrough Announced by Researchers",
    "Space Agency Plans Mission to Mars by 2030",
    "Ocean Plastic Cleanup Project Exceeds Expectations",
]

# Create DataFrame
df = pd.DataFrame({
    'id': range(len(news_headlines)),
    'text': news_headlines
})

print("Sample Headlines:")
print(df.head(10))

In [None]:
# Build network from news headlines
news_edges = network_builder.build_document_network(
    documents=news_headlines,
    similarity_threshold=0.25,
    top_k=4
)

# Create graph
G_news = nx.Graph()
for _, row in news_edges.iterrows():
    G_news.add_edge(row['source'], row['target'], weight=row['similarity'])

# Detect communities using Louvain algorithm
from networkx.algorithms import community

communities = community.greedy_modularity_communities(G_news)

print(f"\nDetected {len(communities)} topic clusters:")
for i, comm in enumerate(communities):
    print(f"\n📰 Cluster {i+1}:")
    for doc_id in sorted(comm):
        print(f"   [{doc_id}] {news_headlines[doc_id][:60]}...")

### Example 2: Finding Similar Documents

Let's say you have a query document and want to find the most similar documents in your corpus.

In [None]:
# Query document
query = "Artificial intelligence is transforming healthcare and medical diagnostics"

# Encode query and all documents
query_embedding = embedder.encode([query])
doc_embeddings = embedder.encode(news_headlines)

# Compute similarities
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

# Get top 5 most similar
top_indices = similarities.argsort()[-5:][::-1]

print(f"Query: '{query}'")
print(f"\n🔍 Top 5 Most Similar Headlines:\n")
for rank, idx in enumerate(top_indices, 1):
    print(f"{rank}. [Similarity: {similarities[idx]:.3f}] {news_headlines[idx]}")


## Section 5: Visualization and Analysis

Let's compare different models and analyze network properties in depth.

### 5.1 Comparing Different Transformer Models

In [None]:
# Compare two models: MiniLM (fast) vs MPNet (accurate)
models = {
    'MiniLM-L6': 'sentence-transformers/all-MiniLM-L6-v2',  # 384 dims, 23M params
    'MPNet-base': 'sentence-transformers/all-mpnet-base-v2',  # 768 dims, 110M params
}

# Test sentences for comparison
test_sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",
    "Dogs are loyal pets",
]

results = {}

for name, model_name in models.items():
    print(f"Testing {name}...")
    embedder_test = TransformerEmbeddings(model_name=model_name)
    sim_matrix = embedder_test.compute_similarity_matrix(test_sentences)
    results[name] = sim_matrix

# Visualize side-by-side
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, (name, sim_matrix) in enumerate(results.items()):
    sns.heatmap(
        sim_matrix,
        annot=True,
        fmt='.3f',
        cmap='RdYlGn',
        xticklabels=['S1', 'S2', 'S3'],
        yticklabels=['S1', 'S2', 'S3'],
        vmin=0,
        vmax=1,
        ax=axes[idx],
        cbar_kws={'label': 'Similarity'}
    )
    axes[idx].set_title(f'{name}\nDimensions: {results[name].shape[0]}')

plt.suptitle('Model Comparison: Similarity Matrices', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("\nKey Observations:")
print(f"• MiniLM is faster and more memory-efficient (384 dims)")
print(f"• MPNet is more accurate but slower (768 dims)")
print(f"• Both correctly identify S1 and S2 as highly similar")

### 5.2 Network Analysis Metrics

Let's analyze the properties of our document network using NetworkX.

In [None]:
# Analyze the news headlines network
print("📊 Network Statistics:")
print(f"   Nodes: {G_news.number_of_nodes()}")
print(f"   Edges: {G_news.number_of_edges()}")
print(f"   Density: {nx.density(G_news):.4f}")
print(f"   Average Degree: {sum(dict(G_news.degree()).values()) / G_news.number_of_nodes():.2f}")

# Connected components
num_components = nx.number_connected_components(G_news)
print(f"   Connected Components: {num_components}")

# Clustering coefficient
avg_clustering = nx.average_clustering(G_news, weight='weight')
print(f"   Avg Clustering Coefficient: {avg_clustering:.4f}")

# Centrality measures
degree_centrality = nx.degree_centrality(G_news)
betweenness_centrality = nx.betweenness_centrality(G_news, weight='weight')

# Top 5 most central nodes
print("\n🏆 Top 5 Most Central Documents (by degree):")
top_central = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:5]
for doc_id, centrality in top_central:
    print(f"   [{doc_id}] {news_headlines[doc_id][:60]}... (centrality: {centrality:.3f})")

In [None]:
# Visualize degree distribution
degrees = [G_news.degree(n) for n in G_news.nodes()]

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(degrees, bins=range(max(degrees)+2), alpha=0.7, color='skyblue', edgecolor='black')
plt.xlabel('Degree')
plt.ylabel('Frequency')
plt.title('Degree Distribution')
plt.grid(axis='y', alpha=0.3)

plt.subplot(1, 2, 2)
weights = [G_news[u][v]['weight'] for u, v in G_news.edges()]
plt.hist(weights, bins=20, alpha=0.7, color='lightcoral', edgecolor='black')
plt.xlabel('Edge Weight (Similarity)')
plt.ylabel('Frequency')
plt.title('Edge Weight Distribution')
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## Summary and Next Steps

### What We Learned:

✅ **Sentence Embeddings** - Transform text into dense vectors that capture semantic meaning  
✅ **Document Networks** - Connect similar documents based on embedding similarity  
✅ **Term Networks** - Identify semantically related concepts  
✅ **Community Detection** - Automatically group similar content  
✅ **Model Comparison** - Trade-offs between speed (MiniLM) and accuracy (MPNet)  
✅ **Network Analysis** - Measure centrality, clustering, and network properties

### Advantages over Co-occurrence Methods:

| Feature | Co-occurrence | Transformers |
|---------|--------------|--------------|
| **Semantic Understanding** | ❌ Word-level only | ✅ Deep semantic meaning |
| **Synonyms** | ❌ Treated as different | ✅ Recognized as similar |
| **Context** | ⚠️ Local window only | ✅ Full context |
| **Sparsity** | ⚠️ Very sparse | ✅ Dense representations |
| **Speed** | ✅ Fast | ⚠️ Slower (but GPU helps) |
| **Memory** | ✅ Lower | ⚠️ Higher |

### When to Use Each:

**Use Co-occurrence** when:
- You have very large datasets (millions of documents)
- You care about exact word usage patterns
- You need maximum speed
- Memory is limited

**Use Transformers** when:
- You need semantic understanding
- Dataset is medium-sized (<100K documents)
- Quality is more important than speed
- You have GPU access

### Next Steps:

1. Try on your own data with different similarity thresholds
2. Experiment with multilingual models for non-English text
3. Combine with co-occurrence methods for hybrid approaches
4. Explore BERTopic for automated topic modeling

Happy analyzing! 🎉