## Connecting to Neo4j

In [3]:
from neo4j import GraphDatabase
import os
from dotenv import load_dotenv

# Load credentials from .env
load_dotenv()
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USER = os.getenv('NEO4J_USER')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

In [7]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def get_episode_embeddings(driver):
    episode_embeddings = {}
    with driver.session() as session:
        result = session.run("""
            MATCH (e:Episode)-[:HAS_SEGMENT]->(s:TranscriptSegment)
            RETURN e.episode_number AS episode, s.embedding AS embedding
        """)
        for record in result:
            ep = record['episode']
            emb = record['embedding']
            if emb is not None:
                if ep not in episode_embeddings:
                    episode_embeddings[ep] = []
                episode_embeddings[ep].append(emb)
    # Compute average embedding for each episode, only if there are valid embeddings
    avg_embeddings = {}
    for ep, embs in episode_embeddings.items():
        valid_embs = [e for e in embs if e is not None]
        if valid_embs:
            avg_embeddings[ep] = np.mean(np.array(valid_embs), axis=0)
    return avg_embeddings

avg_embeddings = get_episode_embeddings(driver)


def find_most_similar_episodes(target_episode, avg_embeddings, top_n=5):
    target_emb = avg_embeddings[target_episode].reshape(1, -1)
    all_eps = list(avg_embeddings.keys())
    all_embs = np.stack([avg_embeddings[ep] for ep in all_eps])
    sims = cosine_similarity(target_emb, all_embs)[0]
    sim_scores = list(zip(all_eps, sims))
    sim_scores = sorted(sim_scores, key=lambda x: -x[1])
    # Exclude the episode itself
    sim_scores = [s for s in sim_scores if s[0] != target_episode]
    return sim_scores[:top_n]


In [10]:
# Example usage:
target_episode = 437  # Change as needed
similar_episodes = find_most_similar_episodes(target_episode, avg_embeddings)
print(f"Most similar episodes to {target_episode}:")
for ep, score in similar_episodes:
    print(f"Episode {ep}: similarity {score:.3f}")

Most similar episodes to 437:
Episode 438: similarity 0.973
Episode 140: similarity 0.968
Episode 426: similarity 0.966
Episode 323: similarity 0.965
Episode 259: similarity 0.964


In [12]:
import pandas as pd

# Load the CSV file
episodes_df = pd.read_csv(r'G:\My Drive\Projects\naruhodo_references\references_Link\Podcast_Neo4j\data\processed\naruhodo_episodes.csv')

# Inspect the columns to confirm names (uncomment to check)
# print(episodes_df.columns)

# Build a mapping from episode number to title
episode_to_title = pd.Series(episodes_df['episode_title'].values, index=episodes_df['episode_number']).to_dict()

In [14]:
import ipywidgets as widgets
from IPython.display import display

def show_similar_episodes(target_episode):
    selected_title = episode_to_title.get(target_episode, "Unknown Title")
    print(f"\nSelected episode: {target_episode} | Title: {selected_title}\n")
    similar_episodes = find_most_similar_episodes(target_episode, avg_embeddings)
    print("Most similar episodes:")
    for ep, score in similar_episodes:
        title = episode_to_title.get(ep, "Unknown Title")
        print(f"Episode {ep}: similarity {score:.3f} | Title: {title}")

episode_selector = widgets.Dropdown(
    options=sorted(avg_embeddings.keys()),
    description='Episode:',
    disabled=False,
)

widgets.interact(show_similar_episodes, target_episode=episode_selector)

interactive(children=(Dropdown(description='Episode:', options=(1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15…

<function __main__.show_similar_episodes(target_episode)>

**How Cosine Similarity Works**

**Definition:**  
- Cosine similarity measures the **cosine of the angle** between two vectors in a multi-dimensional space.

**Range:**  
- **+1**: Vectors point in exactly the **same direction** (most similar).  
- **0**: Vectors are **orthogonal** (no similarity).  
- **-1**: Vectors point in **opposite directions** (most dissimilar; rare in embeddings).

---

**Mathematical Formula**  
Given two vectors **A** and **B**:

cosine_similarity = (A · B) / (||A|| × ||B||)


- **A · B** is the **dot product** of the vectors.  
- **||A||** and **||B||** are the **magnitudes (lengths)** of the vectors.

---

**Why Use Cosine Similarity for Embeddings?**  
- Embeddings (like your **episode vectors**) capture the **“meaning”** or **content** of each episode in high-dimensional space.  
- Cosine similarity tells you how close the **direction** of two episodes is, reflecting how **similar their content** is, regardless of their length.  
- It’s **robust to differences in scale** and especially useful for **text and semantic data**.

---

**Example**  
Suppose you have two episode embeddings:

- **Episode A**: `[0.1, 0.2, 0.3]`  
- **Episode B**: `[0.2, 0.4, 0.6]`  

The cosine similarity is:


(
0.1
∗
0.2
+
0.2
∗
0.4
+
0.3
∗
0.6
)
0.
1
2
+
0.
2
2
+
0.
3
2
×
0.
2
2
+
0.
4
2
+
0.
6
2
=
1
0.1 
2
 +0.2 
2
 +0.3 
2
 
​
 × 
0.2 
2
 +0.4 
2
 +0.6 
2
 
​
 
(0.1∗0.2+0.2∗0.4+0.3∗0.6)
​
 =1


Since **B is just a scaled version of A**, they are **perfectly similar** (cosine similarity = 1).

---

**In Practice:**  
- **High cosine similarity (~1)**: Episodes are **very similar** in content.  
- **Low cosine similarity (~0)**: Episodes are **unrelated**.  
- **Negative values**: Rare in practice with embeddings, but would mean **opposite meanings**.

---

**Summary Table**

| Cosine Similarity | Interpretation                     |
|-------------------|-------------------------------------|
| 1                 | Identical direction (most similar)  |
| 0                 | No similarity                       |
| -1                | Opposite direction (most dissimilar) |


**How Cosine Similarity Works**

**Definition:**  
- Cosine similarity measures the **cosine of the angle** between two vectors in a multi-dimensional space.

**Range:**  
- **+1**: Vectors point in exactly the **same direction** (most similar).  
- **0**: Vectors are **orthogonal** (no similarity).  
- **-1**: Vectors point in **opposite directions** (most dissimilar; rare in embeddings).

---

**Mathematical Formula**  
Given two vectors **A** and **B**:

```
cosine_similarity = (A · B) / (||A|| × ||B||)
```

- **A · B** is the **dot product** of the vectors.  
- **||A||** and **||B||** are the **magnitudes (lengths)** of the vectors.

---

**Why Use Cosine Similarity for Embeddings?**  
- Embeddings (like your **episode vectors**) capture the **“meaning”** or **content** of each episode in high-dimensional space.  
- Cosine similarity tells you how close the **direction** of two episodes is, reflecting how **similar their content** is, regardless of their length.  
- It’s **robust to differences in scale** and especially useful for **text and semantic data**.

---

**Example**  
Suppose you have two episode embeddings:

- **Episode A**: `[0.1, 0.2, 0.3]`  
- **Episode B**: `[0.2, 0.4, 0.6]`  

The cosine similarity is:

```
(0.1*0.2 + 0.2*0.4 + 0.3*0.6) / (sqrt(0.1² + 0.2² + 0.3²) × sqrt(0.2² + 0.4² + 0.6²)) = 1
```

Since **B is just a scaled version of A**, they are **perfectly similar** (cosine similarity = 1).

---

**In Practice:**  
- **High cosine similarity (~1)**: Episodes are **very similar** in content.  
- **Low cosine similarity (~0)**: Episodes are **unrelated**.  
- **Negative values**: Rare in practice with embeddings, but would mean **opposite meanings**.

---

**Summary Table**

| Cosine Similarity | Interpretation                     |
|-------------------|-------------------------------------|
| 1                 | Identical direction (most similar)  |
| 0                 | No similarity                       |
| -1                | Opposite direction (most dissimilar) |

### What are the diferences between Cosine Similarity and GraphRAG

**Cosine Similarity**

**What it is:**  
- A mathematical measure of similarity between two vectors, based on the angle between them.  
- Commonly used to compare text/document embeddings, such as those from BERT, sentence-transformers, etc.

**How it works:**  
- Takes two vectors (e.g., episode embeddings) and computes the **cosine of the angle** between them.  
- Values range from **-1** (opposite) to **1** (identical direction), with **0** meaning orthogonal (unrelated).

**Use case:**  
- Find the most similar items (episodes, documents, etc.) based on their content or semantic meaning.  
- Simple, fast, and works well for **“nearest neighbor”** search in embedding space.

**Example:**  
- “Which episodes are most similar in content to episode 42?”

---

**GraphRAG (Graph Retrieval-Augmented Generation)**

**What it is:**  
- A framework that combines a **graph database** (like Neo4j) with a **large language model (LLM)** for advanced retrieval and generation.  
- “RAG” stands for **Retrieval-Augmented Generation**: the LLM is given context retrieved from a knowledge base (in this case, a graph).

**How it works:**  
When you ask a question, GraphRAG:  
1. Retrieves **relevant nodes/edges** from your graph (using embeddings, graph queries, or both).  
2. Feeds this **structured and unstructured context** to the LLM.  
3. The LLM generates an **answer, summary, or explanation**, grounded in both the graph’s structure and the content.

**Use case:**  
- Complex, relational, or **multi-hop questions** that require understanding both the content and the relationships between items.  
- Summarization, Q&A, and **reasoning over knowledge graphs**.

**Example:**  
- “Which episodes discuss both neuroscience and education, and how are they connected?”  
- “Summarize the main findings from all episodes that reference a specific paper.”

---

**Summary Table**

| Feature/Goal       | Cosine Similarity                  | GraphRAG                                      |
|--------------------|------------------------------------|-----------------------------------------------|
| Main purpose        | Find similar items by content      | Answer/generate using graph + LLM             |
| Uses graph structure| No                                 | Yes                                           |
| Uses LLM            | No                                 | Yes                                           |
| Output              | List of similar items              | Answers, summaries, generated text            |
| Complexity          | Simple, fast                       | More complex, powerful                        |
| Example             | “Find similar episodes”            | “Explain how these episodes are related”      |

---

**When to Use Each**

**Cosine similarity:**  
- When you want to quickly **find similar items based on content/embeddings**.  
- Ideal for **recommendations**, “related episodes,” or clustering.

**GraphRAG:**  
- When you want to **answer complex questions** that require both content and relationships.  
- Best for **advanced Q&A, summarization**, or reasoning over your **knowledge graph**.

---

**In short:**  
- **Cosine similarity** is a **simple, direct** way to measure content similarity.  
- **GraphRAG** is a **powerful, LLM-driven** approach for leveraging both your **graph’s structure and content** for advanced retrieval and generation.
