## Learning Goals

- Describe what a knowledge graph is and why it helps retrieval.
- Construct a tiny research graph (Authors–Papers–Topics) and query it.
- Use graph results to narrow retrieval context before generation.
- Identify scenarios where graph-augmented RAG is worthwhile.


### Dataset: Demo Corpus

We will use a tiny mixed-domain corpus (AI, Climate, Biomedical, Materials) stored in `data/demo_corpus.jsonl`.


In [None]:
from pathlib import Path
import pandas as pd

DATA_PATH = 'data/demo_corpus.jsonl'
df = pd.read_json(DATA_PATH, lines=True)
docs = df.to_dict('records')
print(f'Loaded {len(docs)} docs from {DATA_PATH}')
display(df[['id','title','year','topics']].head())


# Module 5: Graph RAG

*Part of the RCD Workshops series: Retrieval-Augmented Generation (RAG) for Advanced Research Applications*

---

So far, retrieval found text snippets. What if your knowledge isn't just documents—but a **knowledge graph**?


## What is Graph RAG?
A **knowledge graph (KG)** organizes data as entities (nodes) and relationships (edges): facts like (Subject —relation→ Object).
Graphs let you represent links across topics and discover answers even when no single document states them directly.

> **Diagram placeholder:** Network visualization of a small knowledge graph: researchers, papers, citations.

### 5.1 Why Knowledge Graphs?
- **Multi-hop answers**: Answer questions that require tracing connections (e.g., "Which startups were founded by former Google employees?").
- **Structured queries (SPARQL, Cypher)**: Let LLMs generate graph queries from user input.
- **Context beyond text**: Some info is implicit and scattered across documents, but explicit in the graph.

### 5.2 Approaches
1. **Vector-based retrieval over nodes/edges:** Treat node/edge texts as documents; embed and run semantic search (baseline RAG, but on graph content).
2. **Prompt-to-Graph Query:** Use the LLM to translate the user’s question to a graph query (e.g. SPARQL/Cypher), then fetch subgraph to answer.
3. **Hybrid:** Use vectors to find graph entities, then expand by graph traversal.

### 5.3 Hands-on Demo: Building/Querying a Knowledge Graph
Let’s use NetworkX to create and query a tiny toy KG.

In [None]:
import networkx as nx
# Create our mini research collaboration graph
G = nx.DiGraph()
G.add_node('Alice', type='Researcher')
G.add_node('Bob', type='Researcher')
G.add_node('Paper1', type='Paper', title='On Climate Economics')
G.add_node('Paper2', type='Paper', title='Advances in Climate Modeling')
G.add_edge('Alice', 'Paper1', relation='authored')
G.add_edge('Bob', 'Paper2', relation='authored')
G.add_edge('Paper2', 'Paper1', relation='cites')
print('Nodes:', G.nodes(data=True))
print('Edges:', G.edges(data=True))


- **Example query:** "Who has their work cited by Bob?"
Let's traverse the graph to answer.

In [None]:
authors_cited_by_bob = set()
for _, paper, data in G.out_edges('Bob', data=True):
    if data.get('relation') == 'authored':
        for _, cited_paper, cdata in G.out_edges(paper, data=True):
            if cdata.get('relation') == 'cites':
                for author, _, adata in G.in_edges(cited_paper, data=True):
                    if adata.get('relation') == 'authored':
                        authors_cited_by_bob.add(author)
print("Researchers who have their work cited by Bob's paper(s):", authors_cited_by_bob)


### 5.4 Integrating Graphs with an LLM
- **Linearize the subgraph:** Convert facts to sentences and give as context (e.g. "Bob authored Paper2. Paper2 cites Paper1. Alice authored Paper1.").
- **Embed graph context as text:** Each triple can be a sentence embedded with the query.
- **LLM as a reasoner:** Some advanced agents let the LLM interactively traverse a graph (multi-hop reasoning loop)—beyond scope for now.

#### Applications
- Science: concept/citation networks for tracing idea influence.
- Medicine: biomedical KGs link genes, proteins, drugs, diseases.
- Enterprise: org charts, policy graphs, etc.


### Reflection
In what kinds of research would a graph-based RAG be most beneficial compared to text-only retrieval?

In [None]:
from utils import create_answer_box
create_answer_box('📝 **Your Answer:** Graph RAG would be most useful for ...', question_id='mod5_graph_rag_application')