<a href="https://colab.research.google.com/github/YuwenAprilYang/FinAgent/blob/main/S3_Implementing_the_GraphRAG_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Install Dependencies

In [1]:
!pip install python-dotenv neo4j openai tiktoken langchain langchain-openai langchain-community

Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting neo4j
  Downloading neo4j-5.28.1-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.17-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from datacla

# Step 2: Load Environment Variables

In [2]:
from google.colab import drive
drive.mount('/content/drive')

from dotenv import load_dotenv
import os

env_path = '/content/drive/MyDrive/rag.env'
load_dotenv(env_path)

NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE") or "neo4j"
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_ENDPOINT = os.getenv("OPENAI_BASE_URL") + "/embeddings"

# Constants
VECTOR_INDEX_NAME = 'form_10k_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'

Mounted at /content/drive


# Step 3: Connect to Neo4j

In [4]:
from langchain_community.graphs import Neo4jGraph

kg = Neo4jGraph(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database=NEO4J_DATABASE
)

# Check how many chunks exist
kg.query("MATCH (c:Chunk) RETURN count(c) AS count")

[{'count': 40}]

# Step 4: Compute Embeddings

In [5]:
kg.query("""
MATCH (chunk:Chunk) WHERE chunk.textEmbedding IS NULL
WITH chunk, genai.vector.encode(
  chunk.text,
  "OpenAI",
  {
    token: $openAiApiKey,
    endpoint: $openAiEndpoint
  }) AS vector
CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", vector)
""", params={
    "openAiApiKey": OPENAI_API_KEY,
    "openAiEndpoint": OPENAI_ENDPOINT
})

[]

# Step 5: Search Function to Preview Similar Chunks

In [6]:
def neo4j_vector_search(question):
    query = """
    WITH genai.vector.encode(
      $question,
      "OpenAI",
      {
        token: $openAiApiKey,
        endpoint: $openAiEndpoint
      }) AS question_embedding
    CALL db.index.vector.queryNodes($index_name, $top_k, question_embedding) YIELD node, score
    RETURN score, node.text AS text
    """
    return kg.query(query, params={
        'question': question,
        'openAiApiKey': OPENAI_API_KEY,
        'openAiEndpoint': OPENAI_ENDPOINT,
        'index_name': VECTOR_INDEX_NAME,
        'top_k': 5
    })

# Example search
for r in neo4j_vector_search("What is Apple's business model?"):
    print(r["text"])

Item 1. Business Company Background The Company designs, manufactures and markets smartphones, personal computers, tablets, wearables and accessories, and sells a variety of related services. The Company’s fiscal year is the 52- or 53-week period that ends on the last Saturday of September. Products iPhone iPhone ® is the Company’s line of smartphones based on its iOS operating system. The iPhone line includes iPhone 16 Pro, iPhone 16, iPhone 15, iPhone 14 and iPhone SE ® . Mac Mac ® is the Company’s line of personal computers based on its macOS ® operating system. The Mac line includes laptops MacBook Air ® and MacBook Pro ® , as well as desktops iMac ® , Mac mini ® , Mac Studio ® and Mac Pro ® . iPad iPad ® is the Company’s line of multipurpose tablets based on its iPadOS ® operating system. The iPad line includes iPad Pro ® , iPad Air ® , iPad and iPad mini ® . Wearables, Home and Accessories Wearables includes smartwatches, wireless headphones and spatial computers. The Company’s l

# Step 6: Setup LangChain GraphRAG

In [7]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Neo4jVector
from langchain.chains import RetrievalQAWithSourcesChain
import textwrap

# Create a vector store from the graph
vector_store = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)

retriever = vector_store.as_retriever()

# Setup RAG chain
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY),
    chain_type="stuff",
    retriever=retriever
)

# Step 7: Ask Questions with LLM

In [8]:
def prettychain(question: str):
    response = qa_chain({"question": question}, return_only_outputs=True)
    print("\nAnswer:\n" + textwrap.fill(response["answer"], width=80))

In [11]:
# Ask real questions
prettychain("What is Apple's primary business?")


Answer:
Apple's primary business is designing, manufacturing, and marketing smartphones,
personal computers, tablets, wearables, and accessories, as well as selling
related services.


In [12]:
prettychain("Where is Apple headquartered?")


Answer:
Apple is headquartered in Cupertino, California, United States.


In [13]:
prettychain("What are the top risks mentioned in Apple's 10-K?")


Answer:
The top risks mentioned in Apple's 10-K include industrial accidents at
suppliers, public health issues like pandemics, global economic conditions, and
the need to continually improve products and services to remain competitive.


In [14]:
prettychain("Where are the primary suppliers for Apple?")


Answer:
The primary suppliers for Apple are located primarily in China mainland, India,
Japan, South Korea, Taiwan, and Vietnam.


In [15]:
prettychain("Where are the top 5 ROI product lines for Apple")


Answer:
The top 5 ROI product lines for Apple are iPhone, Mac, iPad, Wearables, Home and
Accessories, and Services.


## 📊 Stage 3: GraphRAG System Implementation Summary

### 🎯 Objective
The goal of Stage 3 is to build a prototype system that integrates graph-based semantic retrieval with a large language model (LLM) to answer user questions using the structured 10-K knowledge graph created in Stage 2.

---

### ✅ Key Components and What Was Implemented

1. **🔌 Neo4j Connection**
   - Connected to the Neo4j AuraDB graph that contains `:Chunk` nodes from 10-K filings.
   - Verified that all chunks were uploaded in Stage 2 and accessible via Cypher queries.

2. **🧠 Embedding & Vector Index**
   - Used `OpenAI` to compute vector embeddings of each `Chunk.text`.
   - Stored the embeddings in Neo4j under the `textEmbedding` property.
   - Created a vector index (`form_10k_chunks`) on `:Chunk(textEmbedding)` for fast semantic search.

3. **🔍 Semantic Retrieval**
   - Built a retrieval function using `Neo4jVector` to query top-k relevant chunks based on a user question.
   - This enables semantic search — retrieving text that is *meaningfully similar*, not just keyword-matched.

4. **🤖 LLM Integration with LangChain**
   - Connected OpenAI's GPT model via `ChatOpenAI` through the LangChain framework.
   - Used `RetrievalQAWithSourcesChain` to create a complete retrieval-augmented generation (RAG) pipeline.

5. **🧪 Query Pipeline (Prototype)**
   - Defined a `prettychain()` function to:
     - Accept a user question
     - Retrieve relevant graph chunks
     - Generate and return a clear, context-aware answer
   - This fulfills the end-to-end GraphRAG architecture.

---

### 💬 Example Questions Supported by the System

- "What is Apple's primary business?"
- "Where is Apple headquartered?"
- "What are the top risks mentioned in Apple's 10-K?"
- "Where are the primary suppliers for Apple?"
- "Where are the top 5 ROI product lines for Apple"

---

### ✅ Outcome
This notebook serves as a working **prototype GraphRAG system** that uses a structured knowledge graph, computes embeddings, and integrates with a generative LLM to deliver smart, context-aware answers based on 10-K filings.
