# Basic Pipeline

1. Extract the PDF into text (using PyMuPDF for structure).
2. Chunk and embed text with OpenAI’s ADA.
3. Store embeddings in ChromaDB.
4. Ingest each Rule (and its references) into Neo4j via Cypher.
5. Build a hybrid RAG retriever that combines vector and graph queries.
6. Wire up a LangChain chain using GPT-4o to generate precise, rule-based answers.

In [None]:
# !pip install pymupdf langchain chromadb neo4j openai

#  Load and parse the PDF

In [1]:
import fitz  # PyMuPDF

def extract_pages(pdf_path):
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        text = page.get_text("text")
        pages.append(text)
    return pages

pages = extract_pages("/Users/debbieliske/Documents/CodingProjects/tmp_repo/usga/data/Clarifications Rules of Golf 2023.pdf")

# Chunk and embed with ADA

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
import os

# Split into ~500-token chunks with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=200
)
docs = []
for i, page in enumerate(pages):
    for chunk in splitter.split_text(page):
        docs.append({"page": i+1, "text": chunk})

# Create embeddings
embedder = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    openai_api_key=openai_api_key,
)
texts = [d["text"] for d in docs]
metadatas = [{"page": d["page"]} for d in docs]
embeddings = embedder.embed_documents(texts)

# Persist to ChromaDB

In [6]:
from chromadb import PersistentClient
from chromadb.config import Settings, DEFAULT_TENANT, DEFAULT_DATABASE

# 1. Initialize a local, persistent ChromaDB client
client = PersistentClient(
    path="./chroma_data", 
    settings=Settings(), 
    tenant=DEFAULT_TENANT, 
    database=DEFAULT_DATABASE,
)  #  [oai_citation:0‡Chroma Cookbook](https://cookbook.chromadb.dev/core/clients/?utm_source=chatgpt.com)

# 2. Get or create your collection
collection = client.get_or_create_collection(
    name="golf_rules", 
    metadata={"description": "RAG store for Rules of Golf"},
)  #  [oai_citation:1‡Chroma Cookbook](https://cookbook.chromadb.dev/core/collections/?utm_source=chatgpt.com)

# 3. Generate unique IDs for each text chunk
ids = [f"chunk_{i}" for i in range(len(texts))]

# 4. Add documents + embeddings (ids are now required)
collection.add(
    ids=ids,                   # list of unique string IDs
    documents=texts,           # list of text chunks
    metadatas=metadatas,       # list of metadata dicts
    embeddings=embeddings,     # list of embedding vectors
)  #  [oai_citation:2‡Learn R, Python & Data Science Online](https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide?utm_source=chatgpt.com)

# ──────────
# No more `client.persist()` call is needed: PersistentClient writes are saved
# automatically under your `path` directory as soon as you add data. 
# You’ll see a `chroma.sqlite3` and subfolders appear in ./chroma_data.  [oai_citation:3‡Chroma Cookbook](https://cookbook.chromadb.dev/core/storage-layout/?utm_source=chatgpt.com)

In [7]:
NEO4J_URI="neo4j+s://fd0fa541.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="_yPx2PLHokqyQwWeka2aAws6AaT99D-WGCGuCst-rJs"
AURA_INSTANCEID="fd0fa541"
AURA_INSTANCENAME="Free instance"


In [8]:
from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    NEO4J_URI, 
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD)
)

In [9]:
from neo4j import GraphDatabase
import re

# Configure your Neo4j connection
NEO4J_USER = NEO4J_USERNAME
NEO4J_PASS = NEO4J_PASSWORD

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS))

def extract_rule_number(text):
    m = re.match(r"Rule (\d+(\.\d+)*) –", text)
    return m.group(1) if m else None

# 5.1 Create nodes for each rule chunk
with driver.session() as session:
    for d in docs:
        # Only create one node per rule section header
        rule_no = extract_rule_number(d["text"][:100])
        if rule_no:
            session.run(
                """
                MERGE (r:Rule {id:$rid})
                ON CREATE SET r.text = $txt
                """,
                rid=rule_no, txt=d["text"]
            )
    # 5.2 Link sections by parent_of (e.g., Rule 1 → Rule 1.1)
    session.run("""
    MATCH (r1:Rule), (r2:Rule)
    WHERE r2.id STARTS WITH r1.id + '.'
    MERGE (r1)-[:PARENT_OF]->(r2)
    """)
    # 5.3 Detect explicit references "see Rule X.Y"
    for d in docs:
        refs = re.findall(r"[Rr]ule (\d+(\.\d+)*)", d["text"])
        for ref, _ in refs:
            rule_no = extract_rule_number(d["text"][:100])
            if rule_no:
                session.run(
                    """
                    MATCH (a:Rule {id:$a}), (b:Rule {id:$b})
                    MERGE (a)-[:REFERS_TO]->(b)
                    """,
                    a=rule_no, b=ref
                )

# Hybrid Retriever and RAG Chain

In [10]:
from langchain.chains.router.multi_retrieval_qa import MultiRetrievalQAChain  #  [oai_citation:0‡Introduction | 🦜️🔗 LangChain](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.router.multi_retrieval_qa.MultiRetrievalQAChain.html?utm_source=chatgpt.com)
from langchain_core.prompts import PromptTemplate

# 1) Re-use your llm and prompt
llm = ChatOpenAI(model_name="gpt-4.1-mini", temperature=0)
prompt = PromptTemplate(
    input_variables=["context", "query"],
    template=(
        "Use the following context to answer the question precisely. "
        "If it’s not in the context, say “I don’t know.”\n\n"
        "{context}\n\nQuestion: {query}\nAnswer:"
    )
)

# 2) Supply retriever_infos as before
retriever_infos = [
    {
        "name": "vector",
        "description": "ADA embeddings + ChromaDB retrieval",
        "retriever": vector_retriever
    },
    {
        "name": "graph",
        "description": "Exact‐match Neo4j retrieval",
        "retriever": graph_retriever
    }
]

# 3) Build the multi‐retrieval QA chain without passing default_chain
multi_chain = MultiRetrievalQAChain.from_retrievers(
    llm=llm,
    retriever_infos=retriever_infos,
    default_retriever=vector_retriever,
    default_prompt=prompt,
    default_chain_llm=llm,
    return_source_documents=True,
    verbose=True,
)

# 4) Ask a question
result = multi_chain.invoke({"query": "What does Rule 5.3 say about starting the round?"})
print(result["result"])
# Inspect result["source_documents"] for provenance.

NameError: name 'ChatOpenAI' is not defined

In [16]:
# ─── 1) Install dependencies (run once) ─────────────────────────────────────
# !pip install pymupdf langchain langchain-core langchain-openai chromadb neo4j

# ─── 2) Imports & Setup ─────────────────────────────────────────────────────
import os
import fitz  # PyMuPDF
import re
from neo4j import GraphDatabase
from typing import Any

from langchain.text_splitter                    import RecursiveCharacterTextSplitter
from langchain.embeddings                       import OpenAIEmbeddings
from chromadb                                  import PersistentClient
from chromadb.config                           import Settings, DEFAULT_TENANT, DEFAULT_DATABASE
from langchain_core.retrievers                  import BaseRetriever
from langchain_core.documents                   import Document
from langchain.vectorstores                     import Chroma
from langchain.chains.router.multi_retrieval_qa import MultiRetrievalQAChain
from langchain.chains.retrieval                 import create_retrieval_chain
from langchain.chains.combine_documents.stuff   import create_stuff_documents_chain
from langchain_core.prompts                     import PromptTemplate
from langchain_openai                           import ChatOpenAI


# ─── 4) Extract text from PDF ────────────────────────────────────────────────
def extract_pages(pdf_path: str) -> list[str]:
    doc = fitz.open(pdf_path)
    return [page.get_text("text") for page in doc]

pages = extract_pages("/Users/debbieliske/Documents/CodingProjects/tmp_repo/usga/data/Clarifications Rules of Golf 2023.pdf")

# ─── 5) Chunk text ────────────────────────────────────────────────────────────
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
docs = []
for i, page in enumerate(pages):
    for chunk in splitter.split_text(page):
        docs.append({"page": i+1, "text": chunk})

texts = [d["text"] for d in docs]
metadatas = [{"page": d["page"]} for d in docs]

# ─── 6) Generate embeddings ───────────────────────────────────────────────────
#embedder   = OpenAIEmbeddings(model="text-embedding-ada-002")
def batch_embed_texts(texts: list[str], chunk_size: int = 1000) -> list[list[float]]:
    embeddings = []
    for i in range(0, len(texts), chunk_size):
        batch = texts[i:i + chunk_size]
        embeddings.extend(embedder.embed_documents(batch))
    return embeddings
embeddings = batch_embed_texts(texts, chunk_size=100)
# embeddings = embedder.embed_documents(texts)

# ─── 7) Persist embeddings in ChromaDB ───────────────────────────────────────
client = PersistentClient(
    path="./chroma_data",
    settings=Settings(),
    tenant=DEFAULT_TENANT,
    database=DEFAULT_DATABASE,
)
collection = client.get_or_create_collection(
    name="golf_rules",
    metadata={"description": "RAG store for Rules of Golf"},
)
ids = [f"chunk_{i}" for i in range(len(texts))]
collection.add(
    ids=ids,
    documents=texts,
    metadatas=metadatas,
    embeddings=embeddings,
)

# ─── 8) Ingest rules into Neo4j ──────────────────────────────────────────────
# NEO4J_URI  = "bolt://localhost:7687"
# NEO4J_USER = "neo4j"
# NEO4J_PASS = "<YOUR_PASSWORD>"

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS))

# At the top of your notebook, add:
from typing import Any, Optional

# Then replace the extract_rule_number definition:
def extract_rule_number(text: str) -> Optional[str]:
    m = re.match(r"Rule (\d+(\.\d+)*)", text)
    return m.group(1) if m else None

with driver.session() as session:
    # Create Rule nodes
    for d in docs:
        rid = extract_rule_number(d["text"][:100])
        if rid:
            session.run(
                "MERGE (r:Rule {id:$rid}) ON CREATE SET r.text = $txt",
                rid=rid, txt=d["text"]
            )
    # Parent relationships
    session.run("""
    MATCH (r1:Rule), (r2:Rule)
    WHERE r2.id STARTS WITH r1.id + '.'
    MERGE (r1)-[:PARENT_OF]->(r2)
    """)
    # References
    for d in docs:
        src = extract_rule_number(d["text"][:100])
        if not src:
            continue
        for ref, _ in re.findall(r"[Rr]ule (\d+(\.\d+)*)", d["text"]):
            session.run(
                "MATCH (a:Rule {id:$a}), (b:Rule {id:$b}) MERGE (a)-[:REFERS_TO]->(b)",
                a=src, b=ref
            )

# ─── 9) Define the Neo4jRetriever ────────────────────────────────────────────
class Neo4jRetriever(BaseRetriever):
    driver: Any
    limit: int = 5

    def _get_relevant_documents(self, query: str) -> list[Document]:
        keywords = query.split()
        cypher = f"""
        MATCH (r:Rule)
        WHERE {" OR ".join(f"r.text CONTAINS '{w}'" for w in keywords)}
        RETURN r.id AS id, r.text AS text
        LIMIT {self.limit}
        """
        docs = []
        with self.driver.session() as session:
            for rec in session.run(cypher):
                docs.append(Document(page_content=rec["text"], metadata={"rule_id": rec["id"]}))
        return docs

vector_retriever = Chroma(
    collection_name="golf_rules",
    client=client,
    embedding_function=embedder
).as_retriever()

graph_retriever  = Neo4jRetriever(driver=driver, limit=5)

# ─── 10) Build QA chains ──────────────────────────────────────────────────────
llm = ChatOpenAI(model_name="gpt-4.1-mini", temperature=0, openai_api_key=openai_api_key)
prompt = PromptTemplate(
    input_variables=["context", "query"],
    template=(
        "Use the following context to answer the question precisely. "
        "If it’s not in the context, say “I don’t know.”\n\n"
        "{context}\n\nQuestion: {query}\nAnswer:"
    )
)
combine_chain = create_stuff_documents_chain(llm=llm, prompt=prompt)
vector_chain  = create_retrieval_chain(vector_retriever, combine_chain)
graph_chain   = create_retrieval_chain(graph_retriever,  combine_chain)

# ─── 11) Multi-retrieval router ─────────────────────────────────────────────
multi_chain = MultiRetrievalQAChain.from_retrievers(
    llm=llm,
    retriever_infos=[
        {"name": "vector", "description": "ADA embeddings + ChromaDB", "retriever": vector_retriever},
        {"name": "graph",  "description": "Exact-match Neo4j",         "retriever": graph_retriever},
    ],
    default_retriever=vector_retriever,
    default_prompt=prompt,
    default_chain_llm=llm,
    verbose=True,
)


Insert of existing embedding ID: chunk_0
Insert of existing embedding ID: chunk_1
Insert of existing embedding ID: chunk_2
Insert of existing embedding ID: chunk_3
Insert of existing embedding ID: chunk_4
Insert of existing embedding ID: chunk_5
Insert of existing embedding ID: chunk_6
Insert of existing embedding ID: chunk_7
Insert of existing embedding ID: chunk_8
Insert of existing embedding ID: chunk_9
Insert of existing embedding ID: chunk_10
Insert of existing embedding ID: chunk_11
Insert of existing embedding ID: chunk_12
Insert of existing embedding ID: chunk_13
Insert of existing embedding ID: chunk_14
Insert of existing embedding ID: chunk_15
Insert of existing embedding ID: chunk_16
Insert of existing embedding ID: chunk_17
Insert of existing embedding ID: chunk_18
Insert of existing embedding ID: chunk_19
Insert of existing embedding ID: chunk_20
Insert of existing embedding ID: chunk_21
Insert of existing embedding ID: chunk_22
Insert of existing embedding ID: chunk_23
In

In [17]:
# ─── 12) Run a test query ──────────────────────────────────────────────────
answer = multi_chain.run("What happens if the ball lands on an ant hill?")
print(answer)

  answer = multi_chain.run("What happens if the ball lands on an ant hill?")




[1m> Entering new MultiRetrievalQAChain chain...[0m
vector: {'query': 'What happens if the ball lands on an ant hill?'}
[1m> Finished chain.[0m
If a ball lands on an ant hill, the ant hill is generally considered a loose impediment and may be removed under Rule 15.1. However, ant hills are not treated as animal holes, so free relief is not automatically allowed under Rule 16.1.

In some cases, if the ant hills are large, hard, or conical and difficult or impossible to remove, the Committee may adopt a Local Rule allowing players to treat such ant hills as ground under repair, giving them the option to take relief under Rule 16.1.

It's important to note that fire ants are considered a dangerous animal condition, so free relief is available under Rule 16.2 without the need for a special Local Rule.

So, in summary:
- Normally, you can remove the ant hill as a loose impediment and play the ball as it lies.
- If the ant hill is large and difficult to remove, and a Local Rule is in p

In [18]:
answer = multi_chain.run("What happens if the ball lands near an alligator?")
print(answer)





[1m> Entering new MultiRetrievalQAChain chain...[0m
vector: {'query': 'What happens if the ball lands near an alligator?'}
[1m> Finished chain.[0m
The rules do not specifically mention alligators, but based on the guidance about outside influences such as animals, if the ball lands near an alligator and the alligator causes the ball to move or stops it, the ball must be played as it lies with no penalty to the player. 

If the alligator picks up the ball while it is in motion, the ball is considered to have come to rest on the animal, and the player must take free relief using the point where the alligator picked up the ball as the reference point.

In summary:
- If the ball is deflected or stopped by the alligator, play it as it lies, no penalty.
- If the alligator picks up the ball, take free relief at the spot where the ball was picked up.


In [19]:
answer = multi_chain.run("How far is the drop from an alligator?")
print(answer)



[1m> Entering new MultiRetrievalQAChain chain...[0m
vector: {'query': 'How far is the drop from an alligator?'}
[1m> Finished chain.[0m
The player is allowed to take relief by measuring two club-lengths from the fence (or obstruction) and then has a one club-length relief area in which to drop the ball, no nearer the hole than where the ball originally lay. 

So, when taking relief from an alligator (considered a dangerous animal or obstruction), the drop area is determined by measuring two club-lengths away from the fence or obstruction, and then the ball is dropped within a one club-length relief area from that point, ensuring the ball is not dropped closer to the hole than the original position.


In [20]:
# ─── 12) Run a test query ──────────────────────────────────────────────────
answer = multi_chain.run("What happens if the ball lands near an alligator?")
print(answer)



[1m> Entering new MultiRetrievalQAChain chain...[0m
vector: {'query': 'What happens if the ball lands near an alligator?'}
[1m> Finished chain.[0m
The rules do not specifically mention alligators, but they do address what happens if a ball is stopped or deflected by an animal. If your ball in motion is stopped or deflected by an animal (such as an alligator), there is no penalty, and you must play the ball as it lies.

However, if the animal picks up the ball while it is in motion (for example, the alligator picks up the ball), the ball is considered to have come to rest on the animal. In that case, you must take free relief using the point where the animal picked up the ball as the reference point.

So, if your ball lands near an alligator and is not picked up or moved by it, just play it as it lies. If the alligator picks up the ball, take free relief from that spot.


In [21]:
answer = multi_chain.run("What is the relief when your ball is in a sand trap with standing water inside and your ball is in standing water?")
print(answer)



[1m> Entering new MultiRetrievalQAChain chain...[0m
vector: {'query': 'What is the relief when your ball is in a sand trap with standing water inside and your ball is in standing water?'}
[1m> Finished chain.[0m
When your ball is in a sand trap (bunker) with standing water and the ball is in that standing water, you are entitled to free relief under Rule 16.1, which covers relief from abnormal course conditions (such as temporary water).

Specifically:

- You may take free relief by finding the nearest point of complete relief from the standing water in the bunker, which means a point where the ball is no longer in the water or affected by it.
- The relief must be taken in the bunker (the relief area is within the bunker).
- You drop the ball in the relief area according to Rule 14.3b(3) and Rule 16.1.

You are not allowed to take relief outside the bunker unless the rules specifically allow it, so the ball must be dropped in the bunker at the nearest point of complete relief fro

In [56]:
answer = multi_chain.run("What do the rules say about alligators?")
print(answer)



[1m> Entering new MultiRetrievalQAChain chain...[0m
vector: {'query': 'What do the rules say about alligators?'}
[1m> Finished chain.[0m
The rules of golf do not specifically mention alligators. However, alligators would be considered "animals" under the rules, as they are living members of the animal kingdom (reptiles). If an alligator or its presence creates a dangerous animal condition, Rule 16.2 on interference by a dangerous animal condition would apply, allowing relief for the player. 

In general, animals (including alligators) and their holes or disturbances may be considered abnormal course conditions or loose impediments depending on the situation, and the rules provide guidance on relief in those cases. But there is no specific rule that addresses alligators by name.


In [60]:
answer = multi_chain.run("What is the relief from a dangerous animal?")
print(answer)



[1m> Entering new MultiRetrievalQAChain chain...[0m
vector: {'query': 'What is the relief from a dangerous animal?'}
[1m> Finished chain.[0m
Relief from a dangerous animal condition is allowed under Rule 16.2 when a dangerous animal (such as poisonous snakes, stinging bees, alligators, fire ants, or bears) near a ball could cause serious physical injury to the player if they had to play the ball as it lies.

Here is how relief works:

1. **When the ball is anywhere except a penalty area:**
   - The player may take relief under Rule 16.1b, c, or d, depending on whether the ball is in the general area, in a bunker, or on the putting green.

2. **When the ball is in a penalty area:**
   - **Free Relief (Playing from Inside Penalty Area):** The player may take free relief under Rule 16.1b, but the nearest point of complete relief and the relief area must be inside the penalty area.
   - **Penalty Relief (Playing from Outside Penalty Area):** The player may take penalty relief under R

In [63]:
answer = multi_chain.run("What is the club-length relief from a dangerous animal?")
print(answer)



[1m> Entering new MultiRetrievalQAChain chain...[0m
vector: {'query': 'What is the club-length relief from a dangerous animal?'}
[1m> Finished chain.[0m
When taking relief from a dangerous animal condition under the Rules of Golf, the player is allowed to take free relief by dropping the ball within a relief area defined by the nearest point of complete relief. This relief area is measured as one club-length from that reference point, subject to certain limits.

Specifically:

- The relief area is one club-length from the nearest point of complete relief.
- The relief area must be in a location allowed by the rules, for example:
  - It must not be nearer the hole than the reference point.
  - It must be in the same area of the course as the ball (general area, bunker, or putting green), depending on where the ball lies.
  - It must not be in a penalty area or bunker if the ball is not originally there.
  - It must be a place where there is no interference from the dangerous anima