# Basic Pipeline

1. Extract the PDF into text (using PyMuPDF for structure).
2. Chunk and embed text with OpenAIâ€™s ADA.
3. Store embeddings in ChromaDB.
4. Ingest each Rule (and its references) into Neo4j via Cypher.
5. Build a hybrid RAG retriever that combines vector and graph queries.
6. Wire up a LangChain chain using GPT-4o to generate precise, rule-based answers.

In [None]:
# In your notebook or terminal
!pip install pymupdf langchain chromadb neo4j openai

#  Load and parse the PDF

In [None]:
import fitz  # PyMuPDF

def extract_pages(pdf_path):
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        text = page.get_text("text")
        pages.append(text)
    return pages

pages = extract_pages("Rules of Golf for 2019 (Final).pdf")

# Chunk and embed with ADA

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
import os

# Set your OpenAI key
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_KEY>"

# Split into ~500-token chunks with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500, chunk_overlap=200
)
docs = []
for i, page in enumerate(pages):
    for chunk in splitter.split_text(page):
        docs.append({"page": i+1, "text": chunk})

# Create embeddings
embedder = OpenAIEmbeddings(model="text-embedding-ada-002")
texts = [d["text"] for d in docs]
metadatas = [{"page": d["page"]} for d in docs]
embeddings = embedder.embed_documents(texts)