# Neo4j Aura Setup for GraphRAG Platform

This notebook guides you through setting up the required database objects in Neo4j Aura for the GraphRAG platform. We'll:
1. Connect to the database
2. Create necessary constraints and indexes
3. Verify the setup

## Prerequisites
- Neo4j Aura Enterprise account
- Database connection details
- Required Python packages

## Install Required Packages

In [1]:
# Install required packages if not already installed
!pip install -q neo4j python-dotenv pandas


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import Dependencies and Load Environment Variables

In [2]:
from neo4j import GraphDatabase
import os
from dotenv import load_dotenv
import logging
import pandas as pd

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

# Get database credentials
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

print(f"Using database URI: {NEO4J_URI}")

Using database URI: neo4j+s://f38c24b8.databases.neo4j.io


## Create Database Connection Helper

In [3]:
class Neo4jConnection:
    def __init__(self, uri, username, password):
        self.driver = GraphDatabase.driver(uri, auth=(username, password))
        
    def close(self):
        if self.driver is not None:
            self.driver.close()
            
    def verify_connectivity(self):
        self.driver.verify_connectivity()
        
    def run_query(self, query, parameters=None):
        with self.driver.session() as session:
            result = session.run(query, parameters or {})
            return list(result)
        
    def run_query_to_df(self, query, parameters=None):
        with self.driver.session() as session:
            result = session.run(query, parameters or {})
            records = list(result)
            if not records:
                return pd.DataFrame()
            return pd.DataFrame([r.values() for r in records], columns=result.keys())

## Connect and Test Database

In [4]:
# Create connection
conn = Neo4jConnection(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)

try:
    # Test connectivity
    conn.verify_connectivity()
    print("Successfully connected to Neo4j Aura!")
    
    # Get database info
    info_df = conn.run_query_to_df("""
    CALL dbms.components()
    YIELD name, versions, edition
    RETURN name, versions, edition
    """)
    
    display(info_df)
    
except Exception as e:
    print(f"Connection failed: {str(e)}")

Successfully connected to Neo4j Aura!


Unnamed: 0,name,versions,edition
0,Neo4j Kernel,[5.24-aura],enterprise


## Create Constraints

First, we'll create constraints to ensure data integrity.

In [5]:
constraints = [
    """
    CREATE CONSTRAINT content_id IF NOT EXISTS
    FOR (n:Content) 
    REQUIRE n.id IS UNIQUE
    """,
    """
    CREATE CONSTRAINT speaker_id IF NOT EXISTS
    FOR (n:Speaker) 
    REQUIRE n.id IS UNIQUE
    """,
    """
    CREATE CONSTRAINT topic_id IF NOT EXISTS
    FOR (n:Topic) 
    REQUIRE n.id IS UNIQUE
    """
]

for constraint in constraints:
    try:
        conn.run_query(constraint)
        print(f"Successfully created constraint")
    except Exception as e:
        print(f"Error creating constraint: {str(e)}")

# Verify constraints
constraints_df = conn.run_query_to_df("""
SHOW CONSTRAINTS
""")

display(constraints_df)

Successfully created constraint
Successfully created constraint
Successfully created constraint


Unnamed: 0,id,name,type,entityType,labelsOrTypes,properties,ownedIndex,propertyType
0,5,content_id,UNIQUENESS,NODE,[Content],[id],content_id,
1,3,n10s_unique_uri,UNIQUENESS,NODE,[Resource],[uri],n10s_unique_uri,
2,7,speaker_id,UNIQUENESS,NODE,[Speaker],[id],speaker_id,
3,9,topic_id,UNIQUENESS,NODE,[Topic],[id],topic_id,


## Create Vector Index

Now we'll create the vector index required for semantic search. Note that the dimensions should match your embedding model:
- OpenAI text-embedding-3-large: 3072 dimensions
- OpenAI text-embedding-3-small: 1536 dimensions
- OpenAI text-embedding-ada-002: 1536 dimensions

In [6]:
vector_indexes = [
    """
    CREATE VECTOR INDEX video_content IF NOT EXISTS
    FOR (n:Content) 
    ON (n.embedding)
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 3072,
            `vector.similarity_function`: 'cosine'
        }
    }
    """
]

for index in vector_indexes:
    try:
        conn.run_query(index)
        print(f"Successfully created vector index")
    except Exception as e:
        print(f"Error creating vector index: {str(e)}")

# Verify vector indexes
vector_indexes_df = conn.run_query_to_df("""
SHOW INDEXES
WHERE type = 'VECTOR'
""")

display(vector_indexes_df)

Successfully created vector index


Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,10,video_content,POPULATING,0.0,VECTOR,NODE,[Content],[embedding],vector-2.0,,,


## Create Full-text Index

This index will support keyword-based searching.

In [7]:
fulltext_indexes = [
    """
    CREATE FULLTEXT INDEX video_text IF NOT EXISTS
    FOR (n:Content)
    ON EACH [n.title, n.text]
    """
]

for index in fulltext_indexes:
    try:
        conn.run_query(index)
        print(f"Successfully created full-text index")
    except Exception as e:
        print(f"Error creating full-text index: {str(e)}")

# Verify full-text indexes
fulltext_indexes_df = conn.run_query_to_df("""
SHOW INDEXES
WHERE type = 'FULLTEXT'
""")

display(fulltext_indexes_df)

Successfully created full-text index


Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,11,video_text,ONLINE,100.0,FULLTEXT,NODE,[Content],"[title, text]",fulltext-1.0,,,


## Create Additional Indexes

These indexes will improve query performance.

In [8]:
indexes = [
    """
    CREATE INDEX content_title IF NOT EXISTS
    FOR (n:Content) 
    ON (n.title)
    """,
    """
    CREATE INDEX content_type IF NOT EXISTS
    FOR (n:Content) 
    ON (n.type)
    """,
    """
    CREATE INDEX speaker_name IF NOT EXISTS
    FOR (n:Speaker) 
    ON (n.name)
    """
]

for index in indexes:
    try:
        conn.run_query(index)
        print(f"Successfully created index")
    except Exception as e:
        print(f"Error creating index: {str(e)}")

# Verify all indexes
all_indexes_df = conn.run_query_to_df("""
SHOW INDEXES
""")

display(all_indexes_df)

Successfully created index
Successfully created index
Successfully created index


Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,4,content_id,ONLINE,100.0,RANGE,NODE,[Content],[id],range-1.0,content_id,,0.0
1,12,content_title,ONLINE,100.0,RANGE,NODE,[Content],[title],range-1.0,,,
2,13,content_type,ONLINE,100.0,RANGE,NODE,[Content],[type],range-1.0,,,
3,0,index_343aff4e,ONLINE,100.0,LOOKUP,NODE,,,token-lookup-1.0,,2024-09-28T19:12:23.788000000+00:00,8339.0
4,1,index_f7700477,ONLINE,100.0,LOOKUP,RELATIONSHIP,,,token-lookup-1.0,,2024-09-15T22:16:04.983000000+00:00,26.0
5,2,n10s_unique_uri,ONLINE,100.0,RANGE,NODE,[Resource],[uri],range-1.0,n10s_unique_uri,2024-09-21T01:38:33.236000000+00:00,705483.0
6,6,speaker_id,ONLINE,100.0,RANGE,NODE,[Speaker],[id],range-1.0,speaker_id,,0.0
7,14,speaker_name,ONLINE,100.0,RANGE,NODE,[Speaker],[name],range-1.0,,,
8,8,topic_id,ONLINE,100.0,RANGE,NODE,[Topic],[id],range-1.0,topic_id,,0.0
9,10,video_content,ONLINE,100.0,VECTOR,NODE,[Content],[embedding],vector-2.0,,,0.0


## Verify Database Setup

Let's run some test queries to verify everything is working.

In [9]:
# Test queries
test_queries = [
    {
        "name": "Vector Index Test",
        "query": """
        CALL db.index.vector.queryNodes('video_content', 3, [])
        YIELD node, score
        RETURN count(*) as count
        """
    },
    {
        "name": "Full-text Index Test",
        "query": """
        CALL db.index.fulltext.queryNodes('video_text', 'test')
        YIELD node, score
        RETURN count(*) as count
        """
    },
    {
        "name": "Database Statistics",
        "query": """
        CALL apoc.meta.stats()
        YIELD nodeCount, relCount, labels, relTypes
        RETURN nodeCount, relCount, labels, relTypes
        """
    }
]

for test in test_queries:
    print(f"\nRunning {test['name']}:")
    try:
        result_df = conn.run_query_to_df(test['query'])
        display(result_df)
    except Exception as e:
        print(f"Error: {str(e)}")


Running Vector Index Test:
Error: {code: Neo.ClientError.Procedure.ProcedureCallFailed} {message: Failed to invoke procedure `db.index.vector.queryNodes`: Caused by: java.lang.IllegalArgumentException: Index query vector has 0 dimensions, but indexed vectors have 3072.}

Running Full-text Index Test:


Unnamed: 0,count
0,0



Running Database Statistics:


Unnamed: 0,nodeCount,relCount,labels,relTypes
0,87306,185003,"{'InhibitorInteraction': 6832, 'ActivatorInter...",{'(:AgonistInteraction)-[:hasReference]->()': ...


## Cleanup

Close the database connection.

In [10]:
conn.close()
print("Database connection closed.")

Database connection closed.


## Next Steps

Your Neo4j Aura database is now configured for the GraphRAG platform. You can:
1. Start ingesting video content
2. Test vector similarity searches
3. Run hybrid queries combining vector and keyword search

Remember to update the vector index dimensions if you change the embedding model.