# Neo4j Aura Setup for GraphRAG Platform

This notebook guides you through setting up the required database objects in Neo4j Aura for the GraphRAG platform using a safe namespacing approach that allows coexistence with existing data. We'll:
1. Connect to the database
2. Analyze existing data
3. Create namespaced constraints and indexes
4. Verify the setup and add test data

## Prerequisites
- Neo4j Aura Enterprise account
- Database connection details
- Required Python packages

## Install Required Packages

In [1]:
# Install required packages if not already installed
!pip install -q neo4j python-dotenv pandas numpy


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import Dependencies and Load Environment Variables

In [2]:
from neo4j import GraphDatabase
import os
from dotenv import load_dotenv
import logging
import pandas as pd
import numpy as np

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

# Get database credentials
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

print(f"Using database URI: {NEO4J_URI}")

Using database URI: neo4j+s://f38c24b8.databases.neo4j.io


## Create Enhanced Database Connection Helper

In [12]:
class Neo4jConnection:
    def __init__(self, uri, username, password):
        self.driver = GraphDatabase.driver(uri, auth=(username, password))
        
    def close(self):
        if self.driver is not None:
            self.driver.close()
            
    def verify_connectivity(self):
        self.driver.verify_connectivity()
        
    def run_query(self, query, parameters=None):
        with self.driver.session() as session:
            result = session.run(query, parameters or {})
            return list(result)
        
    def run_query_to_df(self, query, parameters=None):
        with self.driver.session() as session:
            result = session.run(query, parameters or {})
            records = list(result)
            if not records:
                return pd.DataFrame()
            return pd.DataFrame([r.values() for r in records], columns=result.keys())
            
    def analyze_database(self):
        """Analyze existing database structure"""
        queries = [
            {
                "name": "Label Statistics",
                "query": """
                MATCH (n)
                WITH DISTINCT labels(n) as labels
                UNWIND labels as label
                MATCH (n)
                WHERE label in labels(n)
                WITH label, count(n) as count
                RETURN label, count
                ORDER BY count DESC
                """
            },
            {
                "name": "Relationship Types",
                "query": """
                MATCH ()-[r]->()
                WITH DISTINCT type(r) as relType, count(r) as count
                RETURN relType as relationshipType, count
                ORDER BY count DESC
                """
            },
            {
                "name": "GraphRAG Namespace Statistics",
                "query": """
                MATCH (n)
                WHERE any(label IN labels(n) WHERE label STARTS WITH 'GraphRAG_')
                WITH DISTINCT labels(n) as labels, count(n) as nodeCount
                RETURN labels, nodeCount
                ORDER BY nodeCount DESC
                """
            }
        ]
        
        results = {}
        for query in queries:
            try:
                results[query["name"]] = self.run_query_to_df(query["query"])
            except Exception as e:
                logger.error(f"Error running {query['name']}: {str(e)}")
                results[query["name"]] = pd.DataFrame()
                
        return results

## Connect and Analyze Database

In [13]:
# Create connection
conn = Neo4jConnection(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)

try:
    # Test connectivity
    conn.verify_connectivity()
    print("Successfully connected to Neo4j Aura!")
    
    # Analyze existing database
    analysis = conn.analyze_database()
    print("\nExisting Database Analysis:")
    for name, df in analysis.items():
        print(f"\n{name}:")
        display(df)
    
except Exception as e:
    print(f"Connection failed: {str(e)}")

Successfully connected to Neo4j Aura!

Existing Database Analysis:

Label Statistics:


Unnamed: 0,label,count
0,Resource,2619180
1,Affinity,19198
2,SyntheticOrganicLigand,8083
3,InhibitorInteraction,6832
4,AgonistInteraction,6337
5,AntagonistInteraction,4373
6,Target,3176
7,PeptideLigand,2276
8,ChannelBlockerInteraction,1073
9,AllostericModulatorInteraction,868



Relationship Types:


Unnamed: 0,relationshipType,count
0,hasTaxonomy,29692
1,hasReference,22955
2,hasLigand,21535
3,hasAction,21535
4,hasTarget,21069
5,hasUnits,19198
6,hasAffinity,19198
7,xref,14816
8,hasRef,8591
9,hasTargetFamily,3238



GraphRAG Namespace Statistics:


Unnamed: 0,labels,nodeCount
0,[GraphRAG_Content],1


## Create Namespaced Constraints

In [5]:
constraints = [
    """
    CREATE CONSTRAINT graphrag_content_id IF NOT EXISTS
    FOR (n:GraphRAG_Content) 
    REQUIRE n.id IS UNIQUE
    """,
    """
    CREATE CONSTRAINT graphrag_speaker_id IF NOT EXISTS
    FOR (n:GraphRAG_Speaker) 
    REQUIRE n.id IS UNIQUE
    """,
    """
    CREATE CONSTRAINT graphrag_topic_id IF NOT EXISTS
    FOR (n:GraphRAG_Topic) 
    REQUIRE n.id IS UNIQUE
    """
]

for constraint in constraints:
    try:
        conn.run_query(constraint)
        print(f"Successfully created constraint")
    except Exception as e:
        print(f"Error creating constraint: {str(e)}")

# Verify constraints
constraints_df = conn.run_query_to_df("""
SHOW CONSTRAINTS
WHERE name STARTS WITH 'graphrag'
""")

display(constraints_df)


INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE CONSTRAINT graphrag_content_id IF NOT EXISTS FOR (e:GraphRAG_Content) REQUIRE (e.id) IS UNIQUE` has no effect.} {description: `CONSTRAINT graphrag_content_id FOR (e:GraphRAG_Content) REQUIRE (e.id) IS UNIQUE` already exists.} {position: None} for query: '\n    CREATE CONSTRAINT graphrag_content_id IF NOT EXISTS\n    FOR (n:GraphRAG_Content) \n    REQUIRE n.id IS UNIQUE\n    '
INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE CONSTRAINT graphrag_speaker_id IF NOT EXISTS FOR (e:GraphRAG_Speaker) REQUIRE (e.id) IS UNIQUE` has no effect.} {description: `CONSTRAINT graphrag_speaker_id FOR (e:GraphRAG_Speaker) REQUIRE (e.id) IS UNIQUE` already exists.} {position

Successfully created constraint
Successfully created constraint


INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE CONSTRAINT graphrag_topic_id IF NOT EXISTS FOR (e:GraphRAG_Topic) REQUIRE (e.id) IS UNIQUE` has no effect.} {description: `CONSTRAINT graphrag_topic_id FOR (e:GraphRAG_Topic) REQUIRE (e.id) IS UNIQUE` already exists.} {position: None} for query: '\n    CREATE CONSTRAINT graphrag_topic_id IF NOT EXISTS\n    FOR (n:GraphRAG_Topic) \n    REQUIRE n.id IS UNIQUE\n    '


Successfully created constraint


Unnamed: 0,id,name,type,entityType,labelsOrTypes,properties,ownedIndex,propertyType
0,16,graphrag_content_id,UNIQUENESS,NODE,[GraphRAG_Content],[id],graphrag_content_id,
1,18,graphrag_speaker_id,UNIQUENESS,NODE,[GraphRAG_Speaker],[id],graphrag_speaker_id,
2,20,graphrag_topic_id,UNIQUENESS,NODE,[GraphRAG_Topic],[id],graphrag_topic_id,


## Create Namespaced Vector Index

In [6]:
vector_indexes = [
    """
    CREATE VECTOR INDEX graphrag_video_content IF NOT EXISTS
    FOR (n:GraphRAG_Content) 
    ON (n.embedding)
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 3072,
            `vector.similarity_function`: 'cosine'
        }
    }
    """
]

for index in vector_indexes:
    try:
        conn.run_query(index)
        print(f"Successfully created vector index")
    except Exception as e:
        print(f"Error creating vector index: {str(e)}")

# Verify vector indexes
vector_indexes_df = conn.run_query_to_df("""
SHOW INDEXES
WHERE type = 'VECTOR' AND name STARTS WITH 'graphrag'
""")

display(vector_indexes_df)

INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE VECTOR INDEX graphrag_video_content IF NOT EXISTS FOR (e:GraphRAG_Content) ON (e.embedding) OPTIONS {indexConfig: {`vector.dimensions`: 3072, `vector.similarity_function`: "cosine"}}` has no effect.} {description: `VECTOR INDEX graphrag_video_content FOR (e:GraphRAG_Content) ON (e.embedding)` already exists.} {position: None} for query: "\n    CREATE VECTOR INDEX graphrag_video_content IF NOT EXISTS\n    FOR (n:GraphRAG_Content) \n    ON (n.embedding)\n    OPTIONS {\n        indexConfig: {\n            `vector.dimensions`: 3072,\n            `vector.similarity_function`: 'cosine'\n        }\n    }\n    "


Successfully created vector index


Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,21,graphrag_video_content,ONLINE,100.0,VECTOR,NODE,[GraphRAG_Content],[embedding],vector-2.0,,,0


## Create Namespaced Full-text Index

In [7]:
fulltext_indexes = [
    """
    CREATE FULLTEXT INDEX graphrag_video_text IF NOT EXISTS
    FOR (n:GraphRAG_Content)
    ON EACH [n.title, n.text]
    """
]

for index in fulltext_indexes:
    try:
        conn.run_query(index)
        print(f"Successfully created full-text index")
    except Exception as e:
        print(f"Error creating full-text index: {str(e)}")

# Verify full-text indexes
fulltext_indexes_df = conn.run_query_to_df("""
SHOW INDEXES
WHERE type = 'FULLTEXT' AND name STARTS WITH 'graphrag'
""")

display(fulltext_indexes_df)

INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE FULLTEXT INDEX graphrag_video_text IF NOT EXISTS FOR (e:GraphRAG_Content) ON EACH [e.title, e.text]` has no effect.} {description: `FULLTEXT INDEX graphrag_video_text FOR (e:GraphRAG_Content) ON EACH [e.title, e.text]` already exists.} {position: None} for query: '\n    CREATE FULLTEXT INDEX graphrag_video_text IF NOT EXISTS\n    FOR (n:GraphRAG_Content)\n    ON EACH [n.title, n.text]\n    '


Successfully created full-text index


Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,22,graphrag_video_text,ONLINE,100.0,FULLTEXT,NODE,[GraphRAG_Content],"[title, text]",fulltext-1.0,,2024-10-22T06:00:36.054000000+00:00,1


## Create Additional Namespaced Indexes

In [8]:
indexes = [
    """
    CREATE INDEX graphrag_content_title IF NOT EXISTS
    FOR (n:GraphRAG_Content) 
    ON (n.title)
    """,
    """
    CREATE INDEX graphrag_content_type IF NOT EXISTS
    FOR (n:GraphRAG_Content) 
    ON (n.type)
    """,
    """
    CREATE INDEX graphrag_speaker_name IF NOT EXISTS
    FOR (n:GraphRAG_Speaker) 
    ON (n.name)
    """
]

for index in indexes:
    try:
        conn.run_query(index)
        print(f"Successfully created index")
    except Exception as e:
        print(f"Error creating index: {str(e)}")

# Verify all indexes
all_indexes_df = conn.run_query_to_df("""
SHOW INDEXES
WHERE name STARTS WITH 'graphrag'
""")

display(all_indexes_df)

INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE RANGE INDEX graphrag_content_title IF NOT EXISTS FOR (e:GraphRAG_Content) ON (e.title)` has no effect.} {description: `RANGE INDEX graphrag_content_title FOR (e:GraphRAG_Content) ON (e.title)` already exists.} {position: None} for query: '\n    CREATE INDEX graphrag_content_title IF NOT EXISTS\n    FOR (n:GraphRAG_Content) \n    ON (n.title)\n    '
INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE RANGE INDEX graphrag_content_type IF NOT EXISTS FOR (e:GraphRAG_Content) ON (e.type)` has no effect.} {description: `RANGE INDEX graphrag_content_type FOR (e:GraphRAG_Content) ON (e.type)` already exists.} {position: None} for query: '\n    CREATE INDEX graphrag_

Successfully created index
Successfully created index


INFO:neo4j.notifications:Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE RANGE INDEX graphrag_speaker_name IF NOT EXISTS FOR (e:GraphRAG_Speaker) ON (e.name)` has no effect.} {description: `RANGE INDEX graphrag_speaker_name FOR (e:GraphRAG_Speaker) ON (e.name)` already exists.} {position: None} for query: '\n    CREATE INDEX graphrag_speaker_name IF NOT EXISTS\n    FOR (n:GraphRAG_Speaker) \n    ON (n.name)\n    '


Successfully created index


Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,15,graphrag_content_id,ONLINE,100.0,RANGE,NODE,[GraphRAG_Content],[id],range-1.0,graphrag_content_id,2024-10-22T06:00:35.103000000+00:00,1
1,23,graphrag_content_title,ONLINE,100.0,RANGE,NODE,[GraphRAG_Content],[title],range-1.0,,,0
2,24,graphrag_content_type,ONLINE,100.0,RANGE,NODE,[GraphRAG_Content],[type],range-1.0,,,0
3,17,graphrag_speaker_id,ONLINE,100.0,RANGE,NODE,[GraphRAG_Speaker],[id],range-1.0,graphrag_speaker_id,,0
4,25,graphrag_speaker_name,ONLINE,100.0,RANGE,NODE,[GraphRAG_Speaker],[name],range-1.0,,,0
5,19,graphrag_topic_id,ONLINE,100.0,RANGE,NODE,[GraphRAG_Topic],[id],range-1.0,graphrag_topic_id,,0
6,21,graphrag_video_content,ONLINE,100.0,VECTOR,NODE,[GraphRAG_Content],[embedding],vector-2.0,,,0
7,22,graphrag_video_text,ONLINE,100.0,FULLTEXT,NODE,[GraphRAG_Content],"[title, text]",fulltext-1.0,,2024-10-22T06:00:36.054000000+00:00,1


## Add Sample Test Data

In [17]:
def create_normalized_test_vector(dimensions=3072):
    """Create a normalized random vector for testing"""
    # Create a random vector (avoid all zeros)
    vector = np.random.random(dimensions)
    # Normalize to unit length
    norm = np.linalg.norm(vector)
    if norm == 0:
        vector = np.ones(dimensions)
        norm = np.sqrt(dimensions)
    return (vector / norm).tolist()

# Enhanced test queries with proper vector testing and namespacing
test_queries = [
    {
        "name": "Vector Index Test",
        "query": """
        CALL db.index.vector.queryNodes('graphrag_video_content', 3, $vector)
        YIELD node, score
        RETURN count(*) as count
        """,
        "params": {"vector": create_normalized_test_vector()}  # Changed from zeros to normalized vector
    },
    {
        "name": "Full-text Index Test",
        "query": """
        CALL db.index.fulltext.queryNodes('graphrag_video_text', 'test')
        YIELD node, score
        RETURN count(*) as count
        """
    },
    {
        "name": "GraphRAG Content Node Check",
        "query": """
        MATCH (n:GraphRAG_Content)
        RETURN 
            count(n) as totalContent,
            count(n.embedding) as withEmbedding,
            count(n.title) as withTitle,
            count(n.text) as withText
        """
    },
    {
        "name": "Namespaced Statistics",
        "query": """
        MATCH (n)
        WHERE any(label IN labels(n) WHERE label STARTS WITH 'GraphRAG_')
        RETURN 
            count(n) as totalNodes,
            collect(distinct labels(n)) as nodeTypes
        """
    }
]

for test in test_queries:
    print(f"\nRunning {test['name']}:")
    try:
        result_df = conn.run_query_to_df(
            test['query'], 
            parameters=test.get('params', {})
        )
        display(result_df)
    except Exception as e:
        print(f"Error: {str(e)}")


Running Vector Index Test:


Unnamed: 0,count
0,1



Running Full-text Index Test:


Unnamed: 0,count
0,1



Running GraphRAG Content Node Check:


Unnamed: 0,totalContent,withEmbedding,withTitle,withText
0,1,1,1,1



Running Namespaced Statistics:


Unnamed: 0,totalNodes,nodeTypes
0,1,[[GraphRAG_Content]]


## Verify Database Setup

In [20]:
def create_normalized_test_vector(dimensions=3072):
    """Create a normalized random vector for testing"""
    # Create a random vector (avoid all zeros)
    vector = np.random.random(dimensions)
    # Normalize to unit length
    norm = np.linalg.norm(vector)
    if norm == 0:
        vector = np.ones(dimensions)
        norm = np.sqrt(dimensions)
    return (vector / norm).tolist()

# Enhanced test queries with proper vector testing and namespacing
test_queries = [
    {
        "name": "Vector Index Test",
        "query": """
        CALL db.index.vector.queryNodes('graphrag_video_content', 3, $vector)
        YIELD node, score
        RETURN count(*) as count
        """,
        "params": {"vector": create_normalized_test_vector()}  # Changed here
    },
    {
        "name": "Full-text Index Test",
        "query": """
        CALL db.index.fulltext.queryNodes('graphrag_video_text', 'test')
        YIELD node, score
        RETURN count(*) as count
        """
    },
    {
        "name": "GraphRAG Content Node Check",
        "query": """
        MATCH (n:GraphRAG_Content)
        RETURN 
            count(n) as totalContent,
            count(n.embedding) as withEmbedding,
            count(n.title) as withTitle,
            count(n.text) as withText
        """
    },
    {
        "name": "Namespaced Statistics",
        "query": """
        MATCH (n)
        WHERE any(label IN labels(n) WHERE label STARTS WITH 'GraphRAG_')
        RETURN 
            count(n) as totalNodes,
            collect(distinct labels(n)) as nodeTypes
        """
    }
]

for test in test_queries:
    print(f"\nRunning {test['name']}:")
    try:
        result_df = conn.run_query_to_df(
            test['query'], 
            parameters=test.get('params', {})
        )
        display(result_df)
    except Exception as e:
        print(f"Error: {str(e)}")


Running Vector Index Test:


Unnamed: 0,count
0,1



Running Full-text Index Test:


Unnamed: 0,count
0,1



Running GraphRAG Content Node Check:


Unnamed: 0,totalContent,withEmbedding,withTitle,withText
0,1,1,1,1



Running Namespaced Statistics:


Unnamed: 0,totalNodes,nodeTypes
0,1,[[GraphRAG_Content]]


## Cleanup

In [11]:
conn.close()
print("Database connection closed.")

Database connection closed.



## Next Steps

Your Neo4j Aura database is now configured with a namespaced setup for the GraphRAG platform. Key points:

1. All GraphRAG objects are prefixed with `GraphRAG_` to isolate them from existing data
2. Existing database data remains untouched
3. Vector index is configured for OpenAI text-embedding-3-large (3072 dimensions)
4. All indexes and constraints are namespaced

To use this setup:

1. Update your GraphRAG configuration:
```python
config = GraphRAGConfig(
    neo4j_uri=NEO4J_URI,
    neo4j_username=NEO4J_USERNAME,
    neo4j_password=NEO4J_PASSWORD,
    vector_index_name="graphrag_video_content",
    fulltext_index_name="graphrag_video_text"
)
```

2. Ensure all queries use the namespaced labels:
   - GraphRAG_Content instead of Content
   - GraphRAG_Speaker instead of Speaker
   - GraphRAG_Topic instead of Topic

Remember to update the vector index dimensions if you change the embedding model:
- text-embedding-3-large: 3072 dimensions
- text-embedding-3-small: 1536 dimensions
- text-embedding-ada-002: 1536 dimensions
