# Create Vector Index 

This notebook creates a Vector Index on top of the embeddings in Neo4j. This can be used later on for a vector search in the database. Documentation of the vector index in Neo4j can be found here: [Neo4j Vector Index and Search](https://neo4j.com/labs/genai-ecosystem/vector-search/?msclkid=c56d8efbd9a0151f76ebea164f5643cb&utm_source=Google&utm_medium=PaidSearch&utm_campaign=Evergreen&utm_content=EMEA-Search-SEMCE-DSA-None-SEM-SEM-NonABM&utm_term=&utm_adgroup=DSA&gad_source=1&gclid=Cj0KCQjw9vqyBhCKARIsAIIcLMEVATnevK1iXdTlPzxCK8eytADNy4tWZXXRF_NAthTi4x_sO19uUP4aAubSEALw_wcB). 

Again, for executing the queries we make use of [Python Driver](https://neo4j.com/docs/api/python-driver/current/) that enables querying Neo4j from a Python script

In [33]:
import pandas as pd
from neo4j import Query, GraphDatabase, RoutingControl, Result
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from json import loads, dumps

## Get Credentials

In [5]:
env_file = 'credentials.env'

In [6]:
if os.path.exists(env_file):
    load_dotenv(env_file, override=True)

    # Neo4j
    HOST = os.getenv('NEO4J_URI')
    USERNAME = os.getenv('NEO4J_USERNAME')
    PASSWORD = os.getenv('NEO4J_PASSWORD')
    DATABASE = os.getenv('NEO4J_DATABASE')

    # AI
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    os.environ['OPENAI_API_KEY']=OPENAI_API_KEY
    LLM = os.getenv('LLM')
    EMBEDDINGS_MODEL = os.getenv('EMBEDDINGS_MODEL')
else:
    print(f"File {env_file} not found.")

## Setup Connection to Database

Connect to neo4j db

In [7]:
driver = GraphDatabase.driver(
    HOST,
    auth=(USERNAME, PASSWORD)
)

Test the connection

In [8]:
driver.execute_query(
    """
    MATCH (n) RETURN COUNT(n) as Count
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)

Unnamed: 0,Count
0,773


## Create Vector Index 

Convert the embeddings to a vector property with the [setNodeVectorProperty](https://neo4j.com/docs/operations-manual/5/reference/procedures/#procedure_db_create_setNodeVectorProperty) procedure.

In [9]:
no_chunks = driver.execute_query(
    """
    MATCH (n:Chunk) RETURN COUNT(n) as Count
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)

In [10]:
no_chunks

Unnamed: 0,Count
0,610


In [11]:
batch_size = 100
nr_batches = int(no_chunks.iloc[0].Count / batch_size) + 1
print(f'Running {nr_batches} batches with size {batch_size}')

Running 7 batches with size 100


In [12]:
for batch in range(nr_batches):
    query = f"""
        MATCH(c:Chunk)
        WHERE c.id >= {(batch*batch_size)+1} AND c.id <= {(batch+1)*batch_size}
        CALL db.create.setNodeVectorProperty(c, "embedding", c.embedding)
        RETURN count(c) AS propertySetCount
    """

    driver.execute_query(
        query,
        database_=DATABASE,
        routing_=RoutingControl.WRITE,
    )
    
    if ((batch % 10 == 0) & (batch != 0)):
        print(f"Finished: {batch}/{nr_batches} batches ({round(batch/nr_batches*100,2)}%)")

Create the [vector index](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/#create-vector-index) in Neo4j

In [13]:
query = """
    CREATE VECTOR INDEX `chunk-embeddings` IF NOT EXISTS
    FOR (c:Chunk) ON (c.embedding)
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'
        } 
    }
"""

In [14]:
driver.execute_query(
    query,
    database_=DATABASE,
    routing_=RoutingControl.WRITE
)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x15ac5bd10>, keys=[])

In [15]:
schema_result_df  = driver.execute_query(
    'SHOW INDEXES',
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)
schema_result_df.head()

Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,6,chunk-embeddings,ONLINE,100.0,VECTOR,NODE,[Chunk],[embedding],vector-2.0,,2025-05-07T13:19:43.793000000+00:00,58
1,0,index_343aff4e,ONLINE,100.0,LOOKUP,NODE,,,token-lookup-1.0,,2025-05-12T14:18:16.988000000+00:00,2307
2,1,index_f7700477,ONLINE,100.0,LOOKUP,RELATIONSHIP,,,token-lookup-1.0,,2025-05-12T12:24:46.937000000+00:00,13
3,4,unique_chunk,ONLINE,100.0,RANGE,NODE,[Chunk],[id],range-1.0,unique_chunk,2025-05-12T14:36:57.804000000+00:00,5978
4,2,unique_document,ONLINE,100.0,RANGE,NODE,[Document],[id],range-1.0,unique_document,2025-05-12T12:04:21.073000000+00:00,1492


## Experiment with the Vector Search

In [16]:
embedding_model = OpenAIEmbeddings(
    model=EMBEDDINGS_MODEL,
    openai_api_key=OPENAI_API_KEY
)

In [17]:
embedding_model.model

'text-embedding-ada-002'

In [21]:
message = "rood staan"

In [22]:
message_vector = embedding_model.embed_query(message)

In [34]:
similarity_query = """ 
    CALL db.index.vector.queryNodes("chunk-embeddings", 5, $message_vector) YIELD node, score
    WITH node as chunk, score ORDER BY score DESC
    MATCH (d:Document)<-[:PART_OF]-(chunk)
    RETURN score, d.file_name as file_name, chunk.id as chunk_id, chunk.page as page, chunk.chunk AS chunk
"""

In [35]:
results_df = driver.execute_query(
    similarity_query,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    message_vector=message_vector,
    result_transformer_= lambda r: r.to_df()
)

In [36]:
results_df

Unnamed: 0,score,file_name,chunk_id,page,chunk
0,0.908554,Rabo Beleggersrekening Terms 2020.pdf,557,9,"Pagina 9/24\nAlgemene voorwaarden 2020, Rabo B..."
1,0.906113,Payment and Online Services Terms Sept 2022.pdf,456,61,ongeoorloofd roodstaan. \nHet bedrag van de o...
2,0.896973,Rabo Beleggersrekening Terms 2020.pdf,609,24,www.rabobank.nl
3,0.896454,Rabo Beleggersrekening Terms 2020.pdf,596,20,bestaan. Verdere informatie staat op www.rabob...
4,0.895401,Payment and Online Services Terms Sept 2022.pdf,455,61,rente wijzigt.\n173. U betaalt een maandbedrag...


In [42]:
results = dumps(loads(results_df.to_json(orient="records")), indent=2)
print(results)

[
  {
    "score": 0.9085540771,
    "file_name": "Rabo Beleggersrekening Terms 2020.pdf",
    "chunk_id": 557,
    "page": 9,
    "chunk": "Pagina 9/24\nAlgemene voorwaarden 2020, Rabo BeleggersRekening  Febru ari 2020\n13. Zonder effectenkrediet rood staan op de beleggersrekening\n 1 U mag niet rood staan op de beleggersr\nekening als u geen effectenkrediet heeft.\n 2 \n U betaalt rent\ne als u zonder effectenkrediet rood staat op de beleggersrekening. Deze rente is variabel. En deze \nnoemen wij \u2018debetrente bij ongeoorloofde roodstand\u2019 . Wij kunnen deze rente altijd wijzigen. Kijk in artikel 9 van \nhoofdstuk 4 voor hoe wij deze rente berekenen. Wij kunnen met u afspreken dat hierbij een andere opslag geldt \ndan de opslag die wij gebruiken om de debetrente over een effectenkrediet te berekenen.\n 3 \n Bij een ongeoorloof\nde roodstand bent u zonder dat een ingebrekestelling nodig is, in verzuim en is het bedrag dat \nu rood staat onmiddellijk opeisbaar. Dit betekent dat u