# Create Vector Index 

This notebook creates a Vector Index on top of the embeddings in Neo4j. This can be used later on for a vector search in the database. Documentation of the vector index in Neo4j can be found here: [Neo4j Vector Index and Search](https://neo4j.com/labs/genai-ecosystem/vector-search/?msclkid=c56d8efbd9a0151f76ebea164f5643cb&utm_source=Google&utm_medium=PaidSearch&utm_campaign=Evergreen&utm_content=EMEA-Search-SEMCE-DSA-None-SEM-SEM-NonABM&utm_term=&utm_adgroup=DSA&gad_source=1&gclid=Cj0KCQjw9vqyBhCKARIsAIIcLMEVATnevK1iXdTlPzxCK8eytADNy4tWZXXRF_NAthTi4x_sO19uUP4aAubSEALw_wcB). 

Again, for executing the queries we make use of [Python Driver](https://neo4j.com/docs/api/python-driver/current/) that enables querying Neo4j from a Python script

In [27]:
import pandas as pd
from neo4j import Query, GraphDatabase, RoutingControl, Result
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from json import loads, dumps

## Get Credentials

In [28]:
env_file = 'credentials.env'

In [29]:
if os.path.exists(env_file):
    load_dotenv(env_file, override=True)

    # Neo4j
    HOST = os.getenv('NEO4J_URI')
    USERNAME = os.getenv('NEO4J_USERNAME')
    PASSWORD = os.getenv('NEO4J_PASSWORD')
    DATABASE = os.getenv('NEO4J_DATABASE')

    # AI
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    os.environ['OPENAI_API_KEY']=OPENAI_API_KEY
    LLM = os.getenv('LLM')
    EMBEDDINGS_MODEL = os.getenv('EMBEDDINGS_MODEL')
else:
    print(f"File {env_file} not found.")

## Setup Connection to Database

Connect to neo4j db

In [30]:
driver = GraphDatabase.driver(
    HOST,
    auth=(USERNAME, PASSWORD)
)

Test the connection

In [31]:
driver.execute_query(
    """
    MATCH (n) RETURN COUNT(n) as Count
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)

Unnamed: 0,Count
0,1494


## Create Vector Index 

Convert the embeddings to a vector property with the [setNodeVectorProperty](https://neo4j.com/docs/operations-manual/5/reference/procedures/#procedure_db_create_setNodeVectorProperty) procedure.

In [32]:
no_chunks = driver.execute_query(
    """
    MATCH (n:Chunk) RETURN COUNT(n) as Count
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)

In [33]:
no_chunks

Unnamed: 0,Count
0,610


In [34]:
batch_size = 100
nr_batches = int(no_chunks.iloc[0].Count / batch_size) + 1
print(f'Running {nr_batches} batches with size {batch_size}')

Running 7 batches with size 100


In [35]:
for batch in range(nr_batches):
    query = f"""
        MATCH(c:Chunk)
        WHERE c.id >= {(batch*batch_size)+1} AND c.id <= {(batch+1)*batch_size}
        CALL db.create.setNodeVectorProperty(c, "embedding", c.embedding)
        RETURN count(c) AS propertySetCount
    """

    driver.execute_query(
        query,
        database_=DATABASE,
        routing_=RoutingControl.WRITE,
    )
    
    if ((batch % 10 == 0) & (batch != 0)):
        print(f"Finished: {batch}/{nr_batches} batches ({round(batch/nr_batches*100,2)}%)")

Create the [vector index](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/#create-vector-index) in Neo4j

In [36]:
query = """
    CREATE VECTOR INDEX `chunk-embeddings` IF NOT EXISTS
    FOR (c:Chunk) ON (c.embedding)
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'
        } 
    }
"""

In [37]:
driver.execute_query(
    query,
    database_=DATABASE,
    routing_=RoutingControl.WRITE
)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x16739e390>, keys=[])

In [38]:
schema_result_df  = driver.execute_query(
    'SHOW INDEXES',
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)
schema_result_df.head()

Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,6,chunk-embeddings,ONLINE,100.0,VECTOR,NODE,[Chunk],[embedding],vector-2.0,,2025-05-13T10:14:40.648000000+00:00,62
1,0,index_343aff4e,ONLINE,100.0,LOOKUP,NODE,,,token-lookup-1.0,,2025-05-13T13:52:42.707000000+00:00,3692
2,1,index_f7700477,ONLINE,100.0,LOOKUP,RELATIONSHIP,,,token-lookup-1.0,,2025-05-12T12:24:46.937000000+00:00,13
3,4,unique_chunk,ONLINE,100.0,RANGE,NODE,[Chunk],[id],range-1.0,unique_chunk,2025-05-13T12:31:28.842000000+00:00,10865
4,2,unique_document,ONLINE,100.0,RANGE,NODE,[Document],[id],range-1.0,unique_document,2025-05-13T12:31:28.842000000+00:00,2742


## Experiment with the Vector Search

In [39]:
embedding_model = OpenAIEmbeddings(
    model=EMBEDDINGS_MODEL,
    openai_api_key=OPENAI_API_KEY
)

In [40]:
embedding_model.model

'text-embedding-ada-002'

In [41]:
similarity_query = """ 
    CALL db.index.vector.queryNodes("chunk-embeddings", 5, $message_vector) YIELD node, score
    WITH node as chunk, score ORDER BY score DESC
    MATCH (d:Document)<-[:PART_OF]-(chunk)
    RETURN score, d.file_name as file_name, chunk.id as chunk_id, chunk.page as page, chunk.chunk_eng AS chunk
"""

In [42]:
message = "How can I use the online services safely?"

In [43]:
message_vector = embedding_model.embed_query(message)

In [44]:
results_df = driver.execute_query(
    similarity_query,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    message_vector=message_vector,
    result_transformer_= lambda r: r.to_df()
)

In [45]:
results_df

Unnamed: 0,score,file_name,chunk_id,page,chunk
0,0.925171,Terms & Conditions for Online Business Service...,128,14,Page | 14\n\nSection 6: How to use our Online ...
1,0.922913,Terms & Conditions for Online Business Service...,132,15,Page | 15\n\niv. switch to using any other rec...
2,0.922852,Terms & Conditions for Online Business Service...,133,15,"6.3. Using Internet, telecommunication and/or ..."
3,0.919052,Terms & Conditions for Online Business Service...,100,4,user. \n\nPlease ensure that you keep your Sec...
4,0.918594,Payment and Online Services Terms Sept 2022.pdf,243,17,18 Terms and Conditions for Payment and Online...


In [46]:
results = dumps(loads(results_df.to_json(orient="records")), indent=2)
print(results)

[
  {
    "score": 0.9251708984,
    "file_name": "Terms & Conditions for Online Business Services - April 2024.pdf",
    "chunk_id": 128,
    "page": 14,
    "chunk": "Page | 14\n\nSection 6: How to use our Online Services safely\n\n6.1. What you need to do\n\na. Before using any Online Services you and each User must make sure that the Devices are (i) compatible with the Online Services, (ii) free of any viruses, (iii) adequately protected by installing up-to-date security, anti-virus and other software and (iv) adequately locked with a Security Code or other appropriate security measure, if possible.  \nb. It is essential that you and each User take all reasonable steps to protect all Security Resources and Security Codes including by:  \ni. not keeping a written record of any Security Code or, if you do keep a record of a Security Code, keeping it in a secure place separate from any other Security Code or Security Resource and anything which may identify you or your accounts;  \nii

In [47]:
message = "Can I enable push notifications?"

In [48]:
message_vector = embedding_model.embed_query(message)

In [49]:
results_df = driver.execute_query(
    similarity_query,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    message_vector=message_vector,
    result_transformer_= lambda r: r.to_df()
)

In [50]:
results_df

Unnamed: 0,score,file_name,chunk_id,page,chunk
0,0.891418,Payment and Online Services Terms Sept 2022.pdf,185,4,You can often do this by setting it up in Rabo...
1,0.887299,Terms & Conditions for Online Business Service...,140,17,"us, you must ensure that your Device is set up..."
2,0.884811,Terms & Conditions for Online Business Service...,139,17,iii. creating security interests; \niv. check...
3,0.879608,Payment and Online Services Terms Sept 2022.pdf,184,4,contact us through one of the reporting points...
4,0.876083,Payment and Online Services Terms Sept 2022.pdf,320,34,information about this can be found in the rat...


In [51]:
results = dumps(loads(results_df.to_json(orient="records")), indent=2)
print(results)

[
  {
    "score": 0.891418457,
    "file_name": "Payment and Online Services Terms Sept 2022.pdf",
    "chunk_id": 185,
    "page": 4,
    "chunk": "You can often do this by setting it up in Rabo Online Banking. If you set up to receive push notifications from us, you must ensure that your device is set up in such a way that you actually see them.  \n3. We send these notifications without further codes and other security measures. You are responsible for keeping the content of the notification confidential.  \n5. What applies to the information we provide you?  \n1. We provide you with information about:  \na the use of the account, for example about amounts that are credited and debited  \nb the use of a credit if we have given it to you  \nc the use of debit cards, digital cards, and credit cards  \nd other (bank) services determined by us, such as payment requests and online services like iDIN.  \n2. We can do this on paper or via an online service. If we have agreed with you, you 