# Create Vector Index 

This notebook creates a Vector Index on top of the embeddings in Neo4j. This can be used later on for a vector search in the database. Documentation of the vector index in Neo4j can be found here: [Neo4j Vector Index and Search](https://neo4j.com/labs/genai-ecosystem/vector-search/?msclkid=c56d8efbd9a0151f76ebea164f5643cb&utm_source=Google&utm_medium=PaidSearch&utm_campaign=Evergreen&utm_content=EMEA-Search-SEMCE-DSA-None-SEM-SEM-NonABM&utm_term=&utm_adgroup=DSA&gad_source=1&gclid=Cj0KCQjw9vqyBhCKARIsAIIcLMEVATnevK1iXdTlPzxCK8eytADNy4tWZXXRF_NAthTi4x_sO19uUP4aAubSEALw_wcB). 

Again, for executing the queries we make use of [Python Driver](https://neo4j.com/docs/api/python-driver/current/) that enables querying Neo4j from a Python script

In [1]:
import pandas as pd
from neo4j import Query, GraphDatabase, RoutingControl, Result
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from json import loads, dumps

## Get Credentials

In [2]:
env_file = 'credentials.env'

In [3]:
if os.path.exists(env_file):
    load_dotenv(env_file, override=True)

    # Neo4j
    HOST = os.getenv('NEO4J_URI')
    USERNAME = os.getenv('NEO4J_USERNAME')
    PASSWORD = os.getenv('NEO4J_PASSWORD')
    DATABASE = os.getenv('NEO4J_DATABASE')

    # AI
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    os.environ['OPENAI_API_KEY']=OPENAI_API_KEY
    LLM = os.getenv('LLM')
    EMBEDDINGS_MODEL = os.getenv('EMBEDDINGS_MODEL')
else:
    print(f"File {env_file} not found.")

## Setup Connection to Database

Connect to neo4j db

In [4]:
driver = GraphDatabase.driver(
    HOST,
    auth=(USERNAME, PASSWORD)
)

Test the connection

In [5]:
driver.execute_query(
    """
    MATCH (n) RETURN COUNT(n) as Count
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)

Unnamed: 0,Count
0,1419


## Create Vector Index 

Convert the embeddings to a vector property with the [setNodeVectorProperty](https://neo4j.com/docs/operations-manual/5/reference/procedures/#procedure_db_create_setNodeVectorProperty) procedure.

In [6]:
no_chunks = driver.execute_query(
    """
    MATCH (n:Chunk) RETURN COUNT(n) as Count
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)

In [7]:
no_chunks

Unnamed: 0,Count
0,610


In [8]:
batch_size = 100
nr_batches = int(no_chunks.iloc[0].Count / batch_size) + 1
print(f'Running {nr_batches} batches with size {batch_size}')

Running 7 batches with size 100


In [9]:
for batch in range(nr_batches):
    query = f"""
        MATCH(c:Chunk)
        WHERE c.id >= {(batch*batch_size)+1} AND c.id <= {(batch+1)*batch_size}
        CALL db.create.setNodeVectorProperty(c, "embedding", c.embedding)
        RETURN count(c) AS propertySetCount
    """

    driver.execute_query(
        query,
        database_=DATABASE,
        routing_=RoutingControl.WRITE,
    )
    
    if ((batch % 10 == 0) & (batch != 0)):
        print(f"Finished: {batch}/{nr_batches} batches ({round(batch/nr_batches*100,2)}%)")

Create the [vector index](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/#create-vector-index) on Embeddings in Neo4j

In [10]:
query = """
    CREATE VECTOR INDEX `chunk-embeddings` IF NOT EXISTS
    FOR (c:Chunk) ON (c.embedding)
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'
        } 
    }
"""

In [11]:
driver.execute_query(
    query,
    database_=DATABASE,
    routing_=RoutingControl.WRITE
)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x1528f7610>, keys=[])

Create the [vector index](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/#create-vector-index) on Definitions in Neo4j

In [12]:
query = """
    CREATE VECTOR INDEX `definition-embeddings` IF NOT EXISTS
    FOR (def:Definition) ON (def.embedding)
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'
        } 
    }
"""

In [13]:
driver.execute_query(
    query,
    database_=DATABASE,
    routing_=RoutingControl.WRITE
)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x1528fa5d0>, keys=[])

Show the results

In [14]:
schema_result_df  = driver.execute_query(
    'SHOW INDEXES',
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)
schema_result_df.head()

Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,6,chunk-embeddings,POPULATING,0.0,VECTOR,NODE,[Chunk],[embedding],vector-2.0,,,
1,7,definition-embeddings,POPULATING,0.0,VECTOR,NODE,[Definition],[embedding],vector-2.0,,,
2,0,index_343aff4e,ONLINE,100.0,LOOKUP,NODE,,,token-lookup-1.0,,2025-05-15T09:54:11.633000000+00:00,6929.0
3,1,index_f7700477,ONLINE,100.0,LOOKUP,RELATIONSHIP,,,token-lookup-1.0,,2025-05-15T06:38:36.692000000+00:00,120.0
4,2,unique_chunk,ONLINE,100.0,RANGE,NODE,[Chunk],[id],range-1.0,unique_chunk,2025-05-15T09:24:03.456000000+00:00,2440.0


## Experiment with the Vector Search

Define the Embeddings Model

In [15]:
embedding_model = OpenAIEmbeddings(
    model=EMBEDDINGS_MODEL,
    openai_api_key=OPENAI_API_KEY
)

In [16]:
embedding_model.model

'text-embedding-ada-002'

Define the function to perform a Vector Search

In [17]:
def vector_search_chunks(message, nn=5, embedding_model=embedding_model):
    message_vector = embedding_model.embed_query(message)
    similarity_query = """ 
        CALL db.index.vector.queryNodes("chunk-embeddings", $nn, $message_vector) YIELD node, score
        WITH node as chunk, score ORDER BY score DESC
        MATCH (d:Document)<-[:PART_OF]-(chunk)
        RETURN score, d.file_name as file_name, chunk.id as chunk_id, chunk.page as page, chunk.chunk_eng AS chunk
    """
    results_df = driver.execute_query(
        similarity_query,
        database_=DATABASE,
        routing_=RoutingControl.READ,
        message_vector=message_vector,
        nn = nn, 
        result_transformer_= lambda r: r.to_df()
    )
    return results_df

#### Example

Let's run an example with the vector search on the chunks. Examples: 

-  "Can I enable push notifications?"
-  "How can I use the online services safely?"
-  "What is meant with the Rabofoon?"

In [18]:
message = "How can I use the online services safely?"

In [21]:
results_df = vector_search_chunks(message, 5, embedding_model)

In [22]:
results_df

Unnamed: 0,score,file_name,chunk_id,page,chunk
0,0.92572,Terms & Conditions for Online Business Service...,128,14,Page | 14\n\nSection 6: How to use our Online ...
1,0.923584,Terms & Conditions for Online Business Service...,132,15,Page | 15\n\niv. switch to using any other rec...
2,0.923233,Terms & Conditions for Online Business Service...,133,15,"6.3. Using Internet, telecommunication and/or ..."
3,0.919617,Payment and Online Services Terms Sept 2022.pdf,243,17,18 Terms and Conditions for Payment and Online...
4,0.917267,Payment and Online Services Terms Sept 2022.pdf,267,23,Chapter 4 \nOnline services


In [23]:
results = dumps(loads(results_df.to_json(orient="records")), indent=2)
print(results)

[
  {
    "score": 0.9257202148,
    "file_name": "Terms & Conditions for Online Business Services - April 2024.pdf",
    "chunk_id": 128,
    "page": 14,
    "chunk": "Page | 14\n\nSection 6: How to use our Online Services safely\n\n6.1. What you need to do\n\na. Before using any Online Services you and each User must make sure that the Devices are (i) compatible with the Online Services, (ii) free of any viruses, (iii) adequately protected by installing up-to-date security, anti-virus and other software and (iv) adequately locked with a Security Code or other appropriate security measure, if possible.  \nb. It is essential that you and each User take all reasonable steps to protect all Security Resources and Security Codes including by:  \ni. not keeping a written record of any Security Code or, if you do keep a record of a Security Code, keeping it in a secure place separate from any other Security Code or Security Resource and anything which may identify you or your accounts;  \nii

### Vector Search on Definitions

Now perform an Vector Search on the definitions.

In [24]:
def vector_search_definition(message, nn=5, embedding_model=embedding_model):
    message_vector = embedding_model.embed_query(message)
    similarity_query = """ 
        CALL db.index.vector.queryNodes("definition-embeddings", $nn, $message_vector) YIELD node, score
        WITH node as definition, score ORDER BY score DESC
        RETURN score, definition.term as term, definition.description as description
    """
    results_df = driver.execute_query(
        similarity_query,
        database_=DATABASE,
        routing_=RoutingControl.READ,
        message_vector=message_vector,
        nn = nn, 
        result_transformer_= lambda r: r.to_df()
    )
    return results_df

#### Example

In [25]:
message = "What is meant with the Rabofoon?"

In [26]:
results_df = vector_search_definition(message, 5, embedding_model)

In [27]:
results = dumps(loads(results_df.to_json(orient="records")), indent=2)
print(results)

[
  {
    "score": 0.9451599121,
    "term": "rabofoon",
    "description": "A service that allows you to give payment orders using the phone's keys. You give permission for the payment order according to the instructions of Rabofoon. Transfers with Rabofoon require the use of IBAN as a unique identifier."
  },
  {
    "score": 0.9116210938,
    "term": "rabo",
    "description": "Savings Account 2020"
  },
  {
    "score": 0.910446167,
    "term": "rabo alert",
    "description": "An example of a push message or an email message that is sent to the email address specified by the user."
  },
  {
    "score": 0.9073181152,
    "term": "rabo app",
    "description": "To use a digital card, you need the Rabo App, unless we inform you otherwise."
  },
  {
    "score": 0.9063873291,
    "term": "rabo alerts",
    "description": "Notifications that can be received, and there is a concern if someone else gains access to your email address."
  }
]
