# Module 2 - Create Indexes

This notebook creates a Vector Index on top of the embeddings in Neo4j. This can be used later on for a vector search in the database. Documentation of the vector index in Neo4j can be found here: [Neo4j Vector Index and Search](https://neo4j.com/labs/genai-ecosystem/vector-search/?msclkid=c56d8efbd9a0151f76ebea164f5643cb&utm_source=Google&utm_medium=PaidSearch&utm_campaign=Evergreen&utm_content=EMEA-Search-SEMCE-DSA-None-SEM-SEM-NonABM&utm_term=&utm_adgroup=DSA&gad_source=1&gclid=Cj0KCQjw9vqyBhCKARIsAIIcLMEVATnevK1iXdTlPzxCK8eytADNy4tWZXXRF_NAthTi4x_sO19uUP4aAubSEALw_wcB). 

Again, for executing the queries we make use of [Python Driver](https://neo4j.com/docs/api/python-driver/current/) that enables querying Neo4j from a Python script

In [1]:
import pandas as pd
from neo4j import Query, GraphDatabase, RoutingControl, Result
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from json import loads, dumps

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## Get Credentials

In [2]:
env_file = 'credentials.env'

In [3]:
if os.path.exists(env_file):
    load_dotenv(env_file, override=True)

    # Neo4j
    HOST = os.getenv('NEO4J_URI')
    USERNAME = os.getenv('NEO4J_USERNAME')
    PASSWORD = os.getenv('NEO4J_PASSWORD')
    DATABASE = os.getenv('NEO4J_DATABASE')

    # AI
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    os.environ['OPENAI_API_KEY']=OPENAI_API_KEY
    LLM = os.getenv('LLM')
    EMBEDDINGS_MODEL = os.getenv('EMBEDDINGS_MODEL')
else:
    print(f"File {env_file} not found.")

## Setup Connection to Database

Setup connection to the database with the [Python Driver](https://neo4j.com/docs/python-manual/5/).

In [4]:
driver = GraphDatabase.driver(
    HOST,
    auth=(USERNAME, PASSWORD)
)

Test the connection

In [5]:
driver.execute_query(
    """
    MATCH (n) RETURN COUNT(n) as Count
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)

Unnamed: 0,Count
0,1132


## Create Vector Index 

Convert the embeddings to a vector property with the [setNodeVectorProperty](https://neo4j.com/docs/operations-manual/5/reference/procedures/#procedure_db_create_setNodeVectorProperty) procedure.

In [6]:
no_chunks = driver.execute_query(
    """
    MATCH (n:Chunk) RETURN COUNT(n) as Count
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)

In [7]:
no_chunks

Unnamed: 0,Count
0,862


In [8]:
batch_size = 100
nr_batches = int(no_chunks.iloc[0].Count / batch_size) + 1
print(f'Running {nr_batches} batches with size {batch_size}')

Running 9 batches with size 100


In [9]:
for batch in range(nr_batches):
    query = f"""
        MATCH(c:Chunk)
        WHERE c.id >= {(batch*batch_size)+1} AND c.id <= {(batch+1)*batch_size}
        CALL db.create.setNodeVectorProperty(c, "embedding", c.embedding)
        RETURN count(c) AS propertySetCount
    """

    driver.execute_query(
        query,
        database_=DATABASE,
        routing_=RoutingControl.WRITE,
    )
    
    if ((batch % 10 == 0) & (batch != 0)):
        print(f"Finished: {batch}/{nr_batches} batches ({round(batch/nr_batches*100,2)}%)")

Create the [vector index](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/#create-vector-index) on Embeddings in Neo4j

In [10]:
query = """
    CREATE VECTOR INDEX `chunk-embeddings` IF NOT EXISTS
    FOR (c:Chunk) ON (c.embedding)
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'
        } 
    }
"""

In [11]:
driver.execute_query(
    query,
    database_=DATABASE,
    routing_=RoutingControl.WRITE
)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x31142a6d0>, keys=[])

Show the result

In [12]:
schema_result_df  = driver.execute_query(
    'SHOW INDEXES',
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)
schema_result_df.head()

Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,2,chunk-embeddings,POPULATING,0.0,VECTOR,NODE,[Chunk],[embedding],vector-2.0,,,
1,0,index_343aff4e,ONLINE,100.0,LOOKUP,NODE,,,token-lookup-1.0,,2025-12-19T15:13:41.330000000+00:00,37584.0
2,1,index_f7700477,ONLINE,100.0,LOOKUP,RELATIONSHIP,,,token-lookup-1.0,,2025-12-10T16:34:56.212000000+00:00,10.0


## Experiment with the Vector Search

Define the Embeddings Model

In [13]:
embedding_model = OpenAIEmbeddings(
    model=EMBEDDINGS_MODEL,
    openai_api_key=OPENAI_API_KEY
)

In [14]:
embedding_model.model

'text-embedding-ada-002'

Define the function to perform a Vector Search

In [15]:
def vector_search_chunks(message, nn=5, embedding_model=embedding_model):
    message_vector = embedding_model.embed_query(message)
    similarity_query = """ 
        CALL db.index.vector.queryNodes("chunk-embeddings", $nn, $message_vector) YIELD node, score
        WITH node as chunk, score ORDER BY score DESC
        MATCH (d:Document)<-[:PART_OF]-(chunk)
        RETURN score, d.name as doc_name, d.url as doc_url, chunk.id as chunk_id, chunk.page as page, chunk.text as text
    """
    results_df = driver.execute_query(
        similarity_query,
        database_=DATABASE,
        routing_=RoutingControl.READ,
        message_vector=message_vector,
        nn = nn, 
        result_transformer_= lambda r: r.to_df()
    )
    return results_df

#### Example

Let's run an example with the vector search on the chunks. Examples: 

- "Did we do any project on Offshore Windfarms?"
- "In which project was ALEX II used?"
- "What project did we do in the Netherlands?"
- "In which projects did we build a deepwall?"
- "In which year was the ship STRANDWAY?"
- "What are some features of the ship STRANDWAY?" (struggling). 

In [16]:
message = "What are some features of the ship STRANDWAY?"

In [17]:
results_df = vector_search_chunks(message, 10, embedding_model)

In [18]:
results_df

Unnamed: 0,score,doc_name,doc_url,chunk_id,page,text
0,0.909225,ndurance.pdf,https://boskalis.com/media/wbanffyp/ndurance.pdf,803,0,"accommodation barge, bottom \nstrengthened for..."
1,0.909119,le-havre-port-2000.pdf,https://boskalis.com/media/wivdqakk/le-havre-p...,557,1,managed to achieve this goaol.\n \nEQUIPMENT D...
2,0.907196,bear-27-03.pdf,https://boskalis.com/media/rathdvla/bear-27-03...,758,1,Capstan winches Brattvaag 5 t\nTugger winches ...
3,0.906357,ndurance.pdf,https://boskalis.com/media/wbanffyp/ndurance.pdf,804,0,Two engine rooms.\nBeaching capability.\nCorri...
4,0.904755,waterway.pdf,https://boskalis.com/media/gwfevulh/waterway.pdf,838,0,H.S. <= 2.5 m. Unrestricted navigation. \nFEAT...
5,0.901627,rockpiper.pdf,https://boskalis.com/media/qlmpx404/rockpiper.pdf,722,0,Length overall 158.60 m\nBreadth 36.00 m\nMo...
6,0.90033,ndeavor_mpv.pdf,https://boskalis.com/media/jpwb2uu5/ndeavor_mp...,667,0,2 x Boatlandings\nA - Frame 100 t SWL\nMAIN DA...
7,0.900131,ndurance.pdf,https://boskalis.com/media/wbanffyp/ndurance.pdf,800,0,EQUIPMENT\nSHEET\nNDURANCE \nCABLE LAYING VESS...
8,0.899887,ndeavor_mpv.pdf,https://boskalis.com/media/jpwb2uu5/ndeavor_mp...,669,0,"Main engines 7,280 kW\nStern thrusters 2 x 1..."
9,0.899826,giant-7.pdf,https://boskalis.com/media/okxnnsxz/giant-7.pdf,827,0,CONSTRUCTION / CLASSIFICATION\nYear of constru...


In [19]:
results = dumps(loads(results_df.to_json(orient="records")), indent=2)
print(results)

[
  {
    "score": 0.9092254639,
    "doc_name": "ndurance.pdf",
    "doc_url": "https://boskalis.com/media/wbanffyp/ndurance.pdf",
    "chunk_id": 803,
    "page": 0,
    "text": "accommodation barge, bottom \nstrengthened for loading and unloading \naground\nFEATURES\nCompletely new ship and turntable design.\nDiesel electric propulsion system.\nAccommodation on fore ship, total for 98 persons\nTwo engine rooms.\nBeaching capability."
  },
  {
    "score": 0.9091186523,
    "doc_name": "le-havre-port-2000.pdf",
    "doc_url": "https://boskalis.com/media/wivdqakk/le-havre-port-2000.pdf",
    "chunk_id": 557,
    "page": 1,
    "text": "managed to achieve this goaol.\n \nEQUIPMENT DEPLOYED\n \u0082 TSHD Shoalway\n \u0082 TSHD Strandway\n \u0082 TSHD Willem van Oranje\n \u0082 GD Medusa\n \u0082 WID Terra Plana\n \u0082 Additional equipment such as multicat, survey \nlaunch, hopper barges\nSPECIAL CIRCUMSTANCES"
  },
  {
    "score": 0.9071960449,
    "doc_name": "bear-27-03.pdf",
    "

#### View results in the Neo4j Browser

Create a vector from a searchprompt

In [20]:
search_prompt = "Boskalis Ship the BEAR"

query_vector = embedding_model.embed_query(search_prompt)
print(query_vector)

[-0.012081247754395008, -0.02560073882341385, -0.0031538631301373243, -0.0037462827749550343, -0.01857389137148857, 0.014272857457399368, -0.037723079323768616, 0.009903335943818092, 0.01931355893611908, -0.018916331231594086, 0.008307570591568947, 0.0023645414039492607, -0.007321345619857311, -0.024285774677991867, 0.008841775357723236, 0.009006145410239697, 0.007629540748894215, 0.004273638594895601, 0.024299470707774162, 0.00753365783020854, -0.018971120938658714, 0.024422749876976013, 0.024600816890597343, 0.006383062805980444, 0.00819798931479454, -0.0115949846804142, 0.03980511054396629, -0.014204369857907295, -0.010019765235483646, 0.003075102111324668, -0.0017806828254833817, -0.0044037653133273125, -0.023915939033031464, -0.007074789609760046, -0.02131340280175209, -0.022847529500722885, 0.017724642530083656, -0.003708614269271493, 0.003989414311945438, -0.0014784804079681635, 0.009225307032465935, 0.0011523072607815266, 0.0007139852968975902, -0.0101361945271492, -0.007355589

Now take the embedding above and paste the following query in the browser: 

:params query_vector => [-0.005233455915004015, -0.01934160105884075, 0.008535362780094147, -0.0051453616470098495, -0.017919041216373444, 0.008372224867343903, -0.017671070992946625, 0.007941541261970997, 0.014212552458047867, -0.028790533542633057, 0.0048027727752923965, 0.01440831832587719, -0.016509531065821648, -0.018245315179228783, 0.01389932818710804, 0.003270910121500492, 0.026937291026115417, 0.00868544913828373, 0.033253978937864304, 0.009886141866445541, 0.003680385649204254, 0.015413246117532253, 0.0036281815264374018, -0.009377152658998966, -0.0021419974509626627, -0.010284198448061943, 0.02236943505704403, -0.01917193830013275, 0.012548549100756645, -0.01652258262038231, -0.0016069059493020177, -0.009807836264371872, -0.02051619254052639, -0.017005469650030136, -0.022408589720726013, -0.003332902444526553, -0.007458653766661882, -0.0034650438465178013, 0.01974618248641491, 0.00012276109191589057, 0.0134294917806983, 0.0003668558201752603, -0.008065525442361832, -0.00263956724666059, -0.008365699090063572, -0.008848587051033974, -0.023113343864679337, -0.017149031162261963, -0.023100292310118675, 0.00544553529471159, 0.024783873930573463, 0.003133874386548996, -0.005350915249437094, -0.014695440419018269, 0.02458810806274414, 0.021338405087590218, 0.0034030515234917402, 0.006071983836591244, -0.018075652420520782, -0.022904526442289352, 0.010440810583531857, 0.015321888960897923, -0.0012104813940823078, -0.0021795190405100584, 0.0023377626203000546, -0.007647893391549587, 0.017762428149580956, 0.016470378264784813, -0.009220540523529053, 0.018414979800581932, 0.025123199447989464, 0.019954998046159744, 0.031322430819272995, 0.012391936965286732, 0.008163408376276493, 0.01144573837518692, 0.007151954807341099, 0.004466709215193987, 0.013129319064319134, -0.010192841291427612, 0.005706555210053921, -0.01925024390220642, -0.012411513365805149, -0.01144573837518692, 0.01533493958413601, 0.0009771946351975203, -0.015543756075203419]

Now observe the result back in the browser for a search on chunks: 

```
CALL db.index.vector.queryNodes("chunk-embeddings", 3, $query_vector) YIELD node, score
WITH node as chunk, score ORDER BY score DESC
RETURN score, chunk
```

### Create Full Text Index

In [21]:
schema_result_df  = driver.execute_query(
    'CREATE FULLTEXT INDEX full_text_name IF NOT EXISTS FOR (n:Equipment|Project|Client|Contractor) ON EACH [n.name]',
    database_=DATABASE,
    routing_=RoutingControl.WRITE,
    result_transformer_= lambda r: r.to_df()
)
schema_result_df.head()

In [22]:
schema_result_df  = driver.execute_query(
    'SHOW INDEXES',
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)
schema_result_df.head()

Unnamed: 0,id,name,state,populationPercent,type,entityType,labelsOrTypes,properties,indexProvider,owningConstraint,lastRead,readCount
0,2,chunk-embeddings,ONLINE,100.0,VECTOR,NODE,[Chunk],[embedding],vector-2.0,,2025-12-19T15:15:22.693000000+00:00,1.0
1,3,full_text_name,ONLINE,100.0,FULLTEXT,NODE,"[Equipment, Project, Client, Contractor]",[name],fulltext-1.0,,,
2,0,index_343aff4e,ONLINE,100.0,LOOKUP,NODE,,,token-lookup-1.0,,2025-12-19T15:15:21.625000000+00:00,37593.0
3,1,index_f7700477,ONLINE,100.0,LOOKUP,RELATIONSHIP,,,token-lookup-1.0,,2025-12-10T16:34:56.212000000+00:00,10.0


In [34]:
results_df  = driver.execute_query(
    """
    CALL db.index.fulltext.queryNodes("full_text_name", "s STRANdway") YIELD node, score
    RETURN node.name, score
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)
results = dumps(loads(results_df.to_json(orient="records")), indent=2)

print(results)

[
  {
    "node.name": "Strandway",
    "score": 3.0028965473
  }
]
