# Parse, Chunk and Load Documents 

The following notebook executes three steps: 
- **Parsing and Chunking**: The first part of the notebook parses and chunks the documents.  This is done by the [PyPDFLoader](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/#using-pypdf) of LangChain. More documentation can be found here: [LangChain API](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html). 
- **Embeddings**: For every chunk an embeddings is created. For these an OpenAI Embeddings model is used: [text-embedding-3-small](https://platform.openai.com/docs/models/embeddings). 
- **Load to Database**: The Documents and Chunks are loaded to Neo4j. This is done using the [Python Driver](https://neo4j.com/docs/api/python-driver/current/) that enables querying from a Python script.

In [1]:
%pip install pypdf langchain_community langchain langchain_openai IPython neo4j

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
import os
from dotenv import load_dotenv
from neo4j import Query, GraphDatabase, RoutingControl, Result
import ast
from IPython.display import clear_output

## Get Credentials

In [3]:
env_file = 'credentials.env'

In [4]:
if os.path.exists(env_file):
    load_dotenv(env_file, override=True)

    # Neo4j
    HOST = os.getenv('NEO4J_URI')
    USERNAME = os.getenv('NEO4J_USERNAME')
    PASSWORD = os.getenv('NEO4J_PASSWORD')
    DATABASE = os.getenv('NEO4J_DATABASE')

    # AI
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    os.environ['OPENAI_API_KEY']=OPENAI_API_KEY
    LLM = os.getenv('LLM')
    EMBEDDINGS_MODEL = os.getenv('EMBEDDINGS_MODEL')
else:
    print(f"File {env_file} not found.")

In [5]:
documents_path = "documents/"

## Parse and Chunk Documents

In [6]:
chunk_size = 1000
chunk_overlap = 100

In [7]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap  = chunk_overlap,
    length_function = len,
    is_separator_regex = False,
)

In [8]:
directory = os.fsencode(documents_path)
chunk_seq_id = 0
chunks_with_metadata = []

for doc in os.listdir(directory):
    doc_name = os.fsdecode(doc)
    print(f"Parsing: {doc_name}")
    doc_path = documents_path + doc_name
    loader = PyPDFLoader(doc_path)
    pages = loader.load_and_split()
    num_chunks = 0
    for page in pages:
        chunks = text_splitter.split_text(page.page_content)
        for chunk in chunks:
            d = {
                'file': page.metadata['source'],
                'page': page.metadata['page'],
                'chunks': chunk,
                'num_chuncks': len(chunks),
                'chunk_seq_id': chunk_seq_id
            }
            chunk_seq_id += 1
            num_chunks += 1
            chunks_with_metadata.append(d.copy())
    print(f"chunked {len(pages)} pages in {num_chunks} chunks")

Parsing: Rabo SpaarRekening 2020.pdf
chunked 14 pages in 44 chunks
Parsing: Payment and Online Services Terms Sept 2022.pdf
chunked 112 pages in 354 chunks


Create a DataFrame of Chunks

In [9]:
df = pd.DataFrame.from_dict(chunks_with_metadata)

In [10]:
df

Unnamed: 0,file,page,chunks,num_chuncks,chunk_seq_id
0,documents/Rabo SpaarRekening 2020.pdf,0,Rabo \nSpaarRekening 2020,1,0
1,documents/Rabo SpaarRekening 2020.pdf,1,Pagina 2/14\nInhoud\nRabo SpaarRekening Novem...,1,1
2,documents/Rabo SpaarRekening 2020.pdf,2,"Pagina 3/14\nProductkenmerken, Rabo SpaarReken...",4,2
3,documents/Rabo SpaarRekening 2020.pdf,2,"openen, wijzigen of opheffen. Hiervoor zijn ee...",4,3
4,documents/Rabo SpaarRekening 2020.pdf,2,wij de bijbehorende rente. Wij kunnen de schij...,4,4
...,...,...,...,...,...
393,documents/Payment and Online Services Terms Se...,76,"Pas\nDe betaalpas, creditcard of digitale pas....",3,393
394,documents/Payment and Online Services Terms Se...,77,Woordenlijst Voorwaarden betalen en online die...,2,394
395,documents/Payment and Online Services Terms Se...,77,en Rabo Scanner.\nTarieven- en limietenoverzic...,2,395
396,documents/Payment and Online Services Terms Se...,78,79 Voorwaarden betalen en online diensten 2022...,1,396


## Create embeddings

Load an embedding model

In [11]:
embeddings_model = OpenAIEmbeddings(
    model = EMBEDDINGS_MODEL,
    openai_api_key = OPENAI_API_KEY
)

Add an embedding for every chunk in the DataFrame

In [12]:
df['embedding'] = df['chunks'].apply(lambda x: embeddings_model.embed_query(x))

## Create Neo4j Connection

Setup the Python Driver for Neo4j with the loaded credentials

In [13]:
driver = GraphDatabase.driver(
    HOST,
    auth=(USERNAME, PASSWORD)
)

Test the Connection

In [14]:
driver.execute_query(
    """
    MATCH (n) RETURN COUNT(n) as Count
    """,
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)

Unnamed: 0,Count
0,0


## Load to Database

Create some constraints

In [15]:
driver.execute_query(
    'CREATE CONSTRAINT unique_document IF NOT EXISTS FOR (d:Document) REQUIRE d.id IS UNIQUE',
    database_=DATABASE,
    routing_=RoutingControl.WRITE
)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x14719fa10>, keys=[])

In [16]:
driver.execute_query(
    'CREATE CONSTRAINT unique_chunk IF NOT EXISTS FOR (c:Chunk) REQUIRE c.id IS UNIQUE',
    database_=DATABASE,
    routing_=RoutingControl.WRITE
)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x147a29610>, keys=[])

In [17]:
schema_result_df  = driver.execute_query(
    'SHOW CONSTRAINTS',
    database_=DATABASE,
    routing_=RoutingControl.READ,
    result_transformer_= lambda r: r.to_df()
)
schema_result_df.head()

Unnamed: 0,id,name,type,entityType,labelsOrTypes,properties,ownedIndex,propertyType
0,5,unique_chunk,UNIQUENESS,NODE,[Chunk],[id],unique_chunk,
1,3,unique_document,UNIQUENESS,NODE,[Document],[id],unique_document,


### Load Documents Nodes to database

Create Dataframe from the documents

In [18]:
document_df = df['file'].drop_duplicates().copy()
document_df = document_df.reset_index().drop('index',axis=1).reset_index()
document_df = document_df.rename(columns={"index": "doc_id", "file": "file_location"})
document_df['file_name'] = document_df['file_location'].apply(lambda x: x.split('/')[-1])
document_df

Unnamed: 0,doc_id,file_location,file_name
0,0,documents/Rabo SpaarRekening 2020.pdf,Rabo SpaarRekening 2020.pdf
1,1,documents/Payment and Online Services Terms Se...,Payment and Online Services Terms Sept 2022.pdf


Get number of pages per file

In [19]:
df = pd.merge(df, document_df, left_on='file', right_on='file_location', how='left').copy()

In [20]:
df

Unnamed: 0,file,page,chunks,num_chuncks,chunk_seq_id,embedding,doc_id,file_location,file_name
0,documents/Rabo SpaarRekening 2020.pdf,0,Rabo \nSpaarRekening 2020,1,0,"[0.0005772531148977578, -0.044067852199077606,...",0,documents/Rabo SpaarRekening 2020.pdf,Rabo SpaarRekening 2020.pdf
1,documents/Rabo SpaarRekening 2020.pdf,1,Pagina 2/14\nInhoud\nRabo SpaarRekening Novem...,1,1,"[-0.010542603209614754, -0.03696855157613754, ...",0,documents/Rabo SpaarRekening 2020.pdf,Rabo SpaarRekening 2020.pdf
2,documents/Rabo SpaarRekening 2020.pdf,2,"Pagina 3/14\nProductkenmerken, Rabo SpaarReken...",4,2,"[0.0010430770926177502, -0.02986893244087696, ...",0,documents/Rabo SpaarRekening 2020.pdf,Rabo SpaarRekening 2020.pdf
3,documents/Rabo SpaarRekening 2020.pdf,2,"openen, wijzigen of opheffen. Hiervoor zijn ee...",4,3,"[0.010642684996128082, -0.031895823776721954, ...",0,documents/Rabo SpaarRekening 2020.pdf,Rabo SpaarRekening 2020.pdf
4,documents/Rabo SpaarRekening 2020.pdf,2,wij de bijbehorende rente. Wij kunnen de schij...,4,4,"[-0.022181766107678413, -0.033395808190107346,...",0,documents/Rabo SpaarRekening 2020.pdf,Rabo SpaarRekening 2020.pdf
...,...,...,...,...,...,...,...,...,...
393,documents/Payment and Online Services Terms Se...,76,"Pas\nDe betaalpas, creditcard of digitale pas....",3,393,"[-0.009391930885612965, -0.010842716321349144,...",1,documents/Payment and Online Services Terms Se...,Payment and Online Services Terms Sept 2022.pdf
394,documents/Payment and Online Services Terms Se...,77,Woordenlijst Voorwaarden betalen en online die...,2,394,"[0.0028593363240361214, -0.011541321873664856,...",1,documents/Payment and Online Services Terms Se...,Payment and Online Services Terms Sept 2022.pdf
395,documents/Payment and Online Services Terms Se...,77,en Rabo Scanner.\nTarieven- en limietenoverzic...,2,395,"[-0.011058064177632332, -0.03008638136088848, ...",1,documents/Payment and Online Services Terms Se...,Payment and Online Services Terms Sept 2022.pdf
396,documents/Payment and Online Services Terms Se...,78,79 Voorwaarden betalen en online diensten 2022...,1,396,"[-0.008227122016251087, -0.03939712420105934, ...",1,documents/Payment and Online Services Terms Se...,Payment and Online Services Terms Sept 2022.pdf


In [21]:
pages_df = df.groupby(['doc_id', 'file_name']).max(['page'])['page'].apply(lambda x: x+1)

In [22]:
document_df = pd.merge(document_df, pages_df, on='doc_id', how='left')
document_df

Unnamed: 0,doc_id,file_location,file_name,page
0,0,documents/Rabo SpaarRekening 2020.pdf,Rabo SpaarRekening 2020.pdf,14
1,1,documents/Payment and Online Services Terms Se...,Payment and Online Services Terms Sept 2022.pdf,80


### Load the Documents

In [23]:
merge_file_query = """
    MERGE(mergedDocument:Document {id: $doc_id})
    SET mergedDocument.file_location = $file_location,
        mergedDocument.file_name = $file_name,
        mergedDocument.pages = $file_pages
    RETURN mergedDocument
    """

In [24]:
document_df

Unnamed: 0,doc_id,file_location,file_name,page
0,0,documents/Rabo SpaarRekening 2020.pdf,Rabo SpaarRekening 2020.pdf,14
1,1,documents/Payment and Online Services Terms Se...,Payment and Online Services Terms Sept 2022.pdf,80


In [25]:
for index, row in document_df.iterrows():
    print(row)
    clear_output(wait=True)
    driver.execute_query(
        merge_file_query,
        database_=DATABASE,
        routing_=RoutingControl.WRITE,
        doc_id = row.doc_id,
        file_location = row.file_location,
        file_name = row.file_name,
        file_pages = row.page
    )
    print(f"Loaded {row['file_name']}")
    print("Progress: ", np.round((index+1)/document_df.shape[0]*100,2), "%")

Loaded Payment and Online Services Terms Sept 2022.pdf
Progress:  100.0 %


### Load Chunk Nodes to database

Create Dataframe for chunks

In [26]:
chunks_df = df[['chunk_seq_id', 'num_chuncks', 'page', 'chunks', 'embedding']]
chunks_df

Unnamed: 0,chunk_seq_id,num_chuncks,page,chunks,embedding
0,0,1,0,Rabo \nSpaarRekening 2020,"[0.0005772531148977578, -0.044067852199077606,..."
1,1,1,1,Pagina 2/14\nInhoud\nRabo SpaarRekening Novem...,"[-0.010542603209614754, -0.03696855157613754, ..."
2,2,4,2,"Pagina 3/14\nProductkenmerken, Rabo SpaarReken...","[0.0010430770926177502, -0.02986893244087696, ..."
3,3,4,2,"openen, wijzigen of opheffen. Hiervoor zijn ee...","[0.010642684996128082, -0.031895823776721954, ..."
4,4,4,2,wij de bijbehorende rente. Wij kunnen de schij...,"[-0.022181766107678413, -0.033395808190107346,..."
...,...,...,...,...,...
393,393,3,76,"Pas\nDe betaalpas, creditcard of digitale pas....","[-0.009391930885612965, -0.010842716321349144,..."
394,394,2,77,Woordenlijst Voorwaarden betalen en online die...,"[0.0028593363240361214, -0.011541321873664856,..."
395,395,2,77,en Rabo Scanner.\nTarieven- en limietenoverzic...,"[-0.011058064177632332, -0.03008638136088848, ..."
396,396,1,78,79 Voorwaarden betalen en online diensten 2022...,"[-0.008227122016251087, -0.03939712420105934, ..."


In [27]:
merge_chunck_query = """
    MERGE(mergedChunk:Chunk {id: $chunk_seq_id})
        ON CREATE SET
            mergedChunk.page = $page,
            mergedChunk.chunk = $chunk,
            mergedChunk.embedding = $embedding
    RETURN mergedChunk
"""

In [28]:
for index, row in chunks_df.iterrows():
    clear_output(wait=True)
    driver.execute_query(
        merge_chunck_query,
        database_=DATABASE,
        routing_=RoutingControl.WRITE,
        chunk_seq_id = row.chunk_seq_id,
        page = row.page,
        chunk = row.chunks,
        embedding = row.embedding
    )
    print("Progress: ", np.round(((index+1)/chunks_df.shape[0])*100,2), "%")

Progress:  100.0 %


### Load File to Chunk Relationship

In [29]:
part_of_df = df[['chunk_seq_id', 'doc_id']].copy()
part_of_df

Unnamed: 0,chunk_seq_id,doc_id
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0
...,...,...
393,393,1
394,394,1
395,395,1
396,396,1


In [30]:
merge_part_of_query = """
    MATCH
        (doc:Document {id: $doc_id}),
        (chunk:Chunk {id: $chunk_id})
    MERGE (doc)<-[r:PART_OF]-(chunk)
    RETURN doc.name, type(r), chunk.title
"""

In [31]:
for index, row in part_of_df.iterrows():
    clear_output(wait=True)
    driver.execute_query(
        merge_part_of_query,
        database_=DATABASE,
        routing_=RoutingControl.WRITE,
        doc_id = row.doc_id,
        chunk_id = row.chunk_seq_id
    )
    # print(f"Loaded relationship from document {row['doc_id']} to chunk {row['chunk_seq_id']}")
    print("Progress: ", np.round(((index+1)/part_of_df.shape[0])*100,2), "%")

Progress:  100.0 %


## Load Chunk to Chunk Relationship

Link the chunks in order by the "NEXT" relationship.

In [32]:
next_query = """
    MATCH (doc:Document)
    WITH doc
    CALL (doc) {
        MATCH (doc)<-[:PART_OF]-(chunks:Chunk)
        WITH chunks ORDER BY chunks.id ASC
        WITH collect(chunks) as chunk_list
        CALL apoc.nodes.link(
            chunk_list,
            "NEXT",
            {avoidDuplicates: true}
        )
        RETURN size(chunk_list) as size_chunk_list
    }
    WITH doc, size_chunk_list
    RETURN doc, size_chunk_list
"""

In [33]:
 driver.execute_query(
        next_query,
        database_=DATABASE,
        routing_=RoutingControl.WRITE
    )

EagerResult(records=[<Record doc=<Node element_id='4:c6d37cd5-dbfb-40c4-9e15-be5560ce9c92:0' labels=frozenset({'Document'}) properties={'file_location': 'documents/Rabo SpaarRekening 2020.pdf', 'pages': 14, 'file_name': 'Rabo SpaarRekening 2020.pdf', 'id': 0}> size_chunk_list=44>, <Record doc=<Node element_id='4:c6d37cd5-dbfb-40c4-9e15-be5560ce9c92:1' labels=frozenset({'Document'}) properties={'file_location': 'documents/Payment and Online Services Terms Sept 2022.pdf', 'pages': 80, 'file_name': 'Payment and Online Services Terms Sept 2022.pdf', 'id': 1}> size_chunk_list=354>], summary=<neo4j._work.summary.ResultSummary object at 0x155ec38d0>, keys=['doc', 'size_chunk_list'])