## Overview

This Jupyter Notebook automates the construction of a knowledge graph in a Neo4j database based on structured data provided in an Excel file. The process includes establishing a database connection, resetting the graph, reading and categorizing the Excel sheets, and then creating the corresponding nodes and relationships in the graph database. At the end, it also reports how long the entire operation took.

### Connect to Neo4j

This part of the script sets up the environment and prepares the data needed to build a knowledge graph in Neo4j. First, it imports all necessary libraries, including time for measuring execution duration, pandas for handling Excel data, and a set of helper functions defined elsewhere in the project. A timer is started to track the total runtime of the script. Then, a connection to the Neo4j database is established using the `init_connection()` function. After that, the script loads the Excel file `data/knowledge_graph.xlsx`, which contains the structured data that will be used to create the nodes and relationships of the knowledge graph.

In [35]:
# import necessary libraries
import time
import pandas as pd
from helpers.helper_functions import init_connection, create_worksheet_lists, create_nodes, create_relationships, reset_db

# set up timer for runtime of the script
start_time = time.time()

# initialize connection to the database
graph = init_connection()

# define source file path
excel_file = pd.ExcelFile("data/knowledge_graph.xlsx")

Connected to the database


### Reset Knowledge Graph
The line `reset_db(graph=graph)` calls a helper function that completely clears the contents of the connected Neo4j database. This is done to ensure that no duplicate or outdated data remains before rebuilding the knowledge graph from scratch. It provides a clean starting point by removing all existing nodes and relationships, allowing the script to create a fresh and consistent graph structure based on the contents of the Excel file.

In [36]:
# call the function to reset the database
reset_db(graph=graph)

[32mDatabase reset completed[0m


### Create Worksheet Lists

This line calls the `create_worksheet_lists()` function, which processes the Excel file and separates its sheets into two categories: those containing node data and those containing relationship data. This distinction is necessary because nodes and relationships must be handled differently when importing them into Neo4j. The function returns two lists: node_worksheets for sheets defining nodes, and rel_worksheets for sheets defining relationships. These lists are then used in the following steps to build the graph structure accordingly.

In [37]:
# call function to create nodes and relationships and store them in variables
node_worksheets, rel_worksheets = create_worksheet_lists(excel_file=excel_file)

Node worksheets:  ['Doctor', 'Topic', 'SubTopic', 'Illness', 'Symptom', 'Cause', 'Treatment', 'Patient', 'Drug', 'Diagnosis', 'Hospital', 'Allergy', 'Insurance', 'Department']
Relationship worksheets:  ['REL_Doctor', 'REL_Topic', 'REL_Illness', 'REL_Symptom', 'REL_Patient', 'REL_Hospital'] 



### Create Nodes
This line calls the `create_nodes()` function, which reads the worksheets previously identified as containing node data and creates corresponding nodes in the Neo4j database. For each worksheet, it extracts the relevant data from the Excel file and uses it to define nodes with specific labels and properties. The function then inserts these nodes into the graph, building the basic entities of the knowledge graph structure.

In [38]:
# call function to create relationships
create_nodes(worksheets=node_worksheets, excel_file=excel_file, graph=graph)

Created 10 nodes with the label 'Doctor'
Created 10 nodes with the label 'Topic'
Created 20 nodes with the label 'SubTopic'
Created 30 nodes with the label 'Illness'
Created 55 nodes with the label 'Symptom'
Created 55 nodes with the label 'Cause'
Created 55 nodes with the label 'Treatment'
Created 20 nodes with the label 'Patient'
Created 60 nodes with the label 'Drug'
Created 30 nodes with the label 'Diagnosis'
Created 5 nodes with the label 'Hospital'
Created 15 nodes with the label 'Allergy'
Created 5 nodes with the label 'Insurance'
Created 10 nodes with the label 'Department'
[32mAll nodes have been created successfully. In total: 14 node types.
[0m


### Create Relationships

This line calls the `create_relationships()` function, which processes the worksheets identified as containing relationship data. For each worksheet, it reads the data from the Excel file and creates the corresponding relationships between existing nodes in the Neo4j database. Each relationship connects two node types based on the specified relationship type and properties defined in the Excel sheet. This step is essential for establishing meaningful connections between the entities previously added as nodes in the knowledge graph.

In [39]:
# call function to create relationships
create_relationships(worksheets=rel_worksheets, excel_file=excel_file, graph=graph)

Created 190 relationships from a 'Doctor' node
Created 20 relationships from a 'Topic' node
Created 150 relationships from a 'Illness' node
Created 60 relationships from a 'Symptom' node
Created 180 relationships from a 'Patient' node
Created 10 relationships from a 'Hospital' node
[32mAll relationships have been created successfully.
[0m


### Timestamp
This block prints the timestamp indicating when the script was executed and displays the total runtime. It uses `time.strftime()` to format the current date and time in the `dd.mm.yyyy hh:mm:ss` format. Then it calculates the total execution time by subtracting the recorded start time from the current time. Finally, it prints the total runtime in seconds, giving the user insight into how long the entire knowledge graph generation process took to complete.

In [40]:
# print statement to print when the script was executed
print(f"This script was run on: {time.strftime('%d.%m.%Y %H:%M:%S')}")

# stop the runtime timer
end_time = time.time()

# calculate the total execution time
total_time = end_time - start_time

# print the total execution time
print(f"Total execution time: {total_time:.2f} seconds")

This script was run on: 22.04.2025 17:03:23
Total execution time: 2.60 seconds
