# Tests
In the following notebook there will be several tests about the source code and the usabilty of the Knowledge Graph. In first place it's required to import all necessary libraries, connect to the database and start the runtime timer.

In [None]:
# importing necessary libraries
import os
import time
import datetime as dt
import pandas as pd
from termcolor import colored
from helpers.helper_functions import init_connection, excel_import, export_to_excel, test_query, reset_db

# set up timer for runtime of the script
start_time = time.time()

# init connection to the neo4j database
graph = init_connection()

## Functional Tests

### Test Creation of Knowledge Graph

This block of code verifies that the knowledge graph has been successfully created in the Neo4j database by checking the presence of both nodes and relationships. It begins by executing a Cypher query to count all nodes in the graph using `MATCH (n) RETURN count(n) as node_count`. The result is extracted, and an assertion is made to ensure that at least one node exists. The number of nodes is then printed to provide feedback.

Next, the code performs a similar check for relationships. It runs the query `MATCH ()-[r]->() RETURN count(r) as relationship_count` to count all relationships in the database. Again, it asserts that there is at least one relationship present, and finally prints the number of relationships. These checks serve as a fundamental validation that the graph was populated and structured correctly, forming the foundation for all subsequent operations and tests.

In [None]:
# query to get the number of nodes in the database
query_node_count = """
MATCH (n)
RETURN count(n) as node_count
"""

# run the query and get the data of the result
node_result =  graph.run(query_node_count).data()
# extract the number of nodes from the result
number_of_nodes = node_result[0]['node_count']

# assert the number of nodes is at least 1
assert(number_of_nodes > 0)

# print the number of nodes
print(f"Number of nodes in the database: {number_of_nodes}")


# query to get the number of relationships in the database
query_node_count = """
MATCH ()-[r]->()
RETURN count(r) as relationship_count
"""

# run the query and get the data of the result
node_result = graph.run(query_node_count).data()
# extract the number of relationships from the result
number_of_rels = node_result[0]['relationship_count']

# assert the number of relationships is at least 1
assert(number_of_rels > 0)

# print the number of relationships
print(f"Number of relationships in the database: {node_result[0]['relationship_count']}")

### Get all Doctors
This query returns all doctors and tests if all nodes have a specialization attribute to be sure there are only doctor nodes returned.

This block queries all nodes labeled Doctor in the Neo4j database and retrieves their properties, including name, specialization, years of experience and contact email. It ensures that all returned nodes actually represent valid doctors by asserting that each one has a non-empty specialization
field. After the validation, it prints the total number of doctor nodes and returns their data in a tabular format using a pandas DataFrame. This
allows for easy inspection and verification of the doctor-related data in the graph.

In [None]:
# query to get all Doctor nodes
query_node_count = """
MATCH (d:Doctor)
RETURN d.name as name, d.specialization as specialization, d.yearsOfExperience as years_of_experience, d.contactEmail as contact_email
"""

# run the query and get the data of the result
node_result = graph.run(query_node_count).data()

# extract the number of doctors from the result
number_of_doctors = len(node_result)

# assert that all doctors have a specialization to be sure that there are only doctors returned
assert all(doctor['specialization'] for doctor in node_result), node_result

# print the number of doctors
print(f"Number of doctors in the database: {number_of_doctors}")
# print the details of the doctors

# convert the node result as a dataframe
pd.DataFrame(node_result)

This is how it would like if you would query all doctors in Neo4j with the following query: "MATCH (d:Doctors) RETURN d". The difference is that you return everything from the doctor not just an extraction of information of the node.

<img src="../img/doctors.png" height=500 />

### Test Export Functionality
This block tests whether the export functionality of the knowledge graph works correctly. It starts by counting the number of files currently present in the export directory. Then, it calls the export_to_excel() function, which creates a new Excel file containing the current graph data. After the export is completed, the script recounts the number of files in the directory. It uses an assertion to check that exactly one new file has been created. This confirms that the export process successfully generated and saved a new file representing the graph’s state.

In [None]:
# define the relative export path
export_path = "../data/export"

# number of files in export directory
number_of_files = len(os.listdir(export_path))

# define the current time for the filename
current_time = dt.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

export_to_excel(current_time=current_time, graph=graph, export_path=export_path)

# check if the number of files in the export directory has increased
number_of_files_after = len(os.listdir(export_path))

# assert that the number of files has increased by 1
assert number_of_files_after - number_of_files == 1, "Export failed: no new file created."

### Test Import Functionality
This block tests whether the import functionality for the knowledge graph works correctly. It begins by querying and recording the current number of nodes and relationships in the Neo4j database. Then, it loads an Excel file containing additional graph data and imports that data into the graph using the excel_import() function. After the import process, the script again queries the database to count the updated number of nodes and relationships. It uses assertions to verify that both counts have increased, confirming that the import operation successfully added new data to the graph.

In [None]:
# query to get the number of current nodes
query_node_count = """
MATCH (n)
RETURN count(n) as node_count
"""
# query to get the number of current relationships
query_rel_count = """
MATCH ()-[r]->()
RETURN count(r) as relationship_count
"""

# run the query and get the data of the result
node_result = graph.run(query_node_count).data()
# run the query and get the data of the result
rel_result = graph.run(query_rel_count).data()

# extract the number of nodes from the result
number_of_nodes = node_result[0]['node_count']
# extract the number of relationships from the result
number_of_rels = rel_result[0]['relationship_count']

# define the excel file path
import_file = pd.ExcelFile("../data/import/import_data.xlsx")

# import the data from the excel file into the database
excel_import(excel_file=import_file, graph=graph)

# query to get the number of nodes after importing the data
after_result = graph.run(query_node_count).data()
# query to get the number of relationships after importing the data
after_rel_result = graph.run(query_rel_count).data()

# extract the number of nodes from the result
after_number_of_nodes = after_result[0]['node_count']
# extract the number of relationships from the result
after_number_of_rels = after_rel_result[0]['relationship_count']

# assert that the number of nodes has increased
assert after_number_of_nodes > number_of_nodes, f"Number of nodes before import: {number_of_nodes}, after import: {after_number_of_nodes}"

# assert that the number of relationships has increased
assert after_number_of_rels > number_of_rels, f"Number of relationships before import: {number_of_rels}, after import: {after_number_of_rels}"

## Non-functional Tests - Usability Tests

### Get all illnesses

This block defines and executes a Cypher query that retrieves all nodes labeled as Illness from the Neo4j database. For each illness, it returns key properties such as the illness name, its ICD code, and a description. The `test_query()` helper function is used to run the query and display the results in a readable table format. This helps verify that illness-related data has been correctly imported and structured within the knowledge graph.

In [None]:
# define query to get all illness nodes
query = """
MATCH (i:Illness)
RETURN i.name as name, i.ICDCode as ICD_Code, i.description as description
"""

# call the function to run the test of the specified query
test_query(query=query, graph=graph)

This is the view how it would look like in Neo4j (if you return every attribute of the nodes, not just an extraction of them):

<img src="../img/illnesses.png" height=500>

### Find all symptoms of a specific illness

This block defines and executes a Cypher query that retrieves all symptoms associated with the illness named `Migraine`. It matches Symptom nodes connected to an Illness node via the `SYMPTOM_OF` relationship, filtered specifically to cases where the illness name is `Migraine`. The query returns the name of each symptom found. The result is passed to the `test_query()` helper function, which executes the query and formats the output for display. This helps confirm that the symptom-illness relationships for `Migraine` are correctly stored in the graph.

In [None]:
# define query to get all symptoms of a specific illness
query = """ 
MATCH (s:Symptom)-[r:SYMPTOM_OF]->(i:Illness) 
WHERE i.name = 'Migraine'
RETURN s.name as symptom
"""

# call the function to run the test of the specified query
test_query(query=query, graph=graph)

This is the view how it would look like in Neo4j (if you return every attribute of the nodes, not just an extraction of them):

<img src="../img/specific_symptoms.png" height=500>

### Find all doctors who treated patients with a specific illness

This block defines and executes a Cypher query that finds all doctors who have treated patients diagnosed with the illness “Breast Cancer.” It uses the TREATS relationship between Doctor and Patient, and the HAS relationship between Patient and Illness, filtering the results to only include cases where the illness is “Breast Cancer.” The query returns a distinct list of doctor names. The results are then passed to the `test_query()` helper function for execution and formatted display. This test helps verify that the graph correctly represents treatment relationships between doctors and patients for a specific illness.

In [None]:
# define query to get all symptoms of a specific illness
query = """ 
MATCH (d:Doctor)-[:TREATS]->(p:Patient)-[:HAS]->(i:Illness)
WHERE i.name = 'Breast Cancer'
RETURN DISTINCT d.name
"""

# call the function to run the test of the specified query
test_query(query=query, graph=graph)

This is the view how it would look like in Neo4j (if you return every attribute of the nodes, not just an extraction of them):

<img src="../img/specific_doctors.png" height=500>

### List illnesses that share at least one symptom

This block defines and executes a Cypher query that identifies pairs of illnesses sharing at least one common symptom. It does this by matching two different Illness nodes that are both connected to the same Symptom node via the SYMPTOM_OF relationship. The WHERE clause ensures that the illnesses compared are not the same. The query returns a distinct list of illness pairs along with the name of the shared symptom. This is useful for analyzing symptom overlap between different conditions, which can support differential diagnosis or uncover related health issues.

In [None]:
# define query to get all symptoms of a specific illness
query = """ 
MATCH (i1:Illness)<-[:SYMPTOM_OF]-(s:Symptom)-[:SYMPTOM_OF]->(i2:Illness)
WHERE i1.name <> i2.name
RETURN DISTINCT i1.name AS Illness1, i2.name AS Illness2, s.name AS SharedSymptom
"""

# call the function to run the test of the specified query
test_query(query=query, graph=graph)

This is the view how it would look like in Neo4j (if you return every attribute of the nodes, not just an extraction of them):

<img src="../img/specific_illnesses.png" height=500>

### Find patients allergic to drugs they were prescribed

This block defines and executes a Cypher query that identifies patients who are allergic to the drugs they have been prescribed. It matches Patient nodes connected to both Drug and Allergy nodes, and filters the results to only include cases where the name of the drug matches the name of the allergy. The query returns the patient’s name along with the name of the conflicting drug. The result is passed to the test_query() helper function and stored in a DataFrame. If no such conflicts are found, a message is printed indicating that there are no issues. Otherwise, the DataFrame is displayed, showing all detected medication-allergy conflicts. This test is useful for verifying that the knowledge graph can reveal potentially dangerous medical contradictions.

In [None]:
# define query to get all symptoms of a specific illness
query = """ 
MATCH (p:Patient)-[:TAKES]->(d:Drug),
      (p)-[:HAS]->(a:Allergy)
WHERE d.name = a.name
RETURN p.name AS Patient, d.name AS ConflictMedicament
"""

# call the function to run the test of the specified query and store the result in the df variable
df = test_query(query=query, graph=graph)

# if the length of the dataframe is zero, then print that it's luckily that no patients have a conflict with their drugs
# otherwise display the dataframe of the result
if len(df) == 0:
    print("Luckily, no patients have a conflict with their drugs.")
else:
    display(df)

This is the view how it would look like in Neo4j (if you return every attribute of the nodes, not just an extraction of them):

<img src="../img/no_result.png" height=100>

### Find the most common symptom accross all illnesses

This block defines and executes a Cypher query to identify the most common symptom across all illnesses in the knowledge graph. It matches Symptom nodes connected to any Illness node through the SYMPTOM_OF relationship. The query counts how many times each symptom occurs, orders the results in descending order by frequency, and returns the symptom with the highest count. This helps identify which symptom is most frequently associated with illnesses, providing insights into prevalent or general indicators of disease within the graph.

In [None]:
# define query to get all symptoms of a specific illness
query = """ 
MATCH (s:Symptom)-[:SYMPTOM_OF]->(:Illness)
RETURN s.name, COUNT(*) AS Occurrence
ORDER BY Occurrence DESC
LIMIT 1
"""

# call the function to run the test of the specified query
test_query(query=query, graph=graph)

This is the view how it would look like in Neo4j (if you return every attribute of the nodes, not just an extraction of them):

<img src="../img/most_common_symptom.png" height=500>

## Restore database to original state

This block resets the Neo4j database and restores it to its original state after all tests have been executed. It begins by calling the reset_db() function to clear all existing data from the database. Then, it re-imports the Excel file that was generated during the export test to restore the graph to its prior structure and content. After restoring the data, the script deletes the exported Excel file to clean up the working directory. Finally, it prints a confirmation message indicating that the file has been deleted and all tests were successfully completed. This ensures the environment is left in a clean and consistent state.

In [None]:
# call the function to reset the database
reset_db(graph=graph)

# get the file which was created during the tests
file = pd.ExcelFile(f"../data/export/export_{current_time}.xlsx")

# if the file was found then import the file in the database
if file:
    excel_import(excel_file=file, graph=graph)

# delete the export file
os.remove(f"../data/export/export_{current_time}.xlsx")

# print status that all files have been deleted and all tests completed successfully
print("Export file deleted.")
print(colored("--- Tests completed successfully. ---", "green"))

### Timestamp

This block prints the timestamp indicating when the script was executed and displays the total runtime. It uses `time.strftime()` to format the current date and time in the `dd.mm.yyyy hh:mm:ss` format. Then it calculates the total execution time by subtracting the recorded start time from the current time. Finally, it prints the total runtime in seconds, giving the user insight into how long the entire knowledge graph generation process took to complete.

In [None]:
# print statement to print when the script was executed
print(f"This script was run on: {time.strftime("%d.%m.%Y %H:%M:%S")}")

# stop the runtime timer
end_time = time.time()

# calculate the total execution time
total_time = end_time - start_time

# print the total execution time
print(f"Total execution time: {total_time:.2f} seconds")