# MAIN CODE

## Constant Data

This data should be well defined before running the program. It includes a host/port of the Neo4j Database to upload to as well as the locations, categories, and properties of the data.

For our project, we've utilized AWS S3 buckets to store and share data. We also know the "labels" or what each value in our the various datasets represent so we list that along with there respective data types. The data types and labels will be instrumental in building a Cypher command to create a node for each data item with the appropriate properties.

In [None]:
URI = "bolt://localhost:7687"
CATEGORIES = ["predication", "sentence", "entity"]


S3_URLS = {
    "predication" : "s3://semdb-data/parquet_cleaned/predication/",
    "sentence": "s3://semdb-data/parquet_cleaned/sentence/",
    "entity": "s3://semdb-data/parquet_cleaned/entity/",
}

# column labels and data types as they appear in dataframes.
COL_LABELS = {
    "predication": list({
        "PREDICATION_ID": int,  # Auto-generated primary key for each unique predication
        "SENTENCE_ID": int,     # Foreign key to the SENTENCE table
        "PMID": int,            # The PubMed identifier of the citation to which the predication belongs
        "PREDICATE": str,       # The string representation of each predicate (for example TREATS, PROCESS_OF)
        "SUBJECT_CUI": str,     # The CUI of the subject of the predication
        "SUBJECT_NAME": str,    # The preferred name of the subject of the predication
        "SUBJECT_SEMTYPE": str, # The semantic type of the subject of the predication
        "SUBJECT_NOVELTY": int, # The novelty of the subject of the predication
        "OBJECT_CUI": str,      # The CUI of the object of the predication
        "OBJECT_NAME": str,     # The preferred name of the object of the predication
        "OBJECT_SEMTYPE": str,  # The semantic type of the object of the predication
        "OBJECT_NOVELTY": int,  # The novelty of the object of the predication
        }.items()),
    "sentence": list({
        "SENTENCE_ID": int,               # Auto-generated primary key for each sentence
        "PMID": int,                      # The PubMed identifier of the citation to which the sentence belongs
        "TYPE": str,                      # 'ti' for the title of the citation, 'ab' for the abstract
        "NUMBER": int,                    # The location of the sentence within the title or abstract
        "SENT_START_INDEX": int,          # The character position within the text of the MEDLINE citation of the first character of the sentence  NEW
        "SENT_END_INDEX": int,            # The character position within the text of the MEDLINE citation of the last character of the sentence  NEW
        "SENTENCE": str,                  # The actual string or text of the sentence
        }.items()),
    "entity": list({
        "ENTITY_ID": int,    # Auto-generated primary key for each unique entity
        "SENTENCE_ID": int,  # The foreign key to SENTENCE table
        "CUI": str,          # The CUI of the entity
        "NAME": str,         # The preferred name of the entity
        "TYPE": str,         # The semantic type of the entity
        "TEXT": str,         # The text in the utterance that maps to the entity
        "START_INDEX": int,  # The first character position (in document) of the text denoting the entity
        "END_INDEX": int,    # The last character position (in document) of the text denoting the entity
        "SCORE": int,        # The confidence score
        }.items()),
}

## Authentication Setup

The following section defines some basic functionality for retrieving authorization credentials to AWS and the Neo4j Database. The implementation is not important, but it allows us to avoid using hard-coded credentials.

In [None]:
from dotenv import dotenv_values

# get authentication credentials.
env_v = dotenv_values(".env")
storage_opts = {'key': env_v["aws_access_key_id"],
                'secret': env_v["aws_secret_access_key"],
                'token': env_v["aws_session_token"],
               }
neo4j_auth = (env_v["neo4j_user"], env_v["neo4j_pw"])

## Neo4j Database Connection

The next section defines a series of functions to interact with the Neo4j database after authentication.

In a typical distributed environment such as one created through the use of a Dask Cluster, connections to outside resources like the Neo4j `GraphDriver` object we utilize are not able to be shared across processes. This is typically because they contain a resource unique to a specific process (usually a thread lock or something similar). Because of this, the utility function `process_command` was created. While the process of creating a new connection and executing a single command introduces some overhead time costs, it is necessary to accomplish the project's long term goals.

In [None]:
from neo4j import GraphDatabase, Result

class GraphDriver:
    def __init__(self, uri):
        """Creates an authorized API to neo4j server at uri.

        Args:
            uri: the to the link and port to the neo4j server.
        """
        self._driver = GraphDatabase.driver(uri, auth=neo4j_auth)

    def execute_command(self, cmd:str)-> bool:
        """Runs command on neo4j server.

        Args:
            cmd: the Cypher-formatted command to run.

        Returns:
            A boolean value specifiying wheter the command was executed successfully.
        """
        res = True

        if cmd == "Failed":
            return False

        # Attempt command.
        try:
            with self._driver.session() as session:
                session.run(cmd)
        except:
            res = False

        return res

    def execute_command_with_results(self, cmd:str, t_function:str):
        """Runs command on neo4j server and returns result.

        Args:
            cmd: the Cypher-formatted command to run.
            t_function: the transformation function to run on result before returning
        Returns:
            An object created from running the transformation function on the result of the
            command.

        Note:
            `t_function` parameter should be a member function of the neo4j.Result object.
        """
        with self._driver.session() as session:
            return getattr(session.run(cmd), t_function)()

    def delete_all_data(self):
        """Deletes all nodes in database.

        Warning:
            Do not use when database contains > 4 million nodes.
            May cause server to crash. Use the batch `delete_data`
            instead.
        """
        self.execute_command("MATCH (n) DETACH DELETE n")

    def delete_data(self, batch_size:int, reps:int):
        """Deletes all nodes in database by batches.

        Args:
            batch_size: the number of nodes to delete.
            reps: the number of times to repeat the delete operation over
            {batch_size} nodes.
        """
        cmd = f"MATCH (n) WITH n LIMIT {batch_size} DETACH DELETE n"

        for n in range(reps):
            self.execute_command(cmd)

    def close(self):
        """Closing any open connections.

        Note:
            Calling object is useless after calling this function.
        """
        self._driver.close()

def process_command(cmd:str, uri:str) -> bool:
    """Creates connection to database and executes cmd.

    Args:
        cmd: the Cypher-formatted command to run.
        uri: the to the link and port to the neo4j server.
    Returns:
        A boolean value specifiying wheter the command was executed successfully.

    Note:
        Useful for dask df parallelizations since GraphDatabase objects can't be picked (serialized).
    """

    # create connection.
    driver = GraphDriver(uri)

    # execute command and return result.
    return driver.execute_command(cmd)

## Utility Functions

These functions allow us to track the total time a function has taken to finish execution, create a simple log message, and abstract the creation of Cypher commands to query the Neo4j database.

In [None]:
import time
import math

def track_time(func):
    """Records the elapsed time of the passed function.

    Note:
        Only to be used as a decorator.
    """
    def inner(*args, **kwargs):
        start = time.perf_counter()

        # run passed function with all arguments. Capture any returned values.
        retval = func(*args, **kwargs)

        print(f"Elapsed time: {round(time.perf_counter() - start, 2)} seconds.\n")

        return retval

    return inner

def log(msg:str) -> str:
    """Print a message to console with timestamp.

    Args:
        msg: the message to be displayed.
    Returns:
        A formatted string containing a timestamp and the passed message.

    Note:
        Simple implementation instead of using the slightly more complex logging module.
    """

    return f"{time.strftime('%H:%M:%S', time.localtime())}\t{msg}"

def make_prop_string(label: str, val, dtype:type) -> str:
    """Formats properties of Cypher Command.

    Args:
        label: the label of the passed value.
        val: the integer or string data.
        dtype: the type of the passed value (i.e. int or str).

    Returns:
        A string containing the property combo in the proper format.
    """
    prop_string = ""

    # add quotations around str values.
    if dtype == str:
        prop_string += f"{label.lower()}:\"{val}\""
    else:
        prop_string+= f"{label.lower()}:{val}"

    return prop_string

def build_node_cmd(row, category:str, col_labels: list[tuple[str, type]]) -> str:
    """Builds command to create a node with appropriate properties.

    Args:
        row: the data extracted from a dataframe.
        category: the category of the data (i.e. "predication", "entity", or "sentence").
        col_labels: a list of tuples specifying the labels and data type of each value.

    Returns:
        A string containing a properly formatted Cypher command.
    """
    try:
        # create node properites in form "prop:val, prop:val,..."
        props = ', '.join([
            make_prop_string(label, row[label], dtype) for (label, dtype) in col_labels
        ])

        cmd = f"CREATE(:{category.capitalize()}{{{props}}})"

    # avoid crashing the program when command creation fails.
    except:
        cmd = "Failed"
    return cmd

def build_relation_cmd(cat_one:str, cat_two:str, relation:str, batch_size:int=2500, props:str="") -> str:
    """Builds command to create relations between two node categories.

    Args:
        cat_one: a string containing the category name to start the
        relation from.
        cat_two: a string containing the category name to direct the relation
        to.
        relation: a string containing the desired relation between
        the node categories.
        batch_size: the number of nodes to create a relation between. Defualts to 2500.
        props: a string containing a formatted list of properties to add to relation.
        Defaults to empty.

    Returns:
            A string containing a properly formatted Cypher command.
    Note:
        Relationship is directed from category one Node to
        category two node based on matching "sentence_id" properties.
    """
    # get all nodes of category one and two
    cmd = f"MATCH (a:{cat_one.capitalize()}), (b:{cat_two.capitalize()}) "

    # match sentence IDd
    cmd+= "WHERE a.sentence_id = b.sentence_id "

    # only get nodes that currently do not have the specified relation already
    cmd+= f"AND NOT (a)-[:{relation}{{{props}}}]->(b) "

    cmd+= f"WITH a, b LIMIT {batch_size} "

    # create relation
    cmd+= f"MERGE (a)-[:{relation}{{{props}}}]->(b)"

    return cmd

## Database Setup

We start by initializing the driver (the connection to the database). As previously specified, the GraphDriver object has a plethora of methods to update the database with.

In [None]:
# clear existing nodes.
driver = GraphDriver(URI)
driver.delete_all_data()

# distributed
# driver.close()

## Get Data From Buckets

In [None]:
import dask.dataframe as dd
data = {}

for cat in CATEGORIES:
    # read data and store for later use.
    data[cat] = dd.read_parquet(S3_URLS[cat], storage_options=storage_opts)

## Setup Distributed Environment

While this section was not used during testing, it will be crucial to run the program across all of the available data as it's just too much for one client to handle.

In [None]:
from dask.distributed import Client

client_opts = {
    ## define options here such as schedule address, # workers, etc.
}
client = Client(**client_opts)
client

## Command Creation and Execution

It is not feasible to upload the entire dataset to a local Neo4j database. The local server will not be accessible to external clients and all of the data will need to eventually be uploaded to a dedicated server hosting the database. Instead, we aim to verifythe software to automate the process of node and relation creation functions as expected with a small subset of the real data.

We'll utilize the first partition of each dataset to create and execute Cypher commands that create nodes and the relations between each type of node. After each step, we'll verify our results.

In [None]:
@track_time
def create_nodes(cat):
    """Creates and Executes Node Creation Commands for Dataframe based on category.

    Args:
        cat: the category of the dataframe (i.e. "predication", "entity", or "sentence").

    Note:
        All commands are attempted regardless of previous failure.
    """

    print(f"\n{'-' *30}{cat.upper()}{'-' *30}")

    # create commands using dask optimizations.
    print(f"\t{log('Building commands...')} ")

    # only take rows whose "Sentence ID" column wasn't null during cleaning.
    cmds = (
        data[cat][data[cat]["SENTENCE_ID"] != -1].apply(
            build_node_cmd, axis=1, category=cat, col_labels=COL_LABELS[cat], meta=('cypher_cmds', 'str')
        ).persist()
    )

    # cmds = data[cat].apply(build_node_cmd, axis=1, category=cat, col_labels=COL_LABELS[cat], meta=('cypher_cmds', 'str')).persist()

    print(f"\t{log('Command building complete.')} ")

    print(f"\t{log('Attempting to execute commands...')} ")

    # non-distributed
    res = cmds.apply(driver.execute_command, meta=('execution_res', 'bool')).persist()

    # distributed.
    # res = client.persist(cmds.apply(process_command, uri=URI, meta=('execution_res', 'bool')))

    print(f"\t{log('Command execution complete.')} ")

    print(f"\nFailed Command Creations: {cmds[cmds=='Failed'].count().compute()}")
    print(f"Failed Command Executions: {res[res==False].count().compute()}")

    return

@track_time
def create_relations(relation_name:str, cat_one:str, cat_two:str, relation:str, batch_size:int, reps:int):
    """Creates and Executes Node Relation Commands based on passed relation values.

    Args:
        cat_one: a string containing the category name to start the relation from.
        cat_two: a string containing the category name to direct the relation to.
        relation: a string containing the desired relation between the node categories.
        batch_size: the number of nodes to to create a relation between in one cmd.
    """

    # create new database connection.
    # driver = GraphDriver(URI)


    execution_failures = 0
    print(f"\n{'-'*30}{relation_name} Relation{'-'*30}")

    print(f"\t{log('Building command...')} ")
    # attempt to create command.
    try:
        cmd = build_relation_cmd(cat_one, cat_two, relation, batch_size )
    except Exception as e:
        print(f"\t{log('Failed to build command. Exiting.')} ")
        return

    # attempt to run command.
    print(f"\t{log('Attempting to execute commands...')} ")
    for i in range(reps):
        if not (driver.execute_command(cmd)):
            execution_failures+=1

    print(f"\t{log(f'Failed Command Executions: {execution_failures}')}")

    # driver.close()

    return

relation_cmds = {
    "Predication -> Sentence": {
        "category_one": "predication",
        "category_two": "sentence",
        "relation": "PREDICATE_OF",
    },
    "Predication -> Entity": {
        "category_one": "predication",
        "category_two": "entity",
        "relation": "PREDICATES",
    },
    "Entity -> Sentence": {
        "category_one": "entity",
        "category_two": "sentence",
        "relation": "SUBJECT_OF",
    },
}

In [None]:
# run code on small portion of dataset.
for cat in CATEGORIES:
    data[cat] = data[cat].partitions[0]

# create and execute commands to create nodes for each dataframe.
for cat in CATEGORIES:
    create_nodes(cat)


------------------------------PREDICATION------------------------------
	21:20:10	Building commands... 
	21:20:15	Command building complete. 
	21:20:15	Attempting to execute commands... 
	22:32:32	Command execution complete. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 4341.93 seconds.


------------------------------SENTENCE------------------------------
	22:32:32	Building commands... 
	22:32:35	Command building complete. 
	22:32:35	Attempting to execute commands... 
	23:29:55	Command execution complete. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 3443.85 seconds.


------------------------------ENTITY------------------------------
	23:29:55	Building commands... 
	23:30:01	Command building complete. 
	23:30:01	Attempting to execute commands... 
	00:58:40	Command execution complete. 

Failed Command Creations: 0
Failed Command Executions: 1386
Elapsed time: 5324.15 seconds.



## Results of Node Creation

In order, a subset (25) of the created predication, sentence, and entity nodes are shown. Each node category has roughly 400,000 nodes.

In [None]:
%%html
<iframe src="https://drive.google.com/file/d/1S1kQSC7eJ2WOghXqorkvoBu19_QnJGGV/preview" width="640" height="480" allow="autoplay"></iframe>

In [None]:
%%html
<iframe src="https://drive.google.com/file/d/18cXuOOekhR7AI_zMXfr_QGw0ndzv0boB/preview" width="640" height="480" allow="autoplay"></iframe>

In [None]:
%%html
<iframe src="https://drive.google.com/file/d/1F98A49WhvydMs5XbNKUEGf4Wj2ZiNnig/preview" width="640" height="480" allow="autoplay"></iframe>

In [None]:
# create and execute commands to make relationships between nodes in database.
batch_size = 3500
expected_nodes = 400000
for key, value in relation_cmds.items():

    create_relations(key, value["category_one"], value["category_two"], value["relation"], batch_size, math.ceil(expected_nodes/batch_size))


------------------------------Predication -> Sentence Relation------------------------------
	01:44:05	Building command... 
	01:44:05	Attempting to execute commands... 
	01:44:49	Failed Command Executions: 0
Elapsed time: 44.19 seconds.


------------------------------Predication -> Entity Relation------------------------------
	01:44:49	Building command... 
	01:44:49	Attempting to execute commands... 
	01:45:43	Failed Command Executions: 0
Elapsed time: 54.52 seconds.


------------------------------Entity -> Sentence Relation------------------------------
	01:45:43	Building command... 
	01:45:43	Attempting to execute commands... 
	01:46:26	Failed Command Executions: 0
Elapsed time: 43.17 seconds.



## Results of Relation Creation

In the image below, a query which finds nodes with matching sentence IDs is executed. Just like any other resource intensive query, the results are limited to avoid potentially crashing the server.

As expected, mutliple entities relate to a single predicate(predication node), which relate to a single sentence. This is because there are roughly 15x more entities than predications and 8x more entities than sentences indicating a one-to-many relationship between entities and sentences/predications.

In [3]:
%%html
<iframe src="https://drive.google.com/file/d/1qRpJWaNN17LgcY_mQ3umNslTQeegFy3J/preview" width="640" height="480" allow="autoplay"></iframe>

## Complex Operations

In the following section, we'll experiment with some features we can use with the Neo4j database.

First, we query all available predication nodes to gather a list of unique subject semantic types. Next, we select two semantic types and construct a query that returns all nodes ( predication, sentence, and entity") that connect to a single sentence. For this to work, we needed to do some background work and inspect the database for sets of predication nodes that connected to a single sentence node. Since the entity nodes have a "many-to-one" relation with both predication and sentence nodes, where "many" is an understatement, and predication nodes have more of a "some-to-one" relation with the sentence nodes, we focused on the predication nodes. The semantic types "menp," which stands for "Mental Process," and "hlca," which stands for "Health Care Activity," were two suitable candidates.

Finally, we construct and execute a query that returns all nodes related to the same sentence and connected to predications nodes with a subject semantic type of "menp" or "hlca."

The result was a single interconnected graph where all entity nodes connected to the sentence node via an "ENTITY_OF" relation, all predication nodes connected to the sentence node via a "PREDICATION_OF" relation,  and all predication nodes connected to all entity nodes via a "PREDICATES" relation.


In [None]:
# find all unique subject semtypes among available predication nodes.
cmd = "MATCH (p:Predication) WITH DISTINCT p.subject_semtype as distinct_subject_types RETURN collect(distinct_subject_types)"

# result is a 3D array of strings in this case.
# See documentation here: https://neo4j.com/docs/api/python-driver/current/api.html#neo4j.Result.values
unique_semtypes = driver.execute_command_with_results(cmd, "values")
print(*unique_semtypes[0][0], sep = "\n")

virs
dsyn
mamm
bacs
aapp
gngm
bact
phsu
celc
inch
anim
topp
antb
medd
inbe
fndg
sosy
hlca
lbpr
orch
inpo
genf
emst
bpoc
cell
phsf
tisu
enzy
chem
patf
neop
horm
nusq
imft
orgf
mobd
menp
bdsu
podg
moft
spco
diap
nnon
blor
hcro
prog
ortf
resa
celf
comd
bsoj
cgab
acab
orga
humn
popg
lipd
carb
elii
vita
mbrt
irda
npop
aggp
orgt
chvf
plnt
hops
phpr
emod
fngs
invt
arch
chvs
alga
socb
food
sbst
rcpt
nsba
bdsy
grup
orgm
vtbt
bodm
eico
strd
anst
mnob
anab
biof
fish
dora
opco
geoa
bird
qnco
pros
clnd
rnlw
edac
clna
famg
lbtr
eehu
rept
amph
bmod
ocac
resd
ffas
mcha
shro
rich
ocdi
acty
hcpp
amas
gora
bhvr
phob


In [None]:
# extracted from list above
# full semantic type descriptions can be found here: https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/SemanticTypes_2018AB.txt
semtypes = [
    "menp", # Mental Process
    "hlca" # Health Care Activity
]

# create ids to separate each semtype. Will be of form `p0`, `p1`, ...
ids = [f"`p{i}`" for i in range(len(semtypes))]

# define all variables in query
cmd = "MATCH (s:Sentence), (e:Entity), "
# iterate over `pX` ids and add to variable definitions.
cmd += (", ".join([f"({id}:Predication)" for id in ids]) + "\n")


# ensure all sentence ids match. In other words, all nodes should connect to a single sentence.
cmd += "WHERE (\n(s.sentence_id = e.sentence_id AND "

# ensure each predication variable's sentence id matches the sentence id.
# Final format `pX`.sentence_id = `pX+1`.sentence_id AND ... `pN`.sentence_id = s.sentence_id
cmd += (
    " AND ".join(
        [f"{ids[i]}.sentence_id = {ids[i+1]}.sentence_id" if i != len(ids)-1 else f"{ids[i]}.sentence_id = s.sentence_id" for i in range(len(ids))]
    ) + ")\n"
)

# match all semtypes in list
cmd += (
    "AND ("
    + " AND ".join([f"{id}.subject_semtype = '{semtype}'" for id, semtype in zip(ids ,semtypes)])
    + ")\n)\n"
    )


def get_relations(id:str, pos:int) -> str:
    """Retrieves the relations for a predication node.

    Args:
        id: the variable identifier for the predication node.
        pos: the position of the variable identifier in relation to the others.

    Returns:
        A portion of a cypher command which limits results only to Nodes that have
        a relation to another category of node.
    """
    start = pos*3

    string = f"MATCH ({id})-[`r{start}`]->(s) \n"
    string += f"MATCH ({id})-[`r{start+1}`]->(e) \n"
    string += f"MATCH ({id})-[`r{start+2}`]->(s) \n"

    return string

# ensure only nodes with relations are returned.
# Many nodes will not have relations because we are only working with a portion of the data.
cmd += "".join([get_relations(id, i) for i, id in enumerate(ids)])

# limit to 100 Results.
cmd += "RETURN s, e, "
cmd += (
    ", ".join([f"collect(distinct `r{i}`)" for i in range(len(ids)*3)])
    + ", "
)
cmd += (
    ", ".join(ids)
    + "\n"
)
cmd += "LIMIT 100"

print(cmd)

MATCH (s:Sentence), (e:Entity), (`p0`:Predication), (`p1`:Predication)
WHERE (
(s.sentence_id = e.sentence_id AND `p0`.sentence_id = `p1`.sentence_id AND `p1`.sentence_id = s.sentence_id)
AND (`p0`.subject_semtype = 'menp' AND `p1`.subject_semtype = 'hlca')
)
MATCH (`p0`)-[`r0`]->(s) 
MATCH (`p0`)-[`r1`]->(e) 
MATCH (`p0`)-[`r2`]->(s) 
MATCH (`p1`)-[`r3`]->(s) 
MATCH (`p1`)-[`r4`]->(e) 
MATCH (`p1`)-[`r5`]->(s) 
RETURN s, e, collect(distinct `r0`), collect(distinct `r1`), collect(distinct `r2`), collect(distinct `r3`), collect(distinct `r4`), collect(distinct `r5`), `p0`, `p1`
LIMIT 100


In [None]:
%%timeit -n 1 -r 3

driver.execute_command_with_results(cmd, "graph")

166 ms ± 54.7 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


**NOTE: The following cell will not display properly unless the code has been executed. Please see the next cell for a static image of the of the cmd result.**

In [None]:
from yfiles_jupyter_graphs import GraphWidget

w = GraphWidget(graph=driver.execute_command_with_results(cmd, "graph"))
w.show()

GraphWidget(layout=Layout(height='500px', width='100%'))

In [None]:
%%html

<iframe src="https://drive.google.com/file/d/1ipRIVbZKerwQPuMCJhhrj7OO8Fn0U50x/preview" width="640" height="480" allow="autoplay"></iframe>

# Performance Testing

Below are the same functions utilized in the main portion of the code with some slight alterations to make them work using the itertuples method. There are no significant changes that will affect performance of the test.

In [None]:
def build_node_cmd_by_itertuples(row, category:str, col_labels: list[tuple[str, type]]) -> str:
    """Builds command to create a node with appropriate properties.

    Args:
        row: the data extracted from a dataframe.
        category: the category of the data (i.e. "predication", "entity", or "sentence").
        col_labels: a list of tuples specifying the labels and data type of each value.

    Returns:
        A string containing a properly formatted Cypher command.
    """
    try:
        # create node properites in form "prop:val, prop:val,..."

        # note, used integer-based indexing rather than label-based indexing.
        props = ''.join([
            make_prop_string(label, row[i], dtype, False) if label != col_labels[-1][0]
                else make_prop_string(label, row[i], dtype, True) for i, (label, dtype) in enumerate(col_labels)
        ])

        cmd = f"CREATE(:{category.capitalize()}{{{props}}})"

    # avoid crashing the program when command creation fails.
    except:
        cmd = "Failed"
    return cmd

@track_time
def create_nodes_by_itertuples(cat):
    """Creates and Executes Node Creation Commands for Dataframe based on category.

    Args:
        cat: the category of the dataframe (i.e. "predication", "entity", or "sentence").

    Note:
        All commands are attempted regardless of previous failure.
    """

    # distributed
    # driver = GraphDriver(URI)

    # note, added failure checking during each loop instead of at the end.
    failures = {
        "creation" : 0,
        "execution" : 0
    }

    print(f"\n{'-' *30}{cat.upper()}{'-' *30}")

    # create commands using dask optimizations.
    print(f"\t{log('Building and executing commands...')} ")

    for row in data[cat].itertuples():
        cmd = build_node_cmd_by_itertuples(row[1:],category=cat, col_labels=COL_LABELS[cat])

        if cmd == "Failed":
            failures["creation"] += 1

        if not (driver.execute_command(cmd)):
            failures["execution"] += 1

    print(f"\t{log('Build and execution completed.')} ")

    print(f"\nFailed Command Creations: {failures['creation']}")
    print(f"Failed Command Executions: {failures['execution']}")

    # distributed
    # driver.close()

    return

Summary: There are three main bottlenecks during the process. The creation of the commands for all rows (somewhat time consuming), executing each command on the neo4j server (very, very, very time consuming), and creating the relations between the nodes after the uploads(tba).

Generally, the faster we can iterate over some data, whether it be all rows in the dataframe to extract the data to build commands or over a series of commands, the faster a process will go. In the test below, a small portion(~12 MB) of real data is iterated over using two different methods: the [apply](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html) method and the [itertuples](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.itertuples.html) method. Each set of tests is ran 3 times to calculate an average time as performace can vary slightly between runs. The data is stored in a single file.

First, we test command building. The results are below.

_Note_:

- The following tests were executed independently of the other sections of this file.


In [None]:
t_cat = "predication"
temp = "predication_full"
@track_time
def test_build_by_apply(cat):
    cmds = data[cat].apply(build_node_cmd, axis=1, category=cat, col_labels=COL_LABELS[cat], meta=('cypher_cmds', 'str')).persist()


@track_time
def test_build_by_itertuples(cat):
    for row in data[cat].itertuples():
        cmd = build_node_cmd_by_itertuples(row[1:],category=cat, col_labels=COL_LABELS[cat])

# save full full dataframe in temp location.
data[temp] = data[t_cat]

# 1 partition roughly = 12 MB.
data[t_cat] = data[temp].partitions[0]

In [None]:
%%timeit -n 1 -r 3

test_build_by_apply(t_cat)

Elapsed time: 5.7 seconds.

Elapsed time: 5.62 seconds.

Elapsed time: 5.59 seconds.

5.64 s ± 44.9 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [None]:
%%timeit -n 1 -r 3

test_build_by_itertuples(t_cat)

Elapsed time: 2.44 seconds.

Elapsed time: 2.4 seconds.

Elapsed time: 2.41 seconds.

2.41 s ± 18.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


It's clear that the itertuples method offers better (~ 2x) performace during this isolated stage, but what about the command execution stage?

Here, we'll use the function(s) that combines the command creation and command execution steps and get averages run times for both methods.

_Notes_:
1. Each iteration generates ~400,000 nodes so, we'll fully wipe the database between runs. This is fine for testing since we're only looking at relative performance rather than absolute. Alsi, the action is completed for both test cases so its impact can safely be ignored.
2. These tests were performed without the "check for null Sentence ID" step in the current `create_nodes` implementation. This was to ensure that there would be no notable discrepency between operational complexities of the two functions other than the underlying iteration method.

In [None]:
%%timeit -n 1 -r 3

create_nodes(t_cat)

# clear nodes
driver.delete_data()


------------------------------PREDICATION------------------------------
	23:54:32	Building commands... 
	23:54:38	Command building complete. 
	23:54:38	Attempting to execute commands... 
	01:08:01	Command execution complete. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 4409.08 seconds.


------------------------------PREDICATION------------------------------
	01:08:07	Building commands... 
	01:08:13	Command building complete. 
	01:08:13	Attempting to execute commands... 
	02:24:06	Command execution complete. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 4559.33 seconds.


------------------------------PREDICATION------------------------------
	02:24:11	Building commands... 
	02:24:17	Command building complete. 
	02:24:17	Attempting to execute commands... 
	03:40:30	Command execution complete. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 4578.06 seconds.

1h 15min 19s ± 1min 15s per loop (mean ± std. dev

In [None]:
%%timeit -n 1 -r 3

create_nodes_by_itertuples(t_cat)

# clear nodes
driver.delete_data()


------------------------------PREDICATION------------------------------
	03:40:34	Building and executing commands... 
	04:56:33	Build and execution completed. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 4558.82 seconds.


------------------------------PREDICATION------------------------------
	04:56:39	Building and executing commands... 
	06:12:42	Build and execution completed. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 4563.26 seconds.


------------------------------PREDICATION------------------------------
	06:12:47	Building and executing commands... 
	07:28:56	Build and execution completed. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 4568.75 seconds.

1h 16min 7s ± 4.28 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


Interestingly, the itertuples method still offers comparable performance.

Let's run the same test again using a larger (~12x) dataset and see how the two methods compare.

In [None]:
data[t_cat] = data[temp].partitions[0:12]

In [None]:
%%timeit -n 1 -r 3

test_build_by_apply(t_cat)

Elapsed time: 68.61 seconds.

Elapsed time: 67.87 seconds.

Elapsed time: 68.16 seconds.

1min 8s ± 304 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [None]:
%%timeit -n 1 -r 3

test_build_by_itertuples(t_cat)


Elapsed time: 29.26 seconds.

Elapsed time: 29.21 seconds.

Elapsed time: 29.28 seconds.

29.2 s ± 29.2 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [None]:
%%timeit -n 1 -r 3

batch_size = 200000
generated = 5000000
# creates about 5 million nodes in database.
create_nodes(t_cat)

# clear nodes. Attempting to delete all nodes at once may crash server so delete in portions.
driver.delete_data(batch_size, math.ceil(generated/batch_size))


------------------------------PREDICATION------------------------------
	13:27:48	Building commands... 
	13:28:57	Command building complete. 
	13:28:57	Attempting to execute commands... 
	15:43:43	Command execution complete. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 8155.02 seconds.


------------------------------PREDICATION------------------------------
	15:45:46	Building commands... 
	15:46:54	Command building complete. 
	15:46:54	Attempting to execute commands... 
	18:00:54	Command execution complete. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 8108.19 seconds.


------------------------------PREDICATION------------------------------
	18:03:00	Building commands... 
	18:04:08	Command building complete. 
	18:04:08	Attempting to execute commands... 
	20:18:39	Command execution complete. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 8138.49 seconds.

2h 16min 18s ± 18.5 s per loop (mean ± std. dev. 

In [None]:
%%timeit -n 1 -r 3

create_nodes_by_itertuples(t_cat)

# clear nodes. Attempting to delete all nodes at once may crash server so delete in portions.
driver.delete_data(batch_size, int(generated/batch_size))


------------------------------PREDICATION------------------------------
	20:20:43	Building and executing commands... 
	10:43:50	Build and execution completed. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 51786.79 seconds.


------------------------------PREDICATION------------------------------
	10:45:51	Building and executing commands... 
	01:23:06	Build and execution completed. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 51723.85 seconds.


------------------------------PREDICATION------------------------------
	01:25:07	Building and executing commands... 
	15:50:13	Build and execution completed. 

Failed Command Creations: 0
Failed Command Executions: 0
Elapsed time: 51906.18 seconds.

14h 24min 10s ± 1min 15s per loop (mean ± std. dev. of 3 runs, 1 loop each)


Annnnd what a letdown...

While the itertuples method offered comparable, sometimes better, performance than the apply method with small datasets, it greatly suffered in performance with large datasets. This is because the itertuples method iterates through a dask dataframe linearly; in other words, with no optimizations. Whether split across multiple files or a single file, the itertuples method treats the dataframe as a gigantic list and accesses its data items one after another. Meanwhile, the apply method will parallelize the operations base with respect to each file leading to significant reduced downtime between node commands being executed.

This is a great example of why performance testing is required in many scenarios. While one method of accomplishing a task may seem faster in one scenario, it can lead to serious performance drops in other scenario.