# Automated Dependency Graph Generation Framework for FDA Medical Device Report
# Workflow
1. Environment setup
2. Transfer text file to Part-of-Speech (POS) tags
  - Remember to replace file path.
  - The intermetiate POS file will be saved.
3. POS to Neo4j Schema
  - The Neo4j Schema will be based on the POS file generate on step 2.
4. The schema is ready to execute in Neo4j

# Tech Stack
- spaCy for natural language processing
- automate Neo4j Schema generation

# Dataset used
- FDA Medical Device Report Data (https://www.fda.gov/medical-devices/medical-device-reporting-mdr-how-report-medical-device-problems/mdr-data-files#download) Note that although this script is used on this dataset on medical device, the method could be applied to any text-based documents.

# 1. Environment setup


In [1]:
# not necessary for Colab
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.8.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting blis<1.3.0,>=1.2.0 (from thinc<8.4.0,>=8.3.4->spacy)
  Downloading blis-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Downloading spacy-3.8.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading thinc-8.3.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading blis-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 400
pd.options.display.max_colwidth =  400

Replace file paths. (input text file, POS file, output schema file)

In [None]:
# Modify the file path
input_file = "examples/originaltext_5.txt"
pos_file = "pos_5.txt"
output_file = "neo4j_pos_5.cypher"

# 2. Transfer text file to Part-of-Speech (POS) tags

In [2]:
def process_sentences_with_heads(input_file, output_file):
    # Load spaCy model
    nlp = spacy.load('en_core_web_sm')

    # Read all sentences from input file
    with open(input_file, 'r', encoding='utf-8') as f:
        # Read lines and filter out empty lines
        sentences = [line.strip() for line in f.readlines() if line.strip()]

    # Process the text
    doc = nlp(' '.join(sentences))

    # Create output
    output = []
    output.append("Token\tPOS\tDependency\tHead_Token\tHead_Index")
    output.append("-" * 60)

    for token in doc:
        # Get token's head (parent) information
        head_text = token.head.text if token.head != token else "ROOT"
        head_idx = token.head.i if token.head != token else token.i

        output.append(f"{token.text}\t{token.pos_}\t{token.dep_}\t{head_text}\t{head_idx}")

        if token.is_sent_end:
            output.append("")  # Add blank line between sentences

    # Write to file
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write('\n'.join(output))

    print('\n'.join(output))

In [None]:
try:
    process_sentences_with_heads(input_file, pos_file)
    print(f"POS tags have been saved to {pos_file}")
except Exception as e:
    print(f"An error occurred: {str(e)}")

Token	POS	Dependency	Head_Token	Head_Index
------------------------------------------------------------
INFORMATION	PROPN	compound	RECEIVED	1
RECEIVED	PROPN	nsubj	INDICATED	4
BY	ADP	compound	MEDTRONIC	3
MEDTRONIC	PROPN	dobj	RECEIVED	1
INDICATED	VERB	ROOT	ROOT	4
THAT	SCONJ	mark	PASSED	8
,	PUNCT	punct	PASSED	8
CUSTOMER	PROPN	nsubj	PASSED	8
PASSED	VERB	ccomp	INDICATED	4
AWAY	ADV	advmod	PASSED	8
AT	ADP	prep	PASSED	8
HOME	PROPN	pobj	AT	10
ON	PROPN	dobj	PASSED	8
(	PUNCT	punct	ON	12
B)(6	NOUN	appos	ON	12
)	PUNCT	punct	2022	16
2022	NUM	appos	ON	12
.	PUNCT	punct	INDICATED	4

THE	DET	det	CUSTOMER	19
CUSTOMER	PROPN	nsubjpass	ADMITTED	22
WAS	AUX	auxpass	ADMITTED	22
NOT	PART	neg	ADMITTED	22
ADMITTED	VERB	ROOT	ROOT	22
TO	ADP	prep	ADMITTED	22
A	DET	det	PRIOR	26
HOSPITAL	NOUN	compound	PRIOR	26
PRIOR	NOUN	pobj	TO	23
TO	ADP	prep	PRIOR	26
THE	DET	det	INCIDENT	30
REPORTED	PROPN	compound	INCIDENT	30
INCIDENT	PROPN	pobj	TO	27
.	PUNCT	punct	ADMITTED	22

THE	DET	det	CUSTOMER	33
CUSTOMER	PROPN	nsubj	PASSED	34


# 3. POS to Neo4j Schema

In [5]:
def generate_neo4j_queries(pos_file, output_file):
    # Read enhanced POS file
    with open(pos_file, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    # Skip header and separator
    lines = [line.strip() for line in lines[2:]]

    queries = []
    sentence_counter = 1
    token_counter = 0  # Global counter for token IDs
    sentence_start_indices = {}  # Store starting index for each sentence

    # Add constraints
    queries.append("// Create constraints")
    queries.append("CREATE CONSTRAINT IF NOT EXISTS FOR (w:Token) REQUIRE w.id IS UNIQUE;")
    queries.append("")

    # First pass: collect sentence start indices
    current_tokens = []
    for line in lines + [""]:  # Append empty line to process the last sentence
        if not line:  # Sentence boundary
            if current_tokens:
                sentence_start_indices[sentence_counter] = token_counter
                token_counter += len(current_tokens)
                current_tokens = []
                sentence_counter += 1
        else:
            current_tokens.append(line)

    # Reset counters for second pass
    sentence_counter = 1
    token_counter = 0
    current_sentence = []

    # Second pass: generate queries
    for line in lines + [""]:
        if not line:  # Sentence boundary
            if current_sentence:
                queries.extend(create_dependency_queries(current_sentence, sentence_counter, sentence_start_indices))
                queries.append("")
                token_counter += len(current_sentence)
                sentence_counter += 1
                current_sentence = []
        else:
            try:
                token, pos, dep, head_token, head_idx = line.split('\t')
                current_sentence.append((token, pos, dep, head_token, int(head_idx)))
            except ValueError:
                print(f"Skipping malformed line: {line}")

    # End of Schema
    queries.append("// This is end of schema.")

    # Write queries to file
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write('\n'.join(queries))

def create_dependency_queries(sentence_tokens, sentence_num, sentence_start_indices):
    queries = []
    start_idx = sentence_start_indices.get(sentence_num, 0)

    # Create token nodes
    queries.append(f"// Create tokens for sentence {sentence_num}")
    create_tokens = []
    for i, (token, pos, dep, _, _) in enumerate(sentence_tokens):
        token_id = start_idx + i
        token = token.replace("'", "\\'")  # Escape single quotes
        create_tokens.append(
            f"(t{token_id}:Token {{id: {token_id}, text: '{token}', pos: '{pos}', position: {i}}})"
        )

    queries.append("CREATE " + ",\n       ".join(create_tokens))
    queries.append(f"WITH [{', '.join(f't{start_idx + i}' for i in range(len(sentence_tokens)))}] as nodes")

    # Create dependency relationships
    queries.append(f"// Create dependency relationships for sentence {sentence_num}")
    for i, (_, _, dep, _, head_idx) in enumerate(sentence_tokens):
        curr_id = start_idx + i

        if curr_id != head_idx and head_idx >= 0:
            queries.append(
                f"MATCH (t1:Token {{id: {curr_id}}}), (t2:Token {{id: {head_idx}}}) WITH t1, t2 CREATE (t1)-[:{dep}]->(t2);"
            )

    return queries

In [11]:
try:
    generate_neo4j_queries(pos_file, output_file)
    print(f"Neo4j queries have been saved to {output_file}")

    # For Colab: Enable file download
    from google.colab import files
    files.download(output_file)
except Exception as e:
    print(f"An error occurred: {str(e)}")

Neo4j queries have been saved to neo4j_pos_5.cypher


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# 4. Execute Schema in Neo4j (Done in Neo4j)

## Set up and tutorials
1. Download Neo4j app.
2. Open a Neo4j DBMS in Browser. You can use Example Project Movie DBMS.
3. Movie Graph tutorial is a good one to begin with Neo4j.

## Import Schema into Neo4j
1. Clean up: Before we create our database, we might need to clean up original graph (especially if you test the schema on Movie DBMS). Execute the following command to remove existing graph.

    ```
    MATCH (n) DETACH DELETE n;
    ```

2. Import our schema: Copy all the content in .cypher we get from step 3 and execute. This step creates a graph based on our schema in Neo4j.

3. Visualize our graph: Execute the following command to visualize the graph. If the grpah is too big, add limit to the command, in the example command, the number of nodes is limited by 25.

    ```
    MATCH (n) RETURN n;
    # or
    MATCH (n) RETURN n LIMIT 25;
    ```

# Reference
- Download Neo4j Desktop: https://neo4j.com/download/
- Spacy 101: https://spacy.io/usage/spacy-101
- https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/13-POS-Keywords.html