# Part 1. Data Processing

This is the first notebook that covers our research.
In this part we will cover:
- the EDA and data processing for CTD knowledge base and NCBI dataset
- setup knowledge graph with Neo4j
- create embeddings for disease names and synonyms with two embedding models
- cross-validate the disease IDs in the knowledge base and the dataset

## Tool and libraries

In [10]:
import pandas as pd
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.ollama import OllamaEmbedding
from yfiles_jupyter_graphs import GraphWidget

from io import StringIO
import requests
from tqdm import tqdm

from utils.generic import get_driver, Models

## EDA of the Knowledge Base and NCBI Corpus

Throughout this project we will be using CTD dataset as the knowledge base and NCBI corpus that contains labels for diseases to run the experiments.
First, let us have a look on the structure of the knowledge base so we can create an appropriate structure of the knowledge graph.

In [34]:
df = pd.read_csv("../data/raw/CTD_diseases.csv", sep=',')

In [50]:
print(f'{len(df[df['DiseaseID'].isna()])} rows with no DiseaseID')
print(f'{len(df[df['ParentIDs'].isna()])} rows with no ParentIDs')
print(f'{len(df[df['DiseaseName'].isna()])} rows with no DiseaseName')

0 rows with no DiseaseID
1 rows with no ParentIDs
0 rows with no DiseaseName


In [35]:
df.head()

Unnamed: 0,DiseaseName,DiseaseID,AltDiseaseIDs,Definition,ParentIDs,TreeNumbers,ParentTreeNumbers,Synonyms,SlimMappings
0,10p Deletion Syndrome (Partial),MESH:C538288,,,MESH:D002872|MESH:D025063,C16.131.260/C538288|C16.320.180/C538288|C23.55...,C16.131.260|C16.320.180|C23.550.210.050.500.500,"Chromosome 10, 10p- Partial|Chromosome 10, mon...",Congenital abnormality|Genetic disease (inborn...
1,13q deletion syndrome,MESH:C535484,,,MESH:D002872|MESH:D025063,C16.131.260/C535484|C16.320.180/C535484|C23.55...,C16.131.260|C16.320.180|C23.550.210.050.500.500,Chromosome 13q deletion|Chromosome 13q deletio...,Congenital abnormality|Genetic disease (inborn...
2,15q24 Microdeletion,MESH:C579849,DO:DOID:0060395,,MESH:D002872|MESH:D008607|MESH:D025063,C10.597.606.360/C579849|C16.131.260/C579849|C1...,C10.597.606.360|C16.131.260|C16.320.180|C23.55...,15q24 Deletion|15q24 Microdeletion Syndrome|In...,Congenital abnormality|Genetic disease (inborn...
3,16p11.2 Deletion Syndrome,MESH:C579850,,,MESH:D001321|MESH:D002872|MESH:D008607|MESH:D0...,C10.597.606.360/C579850|C16.131.260/C579850|C1...,C10.597.606.360|C16.131.260|C16.320.180|C23.55...,,Congenital abnormality|Genetic disease (inborn...
4,"17,20-Lyase Deficiency, Isolated",MESH:C567076,,,MESH:D000312,C12.050.351.875.253.090.500/C567076|C12.200.70...,C12.050.351.875.253.090.500|C12.200.706.316.09...,"17-Alpha-Hydroxylase-17,20-Lyase Deficiency, C...",Congenital abnormality|Endocrine system diseas...


There are some properties that have missing values (e.g. not all the diseases have alternative disease IDs, synonyms or definition), however, the properties that we care about are: DiseaseName, DiseaseID and ParentIDs (the latter will be essential to create a graph properly). We have only one row with no ParentIDs, let us examine it closer.

In [51]:
df[df['ParentIDs'].isna()]

Unnamed: 0,DiseaseName,DiseaseID,AltDiseaseIDs,Definition,ParentIDs,TreeNumbers,ParentTreeNumbers,Synonyms,SlimMappings
3887,Diseases,MESH:C,,,,C,,,


This looks like a piece of corrupted data or it could have been the first node in the tree, but we don't need it, so we will remove it.

In [52]:
df.dropna(subset=['ParentIDs'], inplace=True)

In [53]:
print(f'{len(df[df['ParentIDs'].isna()])} rows with no ParentIDs')

0 rows with no ParentIDs


As we can see, the document is structured as a graph, therefore we can easily create a Knowledge Graph database out of it using Neo4j's CypherQL.

## Neo4j setup and BAAI embeddings

Since we will using Neo4j's driver to establish a connection to the database, and use different embedding models, we moved the re-usable code into a separate module.

In [8]:
embed_model = HuggingFaceEmbedding(model_name=Models.BAAI_BGE_SMALL_EN_V1_5.value)



In [2]:
driver = get_driver()

Firstly, we will define a function to create disease nodes.

In [None]:
def create_disease_nodes(tx, disease):
    tx.run("""
        MERGE (d:Disease {DiseaseID: $DiseaseID})
        SET d.DiseaseName = $DiseaseName, d.AltDiseaseIDs = $AltDiseaseIDs,
            d.Definition = $Definition, d.TreeNumbers = $TreeNumbers,
            d.ParentTreeNumbers = $ParentTreeNumbers, d.Synonyms = $Synonyms,
            d.SlimMappings = $SlimMappings
    """, 
    DiseaseID=disease['DiseaseID'],
    DiseaseName=disease['DiseaseName'],
    AltDiseaseIDs=disease['AltDiseaseIDs'],
    Definition=disease['Definition'],
    TreeNumbers=disease['TreeNumbers'],
    ParentTreeNumbers=disease['ParentTreeNumbers'],
    Synonyms=disease['Synonyms'],
    SlimMappings=disease['SlimMappings'])

Now we can iterate over the rows in the DataFrame and write our nodes into the KG.

In [17]:
with driver.session() as session:
    for _, row in df.iterrows():
        session.execute_write(create_disease_nodes, row)

We need to connect the nodes into the hierarchical structure, therefore we will define another function for this.

In [19]:
def create_hierarchy(tx, disease):
    if pd.notna(disease['ParentIDs']):
        parent_ids = disease['ParentIDs'].split('|')
        for parent_id in parent_ids:
            tx.run("""
                MATCH (d:Disease {DiseaseID: $DiseaseID})
                MATCH (p:Disease {DiseaseID: $ParentID})
                MERGE (d)-[:SUB_CATEGORY_OF]->(p)
            """, DiseaseID=disease['DiseaseID'], ParentID=parent_id)

And write this into the KB.

In [20]:
with driver.session() as session:
    for _, row in df.iterrows():
        session.execute_write(create_hierarchy, row)

The next step is to create embeddings for the disease names. We will need a function to collect disease names first.

In [18]:
def get_disease_names(tx):
    result = tx.run("""
        MATCH (d:Disease) 
        RETURN d.DiseaseID AS DiseaseID, d.DiseaseName AS DiseaseName
    """)
    return result.data()

In [32]:
with driver.session() as session:
    disease_descriptions = session.execute_read(get_disease_names)

And now we can generate embeddings for the disease names and update the KG.

In [35]:
embeddings = []
for record in disease_descriptions:
    disease_id = record['DiseaseID']
    disease_name = record['DiseaseName']
    
    name_embedding = embed_model.get_text_embedding(disease_name)
    
    embeddings.append((disease_id, name_embedding))

In [30]:
# Function to update disease embeddings
def update_disease_embeddings(tx, disease_id, embedding, embedding_model_name, embedding_prop='DiseaseEmbedding'):
    disease_embedding = f"{embedding_prop}-{embedding_model_name.replace('.', '_').replace('/', '-')}"

    query = """
        MATCH (d:Disease {DiseaseID: $DiseaseID})
        CALL apoc.create.setProperty(d, $disease_embedding, $embedding)
        YIELD node
        RETURN node
    """

    tx.run(query, DiseaseID=disease_id, embedding=embedding, disease_embedding=disease_embedding)

In [36]:
with driver.session() as session:
    for disease_id, embedding in embeddings:
        session.execute_write(update_disease_embeddings, disease_id, embedding, Models.BAAI_BGE_SMALL_EN_V1_5.value)

Let us explore a part of the graph to make sure the connections are set up correctly.

*NB: the code below doesn't retain the state, as it is an inteface to interact with the knowledge graph live. The screenshot of can be found in `notebooks/media/png/Graph visualization.png`*

In [8]:
gw_session = driver.session()
gw = GraphWidget(graph = gw_session.run("MATCH (s)-[r]->(t) RETURN s,r,t LIMIT 20").graph())

Out of range float values are not JSON compliant: nan
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant
  content = self.pack(content)


In [9]:
gw

GraphWidget(layout=Layout(height='710px', width='100%'))

Now we can test the embeddings using a graph query.

In [16]:
def find_similar_diseases(query_embedding, embedding_model_name):
    disease_embedding = f"DiseaseEmbedding-{embedding_model_name.replace('.', '_').replace('/', '-')}"

    with driver.session() as session:
        query = """
            MATCH (d:Disease)
            WHERE apoc.map.get(d, $disease_embedding, null) IS NOT NULL
            WITH d, gds.similarity.cosine(apoc.map.get(d, $disease_embedding, null), $query_embedding) AS similarity
            RETURN d.DiseaseName AS name, similarity
            ORDER BY similarity DESC
            LIMIT 5
        """
        result = session.run(query, query_embedding=query_embedding, disease_embedding=disease_embedding)
        
        return [record["name"] for record in result]

In [27]:
query_embedding = embed_model.get_text_embedding("breast carcinoma")
similar_diseases = find_similar_diseases(query_embedding, Models.BAAI_BGE_SMALL_EN_V1_5.value)
print(similar_diseases)

['Carcinoma', 'Carcinoma, Ductal', 'Carcinoma, Ductal, Breast', 'Breast Neoplasms', 'Breast Carcinoma In Situ']


In [28]:
test_query_2 = embed_model.get_text_embedding("Type II human complement C2 deficiency. Allele-specific amino acid substitutions (Ser189 --> Phe; Gly444 --> Arg) cause impaired C2 secretion.")
similar_diseases_2 = find_similar_diseases(test_query_2, Models.BAAI_BGE_SMALL_EN_V1_5.value)
print(similar_diseases_2)

['Complement Component 3 Deficiency, Autosomal Recessive', 'Complement Component C1s Deficiency', 'COMPLEMENT COMPONENT 8 DEFICIENCY, TYPE II', 'COMPLEMENT COMPONENT 2 DEFICIENCY', 'COMPLEMENT COMPONENT C1r/C1s DEFICIENCY']


Now we can create embeddings for the disease synonyms as well.

In [5]:
def get_disease_synonyms(tx):
    result = tx.run("""
        MATCH (d:Disease) 
        RETURN d.DiseaseID AS DiseaseID, d.DiseaseName AS DiseaseName, d.Synonyms AS Synonyms
    """)
    return result.data()

In [20]:
with driver.session() as session:
    disease_synonyms = session.execute_read(get_disease_synonyms)

In [23]:
synonyms_embeddings = []
for record in disease_synonyms:
    disease_id = record['DiseaseID']
    synonyms = record['Synonyms']

    if pd.isna(synonyms):
        continue
    
    rec_synonyms_embeddings = [embed_model.get_text_embedding(name) for name in synonyms.split('|')]
    
    synonyms_embeddings.append((disease_id, rec_synonyms_embeddings))

In [32]:
def update_disease_synonyms_embeddings(tx, disease_id, embedding, embedding_model_name, embedding_prop='SynonymsEmbedding'):
    embedding_name = f"{embedding_prop}-{embedding_model_name.replace('.', '_').replace('/', '-')}"

    query = """
        WITH apoc.convert.toJson($embedding) AS jsonEmbeddings
        MATCH (d:Disease {DiseaseID: $DiseaseID})
        CALL apoc.create.setProperty(d, $embedding_name, jsonEmbeddings)
        YIELD node
        RETURN node;
    """

    tx.run(query, DiseaseID=disease_id, embedding=embedding, embedding_name=embedding_name)

In [33]:
with driver.session() as session:
    for disease_id, embedding in synonyms_embeddings:
        session.execute_write(update_disease_synonyms_embeddings, disease_id, embedding, Models.BAAI_BGE_SMALL_EN_V1_5.value)

We also need to set this prop to NULL for any nodes that do not have `Synonyms`.

In [None]:
with driver.session() as session:
    query = """
        MATCH (d:Disease)
        WHERE d.`SynonymsEmbedding-BAAI-bge-small-en-v1_5` = []
        SET d.`SynonymsEmbedding-BAAI-bge-small-en-v1_5` = NULL
    """
    session.run(query)

And now we will create a single embedding for the list of synonyms for each node.

In [12]:
with driver.session() as session:
    disease_synonyms = session.execute_read(get_disease_synonyms)

    synonyms_embeddings = []
    for record in disease_synonyms:
        disease_id = record['DiseaseID']
        synonyms = record['Synonyms']

        if pd.isna(record['Synonyms']):
            synonyms = record['DiseaseName']
        else:
            synonyms = record['Synonyms']

        synonyms_string = ' '.join(synonyms.split('|'))
        
        rec_synonyms_embeddings = embed_model.get_text_embedding(synonyms_string)
        
        synonyms_embeddings.append((disease_id, rec_synonyms_embeddings))
    
    for disease_id, embedding in synonyms_embeddings:
        session.execute_write(update_disease_embeddings, disease_id, embedding, Models.BAAI_BGE_SMALL_EN_V1_5.value, "SynonymsSingleEmbedding")

## Ollama embeddings

Now we can add another embedding from Ollama, served locally.

In [3]:
ollama3_embedding = OllamaEmbedding(
    model_name="llama3",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
)

In [None]:
disease_descriptions = session.execute_read(get_disease_names)
llama3_embeddings = []

for record in disease_descriptions:
    disease_id = record['DiseaseID']
    disease_name = record['DiseaseName']
    name_embedding = ollama3_embedding.get_text_embedding(disease_name)
    
    llama3_embeddings.append((disease_id, name_embedding))

In [48]:
len(llama3_embeddings)

13298

In [56]:
with driver.session() as session:
    for id, name_embedding in llama3_embeddings:
        session.execute_write(update_disease_embeddings, id, [float(x) for x in name_embedding], Models.LLAMA3.value)

And we can verify it.

In [35]:
query_embedding_llama3 = ollama3_embedding.get_text_embedding("breast carcinoma")
similar_diseases_llama3 = find_similar_diseases(query_embedding_llama3, Models.LLAMA3.value)
print(similar_diseases_llama3)

['Carcinoma', 'Pancreatic adenoma', 'Adenocarcinoma', 'Urachal cancer', 'familial dilated cardiomyopathy']


In [36]:
llama3_test_query_2 = ollama3_embedding.get_text_embedding("Type II human complement C2 deficiency. Allele-specific amino acid substitutions (Ser189 --> Phe; Gly444 --> Arg) cause impaired C2 secretion.")
similar_diseases_2_llama3 = find_similar_diseases(llama3_test_query_2, Models.LLAMA3.value)
print(similar_diseases_2_llama3)

['Hyperlysinemia Due To Defect In Lysine Transport Into Mitochondria', 'Adrenocortical Unresponsiveness To Acth With Postreceptor Defect', 'Immunodeficiency, Partial Combined, with Absence of HLA Determinants and Beta-2-Microglobulin from Lymphocytes', 'Chylomicronemia, Familial, due to Circulating Inhibitor of Lipoprotein Lipase', 'Ehlers-Danlos Syndrome with Platelet Dysfunction from Fibronectin Abnormality']


We can now create embeddings for `Synonyms` with LLAMA3 embedding model and save those to the Knowledge graph too.

In [None]:
with driver.session() as session:
    synonyms_embeddings = []
    disease_synonyms = session.execute_read(get_disease_synonyms)

    # Wrap the loop with tqdm for a progress bar
    for record in tqdm(disease_synonyms, desc="Processing Disease Synonyms"):
        disease_id = record['DiseaseID']
        synonyms = record['Synonyms']

        if pd.isna(record['Synonyms']):
            synonyms = record['DiseaseName']
        else:
            synonyms = record['Synonyms']

        rec_synonyms_embeddings = [ollama3_embedding.get_text_embedding(name) for name in synonyms.split('|')]
        
        synonyms_embeddings.append((disease_id, rec_synonyms_embeddings))

In [12]:
len(disease_synonyms)

13298

In [13]:
len(synonyms_embeddings)

5197

In [19]:
synonyms_embeddings[5196][0]

'MESH:D005892'

We need to restart, as the previous function had a failing code.

In [28]:
def generate_synonyms_embeddings(driver, existing_embeddings: list = None) -> list:
    if existing_embeddings is None:
        existing_embeddings = []

    # Determine how many records have already been processed
    already_processed_count = len(existing_embeddings)

    synonyms_embeddings = existing_embeddings.copy()

    with driver.session() as session:
        disease_synonyms = session.execute_read(get_disease_synonyms)
        
        # Skip the already processed records
        remaining_disease_synonyms = disease_synonyms[already_processed_count:]

        # Progress bar for processing remaining disease synonyms
        for record in tqdm(remaining_disease_synonyms, desc="Processing Remaining Disease Synonyms"):
            disease_id = record['DiseaseID']
            synonyms = record['Synonyms']

            if pd.isna(record['Synonyms']):
                synonyms = record['DiseaseName']
            else:
                synonyms = record['Synonyms']

            rec_synonyms_embeddings = [ollama3_embedding.get_text_embedding(name) for name in synonyms.split('|')]
            
            synonyms_embeddings.append((disease_id, rec_synonyms_embeddings))
    
    return synonyms_embeddings

def update_embeddings_in_db(driver, synonyms_embeddings: list):
    with driver.session() as session:
        # Progress bar for updating embeddings in the database
        for disease_id, embedding in tqdm(synonyms_embeddings, desc="Updating Disease Embeddings"):
            session.execute_write(update_disease_synonyms_embeddings, disease_id, embedding, Models.LLAMA3.value)

In [21]:
# Resume embedding generation
complete_synonyms_embeddings = generate_synonyms_embeddings(driver, synonyms_embeddings)

Processing Remaining Disease Synonyms: 100%|██████████| 8101/8101 [2:32:37<00:00,  1.13s/it]   


Let us quickly verify that we've got all the embeddings.

In [23]:
len(complete_synonyms_embeddings)

13298

In [33]:
# Update the embeddings in the database
update_embeddings_in_db(driver, complete_synonyms_embeddings)

Updating Disease Embeddings: 100%|██████████| 13298/13298 [12:06<00:00, 18.32it/s]


## NCBI corpus processing

The dataset consists of entities that belog to either of these categories:

- DiseaseClass: This can correspond to the high-level categories in the CTD's MEDIC-Slim, which classify diseases into broad categories such as genetic diseases, neoplasms, etc.
- CompositeMention: This can be related to composite terms in CTD that might combine aspects of multiple diseases or conditions, although CTD primarily focuses on distinct disease terms rather than composite mentions.
- SpecificDisease: These can be mapped directly to specific disease terms in the CTD, which are detailed with their MeSH or OMIM identifiers.
- Modifier: Modifiers like "tumor" or "cancer" can be seen in context with specific diseases in CTD, modifying the understanding or description of the disease, such as in "breast cancer" or "ovarian cancer".

It has been prepared for ML training, thus split into development, training and testing chunks. Since we are not developing the ML model for our task - we will process and combine these chinks together. We will also separate the text from the annotations and save them separately.

In [323]:
with open('../data/raw/NCBI_corpus/NCBIdevelopset_corpus.txt', 'r') as f:
    data = f.read()

In [324]:
with open('../data/raw/NCBI_corpus/NCBItrainset_corpus.txt', 'r') as f:
    train_data = f.read()

In [325]:
with open('../data/raw/NCBI_corpus/NCBItestset_corpus.txt', 'r') as f:
    test_data = f.read()

In [26]:
# Split the input data into lines
lines = [line for line in data.split("\n") if line.strip()]

In [27]:
# Separate the abstract and annotations sections
abstracts = []
annotations = []

for line in lines:
    if "|t|" in line or "|a|" in line:
        parts = line.split("|")
        abstracts.append(parts)
    else:
        annotations.append(line)

In [28]:
# Create DataFrame for abstracts
abstracts_df = pd.DataFrame(abstracts, columns=["ID", "Type", "Text"])

In [29]:
abstracts_df.head()

Unnamed: 0,ID,Type,Text
0,8808605,t,Somatic-cell selection is a major determinant ...
1,8808605,a,X-chromosome inactivation in mammals is regard...
2,9050866,t,"The ataxia-telangiectasia gene product, a cons..."
3,9050866,a,The product of the ataxia-telangiectasia gene ...
4,9012409,t,Molecular basis for Duarte and Los Angeles var...


In [30]:
annotations_str = '\n'.join(annotations)
annotations_df = pd.read_csv(StringIO(annotations_str), sep='\t', header=None, names=['ID', 'Start', 'End', 'Description', 'Type', 'MESH ID'])

In [31]:
annotations_df.head()

Unnamed: 0,ID,Start,End,Description,Type,MESH ID
0,8808605,154,171,enzyme deficiency,DiseaseClass,D008661
1,8808605,376,427,glucose-6-phosphate dehydrogenase (G6PD) defic...,SpecificDisease,D005955
2,8808605,572,611,chronic nonspherocytic hemolytic anemia,SpecificDisease,D000746
3,8808605,677,692,G6PD deficiency,SpecificDisease,D005955
4,8808605,1368,1383,G6PD deficiency,SpecificDisease,D005955


In [32]:
abstracts_df.to_csv('../data/processed/ncbi_dev_abstracts.csv', index=False)

In [33]:
abstracts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      200 non-null    object
 1   Type    200 non-null    object
 2   Text    200 non-null    object
dtypes: object(3)
memory usage: 4.8+ KB


Some IDs do not have the same format as "DiseaseID" property in knowledge graph. We need to re-map them so they match exactly. MESH ID can be either "D005955", "D001932|D001943|D011471" (meaning that there are 3 ids) or "OMIM:217000". We need to re-map them so every id that starts with "D" or "C" has "MESH:" before it, like "MESH:D005955".

In [288]:
def remap_mesh_id(mesh_id) -> str:
    if not isinstance(mesh_id, str):
        return mesh_id
    
    id = mesh_id.strip()

    if '|' in id or '+' in id:
        if '|' in id:
            ids = id.split('|')
        else:
            ids = id.split('+')
            
        remapped_ids = [remap_mesh_id(id) for id in ids]
        return '|'.join(remapped_ids)

    if id.startswith('D') or id.startswith('C'):
        return 'MESH:' + id
    else:
        return id

In [290]:
annotations_df['MESH ID'] = annotations_df['MESH ID'].apply(remap_mesh_id)

In [263]:
annotations_df.head()

Unnamed: 0,ID,Start,End,Description,Type,MESH ID
0,8808605,154,171,enzyme deficiency,DiseaseClass,MESH:D008661
1,8808605,376,427,glucose-6-phosphate dehydrogenase (G6PD) defic...,SpecificDisease,MESH:D005955
2,8808605,572,611,chronic nonspherocytic hemolytic anemia,SpecificDisease,MESH:D000746
3,8808605,677,692,G6PD deficiency,SpecificDisease,MESH:D005955
4,8808605,1368,1383,G6PD deficiency,SpecificDisease,MESH:D005955


In [291]:
annotations_df.to_csv('../data/processed/ncbi_dev_annotations.csv', index=False)

In [152]:
def get_abstractions_and_annotations(data):
    lines = [line for line in data.split("\n") if line.strip()]

    abstracts = []
    annotations = []

    for line in lines:
        if "|t|" in line or "|a|" in line:
            parts = line.split("|")
            abstracts.append(parts)
        else:
            annotations.append(line)

    return abstracts, annotations

Now we can process train and test corpus in the same manner.

In [292]:
train_abstracts, train_annotations = get_abstractions_and_annotations(train_data)
        
train_abstracts_df = pd.DataFrame(train_abstracts, columns=["ID", "Type", "Text"])
train_abstracts_df.to_csv('../data/processed/ncbi_train_abstracts.csv', index=False)

train_annotations_str = '\n'.join(train_annotations)
train_annotations_df = pd.read_csv(StringIO(train_annotations_str), sep='\t', header=None, names=['ID', 'Start', 'End', 'Description', 'Type', 'MESH ID'])
train_annotations_df['MESH ID'] = train_annotations_df['MESH ID'].apply(remap_mesh_id)
train_annotations_df.to_csv('../data/processed/ncbi_train_annotations.csv', index=False)

In [261]:
train_annotations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5145 entries, 0 to 5144
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           5145 non-null   int64 
 1   Start        5145 non-null   int64 
 2   End          5145 non-null   int64 
 3   Description  5145 non-null   object
 4   Type         5145 non-null   object
 5   MESH ID      5145 non-null   object
dtypes: int64(3), object(3)
memory usage: 241.3+ KB


In [293]:
test_abstracts, test_annotations = get_abstractions_and_annotations(test_data)
        
test_abstracts_df = pd.DataFrame(test_abstracts, columns=["ID", "Type", "Text"])
test_abstracts_df.to_csv('../data/processed/ncbi_test_abstracts.csv', index=False)

test_annotations_str = '\n'.join(test_annotations)
test_annotations_df = pd.read_csv(StringIO(test_annotations_str), sep='\t', header=None, names=['ID', 'Start', 'End', 'Description', 'Type', 'MESH ID'])
test_annotations_df['MESH ID'] = test_annotations_df['MESH ID'].apply(remap_mesh_id)
test_annotations_df.to_csv('../data/processed/ncbi_test_annotations.csv', index=False)

In [156]:
test_annotations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           960 non-null    int64 
 1   Start        960 non-null    int64 
 2   End          960 non-null    int64 
 3   Description  960 non-null    object
 4   Type         960 non-null    object
 5   MESH ID      960 non-null    object
dtypes: int64(3), object(3)
memory usage: 45.1+ KB


We will work with specific disease names and we do not plan to train an ML model, so let us create a separate DataFrame from all three datasets for the category of SpecificDisease.

In [327]:
specific_disease_df = pd.concat([annotations_df[annotations_df['Type'] == 'SpecificDisease'], train_annotations_df[train_annotations_df['Type'] == 'SpecificDisease'], test_annotations_df[test_annotations_df['Type'] == 'SpecificDisease']])

In [328]:
specific_disease_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3939 entries, 1 to 959
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           3939 non-null   int64 
 1   Start        3939 non-null   int64 
 2   End          3939 non-null   int64 
 3   Description  3939 non-null   object
 4   Type         3939 non-null   object
 5   MESH ID      3903 non-null   object
dtypes: int64(3), object(3)
memory usage: 215.4+ KB


In [329]:
specific_disease_df.head()

Unnamed: 0,ID,Start,End,Description,Type,MESH ID
1,8808605,376,427,glucose-6-phosphate dehydrogenase (G6PD) defic...,SpecificDisease,MESH:D005955
2,8808605,572,611,chronic nonspherocytic hemolytic anemia,SpecificDisease,MESH:D000746
3,8808605,677,692,G6PD deficiency,SpecificDisease,MESH:D005955
4,8808605,1368,1383,G6PD deficiency,SpecificDisease,MESH:D005955
9,9012409,112,149,Duarte enzyme variant of galactosemia,SpecificDisease,MESH:D005693


Let us now check if there are any rows that don't have IDs.

In [330]:
rows_with_none = specific_disease_df[specific_disease_df['MESH ID'].isna()]
print(len(rows_with_none))

36


Since some disease names might be mentioned multiple times, let us print the unique names that missing IDs.

In [331]:
disease_names_with_none = rows_with_none['Description'].unique()
print(disease_names_with_none)

['Kniest dysplasia' 'Pyle disease' 'SJS type 2'
 'attenuated adenomatous polyposis coli' 'AAPC'
 'clear-cell renal cell carcinoma' 'Norrie disease' 'ND'
 'renal oncocytomas' 'atelosteogenesis type 2' 'diastrophic dysplasia'
 'DTD' 'AO2' 'achondrogenesis type 1B' 'ACG1B' 'AO-2'
 'adult renal hamartomas' 'VLCAD deficiency']


We can populate these missing IDs using MyDisease API.
[MyDisease.info](https://mydisease.info/about) provides simple-to-use REST web services to query/retrieve disease annotation data. It has annotations from different ontologies, which is helpful, as we might be able to re-map the diseases so their IDs are aligned with the knowledge base.

In [332]:
base_URL = "http://mydisease.info/v1/"
query_endpoint = "query"

def get_disease_info(query, endpoint, fields="mondo", size=10, from_=0, fetch_all=False, facet_size=10, dotfield=False):
    params = {
        "q": query,
        "fields": fields,
        "size": size,
        "from": from_,
        "fetch_all": str(fetch_all).lower(),
        "facet_size": facet_size,
        "dotfield": str(dotfield).lower()
    }
    
    response = requests.get(base_URL + endpoint, params=params)
    
    if response.status_code == 200:
        return response.json()
    else:
        response.raise_for_status()

Let us check the query and make sure it's working as expected.

In [333]:
query = "Pyle disease"
response_data = get_disease_info(query, query_endpoint)

Now we can collect the missing data.

In [335]:
disease_names_with_none_data = []

for disease_name in disease_names_with_none:
    response_data = get_disease_info(disease_name, query_endpoint)
    disease_names_with_none_data.append(response_data)
    
print(len(disease_names_with_none_data))

18


Now we can check if we got all the information we need to populate the missing data point.

In [341]:
def extract_mesh_omim(dictionary):
    if dictionary['hits']:
        return {
            "MESH": dictionary['hits'][0].get('mondo', {}).get('xrefs', {}).get('mesh', None),
            "OMIM": dictionary['hits'][0].get('mondo', {}).get('xrefs', {}).get('omim', None),
            "DOID": dictionary['hits'][0].get('mondo', {}).get('xrefs', {}).get('doid', None)
        }

    else:
        return None

disease_mesh_dict = {}

for i, disease in enumerate(disease_names_with_none):
    mesh_omim_value = extract_mesh_omim(disease_names_with_none_data[i]) if i < len(disease_names_with_none_data) else None
    disease_mesh_dict[disease] = mesh_omim_value
    print(f"{disease}: {mesh_omim_value}")

Kniest dysplasia: {'MESH': ['C537208'], 'OMIM': ['245190'], 'DOID': None}
Pyle disease: {'MESH': ['C536252'], 'OMIM': ['265900'], 'DOID': ['DOID:0080019']}
SJS type 2: None
attenuated adenomatous polyposis coli: {'MESH': ['C538265'], 'OMIM': None, 'DOID': None}
AAPC: {'MESH': ['C538265'], 'OMIM': None, 'DOID': None}
clear-cell renal cell carcinoma: {'MESH': None, 'OMIM': None, 'DOID': ['DOID:4467']}
Norrie disease: {'MESH': ['C537849'], 'OMIM': ['310600'], 'DOID': ['DOID:0060844']}
ND: None
renal oncocytomas: None
atelosteogenesis type 2: {'MESH': ['C535395'], 'OMIM': ['256050'], 'DOID': None}
diastrophic dysplasia: {'MESH': ['C536170'], 'OMIM': ['222600'], 'DOID': ['DOID:14687']}
DTD: None
AO2: {'MESH': ['C535395'], 'OMIM': ['256050'], 'DOID': None}
achondrogenesis type 1B: None
ACG1B: None
AO-2: None
adult renal hamartomas: None
VLCAD deficiency: {'MESH': None, 'OMIM': ['201475'], 'DOID': ['DOID:0080155']}


We can infer that "ND" is "Norrie disease", and "AO-2" is "AO2" so we can insert the missing values into the specific disease DataFrame.

In [342]:
specific_disease_df.loc[specific_disease_df['Description'] == 'ND', 'MESH ID'] = 'MESH:C537849'
specific_disease_df.loc[specific_disease_df['Description'] == 'AO-2', 'MESH ID'] = 'MESH:C535395'

And now we can update the rest known values.

In [373]:
for disease, ids in disease_mesh_dict.items():
    if ids is not None:
        id = None

        if 'MESH' in ids and ids['MESH'] is not None:
            id = f"MESH:{ids['MESH'][0]}"
        elif 'OMIM' in ids and ids['OMIM'] is not None:
            id = f"OMIM:{ids['OMIM'][0]}"
        elif 'DOID' in ids and ids['DOID'] is not None:
            id = f"DO:{ids['DOID'][0]}"

        print (f"{disease}: {id}")
        specific_disease_df.loc[specific_disease_df['Description'] == disease, 'MESH ID'] = id

Kniest dysplasia: MESH:C537208
Pyle disease: MESH:C536252
attenuated adenomatous polyposis coli: MESH:C538265
AAPC: MESH:C538265
clear-cell renal cell carcinoma: DO:DOID:4467
Norrie disease: MESH:C537849
atelosteogenesis type 2: MESH:C535395
diastrophic dysplasia: MESH:C536170
AO2: MESH:C535395
VLCAD deficiency: OMIM:201475


We were able to identify more diseases, so we should not have many empty values now.

In [347]:
rows_with_none = specific_disease_df[specific_disease_df['MESH ID'].isna()]
print(len(rows_with_none))

11


In [348]:
specific_disease_df.dropna(subset=['MESH ID'], inplace=True)

In [367]:
specific_disease_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3928 entries, 0 to 3927
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           3928 non-null   int64 
 1   Start        3928 non-null   int64 
 2   End          3928 non-null   int64 
 3   Description  3928 non-null   object
 4   Type         3928 non-null   object
 5   MESH ID      3928 non-null   object
dtypes: int64(3), object(3)
memory usage: 184.3+ KB


Let us now check if there are any missing IDs in the knowledge graph.

In [368]:
mesh_ids = specific_disease_df['MESH ID'].str.split('|').explode().unique()
len(mesh_ids)

559

In [355]:
specific_disease_df.to_csv('../data/processed/ncbi_specific_disease.csv', index=False)

In [370]:
def batch_list(lst, batch_size):
    for i in range(0, len(lst), batch_size):
        yield lst[i:i + batch_size]

In [371]:
def check_ids_missing_in_knowledge_base(driver, mesh_ids, batch_size=100):
    id_present_check_query = """
        UNWIND $meshIds AS meshId
        OPTIONAL MATCH (d:Disease)
        WHERE (d.DiseaseID IS NOT NULL AND ANY(id IN SPLIT(toString(d.DiseaseID), '|') WHERE id = meshId))
        OR (d.AltDiseaseIDs IS NOT NULL AND ANY(altId IN SPLIT(toString(d.AltDiseaseIDs), '|') WHERE altId = meshId))
        RETURN meshId
    """
    
    present_ids = set()
    
    for batch in batch_list(mesh_ids, batch_size):
        with driver.session() as session:
            result = session.run(id_present_check_query, meshIds=batch)
            present_ids.update([record["meshId"] for record in result])
    
    missing_ids = set(mesh_ids) - present_ids
    
    return list(missing_ids)

In [372]:
missing_ids = check_ids_missing_in_knowledge_base(driver, mesh_ids)
print(len(missing_ids))

0


As we can see, all the IDs from the dataset are present in the knowledge graph, so we can save our dataset now.

In [320]:
specific_disease_df.to_csv('../data/processed/ncbi_specific_disease.csv', index=False)

In [363]:
driver.close()