# Jensenlab Target-Disease Association Data Enrichment

This notebook presents a comprehensive approach to enriching a gene-disease association dataset, initially annotated with Disease Ontology Identifiers (DOID). 

The primary focus is on **mapping these DOIDs to the Unified Medical Language System (UMLS) and Medical Subject Headings (MESH) annotation system**, thereby enhancing the dataset's depth and utility.

Leveraging the power of Python, particularly through the use of specialized libraries such as **ols_client** for ontology-based queries and **Biopython** for accessing NCBI's MedGen database, this notebook demonstrates a meticulous process of data enrichment. The initial dataset, consisting of gene identifiers and names associated with diseases and their respective DOIDs, undergoes a transformative journey. Through intelligent scripting and data processing techniques, each disease annotation is cross-referenced and verified against authoritative biomedical resources.

The notebook's operations involve not only the extraction of UMLS and MESH annotations but also a detailed comparative analysis of disease names using the Levenshtein ratio to ensure high fidelity in disease identification. The resulting output is a curated dataset that harmonizes the original Jensen data with UMLS annotations, offering a more accurate and comprehensive view of each disease's genetic associations.

This work highlights the intersection of computational skills and biomedical knowledge, showcasing an adept application of programming tools to address complex biological questions. The resulting dataset stands as a proof of the rigorous process of data enhancement and validation, essential in the pursuit of advanced biomedical research and analysis.

# Section 1 - Package loading

In [3]:
import pandas as pd

# SQL PACKAGES

import psycopg2
import pandas as pd
from sqlalchemy import create_engine
import pandas.io.sql as sqlio

# JSON
from tqdm import tqdm

# XML
import xml.etree.ElementTree as ET

# Similarity analysis
from Levenshtein import ratio

# Section 2 - Data Loading and Initial Exploration

**Jensen Data Overview:**

The Jensen dataset, known for its rich content in target-disease associations, is a crucial resource in biomedical research. It details the connections between biological targets (like genes or proteins) and various diseases. A key feature of this dataset is the annotation of diseases using the Disease Ontology Identifier (DOID), a respected standard in biomedical informatics. DOID annotations facilitate a consistent and precise description of human diseases, making the dataset invaluable for cross-referencing and harmonizing data across different biomedical studies and databases.

**Data Acquisition from Jensenlab:**
Jensenlab, the provider of this dataset, offers data downloads, allowing researchers to access easily to a comprehensive, curated information about gene-disease associations.

In [11]:
file_path = '/home/alphameld/data/GEOmetadata_FeaturePrediction/TargetCompilationProject/Jensen/Data/human_disease_integrated_full.tsv'
names = ['Gene_identifier','Gene_name','Disease_ID', 'Disease_name', 'Score']

Jensen = pd.read_csv(file_path,header = None, sep = '\t', names = names)
Jensen.head()

Unnamed: 0,Gene_identifier,Gene_name,Disease_ID,Disease_name,Score
0,18S_rRNA,18S_rRNA,DOID:9643,Babesiosis,3.635
1,18S_rRNA,18S_rRNA,DOID:1398,Parasitic infectious disease,3.505
2,18S_rRNA,18S_rRNA,DOID:2789,Parasitic protozoa infectious disease,3.502
3,18S_rRNA,18S_rRNA,DOID:4,Disease,3.397
4,18S_rRNA,18S_rRNA,DOID:0050117,Disease by infectious agent,3.311


# Section 3:  - Data Cleaning and Preprocessing

## 3.1 - Extract unique target-disease associations

**Explanation:** Here, the focus shifts to extracting unique disease identifiers (Disease_ID) and their corresponding names (Disease_names) from the dataset. This is an essential step for isolating specific disease-related data.

In [16]:
Jensen_f = Jensen[['Disease_ID','Disease_name']].copy()
Jensen_f.drop_duplicates(inplace=True)

## 3.2 - Extracting the annotation types availabe in Jensen dataset

**Explanation:** This section is dedicated to identifying and extracting various annotation types present in the dataset. 
Understanding and categorizing these annotations are important for comprehending the dataset's structure and enhancing its usability in analysis.

In [24]:
split_df = Jensen_f['Disease_ID'].str.split(':',expand = True)
split_df.columns = ['Anno_type', 'ID']

Jensen_f = pd.concat([Jensen_f, split_df], axis=1)
Jensen_f.head()

Unnamed: 0,Disease_ID,Disease_name,Anno_type,ID
0,DOID:9643,Babesiosis,DOID,9643
1,DOID:1398,Parasitic infectious disease,DOID,1398
2,DOID:2789,Parasitic protozoa infectious disease,DOID,2789
3,DOID:4,Disease,DOID,4
4,DOID:0050117,Disease by infectious agent,DOID,50117


## 3.3 - Extracting Disease Ontology Identifiers (DOIDs)

In [41]:
DOID_anno = Jensen_f[Jensen_f['Anno_type'] == 'DOID'].copy()

# Section 4 - Annotation Extraction and Segmentation

## 4.1 - Getting DOIDs cross references from the OLS client

Declare functions to obtain cross references/annotations Using the OLS (Ontology Lookup Service) client.

In [49]:
from ols_client import EBIClient
# Initialize the OLS client
ols_client = EBIClient()

# Declare functions to obtain cross references

def extract_cross_references(term, namespace_prefix):
    cross_references = []
    if 'has_dbxref' in term['_embedded']['terms'][0]['annotation']:
        dbxrefs = term['_embedded']['terms'][0]['annotation']['has_dbxref']
        for dbxref in dbxrefs:
            db, ref_id = dbxref.split(':', 1)
            if db.lower() == namespace_prefix.lower():
                cross_references.append(f"{db}:{ref_id}")
    return cross_references

This following cell demonstrates continued querying of the ols_client using the function **extract_cross_references**. 
The code is likely executing queries to OLS to retrieve additional ontology terms or annotations, thereby enriching the Jensen dataset with standardized, ontology-based data. This enrichment is key to enhancing the dataset's utility for biomedical research.

In [58]:
cross_reference_data = []
ontology = 'doid'

for term_id, d_anno in tqdm(zip(DOID_anno['Disease_ID'],DOID_anno['Disease_name']),desc='Processing DOIDs',bar_format='{l_bar}{bar:20}{r_bar}{bar:-20b}',position=0, leave=True,total=len(DOID_anno['Disease_ID'])):
    try:
        term = ols_client.get_term(ontology, term_id)
    except requests.exceptions.HTTPError as error:
        print(f"Error fetching term {term_id}: {error}")
        term = None

    if term:
        mesh_cross_references = extract_cross_references(term, 'MESH')
        msh_cross_references = extract_cross_references(term, 'MSH')
        MeSH_cross_references = extract_cross_references(term, 'MeSH')
        #mondo_cross_references = extract_cross_references(term, 'MONDO')
        #orpha_cross_references = extract_cross_references(term, 'ORPHANET')
        umlscui_cross_references = extract_cross_references(term, 'UMLS_CUI')
        umls_cross_references = extract_cross_references(term, 'UMLS')
        
        combined_mesh_cross_references = mesh_cross_references + msh_cross_references + msh_cross_references
        combined_umls_cross_references = umlscui_cross_references + umls_cross_references
        
    else:
        umls_cross_references = []
        combined_mesh_cross_references = []
        mondo_cross_references = []
        orpha_cross_references = []
        combined_umls_cross_references = []

    cross_reference_data.append({
        "DOID": term_id,
        "DiseaseName": d_anno,
        "UMLS" : ', '.join(combined_umls_cross_references) if combined_umls_cross_references else None,
        "MESH": ', '.join(combined_mesh_cross_references) if combined_mesh_cross_references else None
        #"MONDO": ', '.join(mondo_cross_references) if mondo_cross_references else None,
        #"ORPHANET": ', '.join(orpha_cross_references) if orpha_cross_references else None
    })
 
# Create the pandas DataFrame
cross_reference_df = pd.DataFrame(cross_reference_data, columns=["DOID", "DiseaseName", "UMLS", "MESH"])

Processing DOIDs:  82%|████████████████▍   | 7633/9313 [57:59<11:23,  2.46it/s]  

Error fetching term DOID:3394: 404 Client Error:  for url: https://www.ebi.ac.uk/ols4/api/ontologies/doid/terms?iri=DOID%3A3394


Processing DOIDs: 100%|████████████████████| 9313/9313 [1:10:32<00:00,  2.20it/s]


# Section 5 - Cleaning from data extracted from the OLS client

## 5.1 - Split data between UMLS annotations and MESH annotations

This code will segregate the data on 2 different types of annotations:

**UMLS (Unified Medical Language System)** and 
**MESH (Medical Subject Headings)**

In [80]:
UMLS_anno = cross_reference_df[cross_reference_df['UMLS'].notnull()][['DOID','DiseaseName','UMLS']].copy()
MESH_anno = cross_reference_df[(cross_reference_df['UMLS'].isnull()) & (cross_reference_df['MESH'].notnull())][['DOID','DiseaseName','MESH']].copy()

## 5.2 - Expand annotations

The following operations are transforming the 'UMLS' and 'MESH' columns from potentially having multiple annotations per row (in a comma-separated format) to a format where each annotation has its own row. 

## 5.2.1 - UMLS

In [98]:
# Clean the 'UMLS' column to ensure consistency (replace ', ' with ',')
UMLS_anno['UMLS'] = UMLS_anno['UMLS'].str.replace(', ', ',')

# Then split the 'UMLS' column
UMLS_anno['UMLS'] = UMLS_anno['UMLS'].str.split(',')

# Expand the split lists into separate rows
UMLS_anno = UMLS_anno.explode('UMLS')

# Reset the index
UMLS_anno = UMLS_anno.reset_index(drop=True)
UMLS_anno

Unnamed: 0,DOID,DiseaseName,UMLS
0,DOID:9643,Babesiosis,UMLS_CUI:C0004576
1,DOID:1398,Parasitic infectious disease,UMLS_CUI:C0014238
2,DOID:2789,Parasitic protozoa infectious disease,UMLS_CUI:C0033740
3,DOID:4,Disease,UMLS_CUI:C0012634
4,DOID:0050117,Disease by infectious agent,UMLS_CUI:C0001485
...,...,...,...
6013,DOID:14457,Brucella abortus brucellosis,UMLS_CUI:C0302363
6014,DOID:1992,Rectum malignant melanoma,UMLS_CUI:C0349539
6015,DOID:4513,Gallbladder angiosarcoma,UMLS_CUI:C1333742
6016,DOID:13799,Female breast central part cancer,UMLS_CUI:C0153549


### 5.2.2 - MESH

In [99]:
# Clean the 'UMLS' column to ensure consistency (replace ', ' with ',')
MESH_anno['MESH'] = MESH_anno['MESH'].str.replace(', ', ',')

# Splitting the 'UMLS' column
MESH_anno['MESH'] = MESH_anno['MESH'].str.split(',')

# Expand the split lists into separate rows
MESH_anno = MESH_anno.explode('MESH')

# Reset the index
MESH_anno = MESH_anno.reset_index(drop=True)
MESH_anno

Unnamed: 0,DOID,DiseaseName,MESH
0,DOID:0050686,Organ system cancer,MESH:D009371
1,DOID:1339,Diamond-Blackfan anemia,MESH:D029503
2,DOID:8485,Mucormycosis,MESH:D009091
3,DOID:0080208,non-alcoholic fatty liver disease,MESH:D065626
4,DOID:0060479,Shwachman-Diamond syndrome,MESH:C537330
...,...,...,...
638,DOID:0060377,Orofaciodigital syndrome VII,MESH:C563104
639,DOID:0080629,alopecia-mental retardation syndrome 2,MESH:C563668
640,DOID:0110223,Brugada syndrome 6,MESH:C567735
641,DOID:0070356,Visual impairment and progressive phthisis bulbi,MESH:D005128


## 5.3 - Create an exclusive column for the ID

The code block below is enhancing the **UMLS_anno** and **MESH_anno** data frames by splitting the UMLS/MESH annotations into their constituent parts (annotation type and identifier), renaming these parts for clarity, and then merging this detailed information back into the original DataFrame.

### 5.3.1 - UMLS

In [103]:
split_df = UMLS_anno['UMLS'].str.split(':',expand = True)
split_df.columns = ['Anno_type', 'ID']

UMLS_f = pd.concat([UMLS_anno, split_df], axis=1)
UMLS_f.head()

Unnamed: 0,DOID,DiseaseName,UMLS,Anno_type,ID
0,DOID:9643,Babesiosis,UMLS_CUI:C0004576,UMLS_CUI,C0004576
1,DOID:1398,Parasitic infectious disease,UMLS_CUI:C0014238,UMLS_CUI,C0014238
2,DOID:2789,Parasitic protozoa infectious disease,UMLS_CUI:C0033740,UMLS_CUI,C0033740
3,DOID:4,Disease,UMLS_CUI:C0012634,UMLS_CUI,C0012634
4,DOID:0050117,Disease by infectious agent,UMLS_CUI:C0001485,UMLS_CUI,C0001485


## 5.3.2 - MESH

In [105]:
split_df = MESH_anno['MESH'].str.split(':',expand = True)
split_df.columns = ['Anno_type', 'ID']

MESH_f = pd.concat([MESH_anno, split_df], axis=1)
MESH_f.head()

Unnamed: 0,DOID,DiseaseName,MESH,Anno_type,ID
0,DOID:0050686,Organ system cancer,MESH:D009371,MESH,D009371
1,DOID:1339,Diamond-Blackfan anemia,MESH:D029503,MESH,D029503
2,DOID:8485,Mucormycosis,MESH:D009091,MESH,D009091
3,DOID:0080208,non-alcoholic fatty liver disease,MESH:D065626,MESH,D065626
4,DOID:0060479,Shwachman-Diamond syndrome,MESH:C537330,MESH,C537330


# Section 6 - Query MedGen

## 6.1 - Declaring functions

The following functions enable querying the **MedGen** database via NCBI's Entrez programming utilities. 
Our functions perform both general searches and fetch detailed summaries of specific records, and configures necessary user information for accessing NCBI's resources.

In [108]:
# Declaring functions to query data from Medgen
from Bio import Entrez

def search(term):
    handle = Entrez.esearch(db='medgen', term=term, retmax=500)
    record = Entrez.read(handle)
    return record

def summary(id):
    handle = Entrez.esummary(db='medgen', id=id)
    record = handle.read()
    return record

Entrez.email = "your email"
Entrez.api_key = 'your entrez api key'

## 6.2 - Get annotations and disease names

The code block systematically queries MedGen for additional disease-related information based on a set of identifiers, compiles the results, and creates a new, enriched DataFrame that combines the original data with the newly acquired information from MedGen.

### 6.3.1 - UMLS

In [135]:
data = []

for Jensen_dname, d_anno, doid_anno in tqdm(zip(UMLS_f['DiseaseName'],UMLS_f['ID'],UMLS_f['DOID']),desc='Processing UMLSs',bar_format='{l_bar}{bar:20}{r_bar}{bar:-20b}',position=0, leave=True,total=len(UMLS_f['DiseaseName'])):  
    search_results = search(d_anno) 
    uid_list = search_results['IdList']
    # if uid_list is empty, append (disease, None, None)
    if not uid_list:
        data.append((doid_anno, Jensen_dname, None, None))
    else:
        for uid in uid_list:
            summary_record = summary(uid)
            root = ET.fromstring(summary_record)
            umls_id = None
            disease_name = None
            for elem in root.iter('ConceptId'):
                umls_id = elem.text
            for elem in root.iter('Title'):
                disease_name = elem.text
            data.append((doid_anno, Jensen_dname, umls_id, disease_name))  # Store disease, UMLS and disease_name together

# Create DataFrame after loop is finished
Jensen_qr_res_UMLS = pd.DataFrame(data, columns=["DiseaseId", "Jensen_DiseaseName", "UMLS", "MG_DiseaseName"])

Processing UMLSs: 100%|████████████████████| 6018/6018 [35:07<00:00,  2.86it/s]  


## 6.3.2 - MESH

In [137]:
data = []

for Jensen_dname, d_anno, doid_anno in tqdm(zip(MESH_f['DiseaseName'],MESH_f['ID'],MESH_f['DOID']),desc='Processing MESHs',bar_format='{l_bar}{bar:20}{r_bar}{bar:-20b}',position=0, leave=True,total=len(MESH_f['DiseaseName'])):  
    search_results = search(d_anno) 
    uid_list = search_results['IdList']
    # if uid_list is empty, append (disease, None, None)
    if not uid_list:
        data.append((doid_anno, Jensen_dname, None, None))
    else:
        for uid in uid_list:
            summary_record = summary(uid)
            root = ET.fromstring(summary_record)
            umls_id = None
            disease_name = None
            for elem in root.iter('ConceptId'):
                umls_id = elem.text
            for elem in root.iter('Title'):
                disease_name = elem.text
            data.append((doid_anno, Jensen_dname, umls_id, disease_name))  # Store disease, UMLS and disease_name together

# Create DataFrame after loop is finished
Jensen_qr_res_MESH = pd.DataFrame(data, columns=["DiseaseId", "Jensen_DiseaseName", "UMLS", "MG_DiseaseName"])

Processing MESHs: 100%|████████████████████| 643/643 [03:32<00:00,  3.03it/s]


# Section 7 - Final dataset compilation

## 7.1 - Concatenation MESH and UMLS extracted data

In [143]:
jensen_df = pd.concat([Jensen_qr_res_MESH,Jensen_qr_res_UMLS])

## 7.2 - Remove missing data 

In [147]:
jensen_df_nn = jensen_df[jensen_df['UMLS'].notnull()].copy()

# 7.3 - Keeping the most similar entries

## 7.3.1 - Declaring function SOLVE

The **solve** function is designed to refine a DataFrame by keeping only the most relevant entries for each unique disease, based on the similarity of disease names from two different sources. This is accomplished through a combination of iterating over unique identifiers, calculating similarity scores, and selectively dropping less similar entries

In [150]:
from Levenshtein import ratio

def solve(data):
    for disid in data['DiseaseId'].unique():
        subset = data[data['DiseaseId'] == disid]
        if len(subset) > 1:  
            max_simil = 0
            idx_to_keep = None  
            for idx, row in subset.iterrows():
                simil_score = ratio(row['Jensen_DiseaseName'].lower(), row['MG_DiseaseName'].lower())
                if simil_score > max_simil:  
                    max_simil = simil_score  
                    idx_to_keep = idx 
            idx_to_drop = list(set(subset.index) - {idx_to_keep})
            data = data.drop(idx_to_drop)
    return data

### 7.3.2 - Keeping entries with the most similar disease names between Jensen and MedGen

In [151]:
New_Jensen_linked_data = solve(jensen_df_nn)

In [152]:
New_Jensen_linked_data

Unnamed: 0,DiseaseId,Jensen_DiseaseName,UMLS,MG_DiseaseName
0,DOID:0050686,Organ system cancer,C0027653,Neoplasm by Site
1,DOID:1339,Diamond-Blackfan anemia,C1260899,Diamond-Blackfan anemia
2,DOID:8485,Mucormycosis,C0026718,Mucormycosis
4,DOID:0080208,non-alcoholic fatty liver disease,C0400966,Non-alcoholic fatty liver disease
9,DOID:14049,Phaeohyphomycosis,C0276721,Phaeohyphomycosis
...,...,...,...,...
6046,DOID:14457,Brucella abortus brucellosis,C0302363,Brucella abortus brucellosis
6047,DOID:1992,Rectum malignant melanoma,C0349539,Rectum malignant melanoma
6048,DOID:4513,Gallbladder angiosarcoma,C1333742,Gallbladder angiosarcoma
6049,DOID:13799,Female breast central part cancer,C0153549,Female breast central part cancer
