***
***

<img width="220" align="right" alt="Screen Shot 2020-10-14 at 20 48 36" src="https://user-images.githubusercontent.com/8030363/96350526-7d09a680-1073-11eb-9e45-a510c496bcc1.png">


# OMOP2OBO

### *Ontologizing Health Systems at Scale: Making Translational Discovery a Reality*

<br>

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [OMOP2OBO](https://github.com/callahantiff/OMOP2OBO/wiki)  
**Current Release:** **[`V1.0`](https://github.com/callahantiff/OMOP2OBO/wiki/v1.0)**

<br>

***
***

**Project Goals:** Common data models have solved many challenges of utilizing electronic health records, but have not yet meaningfully integrated clinical and molecular data. Aligning clinical data to open biological ontologies (OBOs), which provide semantically computable representations of biological knowledge, requires extensive manual curation and expertise. To address these limitations, we introduce OMOP2OBO, a health system-scale, disease-agnostic methodology to create interoperability between standardized clinical terminologies and semantically encoded OBOs and present results demonstrating the utility within two health systems.

<br>

***
***

## Notebook Purpose

This notebook serves as a `main` file for the `OMOP2OBO` project. This scripts walks through this program step-by-step and generates mappings between Observational Medical Outcomes Partnership (OMOP) common data model and ontologies in the Open Biological and Biomedical Ontologies (OBO) Foundry. There is also a command line version of this file (`omop2obo`) that's automatically installed with OMOP2OBO. Please see the [README](https://github.com/callahantiff/OMOP2OBO) for more information.

**OMOP2OBO Workflow**  
The figure below provides a high-level overview of the `OMOP2OBO` mapping algorithm. The steps code in this notebook aligns to the steps shown in this figure. The only step that is not run is querying an OMOP instance. Since it is highly likely that these instance contain patient data, we assume that data has already been obtained and saved in the `resources/clinical_data/` repository. See project [README](https://github.com/callahantiff/OMOP2OBO) for additional information.

<img width="2000" alt="Screen Shot 2020-09-20 at 22 59 00" src="https://user-images.githubusercontent.com/8030363/96931469-99924e00-147a-11eb-9c19-fe5a95786772.png">

<br>

### Assumptions    
Please make sure that the following dependencies are addressed before running this notebook:

- [x] **OWLTools**. This software also relies on [OWLTools](https://github.com/owlcollab/owltools). If cloning the repository, the owltools library file will automatically be included and placed in the correct repository.
- [x] **Clinical Data**. This program assumes that there is clinical data that needs mapping and it has been placed in the `resources/clinical_data` repository. Each data source provided in this repository is assumed to extracted from the `OMOP` CDM. An example of what is expected for input clinical data can be found [`here`](https://github.com/callahantiff/OMOP2OBO/tree/master/resources/clinical_data).  
- [x] **UMLS Data**. This program depends on data from the National of Library Medicine's Unified Medical Language System (UMLS), specifically the [MRCONSO.RRF](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) and [MRSTY.RRF](https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.Tf/) files. Please note, using these data requires a license agreement. Note that in order to get the `MRSTY.RRF` file you will need to download the UMLS Metathesaurus and run `MetamorphoSys`. Once both data sources are obtained, please place each file in the `resources/mappings` directory.
- [x] **Ontology Data:** Ontology data is automatically downloaded from the user provided input file `ontology_source_list.txt`. Please fill update this information ([here](https://github.com/callahantiff/OMOP2OBO/blob/master/resources/ontology_source_list.txt)).
- [x] **Vocabulary Source Code Mapping**. To increase the likelihood of capturing existing database cross-references, `omop2obo` provides a file that maps different clinical vocabulary source code prefixes between the `UMLS`, OBO ontologies, and clinical data. This information is stored in the  `source_code_vocab_map.csv` ([here](https://github.com/callahantiff/OMOP2OBO/blob/master/resources/mappings/source_code_vocab_map.csv)). The current version of this file is updated for ontologies released September 2020, clinical data normalized to `OMOP_v5.0`, and UMLS `2020AA`. 

<br>

***
### Table of Contents
***
The three primary steps involved in mapping OMOP CDM concepts to OBO ontologies are: `Exact Mappings`, `Fuzzy Mappings`, and `Aggregate Mapping Results`.

* [Download Ontology Data](#download-ontologies)  
* [Generate Mappings](#mappings)  
* [Aggregate Mapping Results](#aggregate-mapping-results)  

***

***

### Set-Up Environment

In [None]:
# import needed libraries
import click
import glob
import pandas as pd
import pickle

from datetime import date, datetime
from typing import Tuple

from omop2obo import ConceptAnnotator, OntologyDownloader, OntologyInfoExtractor, SimilarStringFinder
from omop2obo.utils import aggregates_mapping_results

# set time-stamped var for writing output to
date_today = '_' + datetime.strftime(datetime.strptime(str(date.today()), '%Y-%m-%d'), '%d%b%Y').upper()


***
## Download Ontologies <a class="anchor" id="download-ontologies"></a>
***

The purpose of this step is to download the needed OBO ontologies, process them, and output a master dictionary of processed ontology data. The `V1.0` release parses the following OBO ontologies for each specific clinical domain: 

Ontology | Clinical Domain 
:--: | :--:
[Human Phenotype Ontology (hp)](http://purl.obolibrary.org/obo/hp.owl) | Conditions; Measurements      
[Mondo Disease Ontology (mondo)](http://purl.obolibrary.org/obo/mondo.owl) | Conditions      
[Cell Ontology (cl)](http://purl.obolibrary.org/obo/cl.owl) | Measurements        
[Chemical Entities of Biological Interest (chebi)](http://purl.obolibrary.org/obo/chebi.owl) | Measurements; Drugs        
[NCBI Organism Taxonomy (ncbitaxon)](http://purl.obolibrary.org/obo/ncbitaxon.owl) | Measurements; Drugs     
[Protein Ontology (pr)](http://purl.obolibrary.org/obo/pr.owl) | Measurements; Drugs     
[Uber Anatomy Ontology (uberon)](http://purl.obolibrary.org/obo/uberon/ext.owl) | Measurements       
[Vaccine Ontology (vo)](http://purl.obolibrary.org/obo/vo.owl) | Drugs  

<br>

**Input File:** [`ontology_source_list.txt`](https://github.com/callahantiff/OMOP2OBO/blob/master/resources/ontology_source_list.txt)  

**Purpose:**
This step of the algorithm is designed to complete the following steps:  
1. [Download OBO Ontology Data](#download-data)  
2. [Process Ontologies](#process-ontologies)   
3. [Create Master Ontology Dictionary](#master-ont-dict)  

<br>

*NOTE.* All data from the current release, except for UMLS data, can be downloaded directly from the project Wiki in the current release sub-page.  

***

### Download OBO Ontologies <a class="anchor" id="download-data"></a>
This step downloads each `OBO` ontology in the `ontology_source_list.txt` file using the [OWLTools](https://github.com/owlcollab/owltools) API.


**Output Files:**  
The following content will be downloaded to the `resources/ontologies/` repository.
- `ontology_source_metadata.txt`: a file containing metadata on each downloaded
- An `.owl` file will be downloaded for each ontology in `ontology_source_list.txt`  

In [None]:
# download ontologies
ont = OntologyDownloader('resources/ontology_source_list.txt')
ont.downloads_data_from_url()


***

### Process OBO Ontologies <a class="anchor" id="process-ontologies"></a>
This step processes each of the downloaded ontologies to obtain the following information for all non-deprecated classes: definitions, labels, synonyms, and database Cross-References (DbxRefs). An example of the output for the Cell Ontology is shown below:  

```
{'label': {'osteoclast': 'http://purl.obolibrary.org/obo/CL_0000092'},
 'definition': {'a plasmablast that secretes ige.': 'http://purl.obolibrary.org/obo/CL_0000950'},
 'dbxref': {'bto:0001173': 'http://purl.obolibrary.org/obo/CL_0000558'},
 'dbxref_type': {'bto:0001173': 'DbXref'},
 'synonym': {'multipotent cell': 'http://purl.obolibrary.org/obo/CL_0000048'},
 'synonym_type': {'multipotent cell': 'hasExactSynonym'} 
 }
```     

<br>

**Output Files:**  
The following content will be downloaded to the `resources/ontologies/` repository.   
- A `.pickle` file for each ontology containing processed ontology content (i.e. non-deprecated classes, labels, definitions, synonyms, and dbxrefs) will be downloaded  

In [None]:
# process ontologies
ont_explorer = OntologyInfoExtractor('resources/ontologies', ont.data_files)
ont_explorer.ontology_processor()


***

### Create Master Ontology Dictionary <a class="anchor" id="master-ont-dict"></a>
This step parses each of the processed ontology files from the prior step and merges them all into a single nested ontology dictionary. The primary keys to the dictionary are identifiers for each ontology (e.g. `hp`, `mondo`, `chebi`) and each ontologies sub-dictionary contains the following keys: `label`, `definition`, `dbxref`, `dbxref_type`, `synonym`, and `synonym_type`.

For release `V1.0` this step generates the following results:  

Ontology | Classes | Definitions | Labels | Synonyms | DbXRefs  
:--: | :--: | :--: | :--: | :--: | :--:  
Human Phenotype Ontology (hp) | 15,247 | 12,468 | 15,247 | 19,860 | 19,569  
Mondo Disease Ontology (mondo) | 22,288 | 15,271 | 22,288 | 98,181 | 159,918  
Cell Ontology (cl) | 2,238 | 1,859 | 2,238 | 2,124 | 1,376  
Chemical Entities of Biological Interest (chebi) | 126,169 | 48,824 | 126,169 | 269,798 | 231,247
NCBI Organism Taxonomy (ncbitaxon) | 2,241,110 | 0 | 2,241,110 | 263,571 | 18,426. 
Protein Ontology (pr) | 215,624 | 215,598 | 215,624 | 590,190 | 195,671
Uber Anatomy Ontology (uberon) | 13,898 | 11,026 | 13,898 | 36,771 | 51,322  
Vaccine Ontology (vo) | 5,783 | 1,231 | 5,783 | 6 | 0   	

<br>

**Output Files:**  
The following content will be downloaded to the `resources/ontologies/` repository.  
- `master_ontology_dictionary.pickle`: a file containing the processed ontology content formatted as a dictionary  

In [None]:
# create master dictionary of processed ontologies
ont_explorer.ontology_loader()

# read in ontology data
with open('resources/ontologies/master_ontology_dictionary.pickle', 'rb') as handle:
    ont_data = pickle.load(handle)
handle.close()


In [None]:
# populate table in heading
for key in ont_data.keys():
    print('\nProcessing Ontology {}'.format(key))
    print('# classes: {}'.format(len(set([v for k, v in ont_data[key]['label'].items()]))))
    print('# definitions: {}'.format(len(set(ont_data[key]['definition']))))
    print('# labels: {}'.format(len(set(ont_data[key]['label']))))
    print('# synonyms: {}'.format(len(set(ont_data[key]['synonym']))))
    print('# dbXRefs: {}'.format(len(set(ont_data[key]['dbxref']))))
    

<br><br>

***
## Generate Mappings <a class="anchor" id="mappings"></a>
***

**Purpose:**
This step of the algorithm is designed to complete the following steps to generate mappings:  
1. [Generate Exact Mappings](#exact-map)  
2. [Generate Fuzzy Mappings](#fuzzy-map)    

<br>

**Input Files:**
- `clinical_data`: Clinical data from the OMOP common data model needing mapping  
- `resources/mappings/MRCONSO.RRF`:  UMLS CUI information and mappings    
- `resources/mappings/MRSTY.RRF`: UMLS Semantic Types     
- `source_code_vocab_map.csv`: A file containing information for normalizing 

The current release (`V2.0`) maps concepts for `29,129` condition concepts, `1,697` drug exposure ingredients, and `4,083` measurements.

<br>

**Example Clinical and ONtology Input Data**  

<img width="700" alt="Screen Shot 2020-09-20 at 22 26 23" src="https://user-images.githubusercontent.com/8030363/93732838-5c435380-fb90-11ea-913c-ed2546a565ba.png">

<br>

<br> 

**Output Files:**  
The following content will be downloaded to the `resources/mappings/` repository.  
- A `.csv` file containing mapping results for each processed clinical file     

<br>

*NOTE.* All data from the current release, except for UMLS data, can be downloaded directly from the project Wiki in the current release sub-page.  

***

### Generate Exact Mappings <a class="anchor" id="exact-map"></a>  
This step performs exact mapping using clinical-ontology dbXRefs and by looking for exact matches between clinical code and ontology class labels and synonyms. This task is performed in the following steps at the OMOP `concept` and `ancestor` levels:  
1. Merge OMOP `source_codes` to UMLS `SAB` codes and then re-merge omop-merged CUIs to UMLS CUIs to obtain additional source code mappings  
2. Map OMOP `source_code` to ontology `dbXRef` codes  
3. Exact match OMOP concept labels and synonyms to ontology labels and synonyms

<br>

**Output Files:**  
The following content will be downloaded to the `resources/mappings/` repository.  
- A `.csv` file containing mapping results for each processed clinical file  


**Set Input Parameters**  
Uncomment each clinical domain chunk and process them separately. 
- If you have an OMOP database instance and want to obtain concepts to map, you can run the queries we make available as a GitHub Gist [here](https://gist.github.com/callahantiff/7b84c1bc063ad162bf5bdf5e578d402f).  
- If you have your clinical OMOP data in a Google Cloud Storage Bucket, you can use the [`google_cloud_storage_downloader.py`](https://github.com/callahantiff/OMOP2OBO/blob/master/google_cloud_storage_downloader.py) script to automatically download it into your project repository.

<br>

**The OMOP2OBO Parameters Include:**  
- `clinical_data`: A Pandas DataFrame containing clinical data.
- `ontology_dictionary`: A nested dictionary containing ontology data, where outer keys are ontology identifiers
    (e.g. "hp", "mondo"), inner keys are data types (e.g. "label", "definition", "dbxref", and "synonyms").
    - For each inner key, there is a third dictionary keyed by a string of that item type and with values that
    are the ontology URI for that string type.
- `primary_key`: A string containing the column name of the primary key.
- `concept_codes`: A list of column names containing concept-level codes (optional).
- `concept_strings`: A list of column names containing concept-level labels and synonyms (optional).
- `ancestor_codes`: A list of column names containing ancestor concept-level codes (optional).
- `ancestor_strings`: A list of column names containing ancestor concept-level labels and synonyms (optional).
- `umls_cui_data`: A Pandas DataFrame containing UMLS CUI data from MRCONSO.RRF.
- `umls_tui_data`: A Pandas DataFrame containing UMLS CUI data from MRSTY.RRF.
- `source_code_map`: A dictionary containing clinical vocabulary source code abbreviations.
- `umls_double_merge`: A `bool` specifying whether to merge UMLS SAB codes with OMOP source codes once or twice.
    - Merging once will only align OMOP source codes to UMLS SAB  
    - Merging twice with take the CUIs from merging once and merge them again with the full UMLS SAB set resulting in a larger set of matches. The default value is `True`, which means that the merge will be performed twice.


In [None]:
# select a clinical domain to run ('CONDITIONS', 'DRUGS', or 'LABS')
clinical_domain = 'CONDITIONS'

# point to clinical data
clinical_data = 'file path to data.csv'

# set umls merge type
umls_merge_type = True


In [None]:
if clinical_domain == 'CONDITIONS':
    onts = ['hp', 'mondo']
    primary_key = 'CONCEPT_ID'
    concept_codes = tuple(['CONCEPT_SOURCE_CODE'])
    concept_strings = tuple(['CONCEPT_SOURCE_LABEL', 'CONCEPT_SYNONYM'])
    ancestor_codes = tuple(['ANCESTOR_SOURCE_CODE'])
    ancestor_strings = tuple(['ANCESTOR_LABEL'])
    outfile = 'resources/mappings/condition_codes/OMOP2OBO_MAPPED_'
elif clinical_domain = 'DRUGS':
    onts = ['chebi', 'pr', 'ncbitaxon', 'vo']
    primary_key = 'INGREDIENT_CONCEPT_ID'
    concept_codes = tuple(['INGREDIENT_SOURCE_CODE'])
    concept_strings = tuple(['INGREDIENT_LABEL', 'INGREDIENT_SYNONYM'])
    ancestor_codes = tuple(['INGRED_ANCESTOR_SOURCE_CODE'])
    ancestor_strings = tuple(['INGRED_ANCESTOR_LABEL'])
    outfile = 'resources/mappings/medication_codes/OMOP2OBO_MAPPED_'
else:
    onts = ['hp', 'uberon', 'cl', 'chebi', 'pr', 'ncbitaxon']
    primary_key = 'CONCEPT_ID'
    concept_codes = tuple(['CONCEPT_SOURCE_CODE'])
    concept_strings = tuple(['CONCEPT_LABEL', 'CONCEPT_SYNONYM'])
    ancestor_codes = tuple(['ANCESTOR_SOURCE_CODE'])
    ancestor_strings = tuple(['ANCESTOR_LABEL'])
    outfile = 'resources/mappings/laboratory_tests/OMOP2OBO_MAPPED_'


**Perform Exact Mapping**

In [None]:
mapper = ConceptAnnotator(clinical_file=clinical_data,
                          ontology_dictionary={k: v for k, v in ont_data.items() if k in onts},
                          umls_expand=umls_merge_type,
                          primary_key=primary_key,
                          concept_codes=concept_codes,
                          concept_strings=concept_strings,
                          ancestor_codes=ancestor_codes,
                          ancestor_strings=ancestor_strings,
                          umls_mrconso_file=glob.glob('resources/mappings/*MRCONSO*')[0]
                          if len(glob.glob('resources/mappings/*MRCONSO*')) > 0 else None,
                          umls_mrsty_file=glob.glob('resources/mappings/*MRSTY*')[0]
                          if len(glob.glob('resources/mappings/*MRSTY*')) > 0 else None)

mappings = mapper.clinical_concept_mapper()
    

In [None]:
print('\nSaving Results: {}'.format('Exact Match'))
mappings.to_csv(outfile + clinical_domain.upper() + date_today + '.csv', sep=',', index=False, header=True)

# get column names -- used later to organize output
start_cols = [i for i in mappings.columns if not any(j for j in ['STR', 'DBXREF', 'EVIDENCE'] if j in i)]
exact_cols = [i for i in mappings.columns if i not in start_cols]


***

### Generate Fuzzy Mappings <a class="anchor" id="fuzzy-map"></a> 
This step builds a Term Frequency-Inverse Document Frequency (TF-IDF)-weighted Bag-of-Words model and uses it to identify mappings between the OMOP clinical concepts and ontology terms. To build this model, clinical labels and synonyms are processed along with ontology labels, definitions, and synonyms. Only those matches between clinical concepts and ontology terms with a score `>=0.2` are exported.  

<br>

**Output Files:**  
The following content will be downloaded to the `resources/mappings/` repository.  
- A `.csv` file containing mapping results for each processed clinical file  


In [None]:
if tfidf_mapping is not None:
    sim = SimilarStringFinder(clinical_file=outfile + clinical_domain.upper() + date_today + '.csv',
                              ontology_dictionary={k: v for k, v in ont_data.items() if k in onts},
                              primary_key=primary_key,
                              concept_strings=concept_strings)

    sim_mappings = sim.performs_similarity_search()
    
    # get column names -- used later to organize output
    sim_mappings = sim_mappings[[primary_key] + [x for x in sim_mappings.columns if 'SIM' in x]].drop_duplicates()
    sim_cols = [i for i in sim_mappings.columns if not any(j for j in start_cols if j in i)]

    # merge dbXref, exact string, and TF-IDF similarity results
    merged_scores = pd.merge(mappings, sim_mappings, how='left', on=primary_key)
    mappings = merged_scores[start_cols + exact_cols + sim_cols]

    print('\nSaving Results: {}'.format('TF-IDF Cosine Similarity'))
    mappings.to_csv(outfile + clinical_domain.upper() + date_today + '.csv', sep=',', index=False, header=True)
        

<br><br>

***
## Aggregate Mapping Results<a class="anchor" id="aggregate-mapping-results"></a>
***

**Purpose:**
This step is designed to compile and aggregate the mapping results from running the exact and fuzzy mapping steps. The goal is to parse all of the results for a given clinical domain and each utilized ontology and return a single result (i.e. `ontology uri`, `ontology label`, `mapping category`, and `mapping evidence`). An example of the mapping categories and a mapping result for `OMOP_` (Apraxia) is shown below:
<img width="750" alt="Screen Shot 2020-09-20 at 22 37 23" src="https://user-images.githubusercontent.com/8030363/93733253-efc95400-fb91-11ea-8a61-a614113bd7eb.png">


<br>

**Input Files:** A `.csv` file output from running `omop2obo` to obtain exact and fuzzy mappings. 

<br>

**Output Files:**  
The following content will be downloaded to the `resources/mappings/` repository.  
- A `.csv` file containing mapping results for each processed clinical file  

<br>

*NOTE.* All data from the current release, except for UMLS data, can be downloaded directly from the project Wiki in the current release sub-page.  

In [None]:
# clean up output
if clinical_domain == 'LABS':
    result_type_idx, updated_data = list(mappings.columns).index('RESULT_TYPE'), []
    for idx, row in mappings.iterrows():
        if row['RESULT_TYPE'] == 'Normal/Low/High' or row['RESULT_TYPE'] == 'Negative/Positive':
            for x in row['RESULT_TYPE'].split('/'):
                updated = list(row)
                updated[result_type_idx] = x
                updated_data.append(updated)
        else:
            updated_data.append(list(row))

    # replace values
    data_expanded = pd.DataFrame(updated_data, columns=list(mappings.columns))
else:
    data_expanded = mappings.copy()
data_expanded.fillna('', inplace=True)

# aggregate mapping evidence
updated_mappings = aggregates_mapping_results(data_expanded, onts, ont_data, mapper.source_code_map, 0.25)
updated_mappings.to_csv(outfile + clinical_domain.upper() + date_today + '.csv', sep=',', index=False, header=True)
    