# How to convert MeSH Terminology to OBO

In this example, we show you how to convert the terminology MeSH to OBO format in order to import it into Information Discovery. 

### About MeSH

Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it serves as a thesaurus that facilitates searching. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM's catalog of book holdings. MeSH is also used by ClinicalTrials.gov registry to classify which diseases are studied by trials registered in ClinicalTrials.gov. [Source: Wikipedia]

### Download MeSH

We use the MeSH ASCII format for conversion to OBO. You can find the downloadable file [here](https://www.nlm.nih.gov/mesh/filelist.html) (in 2018, the required file was named d2018.bin).

### Okay, let's start now

Let's start now. We will consider the following MeSH elements:
 - Unique IDs (UI)
 - Main Headings (MH)
 - Entry Terms (EN), and 
 - Hierarchies (TN). 
 
But first of all, we need to complement the top-level concepts of MeSH. I don't know why MeSH doesn't include them. One day I'll ask for it,  but for now we'll hardcode it.

In [5]:
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# define top level terms as they are not part of the MeSH distribution
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

topLevelTerms = {'A': 'Anatomy', 
                'B': 'Organisms',
                'C': 'Diseases',
                'D': 'Chemicals and Drugs',
                'E': 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment',
                'F': 'Psychiatry and Psychology',
                'G': 'Phenomena and Processes',
                'H': 'Disciplines and Occupations',
                'I': 'Anthropology, Education, Sociology, and Social Phenomena',
                'J': 'Technology, Industry, and Agriculture',
                'K': 'Humanities',
                'L': 'Information Science',
                'M': 'Named Groups',
                'N': 'Health Care',
                'V': 'Publication Characteristics',
                'Z': 'Geographicals'}



### Read MeSH into a python list

The following methods are used to load MeSH into a list. By using `yield`, the readMeSH method returns a generator for the entries.

In [6]:
#++++++++++++++++++++++++
# read MeSH into a list
#++++++++++++++++++++++++

def provideEntry(id, name):
    entry = defaultdict(list)
    entry['UI'].append(id)
    entry['MH'].append(name)
    return entry

def readMeSH(fin):
    #start with the top level terms
    for id, name in topLevelTerms.items():
        yield provideEntry(id, name)
    #continue with all others
    entry = None
    for line in fin:
        line = line.strip()
        if not line:
            continue
        #in MeSH, new records are marked like this
        if line == "*NEWRECORD":
            if entry:
                yield entry
            entry = defaultdict(list)
            continue
        # e.g., MH = Biliary Tract
        key, _, value = line.partition(" = ")
        entry[key].append(value)
    if entry:
        yield entry

### Create a mapping between Tree IDs and Unique IDs

Another special feature of Mesh is that, in addition to Unique Identifiers, it has so-called Tree IDs that describe its position in the hierarchy. OBO does not know any tree identifiers. It uses unique IDs to display the parent nodes. We therefore build a list from which we can use the tree ID to determine the parent's unique ID. 

_Es sei angemerkt, dass diese Methode nicht die Hierarchie in MeSH exakt widergibt. Betrachten wir das folgende Beispiel:

`C child_of Concept B child_of A
B child_of D`

Mesh allows C to be a child of B but not a descendant of D. With our simplification to replace tree IDs with unique IDs, we lose this option. In our example, C automatically becomes a descendant of D.

In [7]:
#++++++++++++++++++++++++++++
# map tree ids to unique ids
#
# A special feature of MeSH is that it has both unique IDs and tree IDs. 
# Therefore we create a mapping of tree IDs to unique IDs.
# 
#++++++++++++++++++++++++++++

from collections import defaultdict
parentList = {}
with open('data/d2018.bin', "r", encoding='Latin-1') as infile:
    for entry in readMeSH(infile):
        
        for tree in entry['MN']: 
            parentList[tree] = entry['UI'][0]
      
def getParent(id):
    parent = None
    if '.' in id:
        parent = parentList[re.sub('\.[^\.]+$', '', id)]
    elif len(id) == 3:
        parent = id[0]
    return parent

### Read MeSH and write to OBO

Finally, we import the MeSH and output it directly in OBO format.

In [8]:
#++++++++++++++++++++++++++++
# write list to OBO
#++++++++++++++++++++++++++++

import re

fo = open('data/d2018.obo', 'w', encoding ='utf-8')
with open('data/d2018.bin', "r", encoding='Latin-1') as infile:
    for entry in readMeSH(infile):
            print("[Term]", file = fo)
            print("id: " + entry['UI'][0], file = fo)
            print("name: " + entry['MH'][0], file = fo)
            for syn in entry['ENTRY']:
                syn = re.sub('\|.*', '', syn)
                print("synonym: \"" + syn + "\" EXACT []", file = fo)
            done = []
            for tree in entry['MN']:
                parentId = getParent(tree)
                if parentId not in done:
                    print("is_a: "  + parentId, file = fo)
                    done.append(parentId)
            print('', file = fo)
print('done')

done
