# Extraction and manipulation of MeSH tree/terms

Author: **Pablo Iriarte, University of Geneva - pablo.iriarte@unige.ch**

### Extracting information from MeSH thesaurus

The processus of extracting information frome MeSH follows this simple steps:

 1. Download MeSH thesaurus in XML format from https://www.nlm.nih.gov/mesh/filelist.html
 1. Extract and analyze the MeSH XML tree
 1. Select the informations to extract with the UID (loops and secondary loops)
 
 
### Exemple: extract all the possible entry terms for one particular pharmacological action

This idea was proposed by Mrs Kirsten van Gelderen-Ziesemer during the workshop "[Mining PubMed metadata with Pandas and Jupyter Notebooks](https://www.conftool.com/eahil2019/index.php?page=browseSessions&downloads=show&form_session=39&mode=table&presentations=show)" given during the EAHIL congress in Basel in June 2019

### Notebook 1: extract all the entry terms and pharmacological actions in two files

**Input**: This notebook use the MeSH dump in XML format and compressed "desc2018.gz", situated in the same folder of the notebook

**Output**: This notebook extract all the pharmacological actions and all the Entry terms in two separated tsv files. Those files are used by the second notebook "[2_mesh_entry_terms_for_a_pharamacological_action.ipynb](2_mesh_entry_terms_for_a_pharamacological_action.ipynb)" in order to obtain all the combinations of both and allows to export the result for one pharmacological action in particular:

1. mesh2018_pharmacological_actions.tsv
1. mesh2018_entry_terms.tsv
 

### 1. Extract and anlyze the MeSH XML tree

[An example of XML record is available here](../mesh_record_example.xml)

We use [Cygwin](https://www.cygwin.com/) and [XMLStarlet](http://xmlstar.sourceforge.net/) to generate the XML tree, [the result is available here](../mesh_xml_tree.md)


### 2. Select the informations to extract with the UID

In our case we want to extract all the Entry Terms of MeSH terms with a particular "Pharmacological Action" 

```
Term informations (1):
DescriptorRecordSet/DescriptorRecord/DescriptorName/String
DescriptorRecordSet/DescriptorRecord/DescriptorUI

Pharmacological actions (n):
DescriptorRecordSet/DescriptorRecord/PharmacologicalActionList/PharmacologicalAction/DescriptorReferredTo/DescriptorName/String
DescriptorRecordSet/DescriptorRecord/PharmacologicalActionList/PharmacologicalAction/DescriptorReferredTo/DescriptorUI

Entry terms (n):
DescriptorRecordSet/DescriptorRecord/ConceptList/Concept/TermList/Term/String
DescriptorRecordSet/DescriptorRecord/ConceptList/Concept/TermList/Term/TermUI
```


### 3. parse the XML file and export all the Entry Terms and Pharmacological Actions

One MeSH term could have more than one Pharmacological action and entry terms, in those cases we extract all the terms repeating the name and the UI


In [1]:
from lxml import etree

# Position of the MeSH file in XML gz format
myfilein = 'desc2018.gz'

# Name of the file with the results
myfileout = 'mesh2018_pharmacological_actions.tsv'

# create file
file = open(myfileout, mode='w', encoding='utf-8')

# write first line
file.write('name\tui\tpharmacological_action_name\tpharmacological_action_ui\n')

# Parse XML
root = etree.parse(myfilein)

# select the node roots
mesh_term = root.xpath('/DescriptorRecordSet/DescriptorRecord')

for i in range(len(mesh_term)):
    mesh_name = mesh_term[i].xpath('DescriptorName/String')[0].text
    mesh_ui = mesh_term[i].xpath('DescriptorUI')[0].text
    
    # loop of pharmacological actions
    if (mesh_term[i].xpath('PharmacologicalActionList/PharmacologicalAction/DescriptorReferredTo')):
        pharmacological_actions = mesh_term[i].xpath('PharmacologicalActionList/PharmacologicalAction/DescriptorReferredTo')
        for k in range(len(pharmacological_actions)):
            pa_name = pharmacological_actions[k].xpath('DescriptorName/String')[0].text
            pa_ui = pharmacological_actions[k].xpath('DescriptorUI')[0].text
            # write info to file
            file.write(mesh_name)
            file.write('\t')
            file.write(mesh_ui)
            file.write('\t')
            file.write(pa_name)
            file.write('\t')
            file.write(pa_ui)
            file.write('\n')
file.close()

In [2]:
# same process but for entry terms

# Position of the MeSH file in XML gz format
myfilein = 'desc2018.gz'

# Name of the file with the results
myfileout = 'mesh2018_entry_terms.tsv'

# create file
file = open(myfileout, mode='w', encoding='utf-8')

# write first line
file.write('name\tui\tentry_term_name\tentry_term_ui\n')

# Parse XML
root = etree.parse(myfilein)

# select the node roots
mesh_term = root.xpath('/DescriptorRecordSet/DescriptorRecord')

for i in range(len(mesh_term)):
    mesh_name = mesh_term[i].xpath('DescriptorName/String')[0].text
    mesh_ui = mesh_term[i].xpath('DescriptorUI')[0].text
    # loop of pharmacological actions
    if (mesh_term[i].xpath('ConceptList/Concept/TermList/Term')):
        entry_terms = mesh_term[i].xpath('ConceptList/Concept/TermList/Term')
        for k in range(len(entry_terms)):
            et_name = entry_terms[k].xpath('String')[0].text
            et_ui = entry_terms[k].xpath('TermUI')[0].text
            # write info to file
            file.write(mesh_name)
            file.write('\t')
            file.write(mesh_ui)
            file.write('\t')
            file.write(et_name)
            file.write('\t')
            file.write(et_ui)
            file.write('\n')
file.close()