MeSH Descriptors
================

This notebook contains code to parse and clean the health and medical terms from the NIH Medical Subject Headings. The original files can be found on their FTP site [here](ftp://nlmpubs.nlm.nih.gov/online/mesh/MESH_FILES/xmlmesh/).

There are some existing resources for dealing with MeSH files. These include:

* [Working with MeSH Files in Python](https://code.tutsplus.com/tutorials/working-with-mesh-files-in-python-linking-terms-and-numbers--cms-28587) - a rudimentary approach to parsing the available .bin files.
* [mesh-tree](https://github.com/scienceai/mesh-tree) - a Java library that parses and provides many useful functions for handling MeSH files.

In [18]:
import os
import pandas as pd
import numpy as np
import json
import xml.etree.ElementTree as ET
import xmltodict
from datetime import datetime

pd.options.display.max_colwidth = 100
pd.options.display.max_columns = 999

In [2]:
%matplotlib inline
#NB I open a standard set of directories

#Paths

#Get the top path
top_path = os.path.dirname(os.getcwd())

#Create the path for external data
ext_data = os.path.join(top_path,'data/external')

#Raw path (for html downloads)

raw_data = os.path.join(top_path,'data/raw')

#And external data
proc_data = os.path.join(top_path,'data/processed')

fig_path = os.path.join(top_path,'reports/figures')

#Get date for saving files
today = datetime.utcnow()

today_str = "_".join([str(x) for x in [today.month, today.day, today.year]])

## Approach 1 - xmltodict

This approach uses the very handy `xmltodict` library, which unsurprisingly parses an XML file into a Python dict.

In [4]:
with open(ext_data + '/desc2018.xml', 'r') as f:
    desc_2018_xml = f.read()

In [None]:
desc_2018_json = xmltodict.parse(desc_2018_xml)

This essentially does everything that we need. From here we can create maps between various attributes of the terms, to use for analysis. 

As there is no Python API for interfacing with the MeSH services, a useful thing to do might be to create a wrapper class for the MeSH tree. An idea of how this might look and be used is shown here. It would essentially serve as a class to download and parse the latest MeSH files, and to provide convenience functions for creating mappings.

In [70]:
class MeSHDescriptors():
    def __init__(self, mesh_descriptor_dict=None, file=None, url=None):
        """MeSHDescriptors
        Read, parse or download MeSH descriptor XML files.
        """
        if mesh_descriptor_dict is not None:
            self.descriptors = mesh_descriptor_dict
        elif file is not None:
            self.descriptors = self.read_mesh_xml(file)
        elif url is not None:
            self.descriptors = self.read_remote_xml(file)
    
    def read_mesh_xml(self, file):
        """read_mesh_xml
        Reads and parses from XML file.
        """
        with open(file, 'rb') as f:
            desc_2018_xml = f.read()
        self.descriptors = xmltodict.parse(desc_2018_xml)
    
    def descriptor_ui_2_tree_number(self):
        """descriptor_ui_2_tree_number
        Create a mapping between DescriptorUI and TreeNumber fields.
        """
        mapper = {}
        for d in self.descriptors['DescriptorRecordSet']['DescriptorRecord']:
            k = d['DescriptorUI']
            if k is not None:
                v = d.get('TreeNumberList')
                if v is not None:
                    v = v.get('TreeNumber')
            mapper[k] = v
        return mapper
    
    def to_json(self, file_path=None):
        """to_json
        Serialize the parsed descriptors as a json.
        """
        with open(file_path, 'w') as f:
            json.dump(self.descriptors, f)

In [66]:
mesh_descriptors = MeSHDescriptors(desc_2018_json)

As an intial example, we can create a mapping between the _DescriptorUI_ and the _TreeNumber_.

In [67]:
dui_tree_number_map = mesh_descriptors.descriptor_ui_2_tree_number()

In [71]:
dui_tree_number_map['D013334']

'M01.848'

From here, it is obvious how we might create further mappings that could be useful to make increased use of the full information available fom the descriptors.

## 2. An Alternate Route - XML to DataFrame

This was the original approach to parsing the MeSH term XML file. It seems irrelavent now that the `xmltodict` method is in use, however I have left it here for interest.

In [3]:
# Adapted from 
# http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe/
# The original did not account for structures where the last children shared names but not parents as 
# occurs in this dataset. This gives messier names, but all the information.

class XML2DataFrame:

    def __init__(self, xml_data):
#         parser = ET.XMLParser(encoding="utf-8")
#         self.root = ET.fromstring(xml_data, parser=parser)
        self.root = ET.XML(xml_data)

    def parse_root(self, root):
        return [self.parse_element(child, 'Root') for child in iter(root)]

    def parse_element(self, element, parent_name, parsed=None):
        if parsed is None:
            parsed = dict()
        for key in element.keys():
            parsed[parent_name + key] = element.attrib.get(key)
        if element.text:
            h_key = parent_name + element.tag
#             if h_key in parsed:
#                 h_key = h_key + '_1'
            parsed[h_key] = element.text
        for child in list(element):
            self.parse_element(child, element.tag, parsed)
        return parsed

    def process_data(self):
        structure_data = self.parse_root(self.root)
        return pd.DataFrame(structure_data)

In [174]:
desc_2018_df.head()

Unnamed: 0,AllowableQualifierAbbreviation,AllowableQualifierQualifierReferredTo,AllowableQualifiersListAllowableQualifier,ConceptCASN1Name,ConceptConceptName,ConceptConceptRelationList,ConceptConceptUI,ConceptListConcept,ConceptListPreferredConceptYN,ConceptNameString,ConceptRegistryNumber,ConceptRelatedRegistryNumberList,ConceptRelationConcept1UI,ConceptRelationConcept2UI,ConceptRelationListConceptRelation,ConceptRelationListRelationName,ConceptScopeNote,ConceptTermList,DateCreatedDay,DateCreatedMonth,DateCreatedYear,DateEstablishedDay,DateEstablishedMonth,DateEstablishedYear,DateRevisedDay,DateRevisedMonth,DateRevisedYear,DescriptorNameString,DescriptorRecordAllowableQualifiersList,DescriptorRecordAnnotation,DescriptorRecordConceptList,DescriptorRecordConsiderAlso,DescriptorRecordDateCreated,DescriptorRecordDateEstablished,DescriptorRecordDateRevised,DescriptorRecordDescriptorName,DescriptorRecordDescriptorUI,DescriptorRecordEntryCombinationList,DescriptorRecordHistoryNote,DescriptorRecordNLMClassificationNumber,DescriptorRecordOnlineNote,DescriptorRecordPharmacologicalActionList,DescriptorRecordPreviousIndexingList,DescriptorRecordPublicMeSHNote,DescriptorRecordSeeRelatedList,DescriptorRecordTreeNumberList,DescriptorReferredToDescriptorName,DescriptorReferredToDescriptorUI,ECINDescriptorReferredTo,ECINQualifierReferredTo,ECOUTDescriptorReferredTo,ECOUTQualifierReferredTo,EntryCombinationECIN,EntryCombinationECOUT,EntryCombinationListEntryCombination,PharmacologicalActionDescriptorReferredTo,PharmacologicalActionListPharmacologicalAction,PreviousIndexingListPreviousIndexing,QualifierNameString,QualifierReferredToQualifierName,QualifierReferredToQualifierUI,RelatedRegistryNumberListRelatedRegistryNumber,RootDescriptorClass,RootDescriptorRecord,SeeRelatedDescriptorDescriptorReferredTo,SeeRelatedListSeeRelatedDescriptor,TermDateCreated,TermEntryVersion,TermListConceptPreferredTermYN,TermListIsPermutedTermYN,TermListLexicalTag,TermListRecordPreferredTermYN,TermListTerm,TermSortVersion,TermString,TermTermUI,TermThesaurusIDlist,ThesaurusIDlistThesaurusID,TreeNumberListTreeNumber
0,TO,\n,\n,"4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrro...",\n,\n,M0353609,\n,N,A-23187,0.0,\n,M0000001,M0353609,\n,NRW,"An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports C...",\n,8,3,1990,1,1,1984,27,5,2016,Calcium Ionophores,\n,,\n,,\n,\n,\n,\n,D000001,,91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n,,use CALCIMYCIN to search A 23187 1975-90\n,\n,\n,91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n,,\n,\n,D061207,,,,,,,,\n,\n,Carboxylic Acids (1973-1974),toxicity,\n,Q000633,52665-69-7 (Calcimycin),1,\n,,,\n,,N,Y,NON,N,\n,,"A23187, Antibiotic",T000003,\n,NLM (1991),D03.633.100.221.173
1,TO,\n,\n,"Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester",\n,\n,M0352200,\n,N,Difos,0.0,\n,M0000002,M0352200,\n,NRW,An organothiophosphate insecticide.\n,\n,7,10,1986,1,1,1991,8,7,2013,Insecticides,\n,"for use to kill or control insects, use no qualifiers on the insecticide or the insect; appropri...",\n,,\n,\n,\n,\n,D000002,,"96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)\n",,,\n,\n,"96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)\n",,\n,\n,D007306,,,,,,,,\n,\n,Insecticides (1966-1971),toxicity,\n,Q000633,3383-96-8 (Temefos),1,\n,,,\n,,Y,N,TRD,N,\n,,Difos,T000006,\n,UNK (19XX),D02.886.300.692.800
2,ES,\n,\n,,\n,,M0000003,\n,Y,Abattoirs,,,,,,,Places where animals are slaughtered and dressed for market.\n,\n,29,3,1974,1,1,1966,8,6,2016,Abattoirs,\n,,\n,,\n,\n,\n,\n,D000003,,,WA 707,,,,,,\n,,,,,,,,,,,,,ethics,\n,Q000941,,1,\n,,,\n,,N,Y,NON,N,\n,,Slaughterhouse,T000010,\n,UNK (19XX),J03.540.020
3,,,,,\n,\n,M0511063,\n,N,Acronyms as Topic,,,M0000004,M0511063,\n,NRW,Works about shortened forms of written words or phrases used for brevity.\n,\n,29,6,2007,1,1,1960,30,6,2017,Abbreviations as Topic,,includes acronyms; do not confuse with Publication Type ABBREVIATIONS\n,\n,,\n,\n,\n,\n,D000004,,2008(1963)\n,,,,,2008; see ABBREVIATIONS 1963-2007\n,,\n,,,,,,,,,,,,,,,,,1,\n,,,\n,,Y,N,NON,N,\n,,Acronyms as Topic,T701041,\n,NLM (2008),L01.559.598.400.556.131
4,VI,\n,\n,,\n,,M0000005,\n,Y,Abdomen,,,,,,,That portion of the body that lies between the THORAX and the PELVIS.\n,\n,1,1,1999,1,1,1966,9,8,2016,Abdominal Injuries,\n,GEN: prefer specifics; abdom muscles = ABDOMINAL MUSCLES but RECTUS ABDOMINIS is available; abdo...,\n,,\n,\n,\n,\n,D000005,\n,,,,,,,,\n,\n,D000007,\n,\n,\n,,\n,\n,\n,,,,injuries,\n,Q000293,,1,\n,,,\n,,N,Y,NON,N,\n,,Abdomens,T000012,\n,NLM (1966),A01.923.047


In [175]:
desc_2018_df.columns

Index(['AllowableQualifierAbbreviation',
       'AllowableQualifierQualifierReferredTo',
       'AllowableQualifiersListAllowableQualifier', 'ConceptCASN1Name',
       'ConceptConceptName', 'ConceptConceptRelationList', 'ConceptConceptUI',
       'ConceptListConcept', 'ConceptListPreferredConceptYN',
       'ConceptNameString', 'ConceptRegistryNumber',
       'ConceptRelatedRegistryNumberList', 'ConceptRelationConcept1UI',
       'ConceptRelationConcept2UI', 'ConceptRelationListConceptRelation',
       'ConceptRelationListRelationName', 'ConceptScopeNote',
       'ConceptTermList', 'DateCreatedDay', 'DateCreatedMonth',
       'DateCreatedYear', 'DateEstablishedDay', 'DateEstablishedMonth',
       'DateEstablishedYear', 'DateRevisedDay', 'DateRevisedMonth',
       'DateRevisedYear', 'DescriptorNameString',
       'DescriptorRecordAllowableQualifiersList', 'DescriptorRecordAnnotation',
       'DescriptorRecordConceptList', 'DescriptorRecordConsiderAlso',
       'DescriptorRecordDateCreat

In [176]:
desc_2018_df.head(1)

Unnamed: 0,AllowableQualifierAbbreviation,AllowableQualifierQualifierReferredTo,AllowableQualifiersListAllowableQualifier,ConceptCASN1Name,ConceptConceptName,ConceptConceptRelationList,ConceptConceptUI,ConceptListConcept,ConceptListPreferredConceptYN,ConceptNameString,ConceptRegistryNumber,ConceptRelatedRegistryNumberList,ConceptRelationConcept1UI,ConceptRelationConcept2UI,ConceptRelationListConceptRelation,ConceptRelationListRelationName,ConceptScopeNote,ConceptTermList,DateCreatedDay,DateCreatedMonth,DateCreatedYear,DateEstablishedDay,DateEstablishedMonth,DateEstablishedYear,DateRevisedDay,DateRevisedMonth,DateRevisedYear,DescriptorNameString,DescriptorRecordAllowableQualifiersList,DescriptorRecordAnnotation,DescriptorRecordConceptList,DescriptorRecordConsiderAlso,DescriptorRecordDateCreated,DescriptorRecordDateEstablished,DescriptorRecordDateRevised,DescriptorRecordDescriptorName,DescriptorRecordDescriptorUI,DescriptorRecordEntryCombinationList,DescriptorRecordHistoryNote,DescriptorRecordNLMClassificationNumber,DescriptorRecordOnlineNote,DescriptorRecordPharmacologicalActionList,DescriptorRecordPreviousIndexingList,DescriptorRecordPublicMeSHNote,DescriptorRecordSeeRelatedList,DescriptorRecordTreeNumberList,DescriptorReferredToDescriptorName,DescriptorReferredToDescriptorUI,ECINDescriptorReferredTo,ECINQualifierReferredTo,ECOUTDescriptorReferredTo,ECOUTQualifierReferredTo,EntryCombinationECIN,EntryCombinationECOUT,EntryCombinationListEntryCombination,PharmacologicalActionDescriptorReferredTo,PharmacologicalActionListPharmacologicalAction,PreviousIndexingListPreviousIndexing,QualifierNameString,QualifierReferredToQualifierName,QualifierReferredToQualifierUI,RelatedRegistryNumberListRelatedRegistryNumber,RootDescriptorClass,RootDescriptorRecord,SeeRelatedDescriptorDescriptorReferredTo,SeeRelatedListSeeRelatedDescriptor,TermDateCreated,TermEntryVersion,TermListConceptPreferredTermYN,TermListIsPermutedTermYN,TermListLexicalTag,TermListRecordPreferredTermYN,TermListTerm,TermSortVersion,TermString,TermTermUI,TermThesaurusIDlist,ThesaurusIDlistThesaurusID,TreeNumberListTreeNumber
0,TO,\n,\n,"4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrro...",\n,\n,M0353609,\n,N,A-23187,0,\n,M0000001,M0353609,\n,NRW,"An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports C...",\n,8,3,1990,1,1,1984,27,5,2016,Calcium Ionophores,\n,,\n,,\n,\n,\n,\n,D000001,,91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n,,use CALCIMYCIN to search A 23187 1975-90\n,\n,\n,91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n,,\n,\n,D061207,,,,,,,,\n,\n,Carboxylic Acids (1973-1974),toxicity,\n,Q000633,52665-69-7 (Calcimycin),1,\n,,,\n,,N,Y,NON,N,\n,,"A23187, Antibiotic",T000003,\n,NLM (1991),D03.633.100.221.173


In [177]:
desc_2018_df.drop([
       'AllowableQualifierQualifierReferredTo',
       'AllowableQualifiersListAllowableQualifier',
       'ConceptConceptName', 'ConceptConceptRelationList',
       'ConceptListConcept',
       'ConceptRelatedRegistryNumberList', 'ConceptRelationListConceptRelation',
       'DescriptorRecordAllowableQualifiersList',
       'DescriptorRecordConceptList', 
       'DescriptorRecordDateCreated', 'DescriptorRecordDateEstablished',
       'DescriptorRecordDateRevised', 'DescriptorRecordDescriptorName',
       'DescriptorRecordPharmacologicalActionList',
       'DescriptorRecordPreviousIndexingList',
       'DescriptorRecordTreeNumberList', 'DescriptorReferredToDescriptorName',
       'PharmacologicalActionDescriptorReferredTo',
       'PharmacologicalActionListPharmacologicalAction',
       'QualifierReferredToQualifierName',
       'RootDescriptorRecord',
       'TermDateCreated',
       'TermListTerm',
       'TermThesaurusIDlist','ECINDescriptorReferredTo',
       'ECINQualifierReferredTo',
       'ECOUTDescriptorReferredTo',
       'ECOUTQualifierReferredTo',
       'EntryCombinationECIN',
       'EntryCombinationECOUT'],
        axis=1, inplace=True)

In [178]:
desc_2018_df.head(1)

Unnamed: 0,AllowableQualifierAbbreviation,ConceptCASN1Name,ConceptConceptUI,ConceptListPreferredConceptYN,ConceptNameString,ConceptRegistryNumber,ConceptRelationConcept1UI,ConceptRelationConcept2UI,ConceptRelationListRelationName,ConceptScopeNote,ConceptTermList,DateCreatedDay,DateCreatedMonth,DateCreatedYear,DateEstablishedDay,DateEstablishedMonth,DateEstablishedYear,DateRevisedDay,DateRevisedMonth,DateRevisedYear,DescriptorNameString,DescriptorRecordAnnotation,DescriptorRecordConsiderAlso,DescriptorRecordDescriptorUI,DescriptorRecordEntryCombinationList,DescriptorRecordHistoryNote,DescriptorRecordNLMClassificationNumber,DescriptorRecordOnlineNote,DescriptorRecordPublicMeSHNote,DescriptorRecordSeeRelatedList,DescriptorReferredToDescriptorUI,EntryCombinationListEntryCombination,PreviousIndexingListPreviousIndexing,QualifierNameString,QualifierReferredToQualifierUI,RelatedRegistryNumberListRelatedRegistryNumber,RootDescriptorClass,SeeRelatedDescriptorDescriptorReferredTo,SeeRelatedListSeeRelatedDescriptor,TermEntryVersion,TermListConceptPreferredTermYN,TermListIsPermutedTermYN,TermListLexicalTag,TermListRecordPreferredTermYN,TermSortVersion,TermString,TermTermUI,ThesaurusIDlistThesaurusID,TreeNumberListTreeNumber
0,TO,"4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrro...",M0353609,N,A-23187,0,M0000001,M0353609,NRW,"An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports C...",\n,8,3,1990,1,1,1984,27,5,2016,Calcium Ionophores,,,D000001,,91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n,,use CALCIMYCIN to search A 23187 1975-90\n,91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n,,D061207,,Carboxylic Acids (1973-1974),toxicity,Q000633,52665-69-7 (Calcimycin),1,,,,N,Y,NON,N,,"A23187, Antibiotic",T000003,NLM (1991),D03.633.100.221.173


In [179]:
desc_2018_df.rename(columns={'AllowableQualifierAbbreviation': 'QualifierAbbreviation',
                            'ConceptConceptUI': 'ConceptUI',
                            'ConceptListPreferredConceptYN': 'PreferredConceptYN',
                            'ConceptRelationConcept1UI': 'Concept1UI',
                            'ConceptRelationConcept1UI': 'Concept2UI',
                            'ConceptRelationListRelationName' : 'ConceptRelationName',
                            'PreviousIndexingListPreviousIndexing': 'PreviousIndexing',
                            'EntryCombinationListEntryCombination': 'EntryCombination',
                            'RelatedRegistryNumberListRelatedRegistryNumber': 'RelatedRegistryNumber',
                            'SeeRelatedDescriptorDescriptorReferredTo': 'DescriptorReferredTo',
                            'SeeRelatedListSeeRelatedDescriptor': 'SeeRelatedDescriptor',
                            'TermListConceptPreferredTermYN': 'PreferredTermYN',
                            'TermListIsPermutedTermYN': 'IsPermutedTermYN',
                            'ThesaurusIDlistThesaurusID': 'ThesaurusID',
                            'TreeNumberListTreeNumber': 'TreeNumber'}, inplace=True)

In [180]:
# desc_2018_df['TreeNumber'][pd.isnull(desc_2018_df['TreeNumber'])] = ['U01', 'U02']
desc_2018_df = desc_2018_df[~pd.isnull(desc_2018_df['TreeNumber'])]

MeSH codes resemble the format "A01.343.124.243" with up to 12 levels, and where the first letter denotes the coarsest category. We want to know the position in the hierarchy for each word, so we count the number of splits in the code for each term.

In [25]:
code_splits = []

for c in desc_2018_df['TreeNumber'].str.split('.'):
    code_splits.append(c)

In [26]:
# mesh_tree_codes = ['.'.join(c) for c in code_splits]
code_lengths = [len(c) for c in code_splits]
max_code_length = max(code_lengths)
# desc_2018_df['MeshTreeCode'] = mesh_tree_codes

In [183]:
print(max_code_length)

13


In [184]:
# reset

# for c in desc_2018_df.columns:
#     if 'tree' in c:
#         desc_2018_df.drop(c, axis=1, inplace=True)

In [185]:
desc_2018_df['tree_number_0'] = [c[0][0] for c in code_splits]

In [186]:
code_splits[200]

['D12', '776', '664', '962', '813', '500', '875']

Let's add columns for each code order, so we can group terms together under common codes later.

In [187]:
for i in range(1, max_code_length):
    tree_lvl_codes = []
    for c in code_splits:
        if len(c) >= i:
            tree_lvl_codes.append('.'.join(c[:i]))
        else:
            tree_lvl_codes.append(np.nan)
    desc_2018_df['tree_number_{}'.format(i)] = tree_lvl_codes

We want to map the codes to actual terms, so starting with the 0th level, we map terms obtained manually from the MeSH website.

In [188]:
# from https://meshb.nlm.nih.gov/treeView
tree_0_map = {
    'A': 'anatomy',
    'B': 'organisms',
    'C': 'diseases',
    'D': 'chemicals and drugs',
    'E': 'analytical, diagnostic, and therapeutic techniques, and equipment',
    'F': 'psychiatry and psychology',
    'G': 'phenomena and processes',
    'H': 'disciplines and occupations',
    'I': 'anthropology, education, sociology, and social phenomena',
    'J': 'technology, industry, and agriculture',
    'K': 'humanities',
    'L': 'information science',
    'M': 'named groups',
    'N': 'health care',
    'V': 'publication characteristics',
    'Z': 'geographicals'
}

In [189]:
desc_2018_df['tree_string_0'] = desc_2018_df['tree_number_0'].map(tree_0_map)

Some of the original strings are reversed using commas. To help matching in the documents we should put them in correct order.

In [12]:
# desc_2018_df.to_csv(proc_data + '/mesh_codes_cleaned_{}.csv'.format(today_str), index=False)
desc_2018_df = pd.read_csv(proc_data + '/mesh_codes_cleaned_{}.csv'.format('5_3_2018')).drop('Unnamed: 0', axis=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [13]:
def process_string(string):
    string = string.split(', ')
    string = ' '.join(string[::-1])
    return string.lower()

In [14]:
for c in desc_2018_df.columns:
    if 'String' in c:
        print(c)

ConceptNameString
DescriptorNameString
QualifierNameString
TermString


In [15]:
desc_2018_df['ConceptNameString'][:10]

0                  A-23187
1                    Difos
2                Abattoirs
3        Acronyms as Topic
4                  Abdomen
5           Abdomen, Acute
6       Abdominal Injuries
7      Abdominal Neoplasms
8    Transversus Abdominis
9           Abducens Nerve
Name: ConceptNameString, dtype: object

In [16]:
desc_2018_df['DescriptorNameString'][:10]

0        Calcium Ionophores
1              Insecticides
2                 Abattoirs
3    Abbreviations as Topic
4        Abdominal Injuries
5            Abdomen, Acute
6        Abdominal Injuries
7       Abdominal Neoplasms
8            Abdominal Wall
9          Abducens Nucleus
Name: DescriptorNameString, dtype: object

In [17]:
desc_2018_df['QualifierNameString'][:10]

0          toxicity
1          toxicity
2            ethics
3               NaN
4          injuries
5           nursing
6         pathology
7    ultrastructure
8    ultrastructure
9          injuries
Name: QualifierNameString, dtype: object

In [18]:
desc_2018_df['TermString'][:10]

0       A23187, Antibiotic
1                    Difos
2           Slaughterhouse
3        Acronyms as Topic
4                 Abdomens
5           Acute Abdomens
6        Injury, Abdominal
7     Neoplasms, Abdominal
8    Transverse Abdominals
9       Nerve VIs, Cranial
Name: TermString, dtype: object

In [19]:
desc_2018_df['ConceptStringProcessed'] = desc_2018_df['ConceptNameString'].apply(lambda x: process_string(x))
desc_2018_df['DescriptorStringProcessed'] = desc_2018_df['DescriptorNameString'].apply(lambda x: process_string(x))
# desc_2018_df['QualifierStringProcessed'] = desc_2018_df['QualifierNameString'].apply(lambda x: process_string(x))
desc_2018_df['TermStringProcessed'] = desc_2018_df['TermString'].apply(lambda x: process_string(x))

For each level, take the tree codes and the processed strings, but only for the ones where the next level up is NaN. This means that only ones which finish at this level of the tree are taken. Set the index of the dataframe to the tree codes and convert to a dict that maps codes to strings. Map that dict on to the codes for the next level up.

In [22]:
def expand_string_tree(df, string_column, max_code_length=13):
    for i in range(1, max_code_length - 1):
        tree_name_map = desc_2018_df[['TreeNumber', string_column]][pd.isnull(desc_2018_df['tree_number_{}'.format(i + 1)])].set_index('TreeNumber').to_dict()
        tree_name_map = tree_name_map[string_column]
        tree_name_map.pop(np.nan, None)
        desc_2018_df['tree_{}_{}'.format(string_column, i)] = desc_2018_df['tree_number_{}'.format(i)].map(tree_name_map, na_action='ignore')
    desc_2018_df['tree_{}_{}'.format(string_column, max_code_length - 1)] = np.nan
    return df

In [23]:
for c in ['ConceptStringProcessed', 'DescriptorStringProcessed', 'TermStringProcessed']:
    desc_2018_df = expand_string_tree(desc_2018_df, c)

In [168]:
# desc_2018_df.to_csv(proc_data + '/mesh_codes_cleaned_{}.csv'.format(today_str), index=False)

After this there are some broken codes, due to duplicate entries in the tree, but these are relatively few in number.

In [27]:
desc_2018_df['tree_order'] = code_lengths

Finally export as a json.

In [28]:
reoriented = desc_2018_df.set_index('DescriptorRecordDescriptorUI')

In [29]:
concept_string_dict = reoriented.to_dict(orient='index')

In [30]:
reoriented.to_json(proc_data + '/mesh_codes_processed_DUI_{}.json'.format(today_str), orient='index')

Need to do a second iteration of this where the tree is not built on one of the terms, but rather the tree numbers.

Possible structure that we might want to obtain later:

```
{'A': {'level': 0,
       'term': 'humans',
       'children': {'A01': {...
                           }
                    ...
                   }
       ... 
      }
 ...
}
                   
```

In [6]:
reoriented = desc_2018_df.set_index('TreeNumber')

In [7]:
concept_string_dict = reoriented.to_dict(orient='index')

In [8]:
reoriented.to_json(proc_data + '/mesh_codes_processed_tree_number_{}.json'.format(today_str), orient='index')

In [3]:
desc_2018_df = pd.read_json('../data/processed/mesh_codes_processed_5_4_2018.json')

In [17]:
desc_2018_df.set_index('TreeNumber').to_json('../data/processed/mesh_codes_processed_5_8_2018.json', orient='index')

In [29]:
desc_2018_df_2[desc_2018_df_2['ConceptNameString'].str.contains('Volition')]

Unnamed: 0,Concept2UI,ConceptCASN1Name,ConceptNameString,ConceptRegistryNumber,ConceptRelationConcept2UI,ConceptRelationName,ConceptScopeNote,ConceptTermList,ConceptUI,DateCreatedDay,DateCreatedMonth,DateCreatedYear,DateEstablishedDay,DateEstablishedMonth,DateEstablishedYear,DateRevisedDay,DateRevisedMonth,DateRevisedYear,DescriptorNameString,DescriptorRecordAnnotation,DescriptorRecordConsiderAlso,DescriptorRecordDescriptorUI,DescriptorRecordEntryCombinationList,DescriptorRecordHistoryNote,DescriptorRecordNLMClassificationNumber,DescriptorRecordOnlineNote,DescriptorRecordPublicMeSHNote,DescriptorRecordSeeRelatedList,DescriptorReferredTo,DescriptorReferredToDescriptorUI,DescriptorStringProcessed,EntryCombination,IsPermutedTermYN,PreferredConceptYN,PreferredTermYN,PreviousIndexing,QualifierAbbreviation,QualifierNameString,QualifierReferredToQualifierUI,RelatedRegistryNumber,RootDescriptorClass,SeeRelatedDescriptor,TermEntryVersion,TermListLexicalTag,TermListRecordPreferredTermYN,TermSortVersion,TermString,TermStringProcessed,TermTermUI,ThesaurusID,TreeNumber,tree_ConceptStringProcessed_1,tree_ConceptStringProcessed_10,tree_ConceptStringProcessed_11,tree_ConceptStringProcessed_12,tree_ConceptStringProcessed_2,tree_ConceptStringProcessed_3,tree_ConceptStringProcessed_4,tree_ConceptStringProcessed_5,tree_ConceptStringProcessed_6,tree_ConceptStringProcessed_7,tree_ConceptStringProcessed_8,tree_ConceptStringProcessed_9,tree_DescriptorStringProcessed_1,tree_DescriptorStringProcessed_10,tree_DescriptorStringProcessed_11,tree_DescriptorStringProcessed_12,tree_DescriptorStringProcessed_2,tree_DescriptorStringProcessed_3,tree_DescriptorStringProcessed_4,tree_DescriptorStringProcessed_5,tree_DescriptorStringProcessed_6,tree_DescriptorStringProcessed_7,tree_DescriptorStringProcessed_8,tree_DescriptorStringProcessed_9,tree_TermStringProcessed_1,tree_TermStringProcessed_10,tree_TermStringProcessed_11,tree_TermStringProcessed_12,tree_TermStringProcessed_2,tree_TermStringProcessed_3,tree_TermStringProcessed_4,tree_TermStringProcessed_5,tree_TermStringProcessed_6,tree_TermStringProcessed_7,tree_TermStringProcessed_8,tree_TermStringProcessed_9,tree_number_0,tree_number_1,tree_number_10,tree_number_11,tree_number_12,tree_number_2,tree_number_3,tree_number_4,tree_number_5,tree_number_6,tree_number_7,tree_number_8,tree_number_9,tree_order,tree_string_0
volition,,,Volition,,,,Voluntary activity without external compulsion.\n,\n,M0022838,30,3,1974,1,1,1966,23,5,1995,Volition,,,D014836,,,,,,,,,volition,,N,Y,N,,ES,ethics,Q000941,,1,,,NON,N,,Will,will,T043363,UNK (19XX),F02.463.902,psychologic processes,,,,mental processes,volition,,,,,,,psychological phenomena,,,,mental processes,volition,,,,,,,psychological processe,,,,human information processing,will,,,,,,,F,F02,,,,F02.463,F02.463.902,,,,,,,3,psychiatry and psychology


In [3]:
df = pd.read_json('../data/processed/mesh_codes_processed_5_4_2018.json', orient='index')

In [8]:
df = df.reset_index()

In [9]:
df.rename(columns={'index': 'DescriptorRecordDescriptorUI'}, inplace=True)

In [11]:
df.columns

Index(['DescriptorRecordDescriptorUI', 'Concept2UI', 'ConceptCASN1Name',
       'ConceptNameString', 'ConceptRegistryNumber',
       'ConceptRelationConcept2UI', 'ConceptRelationName', 'ConceptScopeNote',
       'ConceptTermList', 'ConceptUI',
       ...
       'tree_number_2', 'tree_number_3', 'tree_number_4', 'tree_number_5',
       'tree_number_6', 'tree_number_7', 'tree_number_8', 'tree_number_9',
       'tree_order', 'tree_string_0'],
      dtype='object', length=103)

In [4]:
df.head()

Unnamed: 0,Concept2UI,ConceptCASN1Name,ConceptNameString,ConceptRegistryNumber,ConceptRelationConcept2UI,ConceptRelationName,ConceptScopeNote,ConceptTermList,ConceptUI,DateCreatedDay,DateCreatedMonth,DateCreatedYear,DateEstablishedDay,DateEstablishedMonth,DateEstablishedYear,DateRevisedDay,DateRevisedMonth,DateRevisedYear,DescriptorNameString,DescriptorRecordAnnotation,DescriptorRecordConsiderAlso,DescriptorRecordDescriptorUI,DescriptorRecordEntryCombinationList,DescriptorRecordHistoryNote,DescriptorRecordNLMClassificationNumber,DescriptorRecordOnlineNote,DescriptorRecordPublicMeSHNote,DescriptorRecordSeeRelatedList,DescriptorReferredTo,DescriptorReferredToDescriptorUI,DescriptorStringProcessed,EntryCombination,IsPermutedTermYN,PreferredConceptYN,PreferredTermYN,PreviousIndexing,QualifierAbbreviation,QualifierNameString,QualifierReferredToQualifierUI,RelatedRegistryNumber,RootDescriptorClass,SeeRelatedDescriptor,TermEntryVersion,TermListLexicalTag,TermListRecordPreferredTermYN,TermSortVersion,TermString,TermStringProcessed,TermTermUI,ThesaurusID,TreeNumber,tree_ConceptStringProcessed_1,tree_ConceptStringProcessed_10,tree_ConceptStringProcessed_11,tree_ConceptStringProcessed_12,tree_ConceptStringProcessed_2,tree_ConceptStringProcessed_3,tree_ConceptStringProcessed_4,tree_ConceptStringProcessed_5,tree_ConceptStringProcessed_6,tree_ConceptStringProcessed_7,tree_ConceptStringProcessed_8,tree_ConceptStringProcessed_9,tree_DescriptorStringProcessed_1,tree_DescriptorStringProcessed_10,tree_DescriptorStringProcessed_11,tree_DescriptorStringProcessed_12,tree_DescriptorStringProcessed_2,tree_DescriptorStringProcessed_3,tree_DescriptorStringProcessed_4,tree_DescriptorStringProcessed_5,tree_DescriptorStringProcessed_6,tree_DescriptorStringProcessed_7,tree_DescriptorStringProcessed_8,tree_DescriptorStringProcessed_9,tree_TermStringProcessed_1,tree_TermStringProcessed_10,tree_TermStringProcessed_11,tree_TermStringProcessed_12,tree_TermStringProcessed_2,tree_TermStringProcessed_3,tree_TermStringProcessed_4,tree_TermStringProcessed_5,tree_TermStringProcessed_6,tree_TermStringProcessed_7,tree_TermStringProcessed_8,tree_TermStringProcessed_9,tree_number_0,tree_number_1,tree_number_10,tree_number_11,tree_number_12,tree_number_2,tree_number_3,tree_number_4,tree_number_5,tree_number_6,tree_number_7,tree_number_8,tree_number_9,tree_order,tree_string_0
(+)-isomer bay-k-8644,M0002232,"3-Pyridinecarboxylic acid, 1,4-dihydro-2,6-dimethyl-5-nitro-4-(2-(trifluoromethyl)phenyl)-, meth...","Bay-K-8644, (+)-Isomer",98791-67-4,M0330830,NRW,"A dihydropyridine derivative, which, in contrast to NIFEDIPINE, functions as a calcium channel a...",\n,M0330830,13,9,1999,1,1,1987,27.0,5.0,2016.0,Calcium Channel Agonists,a calcium channel agonist\n,,D001498,,87\n,,,87\n,,,D002120,calcium channel agonists,,N,N,Y,Pyridines (1966-1974),TO,toxicity,Q000633,98791-67-4 ((+)-isomer),1,,,NON,N,PYRIDINECARBOXYLIC ACID 03 1 4 DIHYDRO 2 6 DIMETHYL 5 NITRO 4 2 TRIFLUOROMETHYL PHENYL METHYL ESTER,"Bay-K-8644, (+)-Isomer",(+)-isomer bay-k-8644,T361024,NLM (1999),D03.383.725.547.900,heterocyclic compounds,,,,1-ring heterocyclic compounds,pyridines,nicotinic acids,(+)-isomer bay-k-8644,,,,,heterocyclic compounds,,,,1-ring heterocyclic compounds,piperidines,vitamin b complex,calcium channel agonists,,,,,heterocyclic compounds,,,,1 ring heterocyclic cpds,pyridines,nicotinic acids,(+)-isomer bay-k-8644,,,,,D,D03,,,,D03.383,D03.383.725,D03.383.725.547,D03.383.725.547.900,,,,,5,chemicals and drugs
(+)-isomer methoxyhydroxyphenylglycol,M0013606,"1,2-Ethanediol, 1-(4-hydroxy-3-methoxyphenyl)-","Methoxyhydroxyphenylglycol, (+)-Isomer",87171-17-3,M0330144,REL,"Synthesized from endogenous epinephrine and norepinephrine in vivo. It is found in brain, blood,...",\n,M0330144,13,9,1999,1,1,1991,25.0,7.0,2001.0,Methoxyhydroxyphenylglycol,,,D008734,,91(77); was see under GLYCOLS 1977-90; was METHOXYHYDROXYPHENYL GLYCOL see under GLYCOLS 1975-76...,QV 82,use METHOXYHYDROXYPHENYLGLYCOL to search METHOXYHYDROXYPHENYL GLYCOL 1973-76 (as Prov 1973-74)\n,91; was see under GLYCOLS 1977-90; was METHOXYHYDROXYPHENYL GLYCOL see under GLYCOLS 1975-76\n,,,,methoxyhydroxyphenylglycol,,N,N,Y,Glycols (1966-1974),TO,toxicity,Q000633,87171-18-4 ((-)-isomer),1,,,NON,N,,"Methoxyhydroxyphenylglycol, (+)-Isomer",(+)-isomer methoxyhydroxyphenylglycol,T360338,NLM (1999),D02.033.455.250.610,organic chemicals,,,,alcohols,glycols,ethylene glycols,(+)-isomer methoxyhydroxyphenylglycol,,,,,organic chemicals,,,,alcohols,glycols,ethylene glycols,methoxyhydroxyphenylglycol,,,,,organic chemicals,,,,alcohols,glycols,ethanediols,(+)-isomer methoxyhydroxyphenylglycol,,,,,D,D02,,,,D02.033,D02.033.455,D02.033.455.250,D02.033.455.250.610,,,,,5,chemicals and drugs
(+)-isomer oxyphenonium bromide,M0015691,"Ethanaminium, 2-((cyclohexylhydroxyphenylacetyl)oxy)-N,N-diethyl-N-methyl-","Oxyphenonium Bromide, (+)-Isomer",99102-05-3,M0329910,NRW,A quaternary ammonium anticholinergic agent with peripheral side effects similar to those of ATR...,\n,M0329910,13,9,1999,1,1,1991,8.0,7.0,2013.0,Muscarinic Antagonists,,,D010115,,92(64); was OXYPHENONIUM BROMIDE 1964-91 (see under AMMONIUM COMPOUNDS 1964-90)\n,,use OXYPHENONIUM to search OXYPHENONIUM BROMIDE 1966-91\n,92; was OXYPHENONIUM BROMIDE 1964-91 (see under AMMONIUM COMPOUNDS 1964-90)\n,,,D018727,muscarinic antagonists,,N,N,Y,Oxyphenonium Bromide (1966-1991),TO,toxicity,Q000633,S9421HWB3Z,1,,,NON,N,,"Oxyphenonium Bromide, (+)-Isomer",(+)-isomer oxyphenonium bromide,T360104,NLM (1999),D02.675.276.648,organic chemicals,,,,onium compounds,quaternary ammonium compounds,(+)-isomer oxyphenonium bromide,,,,,,organic chemicals,,,,onium compounds,quinolinium compounds,muscarinic antagonists,,,,,,organic chemicals,,,,onium compounds,quaternary ammonium compounds,(+)-isomer oxyphenonium bromide,,,,,,D,D02,,,,D02.675,D02.675.276,D02.675.276.648,,,,,,4,chemicals and drugs
"(+,-)-isomer 2,5-dimethoxy-4-methylamphetamine",M0006747,"Benzeneethanamine, 2,5-dimethoxy-alpha,4-dimethyl-","2,5-Dimethoxy-4-Methylamphetamine, (+,-)-Isomer",26011-50-7,M0330395,NRW,"A psychedelic phenyl isopropylamine derivative, commonly called DOM, whose mood-altering effects...",\n,M0330395,13,9,1999,1,1,1991,28.0,6.0,2010.0,Serotonin Receptor Agonists,,,D004290,,2001(1973)\n,,,2001; see DOM 1991-2000 & AMPHETAMINES 1973-90\n,,,D017366,serotonin receptor agonists,,N,N,Y,,TO,toxicity,Q000633,"50505-89-0 ((S)-isomer, HCl)",1,,,NON,N,DIMETHOXYMETHYLAMPHETAMINE 02 05 04,"2,5-Dimethoxy-4-Methylamphetamine, (+,-)-Isomer","(+,-)-isomer 2,5-dimethoxy-4-methylamphetamine",T360589,NLM (1999),D02.092.471.683.152.391,organic chemicals,,,,amines,ethylamines,phenethylamines,amphetamines,"(+,-)-isomer 2,5-dimethoxy-4-methylamphetamine",,,,organic chemicals,,,,amines,ethylamines,phenethylamines,appetite depressants,serotonin receptor agonists,,,,organic chemicals,,,,amines,ethylamines,phenylethylamines,amphetamines,"(+,-)-isomer 2,5-dimethoxy-4-methylamphetamine",,,,D,D02,,,,D02.092,D02.092.471,D02.092.471.683,D02.092.471.683.152,D02.092.471.683.152.391,,,,6,chemicals and drugs
(+-)-isomer bupropion,M0025361,"1-Propanone, 1-(3-chlorophenyl)-2-((1,1-dimethylethyl)amino)-","Bupropion, (+-)-Isomer",34911-55-2,M0331329,REL,"A unicyclic, aminoketone antidepressant. The mechanism of its therapeutic actions is not well un...",\n,M0331329,13,9,1999,1,1,1992,1.0,7.0,2016.0,Cytochrome P-450 CYP2D6 Inhibitors,do not confuse with Zyban fungicide\n,,D016642,,92; was BUPROPION (NM) 1978-91\n,,use BUPROPION (NM) to search BUPROPION 1980-91\n,92; BUPROPION was indexed under PROPIOPHENONES 1978-91\n,\n,\n,D065690,cytochrome p-450 cyp2d6 inhibitors,,N,N,Y,Propiophenones (1966-1991),TO,toxicity,Q000633,ZG7E5POY8O,1,\n,,NON,N,,"Bupropion, (+-)-Isomer",(+-)-isomer bupropion,T361523,NLM (1999),D02.522.818.110,organic chemicals,,,,ketones,propiophenones,(+-)-isomer bupropion,,,,,,organic chemicals,,,,ketones,propiophenones,cytochrome p-450 cyp2d6 inhibitors,,,,,,organic chemicals,,,,ketones,propiophenones,(+-)-isomer bupropion,,,,,,D,D02,,,,D02.522,D02.522.818,D02.522.818.110,,,,,,4,chemicals and drugs
