# Labelling Health Grant Descriptions with MeSH Terms

In [203]:
import pandas as pd
import numpy as np
import json
import spacy
import textacy

from spacy.lang.en import English
from spacy.matcher import PhraseMatcher, Matcher
from spacy.tokens import Doc, Span, Token

In [214]:
nlp = spacy.load('en_core_web_sm')

1. Import health terms.
2. Filter terms down to the desired level.
3. Add them as patterns to the matcher, labelled by the terms in the highest category level desired.
4. Add a token attribute to hold the category label, and create a callback function to update this attribute.
5. Find matches in documents.

- Stretch goal: associate terms using part of speech in a network or coocurrence matrix.

**1. Import Health Terms**

In [19]:
with open('../data/processed/mesh_codes_processed_5_8_2018.json', 'r') as f:
    health_terms = json.load(f)

**2. Filter Terms**

In [106]:
def process_string(string):
    string = string.split(', ')
    string = ' '.join(string[::-1])
    return string.lower()

def filter_terms(terms, order, on='ConceptStringProcessed'):
    filtered = {}
    for tree_number, properties in terms.items():
#         print(tree_number, properties)
        if properties['tree_order'] >= order:
#             if properties.get('tree_{}_{}'.format(on, order)) is not None:
            top_parent = properties['tree_{}_{}'.format(on, order)]
            names = list(set([process_string(properties['{}'.format(t)]) for t in ['TermString', 'ConceptNameString', 'DescriptorNameString']]))
            for name in names:
                filtered[name.lower()] = top_parent
    return filtered

In [107]:
health_terms_filtered = filter_terms(health_terms, 1)

In [128]:
len(health_terms_filtered)

55183

**4. Import Project Descriptions**

In [155]:
health_grants = pd.read_csv('../data/processed/health_research_grants_4_26_2018.csv')

n_docs = 200
descriptions = health_grants.sample(n=200)['public_description']

In [260]:
%time descriptions_clean = [textacy.preprocess_text(textacy.preprocess.fix_bad_unicode(d), lowercase=True) for d in descriptions]

CPU times: user 499 ms, sys: 4.11 ms, total: 503 ms
Wall time: 504 ms


**3. Create PhraseMatcher**

In [261]:
tokenizer = nlp.tokenizer

%time phrases = [(tokenizer(k), v) for k, v in health_terms_filtered.items()]

for phrase, _ in phrases:
    for token in phrase:
        _ = tokenizer.vocab[token.text]

CPU times: user 5.03 s, sys: 1.21 s, total: 6.24 s
Wall time: 7.09 s


In [239]:
matcher = PhraseMatcher(tokenizer.vocab, max_length=10)

In [262]:
%%time
for phrase in phrases:
    if phrase[1] is not None:
        if len(phrase[0]) < 10:
            matcher.add(phrase[1], None, phrase[0])

CPU times: user 118 ms, sys: 9.36 ms, total: 127 ms
Wall time: 129 ms


**4. Add Token Attribute and Callback**

**5. Find Matches**

In [263]:
%%time
matches = []

for text in descriptions_clean:
    doc = tokenizer(text)
#     for w in doc:
#         _ = doc.vocab[w.text]
    matches.append(matcher(doc))

CPU times: user 1.67 s, sys: 16.6 ms, total: 1.68 s
Wall time: 1.77 s


In [279]:
print("Estimated time to execute matching on all documents:", (0.5 + 7.1 + 0.2 + 1.8) * (len(health_grants) / n_docs) / 3600, 'hours')

Estimated time to execute matching on all documents: 0.5132933333333334 hours


It will take roughly half an hour to apply the current matching scheme to around 40,000 documents.

The results here are pretty promising. I originally thought that there would be no matches in many documents, but it looks as if all documents have multiple matches. Perhaps the standardisation of language within the health and medical fields is pretty strong. One issue is words that are wrongly matched because they appear both in the MeSH terms and also in the documens, but clearly not in a medical sense. For example "will" is matched wherever it occurs as a "psychological process", which is clearly not going to be the case for many uses of the word.

In [270]:
for i, match in enumerate(matches):
    if i in [0, 100]:
        print(i, '='*100)
        for match_id, start, end in match:
            string_id = nlp.vocab.strings[match_id]  # get string representation
            span = docs[i][start:end]  # the matched span
            print('{:<16}\t{:<48}\t{}'.format(match_id, string_id, span.text))

17520808660558581486	persons                                         	women
4041061045944005301	behavioral disciplines and activities           	mental health
3742528541303202448	reproductive and urinary physiological phenomena	pregnancy
3742528541303202448	reproductive and urinary physiological phenomena	postpartum period
1700477306715462547	behavior and behavior mechanisms                	mood
16986002624275872851	health occupations                              	epidemiology
1700477306715462547	behavior and behavior mechanisms                	risk
4508651938417329706	physical sciences                               	neuroscience
1700477306715462547	behavior and behavior mechanisms                	mood
413156896107774725	severe mental disorders                         	mood disorders
12114241926807388188	psychologic processes                           	will
737360591271645834	humanities                                      	awards
12114241926807388188	psychologic processes             

**Possible Improvements**

1. Enhance the list of terms by splitting into unigrams and filtering both the most uncommon and most common. Manually check for terms that are sufficiently appropriate in each category.
2. Use POS tagging to stop tagging words that are not used in a medical sense, e.g. "will". It seems like all of the MeSH terms are nouns, but I could be wrong...
3. Use word vectors to match synonmyms with a very close similarity (maybe it would be better to just enhance the initial list of terms).

**Think About**
- Best way to use the transformed documents - clustering vs network.
- Whether there's a computationally efficient way to find similar words within a document (probably most efficient just to stick with spaCy's PhraseMatcher).