## Read in samples from the TREC CDS data

These give us some sample clinical narratives. We want to extract then and use them for simple NLP tasks.

These are pretty simple examples, and we just want to match up on demographics. so they're well suited to regular expressions. We'll do NLP as we move into our system.

http://www.trec-cds.org/

### Get the samples

In [1]:
import untangle
import urllib2
import pandas as ps

links = [
    'http://www.trec-cds.org/case-based-topics-imageclef-2013.xml',
    'http://www.trec-cds.org/topics2014.xml',
    'http://trec.nist.gov/data/clinical/topics-2015-A.xml'
]
textSnippets = []

for link in links:
    res = urllib2.urlopen(link)
    xmlData = res.read()
    linkData = untangle.parse(xmlData)
    for topic in linkData.topics.topic:
        description = topic.get_elements(name='description');
        text = ''
        if (len(description) == 0):
            summary = topic.get_elements(name='summary')
            if (len(summary) > 0):
                text = summary[0].cdata
        else: 
            text = description[0].cdata
        
        if (len(text) > 0):
            textSnippets.append(text.strip('\t\n\r'))
        
print ('Found %d snippets.' % len(textSnippets))

textDf = ps.DataFrame(textSnippets, columns=['text'])
textDf.tail(10)


Found 95 snippets.


Unnamed: 0,text
85,A 32-year-old male presents to your office com...
86,A 65-year-old male with a history of tuberculo...
87,An 18-year-old male returning from a recent va...
88,A 31 yo male with no significant past medical ...
89,A 10-year-old boy comes to the emergency depar...
90,A 28 yo female G1P0A0 is admitted to the Ob/Gy...
91,A 15 yo girl accompanied by her mother is refe...
92,A previously healthy 8-year-old boy presents w...
93,A 4-year-old girl presents with persistent fev...
94,A 47 year old male who fell on his outstretche...


### Try to parse out demographics

In [5]:
import nltk.data
import re
import datetime

# Pull out the first sentence, since it has the demographics in this data set.
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
textDf['first_sentence'] = textDf['text'].map(lambda t: sentence_detector.tokenize(t)[0])
now = datetime.datetime.now()

# months can just go in the 0 bucket
years_re = re.compile('(?i)[1-9]?[0-9]?(\-| )(year|yo)')
digits_re = re.compile('(?i)[1-9]?[0-9]')
def tokenizeAge(t):
    years = years_re.search(t)
    if (years is not None):
        age = digits_re.search(years.group())
        if (age is not None):
            return int(age.group())
    
    return 0
    
male_re = re.compile('(?i)(male|man|boy)')
female_re = re.compile('(?i)(female|woman|girl)')
def tokenizeGender(t):
    female = female_re.search(t)
    male = male_re.search(t)
    if (female is not None):
        return 'F'
    elif (male is not None):
        return 'M'
    else:
        return 'U'
    
textDf['age'] = textDf['first_sentence'].map(lambda t: tokenizeAge(t))
textDf['gender'] = textDf['text'].map(lambda t: tokenizeGender(t))
textDf['minBirthYear'] = textDf['age'].map(lambda a: int(now.year) - a)
textDf.head(20)

Unnamed: 0,text,first_sentence,age,gender,minBirthYear
0,"A 43-year-old man with painless, gross hematur...","A 43-year-old man with painless, gross hematuria.",43,M,1973
1,A woman in her mid-30s presented with dyspnea ...,A woman in her mid-30s presented with dyspnea ...,0,F,2016
2,A 29-year-old man was brought to the emergency...,A 29-year-old man was brought to the emergency...,29,M,1987
3,A 70-year-old man with a history of alcoholism...,A 70-year-old man with a history of alcoholism...,70,M,1946
4,A 48-year-old woman with right cheek swelling ...,A 48-year-old woman with right cheek swelling ...,48,F,1968
5,A 55-year-old man with progressive behavioral ...,A 55-year-old man with progressive behavioral ...,55,M,1961
6,A 76-year-old man with rectal bleeding and wei...,A 76-year-old man with rectal bleeding and wei...,76,M,1940
7,"A 56-year-old woman with Hepatitis C, now with...","A 56-year-old woman with Hepatitis C, now with...",56,F,1960
8,A 5-year-old boy with abdominal distension and...,A 5-year-old boy with abdominal distension and...,5,M,2011
9,A 60-year-old woman with abdominal discomfort ...,A 60-year-old woman with abdominal discomfort ...,60,F,1956


In [7]:
textDf.to_csv("trec_cds_text.csv", encoding='utf-8')