# Lab 07: Information Extraction

Student: John Wu

In [1]:
import pprint, re
from dateutil.parser import parse as dateParser
pprint.sorted = lambda x, key=None: x # disable sorting of results

For this project, the spaCy library, especially its NER capabilities, is used extensively. The library performs NLP tasks such as tokenization, POS tagging, sentence segmentation, rule-based matching, etc. There are pre-trained models that achieve 86% accuracy in NER. Its output is combined with other techniques to perform the information extraction.

In [2]:
import spacy
nlp = spacy.load("en_core_web_md") # load spacy English model

Functions for loading input SGML files.

In [3]:
tagCap = re.compile(r'<P ID=(\d+)>\s+(.+?)\s?</P>', re.DOTALL)
def readFiles(filePath):
    with open(filePath, 'r', encoding='utf-8') as fh:
        matches = tagCap.findall(fh.read())
        ids,txts = zip(*matches)
        ids = [int(s) for s in ids]
        return ids, txts
    
trIDs, trTxts = readFiles('data/obits.train.txt') # get training files

An example of the NER capability on the 19th document.

In [4]:
spacy.displacy.render(nlp(trTxts[18]), style='ent')

## Extraction of Required Relations

__Name of the deceased__

Unfortunately, due to peculiarities with names like nicknames, the `PERSON` tagging capability of spaCy does not work well enough to simply extract the first PERSON entity. However, in obituaries, the name of the deceased is in the first sentence vast majority of the time. Therefore, we could parse the first sentence using word patterns to get the full name for vast majority of the time.

All tokens of the name are written consecutively, in the shape of "Xx". Therefore, we simply take the first sequence of token where the following are not true:

1. Frst letter is not capitalized
1. Is a number
1. is a punctuation, unless it's either a comma, left, or right side punctuation (like quotation marks)

This way, names such as `John "The Johnny" C. Smith, Jr.` would get captured as the full name. There is a final check, that one of the tokens in this string of consecutive tokens has to be tagged as `PERSON` entity to eliminate spurious matches. If none of the tokens are `PERSON`, we simply take the first entity tagged as `PERSON`.

In [5]:
def extractName(doc):
    sent = next(doc.sents) # get first sentence of document
    for n,tk in enumerate(sent): # loop over tokens of first sentence
        # if
        if (tk.is_digit or tk.shape_[0]=='x' or tk.is_punct and 
                not (tk.text==',' or tk.is_left_punct or tk.is_right_punct)):
            break
    if doc[n-1].is_punct: # if span ends with punctuation, reduce span by 1
        n -= 1
    nameSpan = doc[:n]
    
    # see if any PERSON entity was matched in span
    if 'PERSON' not in set((tk.ent_type_ for tk in nameSpan)): # no PERSON
        for e in doc.ents: # loop over all extracted entities
            if e.label_ == 'PERSON': # find first PERSON entity
                nameSpan = e # set that entity to the name
                break
    return nameSpan.text, nameSpan.end

__Sex of the deceased__

A simple way of doing this is to count the number of female pronouns ("her" and "she") vs. male ones ("his" and "he") and return the one with the higher number.

In [6]:
def isMan(tokens):
    tokens = [tk.lower_ for tk in tokens]
    female = sum((tk=='her' or tk=='she' for tk in tokens))
    male = sum((tk=='his' or tk=='he' for tk in tokens))
    return male >= female

__Age at death__

For this, we use combine regular expression with spaCy's matching capability. The regex matches for either a 3-digit number if the first digit is a 1, or a two digit number as a separate token (so "105 year's old" would be matched, but not "410-105-0001"). Since the age is most likely in the early part of the obit, we only check the first half of the document.

One complication is that the regular expression will also match dates (e.g. the "25" in "Jan 25, 2019"). Therefore, we check the match against NER result, and only accept the match if it is not a date or if it is tagged as date, the number is greater than 31.

The function returns the maximum of all matches (as dead people tend to be older).

Note that the function only tries to match the age at death linguistically instead of constructing the age of death from other information provided in the document. In the template filling section, a construction of the age will be described.

In [7]:
ageMatch = spacy.matcher.Matcher(nlp.vocab)
ageMatch.add("age", None, [{"TEXT": {"REGEX": "^(1?\d\d)$"}}])
def getAgeDoc(doc, nameEnd=None):
    matches = ageMatch(doc)
    spanEnd = nameEnd if nameEnd else next(doc.sents).end

    numPos = list()
    for n,(nx,mBeg,mEnd) in enumerate(matches):
        if mBeg > len(doc)//2: # age unlikely to be in latter 1/2 
            break

        num = int(doc[mBeg:mEnd].text) # the int that was matched
        if (doc[mBeg].ent_type_!='DATE' or # not date, or if date, num > 31
                (doc[mBeg].ent_type_=='DATE' and num > 31) ): 
            numPos.append( (num,mBeg) )

    return max(numPos, key=lambda x: x[0]) if numPos else (None,None)

__Location(s) of residency__

This is perhaps the hardest extraction out of all due to the amount of complications. While spaCy generally does a pretty good job on tagging location-like entities (e.g. GPE and locations), one location are tagged separately, or the in-entities are tagged incorrectly. For example, "Rockville, Maryland" will get tagged as two separate entities. "Rockville, MD" might be tagged as `LOC` and `ORG`. To address this problem, the function works as follows. For every entity tagged as `GPE` or `LOC` it concatenate subsequent tokens if the tagged entity is:

1. followed by another location-like entity ('GPE', 'LOC', 'FAC', 'ORG', or 'NORP').
1. followed by a comma, but is followed by a location-like entity.

This process is continued on until a subsequent token no longer fits. So things like "New York, NY, U.S.A." will get counted as one location. For deduplication, the token index of all tokens determined to be a location is saved, such that any subsequent matches have to be an occurence yet to be seen.

Another complication is that obituaries often contains information on survivors and the pre-deceased. The obituaries sometimes identify them by the cities they reside in. The function attempts prevent capturing these by looking for keywords for survivors and exclude the sentence they appear in from capture. The end of the document is also likely to contain locations where funeral services are being held. A rough way of dealing with this is to limit the matching to the first 4/5th of the document.

In [8]:
survivor = spacy.matcher.Matcher(nlp.vocab)
survivor.add("surv", None, [{"LOWER":{"REGEX": r'(surviv|pre-?deceas).*'}}])
locLikes = {'GPE', 'LOC', 'FAC', 'ORG', 'NORP'}

def findLocations(doc):
    part = doc[:len(doc)//10*8] # exclude the end (where funeral info are)
    locs,locIdxs = set(),set()
    
    for s in [doc[m[1]].sent for m in survivor(doc)]: # loop over sentences
        locIdxs.update( range(s.start,s.end+1) )
    
    for loc in (e for e in part.ents if (e.label_=='GPE' or e.label_=='LOC')):
        if loc.start in locIdxs:
            continue
        end = loc.end
        while ((doc[end].text==',' and doc[end+1].ent_type_ in locLikes) or 
               doc[end].ent_type_ in locLikes):
            end += 1
        locs.add(doc[loc.start:end].text)
        locIdxs.update( range(loc.start,end) )
    return locs 

__Spouse(s) of the deceased__

For matching the name of the spouses, we use regular expression to match for any appearance of tokens describing spouses or marriage (e.g. husband, partner, married, etc.). The first `PERSON` entity after such matched token is assumed to be the name of the spouse.

In [9]:
spouses = spacy.matcher.Matcher(nlp.vocab)
spouseRe = r'(husband|wife|spouse|partner|married).*'
spouses.add('sp', None, [{"LOWER": {"REGEX":spouseRe}}] )

def findSpouseName(doc):
    matches = spouses(doc) # search for word related to spouses
    if not matches: # if no match, assumes no spouse can be found
        return None
    
    for x,mtBeg,mtEnd in matches:
        span = doc[mtBeg: doc[mtBeg].sent.end]
        for et in span.ents: # loop over all persons in text span
            if et.label_ == 'PERSON': 
                return et.text # return the first person found
    return None # cannot find person, return empty

## Extraction of Additional Relations

Three additional relations extracted from the text (if available) were: date of birth, date of death, and date of a funeral service.

All three of these were extracted using the same function. We first build a spaCy matcher with either keywords or regular expression. With the spaCy matcher from the parameter, the function gets the first sentence it is matched in. Then it concatenate two more sentences after the sentence. For this span of the document, the function finds the first `DATE` entity that contains at least two digits.

One peculiarity for birthday is that the selected document span starts at the start of the sentence it is matched (for expression like "On October 1st 1945, John Smith was born", while the other two start the span at the keyword it matches.

In [10]:
bdayMatch = spacy.matcher.Matcher(nlp.vocab) # for birthday matches
bdayMatch.add("bday", None, [{"LOWER":"born"}],[{"LOWER":"birth"}])

deathMatch = spacy.matcher.Matcher(nlp.vocab) # for death matches
deathSyns = r'^(die|pass|.?sleep|heaven|succumb|perish).*' # synonyms for death
deathMatch.add("death", None, [{"LOWER":{"REGEX": deathSyns}}])

funeralMatch = spacy.matcher.Matcher(nlp.vocab) # for funeral service matches
funeralMatch.add("celebration", None, [{"LOWER":"life"},{"LOWER":"celebration"}])
funeralMatch.add("serv", None, [{"LOWER":{"REGEX": r'^(service|visitation)s?'}}]) 
funeralMatch.add("memo", None, [{"LOWER":"memorial"}], [{"LOWER":"viewing"}], 
                 [{"LOWER":"funeral"}])

def findDateAfterMatch(doc, matcher, startAtMatch=False):
    matched = matcher(doc)
    if not matched: # if no match, then nothing is after
        return None
    matchedSent = doc[matched[0][1]].sent # assume 1st match is good
    spanBeg = matched[0][1] if startAtMatch else matchedSent.start
    spanEnd = matchedSent.end
    for n in range(2):
        if spanEnd >= len(doc): # if no more sentences left in doc
            break
        spanEnd = doc[spanEnd].sent.end
        
    for et in doc[spanBeg:spanEnd].ents:
        if et.label_ == 'DATE' and re.findall(r'\d\d', et.text):
            return et.text
    return None

## Filling Template and Outputting Result

The following function simply runs all of the function above and save the result in a python dict. However, it also attempts to construct the age at death if the birth date of the deceased is available along with either a date of death or a date of funeral. With both the date of birth and date of death, the age at death can be calculated. The date of the funeral is used as a proxy if the date of death is not available. This constructed age at death overwrites the age gotten from earlier letter matching, unless the absolute difference is within two years (in which the explicitly stated age is likely more accurate).

In [11]:
def fillTemplate(doc):
    info = dict()
    info['name'],nameEnd = extractName(doc)
    info['sex'] = 'male' if isMan(doc) else 'female'
    info['age'] = getAgeDoc(doc)[0] # get age using document parsing
    info['locations'] = list(findLocations(doc))
    info['spouse'] = findSpouseName(doc)
    info['birth date'] = findDateAfterMatch(doc, bdayMatch)
    info['death date'] = findDateAfterMatch(doc, deathMatch, startAtMatch=True)
    info['funeral date'] = findDateAfterMatch(doc, funeralMatch, True)

    # try to calculate age at death using birth and death (or funeral) dates
    if info['birth date'] and (info['death date'] or info['funeral date']):
        bd = dateParser(info['birth date']) # parse birth date
        if info['death date']: # if death date is stated, parse death date
            dd = dateParser(info['death date']) 
        else: # otherwise use funeral service date as proxy for death
            dd = dateParser(info['funeral date'])
        elapsedYrs = (dd.year-bd.year) - ((dd.month,dd.day)<(bd.month,bd.day))
        if not info['age'] or abs(info['age'] - elapsedYrs)>2:
            info['age'] = elapsedYrs            
    
    return info

Load the obituaries, analyze with spaCy, and attemp to fill template.

In [12]:
def parseObitsOutputInfo(obitFiles, outInfoFiles):
    docIDs,txts = readFiles(obitFiles)
    
    with open(outInfoFiles, 'w', encoding='utf-8') as outFH:
        for docID,txt in zip(docIDs,txts):
            doc = nlp(txt)
            out = {'ID': docID}
            out.update(fillTemplate(doc))
            pprint.pprint(out, outFH)
            outFH.write('\n')

In [13]:
parseObitsOutputInfo('data/obits.train.txt', 'obits.train.out')

## Evaluation

This section attempts to extract information from the test file. The evaluations are largely straight forward, except for location of residence and funeral date.

For locations of residence, I consider a extracted location to be correct if it is a location explicitly stated by the obituary (i.e. no inferring a person lived in Baltimore if the person attend JHU), even if the locations is non-specific (i.e. "he moved to **Colombia**" or "his time on **earth**"). If a location was merely mentioned but not explicitly as a residence (i.e. "she was a member of Christian Denomination Church in **Laurel, MD**"), it is not counted as a location.

For date of the funeral service, some obituaries mention multiple events, possibly on different dates like a viewing vs. a burial. I took the generous route and as long as the date is explicitly stated as any sort of memorial event, it would be considered correct.

In [14]:
parseObitsOutputInfo('data/obits.test.txt', 'obits.test.out')

### Performance Statistics


**Relation**|**Precision**|**Recall**|**F-score**
-----|-----|-----|-----
Name|10 / 10 (100.00%) |10 / 10 (100.00%) |1.000
Sex|10 / 10 (100.00%) |10 / 10 (100.00%) |1.000
Age|10 / 10 (100.00%) |10 / 10 (100.00%) |1.000
Location|26 / 44 (59.09%) |26 / 27 (96.30%) |0.732
Spouse|7 / 9 (77.78%) |7 / 8 (87.50%) |0.824
Birth Date|10 / 10 (100.00%) |10 / 10 (100.00%) |1.000
Death Date|10 / 10 (100.00%) |10 / 10 (100.00%) |1.000
Funeral Date|7 / 7 (100.00%) |7 / 10 (70.00%) |0.824

___

The underlying numbers for the performance stats are as follows:

**Relation**|**Performance**|**D100**|**D101**|**D102**|**D103**|**D104**|**D105**|**D106**|**D107**|**D108**|**D109**|**Row Sum**
-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----
Name|Extracted|1|1|1|1|1|1|1|1|1|1|10
Name|Correct|1|1|1|1|1|1|1|1|1|1|10
Name|Total|1|1|1|1|1|1|1|1|1|1|10
Sex|Extracted|1|1|1|1|1|1|1|1|1|1|10
Sex|Correct|1|1|1|1|1|1|1|1|1|1|10
Sex|Total|1|1|1|1|1|1|1|1|1|1|10
Age|Extracted|1|1|1|1|1|1|1|1|1|1|10
Age|Correct|1|1|1|1|1|1|1|1|1|1|10
Age|Total|1|1|1|1|1|1|1|1|1|1|10
Location|Extracted|5|3|4|2|4|2|6|5|4|9|44
Location|Correct|4|3|3|2|2|2|1|4|2|3|26
Location|Total|5|3|3|2|2|2|1|4|2|3|27
Spouse|Extracted|1|1|1|0|1|1|1|1|1|1|9
Spouse|Correct|1|1|1|0|0|1|0|1|1|1|7
Spouse|Total|1|1|1|0|1|1|0|1|1|1|8
Birth Date|Extracted|1|1|1|1|1|1|1|1|1|1|10
Birth Date|Correct|1|1|1|1|1|1|1|1|1|1|10
Birth Date|Total|1|1|1|1|1|1|1|1|1|1|10
Death Date|Extracted|1|1|1|1|1|1|1|1|1|1|10
Death Date|Correct|1|1|1|1|1|1|1|1|1|1|10
Death Date|Total|1|1|1|1|1|1|1|1|1|1|10
Funeral Date|Extracted|1|1|1|1|1|1|0|1|0|0|7
Funeral Date|Correct|1|1|1|1|1|1|0|1|0|0|7
Funeral Date|Total|1|1|1|1|1|1|1|1|1|1|10

### Analysis

In general, the algorithm performed pretty well. This is due to obituaries largely following a general format. For the name, the location was easy as it is located in the first sentence or two. Once the linguistical format was worked out, extraction was easy. The sex was also very easy as obituaries are written in the third person with gendered pronouns. Age at death was also not difficult because it is either explicitly stated or could be constructed with date of birth and death.

The dates were also not difficult to extract. Birth and death dates were easily captured by looking for keywords like "passed", "died", and "born". The date of the funeral performed less than ideal, despite the generous scoring. This is due to there being multiple ways of describing a memorial service. Even though the matcher has a long list of words describing funeral service, the test set had some ways that was not seen in the training set, such as "Mass of Christian Burial will be held on Friday, April 5, 2019" (did not search for burial as keyword) and "Friends will be received on Tuesday, March 26, 2019" (no funeral related keywords at all). It is possible that this can be addressed by taking the last date in the obituary given the scoring standard.

There were two errors made with spouses. One obituary mentioned husband/wife of the survivors, but not of the deceased person. There is no easy way to correct for this error. Another one matched for a wrong name due to the sentence being: "husband, 'Honest John', favorite son-in-law, Roscoe Keene". However, the husband is mentioned elsewhere following the word "marriage" (which was not searched for). This could have been addressed by having additional keywords for matrimony.

For location of residence, the recall was high but the precision was very low. This is there being many false positives. It is easier to do NER for a location, but h ard to distinguish whether such location is a location of residence. For example, it is possible for a location only be for work, but not living. A big source of errors were locations of survivors being extracted. The function does attempt to address this by excluding locations in the same sentence describing survivors. However, sometimes survivor description goes over sentence segmenter (a semi-colon). A good way to address the problem of low precision is to instead search for keywords regarding families like "child", "son", etc so that the locations of survivors are not even extracted.