# Part II: Extracting Variables from Local Raw Files

First run [Part I: get_monthly_death_list_1.ipynb](http://localhost:8888/notebooks/pt1_scrape_summary_pages.ipynb).

This notebook converts and consolidates the local JSON files from [Part I](http://localhost:8888/notebooks/pt1_scrape_summary_pages.ipynb) into one complete CSV files and extracts relevant variables and columns from each entry.

In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import re
import json
import csv

from os import listdir
from os.path import isfile, join
import sys
#reload(sys)
#sys.setdefaultencoding('utf-8')

from bs4 import BeautifulSoup as bs

### Helper functions

Cleaning should remove URL and ref tags instead of matching descriptions and URLs.

TODO: [Robust removal of urls on Stack Overflow](https://stackoverflow.com/questions/6883049/regex-to-find-urls-in-string-in-python) and [here as well](https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python). Example:

#### Break apart description and cause of death

Cases (where clauses are comma-separated components of each string):
- If one clause, use that clause as the description
- If multiple clauses, use the combined (all) clauses as the description and the last clause as cause of death.
  - This is so that descriptions don't omit any useful information that would have been comma separated such as multiple titles (e.g. "John Doe, 80, President of the United States, Governor of Texas." would list "Governor of Texas" as cause of death but would also not fail to include the description as "President of the United States, Governor of Texas.")

*Note:* This section used to split greater than two clauses into:
- Combined clause from first to second to last as the description
- Last clause as cause of death

However, I realized that this would probably still have the issues of misclassifying some part of the description as cause of death (see above).

In [3]:
desc_death_re = re.compile('(.*?),? ?((?:.+)*?). (?:.*?)?', re.DOTALL|re.MULTILINE)
death_clean_no_url_re = re.compile('\s?(.[^<]+)\.? ?(?=<|\[?http)(?:.*)?$')
# death_clean_no_url_re = re.compile('\s?\w(.[^<])+[.]?(?:<|\[?http)(?:.*)?$')

"""
Input: single text string to be processed

Output: list of two string elements
  - first string is description of person
  - second string is cause of death 
  (last clause of input when more than one comma in field)
  
"""
# bad design below: just remove urls and refs instead of extracting
def get_description_and_death(text):
    text_no_url = text
    if ('http' in text_no_url) or ('<ref' in text_no_url):
        # bad design here
        try:
            text_no_url = death_clean_no_url_re.match(text).groups()[0]
        except AttributeError:
            print text
            raw_input("Press enter to continue")
            raw_input("Press enter to continue")
    text_parts = text_no_url.replace('=','').split(',')
    num_parts = len(text_parts)
    if num_parts == 0:
        return ['', '']
    elif num_parts == 1:
        return text_parts + ['']
    else:
        return ([",".join(text_parts[:-1])] + [text_parts[-1]])

In [4]:
"""
Runs get_description_and_death() on the last element of a list

Input: list of length n
Output: list of length (n+1) with last element broken into description and death
"""
def add_description_and_death(entry_list):
    return entry_list[:-1] + get_description_and_death(entry_list[-1])

Testing get description text:

In [5]:
assert(get_description_and_death(
        'English singer-songwriter ("[[Won\'t Somebody Dance with Me]]"), brain haemorrhage.<ref>[http://www.bbc.co.uk/news/entertainment-arts-29457228 Singer Lynsey De Paul dies aged 64]</ref>')
       == (['English singer-songwriter ("[[Won\'t Somebody Dance with Me]]")', ' brain haemorrhage.'])
)

#### Get month, year, name, age rows

Month and year come from the file name so are guaranteed. Name and age are virtually guaranteed from each listing as well.

In [6]:
mo_yr_key_re = re.compile('(\d+)_(\d+).*?')
name_age_re = re.compile('\s?\[\[(.*?)\]\], (\d+), (.+)?$', re.MULTILINE)

"""
Inputs: month-year key string, text entry string
Outputs: list of length 4 of month, year, name, and age
"""
def parse_month_year_name_age(my_key, text_entry):
    return (list(re.match(mo_yr_key_re, my_key).groups()) +
            list(re.match(name_age_re, text_entry.replace('\n', '')).groups()))

#### Convert all links in text blocks into normal text

In other words, removing the [ and ] brackets as well as any text after | in links if present.

In [7]:
link_re = re.compile('\[\[([^\|\]]*)(?=\||\]\])', re.DOTALL)
link_all_re = re.compile('(\[\[(?:[^\[\]])+\]\])')

"""
Used to be messy, not anymore!
"""

"""
Find wikitext links and convert them to the displayed text

Input: text block
Output: text block with wikitext URL text extracted and URL characters removed
"""
def extract_link_text(link_block):
    link_present = link_re.search(link_block)
    if link_present:
        return link_present.groups()
    return link_block

"""
Helper function for removing link text when using re.sub--identifies a wikitext URL

Input: re.match object
Output: text of matched object 
"""
def link_only(matchobj):
    cleaned_text = extract_link_text(matchobj.groups()[0])[0]
    return cleaned_text

"""
Testing function
"""
def link_only_special(text):
    print text.groups()

"""
Substitute all wikitext URL links with the display text for the URL

Input: text block
Output: text block with links removed
"""
def remove_link_text(text_block):
    return re.sub(link_all_re, link_only, text_block)

Testing extract_link_text:

In [8]:
assert(extract_link_text('[[1asdf|2]]') == ('1asdf',))
assert(extract_link_text('[[George Savage (politician)|George Savage]]') == ('George Savage (politician)',))
assert(remove_link_text('asdf asdf [[ab]] [[cd|efg]]') == 'asdf asdf ab cd')
assert(remove_link_text('[[George Savage (politician)|George Savage]], 72, British politician, [[Member of the Legislative Assembly (Northern Ireland)|MLA]] for [[Upper Bann (Assembly constituency)|Upper Bann]]')
       ==
       'George Savage (politician), 72, British politician, Member of the Legislative Assembly (Northern Ireland) for Upper Bann (Assembly constituency)'
       )
assert(link_all_re.search('[[George Savage (politician)|George Savage]], 72, British politician, [[Member of the Legislative Assembly (Northern Ireland)|MLA]] for [[Upper Bann (Assembly constituency)|Upper Bann]]').groups()
       ==
       ('[[George Savage (politician)|George Savage]]',)
       )
assert(re.sub(link_all_re, link_only, '[[George Savage (politician)|George Savage]], 72, British politician, [[Member of the Legislative Assembly (Northern Ireland)|MLA]] for [[Upper Bann (Assembly constituency)|Upper Bann]]')
       == 
       'George Savage (politician), 72, British politician, Member of the Legislative Assembly (Northern Ireland) for Upper Bann (Assembly constituency)'
       )
assert(re.sub(link_all_re, link_only, '[[George Savage (politician)|George Savage]], 72, British politician, [[Member of the Legislative Assembly (Northern Ireland)|MLA]] for [[Upper Bann (Assembly constituency)|Upper Bann]]')
       == 'George Savage (politician), 72, British politician, Member of the Legislative Assembly (Northern Ireland) for Upper Bann (Assembly constituency)'
       )

#### Extracting nationality from description

Nationality is roughly matched to the first $n$ words that have capitalized first letters. Sometimes this will capture titles or description text such as "British Prime Minister" or "Australian Olympic." However, it does a pretty good job of making sure countries such as "New Zealand" and "South African" aren't cut off.

In [9]:
natl_pattern1 = re.compile(' ?((?:[A-Z][^\s]+ ?)+) ', re.UNICODE)

natl_unmatched_list = []

def get_nationality_text(desc_text):
    natl_match = natl_pattern1.match(desc_text.strip('['))
    if natl_match:
        return natl_match.groups()[0]
    else:
        natl_unmatched_list.append(desc_text)
"""
def get_nationality_text(desc_text):
    try:
        # get rid of url links
        return natl_pattern1.match(desc_text.strip('[')).groups()[0]
    except AttributeError, e:
        print desc_text     
"""

"\ndef get_nationality_text(desc_text):\n    try:\n        # get rid of url links\n        return natl_pattern1.match(desc_text.strip('[')).groups()[0]\n    except AttributeError, e:\n        print desc_text     \n"

Testing get_nationality_text:

In [10]:
assert(natl_pattern1.match('Native American asdf asdfasdf ').groups() == ('Native American',))

#### Parse out additional characters and urls from name field

In [11]:
# essentially does the same thing as extract_link_text
def get_wiki_url(name_text):
    return name_text.split('|')[0].strip('[').strip(']')

#### Additional functions for cleaning and removing characters

In [12]:
def remove_end_period(text):
    return re.sub('\.$', '', re.sub('\s$','',text))

def remove_beginning_space(text):
    return re.sub('^ +','',text)

Testing cleaning functions:

In [13]:
assert(remove_beginning_space('   asdf') == 'asdf')
assert(remove_end_period('fasdf. ') == 'fasdf')

In [14]:
def text_clean(text):
    if type(text) != str:
        return text
    
    new_text = text
    url_match = re.match(death_clean_no_url_re, text)
    if url_match:
        new_text = url_match.groups()[0]
    return remove_beginning_space(
    remove_end_period(
        remove_link_text(new_text)
        ).replace('[','').replace(']','')
    )

Testing text_clean:

In [15]:
assert(text_clean('Crops and Livestock. Shot by masked gunmen. http://news.bbc.co.uk/2/low/americas/4935288.stm ')
       ==
       'Crops and Livestock. Shot by masked gunmen'
       )

### Re-read and process scraped file data locally

In [16]:
# dictionary to store entries by month
# keys are month-year string pairs and values are lists of entries (people)
month_year_pages = {}

# get JSON file name list
jsonfiles = [f for f in listdir('../out/raw_pages/') if (isfile(join('../out/raw_pages/', f)) and f[-4:] == 'json')]

# read in files from local folder
for jfilename in jsonfiles:
    
    # dictionary month-year key
    my_key = "_".join(re.match(mo_yr_key_re, jfilename).groups())
    
    # read in file
    with open('../out/raw_pages/' + jfilename, 'rb') as infile:
        contents = infile.read()
        infile.close()
    
    # get relevant JSON field for contents
    entry_raw = json.loads(contents)['query']['pages'].values()[0]['revisions'][0]['*']
    
    # break raw contents into list of entries
    # store entries into dictionary value
    month_year_pages[my_key] = [
        add_description_and_death(
            parse_month_year_name_age(my_key, 
                                      re.sub(r'^https?:\/\/.*[\r\n]*', '', entry, flags=re.MULTILINE)
                                     )
        )
        for entry in entry_raw.encode('utf-8').rstrip().split('*')
        if re.match(name_age_re, entry.replace('\n', ''))]

# count approximate number of entries
print sum(map(len, month_year_pages.values()))

55505


Seeing which months have fewer than 100 entries on their pages:

In [17]:
for mkey in month_year_pages.keys():
    num_post_entries = len(month_year_pages[mkey])
    if num_post_entries < 100:
        print mkey, ":", len(month_year_pages[mkey])

2004_4 : 95
2004_6 : 90
2004_2 : 87


Create dataframe:

In [18]:
df_full = pd.DataFrame(columns=['year','month','name','age','desc','cause_of_death'])

for entry in month_year_pages.values():
    df_sub = pd.DataFrame(entry, columns=['year','month','name','age','desc','cause_of_death'])
    df_full = pd.concat([df_full, df_sub], axis=0)

print df_full.shape
df_full.head()

(55505, 6)


Unnamed: 0,year,month,name,age,desc,cause_of_death
0,2014,10,Lynsey de Paul,64,"English singer-songwriter (""[[Won't Somebody D...",brain haemorrhage.
1,2014,10,Maurice Hodgson|Sir Maurice Hodgson,94,British business executive.,
2,2014,10,Shlomo Lahat,86,"Israeli general and politician, Mayor of [[Tel...",lung infection.
3,2014,10,José Martínez (infielder)|José Martínez,72,Cuban baseball player ([[Pittsburgh Pirates]])...,[[Chicago Cubs]]) and executive ([[Atlanta Br...
4,2014,10,Oluremi Oyo,61,Nigerian journalist,cancer.


### Further Processing

#### Reminder: Strip links, quotes, and brackets before extracting Nationality text

In [19]:
df_full['desc'] = df_full.desc.map(text_clean)
df_full['cause_of_death'] = df_full.cause_of_death.map(text_clean)

#### Extract Nationality

Extracting nationality text as well as possible by taking the first consecutive capitalized words in the description. 'Olympic' and similar capitalized words might throw this off.

In [20]:
df_full['nationality'] = df_full.desc.map(get_nationality_text)

Number of entries with unmatched nationalities:

In [21]:
print len(natl_unmatched_list)

416


Extract name from wikitext URL

In [22]:
df_full['name'] = df_full.name.map(get_wiki_url)

In [23]:
df_full.head()

Unnamed: 0,year,month,name,age,desc,cause_of_death,nationality
0,2014,10,Lynsey de Paul,64,"English singer-songwriter (""Won't Somebody Dan...",brain haemorrhage,English
1,2014,10,Maurice Hodgson,94,British business executive,,British
2,2014,10,Shlomo Lahat,86,"Israeli general and politician, Mayor of Tel A...",lung infection,Israeli
3,2014,10,José Martínez (infielder),72,"Cuban baseball player (Pittsburgh Pirates), co...",Chicago Cubs) and executive (Atlanta Braves),Cuban
4,2014,10,Oluremi Oyo,61,Nigerian journalist,cancer,Nigerian


Clean text again to get rid of urls; for some reason, running twice does the trick.

In [24]:
df_full['desc'] = df_full.desc.map(text_clean)
df_full['cause_of_death'] = df_full.cause_of_death.map(text_clean)

#### Quick fix for cause of death ending in parentheses

Hacking around: if the last element (cause of death) ends in a closing paren/parentheses, append the cause of death to the description text and remove the cause_of_death.

In [25]:
full_2_list = [df_full.columns.tolist()] + list(df_full.values.tolist())

In [26]:
for row in full_2_list[1:]:
    if len(row[5]) > 0:
        if row[5][-1] == ')':
            row[4] = row[4] + ", " + row[5]
            row[5] = ''

## Write out file

It would be useful to plot missing data by page_size i.e. fame or "importance."

In [27]:
with open('../out/celeb_deaths_wikipedia_full_1.csv', 'wb') as df_full_2_outfile:
    out_writer = csv.writer(df_full_2_outfile, delimiter=',')
    for row in full_2_list:
        out_writer.writerow(row)
    df_full_2_outfile.close()