# Scraping Wikipedia Celebrity Deaths

Creating a notebook to produce the dataset found at the [Kaggle Celebrity Deaths Page](https://www.kaggle.com/hugodarwood/celebrity-deaths).

Attempting to replace the current dataset since it isn't complete (up-to-date) since there's no notebook to run to get up-to-date information and it has bad parses for some of the fields.

This notebook isn't perfect either but does a better job.

In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import re
import json
import csv

from os import listdir
from os.path import isfile, join
import sys
#reload(sys)
#sys.setdefaultencoding('utf-8')

from bs4 import BeautifulSoup as bs

### Helper functions

#### Break apart description and cause of death

Cases (where clauses are comma-separated parts of each string):
- If one clause, use that clause as the description
- If multiple clauses, use the combined (all) clauses as the description and the second clause as cause of death.
  - This is so that descriptions don't omit any useful information that would have been comma separated such as multiple titles

It used to split greater than two clauses into:
- Combined clause from first to second to last as the description
- Last clause as cause of death

However, I realized that this would probably still have the issues of misclassifying some part of the description as cause of death.

In [3]:
desc_death_re = re.compile('(.*?),? ?((?:.+)*?). (?:.*?)?', re.DOTALL|re.MULTILINE)
death_clean_no_url_re = re.compile('\s?(.[^<]+)\.? ?(?=<|\[?http)(?:.*)?$')
# death_clean_no_url_re = re.compile('\s?\w(.[^<])+[.]?(?:<|\[?http)(?:.*)?$')

"""
Input: single text string to be processed

Output: list of strings of length two
  - first string is description of person
  - second string is cause of death 
  (last clause of input when more than one comma in field)
  
"""
def get_description_and_death(text):
    text_no_url = text
    if ('http' in text_no_url) or ('<ref' in text_no_url):
        try:
            text_no_url = death_clean_no_url_re.match(text).groups()[0]
        except AttributeError:
            print text
            raw_input("Press enter to continue")
            raw_input("Press enter to continue")
    text_parts = text_no_url.replace('=','').split(',')
    num_parts = len(text_parts)
    if num_parts == 0:
        return ['', '']
    elif num_parts == 1:
        return text_parts + ['']
#    elif num_parts == 2:
#        return [",".join(text_parts)] + [text_parts[1]]
    else:
        return ([",".join(text_parts[:-1])] + [text_parts[-1]])
                #[death_clean_no_url_re.match(text_parts[-1]).groups()[0]])

In [4]:
"""
Runs get_description_and_death() on the last element of a list
"""
def add_description_and_death(entry_list):
    return entry_list[:-1] + get_description_and_death(entry_list[-1])

Testing functions:

In [5]:
get_description_and_death('English singer-songwriter ("[[Won\'t Somebody Dance with Me]]"), brain haemorrhage.<ref>[http://www.bbc.co.uk/news/entertainment-arts-29457228 Singer Lynsey De Paul dies aged 64]</ref>')

['English singer-songwriter ("[[Won\'t Somebody Dance with Me]]")',
 ' brain haemorrhage.']

#### Function for cleaning a particular list entry

As far as I know this function isn't actually called for any cleaning and was written as an afterthought in case it was needed.

#### Get month, year, name, age rows

Month and year come from the file name so are guaranteed. Name and age are virtually guaranteed from each listing as well.

In [6]:
mo_yr_key_re = re.compile('(\d+)_(\d+).*?')
name_age_re = re.compile('\[\[(.*?)\]\], (\d+), (.+)?$', re.MULTILINE)

"""
Inputs: month-year key string, text entry string

Outputs: list of month, year, name, and age
"""
def parse_month_year_name_age(my_key, text_entry):
    return (list(re.match(mo_yr_key_re, my_key).groups()) +
            list(re.match(name_age_re, text_entry.replace('\n', '')).groups()))

#### Convert all links in text blocks into normal text

In other words, removing the [ and ] brackets as well as any text after | in links if present.

In [7]:
#link_re = re.compile('\[\[([^\|(?:\]\])]*)(?=\||\]\])', re.DOTALL)
link_re = re.compile('\[\[([^\|\]]*)(?=\||\]\])', re.DOTALL)

#link_all_re = re.compile('(\[\[([^(\]\])]*)(\]\]))')
link_all_re = re.compile('(\[\[(?:[^\[\]])+\]\])')

"""
Used to be messy, not anymore!
"""
"""
Find links and convert them to the displayed text
"""

# need to check if contains link
# essentially does the same thing as get_wiki_url...
def extract_link_text(link_block):
    link_present = link_re.search(link_block)
    if link_present:
        return link_present.groups()
    return link_block

def link_only(matchobj):
    cleaned_text = extract_link_text(matchobj.groups()[0])[0]
    return cleaned_text

def link_only_special(text):
    print text.groups()

def remove_link_text(text_block):
    return re.sub(link_all_re, link_only, text_block)

In [8]:
print extract_link_text('[[1asdf|2]]')
print extract_link_text('[[George Savage (politician)|George Savage]]')

('1asdf',)
('George Savage (politician)',)


In [9]:
#remove_link_text('asdf asdf [[ab]] [[cd|efg]]')
#remove_link_text('[[George Savage (politician)|George Savage]], 72, British politician, [[Member of the Legislative Assembly (Northern Ireland)|MLA]] for [[Upper Bann (Assembly constituency)|Upper Bann]]')
#link_all_re.search('[[George Savage (politician)|George Savage]], 72, British politician, [[Member of the Legislative Assembly (Northern Ireland)|MLA]] for [[Upper Bann (Assembly constituency)|Upper Bann]]').groups()
#re.sub(link_all_re, link_only, '[[George Savage (politician)|George Savage]], 72, British politician, [[Member of the Legislative Assembly (Northern Ireland)|MLA]] for [[Upper Bann (Assembly constituency)|Upper Bann]]')
a = re.sub(link_all_re, link_only, '[[George Savage (politician)|George Savage]], 72, British politician, [[Member of the Legislative Assembly (Northern Ireland)|MLA]] for [[Upper Bann (Assembly constituency)|Upper Bann]]')

In [10]:
a

'George Savage (politician), 72, British politician, Member of the Legislative Assembly (Northern Ireland) for Upper Bann (Assembly constituency)'

#### Extracting nationality from description

Nationality is roughly matched to the first $n$ words that have capitalized first letters. Sometimes this will capture titles or description text such as "British Prime Minister" or "Australian Olympic." However, it does a pretty good job of making sure countries such as "New Zealand" and "South African" aren't cut off.

In [11]:
natl_pattern1 = re.compile(' ?((?:[A-Z][^\s]+ ?)+) ', re.UNICODE)

def get_nationality_text(desc_text):
    try:
        # get rid of url links
        return natl_pattern1.match(desc_text.strip('[')).groups()[0]
    except AttributeError, e:
        print desc_text

In [12]:
natl_pattern1.match('Native American asdf asdfasdf ').groups()

('Native American',)

#### Parse out additional characters and urls from name field

In [13]:
# essentially does the same thing as extract_link_text
def get_wiki_url(name_text):
    return name_text.split('|')[0].strip('[').strip(']')

#### Additional functions for cleaning and removing characters

In [14]:
def remove_end_period(text):
    return re.sub('\. ?$','',text)

def remove_beginning_space(text):
    return re.sub('^ +','',text)

In [15]:
remove_beginning_space('   asdf')

'asdf'

In [16]:
def text_clean(text):
    if type(text) != str:
        return text
    return remove_beginning_space(
    remove_end_period(
        remove_link_text(text)
        ).replace('[','').replace(']','')
    )

### Re-read and process scraped file data locally

In [17]:
jsonfiles = [f for f in listdir('../out/raw_pages/') if (isfile(join('../out/raw_pages/', f)) and f[-4:] == 'json')]

In [18]:
month_year_pages = {}

for jfilename in jsonfiles:
    my_key = "_".join(re.match(mo_yr_key_re, jfilename).groups())
    with open('../out/raw_pages/' + jfilename, 'rb') as infile:
        contents = infile.read()
        infile.close()
    month_year_pages[my_key] = json.loads(contents)['query']['pages'].values()[0]['revisions'][0]['*']

In [19]:
counter = 0
for value in month_year_pages.values():
    new_entries = len(value.encode('utf-8').rstrip().split('*'))
    counter += new_entries
counter

57927

Should combine cell below with cell above (later).

In [20]:
for my_key in month_year_pages.keys():
    month_year_pages[my_key] = [
        add_description_and_death(
            parse_month_year_name_age(my_key, entry)
        )
        for entry in month_year_pages[my_key].encode('utf-8').rstrip().split('*')
        if re.match(name_age_re, entry.replace('\n', ''))]

Make dataframe:

In [21]:
df_full = pd.DataFrame(columns=['year','month','name','age','desc','cause_of_death'])
print df_full.shape
df_full.head()

(0, 6)


Unnamed: 0,year,month,name,age,desc,cause_of_death


In [22]:
for entry in month_year_pages.values():
    df_sub = pd.DataFrame(entry, columns=['year','month','name','age','desc','cause_of_death'])
    df_full = pd.concat([df_full, df_sub], axis=0)

In [23]:
print df_full.shape
df_full.head()

(49051, 6)


Unnamed: 0,year,month,name,age,desc,cause_of_death
0,2014,10,Lynsey de Paul,64,"English singer-songwriter (""[[Won't Somebody D...",brain haemorrhage.
1,2014,10,Maurice Hodgson|Sir Maurice Hodgson,94,British business executive.,
2,2014,10,Shlomo Lahat,86,"Israeli general and politician, Mayor of [[Tel...",lung infection.
3,2014,10,José Martínez (infielder)|José Martínez,72,Cuban baseball player ([[Pittsburgh Pirates]])...,[[Chicago Cubs]]) and executive ([[Atlanta Br...
4,2014,10,Oluremi Oyo,61,Nigerian journalist,cancer.


In [24]:
df_cod_nons = df_full[df_full.cause_of_death.map(lambda x: type(x) != str)]
df_cod_nons.head()

Unnamed: 0,year,month,name,age,desc,cause_of_death


In [25]:
df_full.iloc[4].values[5]

' cancer.'

### Further Processing

#### Reminder: Strip links, quotes, and brackets before extracting Nationality text

In [26]:
df_full['desc'] = df_full.desc.map(text_clean)
df_full['cause_of_death'] = df_full.cause_of_death.map(text_clean)

#### Extract Nationality

Extracting nationality text as well as possible by taking the first consecutive capitalized words in the description. 'Olympic' and similar capitalized words might throw this off.

In [27]:
df_full['nationality'] = df_full.desc.map(get_nationality_text)

ni-Vanuatu international footballer
chief of web security at AOL
<!-- date of birth unknown -->British artist. 
former Canadian Football League player and NHL referee. 
former Soviet FIFA World Cup footballer & title-winning coach for Dynamo Kiev 
former German Nazi SS officer. 
former Second Deputy Prime Minister of Singapore
former professional wrestling promoter
last American survivor of the ''RMS Titanic'' sinking
former CEO of Firestone Tire and Rubber Company
fled the U.S. with granddaughter in Elizabeth Morgan case custody battle 
lawyer & professor at New York Law School 
founder of Ninety Nine Restaurant & Pub
first Asian-American to be editor of a major American daily newspaper, the ''St. Louis Post-Dispatch''
first Australian Defence Force serviceperson killed in Iraq. 
humanitarian worker and charity activist
scholar and journalist
military band conductor
football player (Associação Chapecoense de Futebol)
football player (Associação Chapecoense de Futebol)
football player 

Get wikitext url name

In [28]:
df_full['name'] = df_full.name.map(get_wiki_url)

## Write out file

In [29]:
df_full.to_csv('../out/celeb_deaths_wikipedia_full_1.csv', index=False)

It would be useful to plot missing data by page_size i.e. fame or "importance."

In [30]:
df_full.head()

Unnamed: 0,year,month,name,age,desc,cause_of_death,nationality
0,2014,10,Lynsey de Paul,64,"English singer-songwriter (""Won't Somebody Dan...",brain haemorrhage,English
1,2014,10,Maurice Hodgson,94,British business executive,,British
2,2014,10,Shlomo Lahat,86,"Israeli general and politician, Mayor of Tel A...",lung infection,Israeli
3,2014,10,José Martínez (infielder),72,"Cuban baseball player (Pittsburgh Pirates), co...",Chicago Cubs) and executive (Atlanta Braves),Cuban
4,2014,10,Oluremi Oyo,61,Nigerian journalist,cancer,Nigerian


Quick workaround for the ones ending in parentheses:

full_2_list = []
with open('../out/celeb_deaths_wikipedia_full_1.csv', 'rb') as df_full_1_infile:
    in_reader = csv.reader(df_full_1_infile)
    for row in in_reader:
        full_2_list.append(row)
    df_full_1_infile.close()

Correct way to do it:

In [66]:
full_2_list = [df_full.columns.tolist()] + list(df_full.values.tolist())

In [67]:
for row in full_2_list[1:]:
    if len(row[5]) > 0:
        if row[5][-1] == ')':
            row[4] = row[4] + ", " + row[5]
            row[5] = ''

In [68]:
with open('../out/celeb_deaths_wikipedia_full_1.csv', 'wb') as df_full_2_outfile:
    out_writer = csv.writer(df_full_2_outfile, delimiter=',')
    for row in full_2_list:
        out_writer.writerow(row)
    df_full_2_outfile.close()