# Scraping Wikipedia Celebrity Deaths

Creating a notebook to produce the dataset found at the [Kaggle Celebrity Deaths Page](https://www.kaggle.com/hugodarwood/celebrity-deaths).

Attempting to replace the current dataset since it isn't complete (up-to-date) since there's no notebook to run to get up-to-date information and it has bad parses for some of the fields.

This notebook isn't perfect either but does a better job.

In [3]:
%matplotlib inline

In [4]:
import numpy as np
import pandas as pd
import re
import json

from os import listdir
from os.path import isfile, join
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from bs4 import BeautifulSoup as bs

### Helper functions

In [5]:
desc_death_re = re.compile('(.*?),? ?((?:.+)*?). (?:.*?)?', re.DOTALL|re.MULTILINE)
death_clean_no_url_re = re.compile('\s?(.[^<]+)\.? ?(?=<|\[?http)(?:.*)?$')
# death_clean_no_url_re = re.compile('\s?\w(.[^<])+[.]?(?:<|\[?http)(?:.*)?$')

"""
Input: single text string to be processed

Output: list of strings of length two
  - first string is description of person
  - second string is cause of death 
  (last clause of input when more than one comma in field)
  
"""
def get_description_and_death(text):
    text_no_url = text
    if ('http' in text_no_url) or ('<ref' in text_no_url):
        try:
            text_no_url = death_clean_no_url_re.match(text).groups()[0]
        except AttributeError:
            print text
            raw_input("Press enter to continue")
            raw_input("Press enter to continue")
    text_parts = text_no_url.replace('=','').split(',')
    num_parts = len(text_parts)
    if num_parts == 0:
        return ['', '']
    elif num_parts == 1:
        return text_parts + ['']
    elif num_parts == 2:
        return [",".join(text_parts)] + [text_parts[1]]
    else:
        return ([",".join(text_parts[:-1])] + [text_parts[-1]])
                #[death_clean_no_url_re.match(text_parts[-1]).groups()[0]])

In [6]:
"""
Runs get_description_and_death() on the last element of a list
"""
def add_description_and_death(entry_list):
    return entry_list[:-1] + get_description_and_death(entry_list[-1])

Testing functions:

In [7]:
get_description_and_death('English singer-songwriter ("[[Won\'t Somebody Dance with Me]]"), brain haemorrhage.<ref>[http://www.bbc.co.uk/news/entertainment-arts-29457228 Singer Lynsey De Paul dies aged 64]</ref>')

['English singer-songwriter ("[[Won\'t Somebody Dance with Me]]"), brain haemorrhage.',
 ' brain haemorrhage.']

In [8]:
"""
This is for input as a list cleaning a particular list entry. 
See below for the pandas dataframe version

Inputs: list, index number, character pattern to be replaced

Outputs: list with entry in index number cleaned of the character pattern to be replaced
"""
def clean_text(text_list, idx_num, chars):
    text_list_out = text_list
    text_list_out[idx_num] = text_list[idx_num].replace(chars, '')
    return text_list_out

In [9]:
mo_yr_key_re = re.compile('(\d+)_(\d+).*?')
name_age_re = re.compile('\[\[(.*?)\]\], (\d+), (.+)?$', re.MULTILINE)

"""
Inputs: month-year key string, text entry string

Outputs: list of month, year, name, and age
"""
def parse_month_year_name_age(my_key, text_entry):
    return (list(re.match(mo_yr_key_re, my_key).groups()) +
            list(re.match(name_age_re, text_entry.replace('\n', '')).groups()))

In [10]:
"""
Find links and convert them to the displayed text
"""
link_re = re.compile('\[\[([^\|(?:\]\])]*)(?=\||\]\])', re.DOTALL)

# need to check if contains link
def extract_link_text(link_block):
    link_present = link_re.search(link_block)
    if link_present:
        return link_present.groups()
    return link_block

In [11]:
extract_link_text('asdf asdf [[1asdf|2]]')

('1asdf',)

In [12]:
link_all_re = re.compile('(\[\[([^(\]\])]*)(\]\]))')

"""
Used to be messy, not anymore!
"""
def link_only(matchobj):
    cleaned_text = extract_link_text(matchobj.groups()[0])[0]
    return cleaned_text

def remove_link_text(text_block):
    return re.sub(link_all_re, link_only, text_block)

In [13]:
remove_link_text('asdf asdf [[ab]] [[cd|efg]]')

'asdf asdf ab cd'

In [14]:
def remove_end_period(text):
    return re.sub('\. ?$','',text)

def remove_beginning_space(text):
    return re.sub('^ +','',text)

In [15]:
remove_beginning_space('   asdf')

'asdf'

In [16]:
"""
for a single entry in the dataframe
"""
def text_clean(text):
    return remove_beginning_space(
        remove_end_period(
            remove_link_text(text)
        ).replace('[','').replace(']','')
    )

In [17]:
natl_pattern1 = re.compile(' ?((?:[A-Z][^\s]+ ?)+) ', re.UNICODE)

def get_nationality_text(desc_text):
    try:
        # get rid of url links
        return natl_pattern1.match(desc_text.strip('[')).groups()[0]
    except AttributeError, e:
        print desc_text

In [18]:
natl_pattern1.match('Native American asdf asdfasdf ').groups()

('Native American',)

In [19]:
def get_wiki_url(name_text):
    return name_text.split('|')[0].strip('[').strip(']')

### Re-read and process scraped file data locally

In [20]:
jsonfiles = [f for f in listdir('../out/raw_pages/') if (isfile(join('../out/raw_pages/', f)) and f[-4:] == 'json')]

In [21]:
month_year_pages = {}

for jfilename in jsonfiles:
    my_key = "_".join(re.match(mo_yr_key_re, jfilename).groups())
    with open('../out/raw_pages/' + jfilename, 'rb') as infile:
        contents = infile.read()
        infile.close()
    month_year_pages[my_key] = json.loads(contents)['query']['pages'].values()[0]['revisions'][0]['*']

In [22]:
counter = 0
for value in month_year_pages.values():
    new_entries = len(value.encode('utf-8').rstrip().split('*'))
    counter += new_entries
counter

57927

Should combine cell below with cell above (later).

In [23]:
for my_key in month_year_pages.keys():
    month_year_pages[my_key] = [
        add_description_and_death(
            parse_month_year_name_age(my_key, entry)
        )
        for entry in month_year_pages[my_key].encode('utf-8').rstrip().split('*')
        if re.match(name_age_re, entry.replace('\n', ''))]

Make dataframe:

In [24]:
df_full = pd.DataFrame(columns=['year','month','name','age','desc','cause_of_death'])
print df_full.shape
df_full.head()

Unnamed: 0,year,month,name,age,desc,cause_of_death


In [25]:
for entry in month_year_pages.values():
    df_sub = pd.DataFrame(entry, columns=['year','month','name','age','desc','cause_of_death'])
    df_full = pd.concat([df_full, df_sub], axis=0)

In [26]:
print df_full.shape
df_full.head()

Unnamed: 0,year,month,name,age,desc,cause_of_death
0,2014,10,Lynsey de Paul,64,"English singer-songwriter (""[[Won't Somebody D...",brain haemorrhage.
1,2014,10,Maurice Hodgson|Sir Maurice Hodgson,94,British business executive.,
2,2014,10,Shlomo Lahat,86,"Israeli general and politician, Mayor of [[Tel...",lung infection.
3,2014,10,José Martínez (infielder)|José Martínez,72,Cuban baseball player ([[Pittsburgh Pirates]])...,[[Chicago Cubs]]) and executive ([[Atlanta Br...
4,2014,10,Oluremi Oyo,61,"Nigerian journalist, cancer.",cancer.


In [27]:
df_full.iloc[4].values[5]

' cancer.'

### Further Processing

#### Reminder: Strip links, quotes, and brackets before extracting Nationality text

In [28]:
df_full['desc'] = df_full.desc.map(text_clean)
df_full['cause_of_death'] = df_full.cause_of_death.map(text_clean)

#### Extract Nationality

Extracting nationality text as well as possible by taking the first consecutive capitalized words in the description. 'Olympic' and similar capitalized words might throw this off.

In [29]:
df_full['nationality'] = df_full.desc.map(get_nationality_text)

Get wikitext url name

In [30]:
df_full['name'] = df_full.name.map(get_wiki_url)

## Write out file

In [31]:
df_full.to_csv('../out/celeb_deaths_wikipedia_full_1.csv', index=False)

It would be useful to plot missing data by page_size i.e. fame or "importance."