# Scraping Wikipedia Celebrity Deaths

Creating a notebook to produce the dataset found at the [Kaggle Celebrity Deaths Page](https://www.kaggle.com/hugodarwood/celebrity-deaths).

Attempting to replace the current dataset since it isn't complete (up-to-date) since there's no notebook to run to get up-to-date information and it has bad parses for some of the fields.

This notebook isn't perfect either but does a better job.

In [1]:
%matplotlib inline

In [85]:
import numpy as np
import pandas as pd
import re
import requests
import json

from bs4 import BeautifulSoup as bs

In [3]:
month_to_num = {
    'January': 1,
    'February': 2,
    'March': 3,
    'April': 4,
    'May': 5,
    'June': 6,
    'July': 7,
    'August': 8,
    'September': 9,
    'October': 10,
    'November': 11,
    'December': 12
}

Iterators:

In [16]:
year_list = range(2004,2017)
type(year_list)

list

### Scrape all monthly death pages

Scrape all death pages and store into the '../out/raw_pages' directory

In [86]:
for year in year_list:
    for month in month_to_num.keys():
        filename = month + '_' + str(year) + '_deaths.json'
        url = ('https://en.wikipedia.org/w/api.php?action=query&titles=Deaths_in_' +  
               month + '_' + str(year) + 
               '&prop=revisions&rvprop=content&format=json')
        content = requests.get(url).json()
        with open("../out/raw_pages/" + filename, "wb") as outfile:
            json.dump(content, outfile)

### Helper functions

#### Page entries for given (month, year) pair

In [17]:
"""
Return a list of the entries from the wiki text obtained from the API url
"""
def get_page_entries(month, year):
    url = ('https://en.wikipedia.org/w/api.php?action=query&titles=Deaths_in_' +  
            month + '_' + str(year) + 
            '&prop=revisions&rvprop=content&format=json')
    print url
    page_mkdown = bs(requests.get(url).json()['query']['pages'].values()[0]['revisions'][0]['*'])
    page_txt_list = page_mkdown.text.split('*')
    return page_txt_list

    """
    for i in xrange(0,len(page1_txt_list)):
        first_two_chars = page1_txt_list[i][:2]
        if first_two_chars != '[[':
            print "Error: ", page1_txt_list[i][:40]
            print ""
    """

#### Extract Nationality

Extracting nationality text as well as possible by taking the first consecutive capitalized words in the description. 'Olympic' and similar capitalized words might throw this off.

In [18]:
natl_pattern1 = re.compile('((?:[A-Z][a-z\-]+ ?)+) ')

def get_nationality_text(desc_text):
    try:
        # get rid of url links
        return natl_pattern1.match(desc_text.strip('[')).groups()[0]
    except AttributeError, e:
        print desc_text

#### Regex for creating columns

Compile regex to extract data fields.

In [71]:
# full populated (although some might not have death)
row_pattern_full = re.compile('\[\[(.*?)\]\], (\d+), (?:(.+), )+(.*?).($| \[{1}(h.+)\]{1})')

# missing death and no extra commas
row_pattern_m1 = re.compile('\[\[(.*?)\]\], (\d+), (.+)[\.]?($| \[{1}(h.+)\]{1})')
#row_pattern_m1 = re.compile('\[\[(.*?)\]\], (\d+), (.+). ?\[{1}(h.+)\]{1}') # original
"""
convert the page entry source into a dictionary of people
"""
def get_entry_dict(page_entries_list):
    
    count_errors = 0
    entry_dict = {}

    for entry in page_entries_list:
        entry_str = str(entry.encode('utf-8'))
        print entry_str
        try:
            entry_items = row_pattern_full.match(entry_str).groups()
            entry_dict[entry_items[0]] = entry_items[1:]
        except AttributeError, e1:
            try:
                entry_items = row_pattern_m1.match(entry_str).groups()
                entry_dict[entry_items[0]] = list(entry_items[1:3]) + ['', entry_items[3]]
            except AttributeError, e2:
                print "Error: ", entry[:20]
                print ""
                count_errors += 1

    print "Total entries: ", len(page_entries_list)
    print "Number of skipped entries: ", count_errors
    
    return entry_dict

Actual loop:

In [73]:
row_pattern_m1.match('[[Buzz Gardner]], 72, American trumpeter, died by asdf. [http://]').groups()

('Buzz Gardner', '72', 'American trumpeter, died by asdf. [http://]', '', None)

In [59]:
for year in year_list:
    for month in month_to_num.keys():
        entry_dict = {}
        
        page_entries = get_page_entries(month, year)
        entry_dict = get_entry_dict(page_entries)
        
        df_entries = pd.DataFrame(entry_dict).T
        df_entries.reset_index(inplace=True)
        
        df_entries.columns = ['name', 'age', 'description', 'death', 'obituary_url']
        del df_entries['obituary_url']
        df_entries['nationality_text'] = df_entries.description.apply(get_nationality_text)
        df_entries['name'] = df_entries.name.apply(lambda x: unicode(x, errors='ignore'))
        
        df_entries.to_csv('../out/celeb_deaths_wikipedia_' +
                          str(year) + '_' + str(month_to_num[month]) + '.csv', 
                          index=False)

https://en.wikipedia.org/w/api.php?action=query&titles=Deaths_in_February_2004&prop=revisions&rvprop=content&format=json
{{Deaths in month TOC}}
The following is a list of notable '''deaths in February 2004'''.

== February 2004 ==
===[[February 1|1]]===

Error:  {{Deaths in month TO

 [[Art Albrecht]], 82, American football player.

Error:   [[Art Albrecht]], 8

 [[Buzz Gardner]], 72, American trumpeter ([[The Mothers of Invention]])

Error:   [[Buzz Gardner]], 7

 [[Ally MacLeod]], 72, Scottish football player and manager.

Error:   [[Ally MacLeod]], 7

 [[Bob Stokoe]], 73, footballer, F.A. Cup winning manager.

===[[February 2|2]]===

Error:   [[Bob Stokoe]], 73,

 [[Alan Bullock]], 89, British historian.

Error:   [[Alan Bullock]], 8

 [[Henry Cockburn (footballer)|Henry Cockburn]], 82, English footballer.

Error:   [[Henry Cockburn (f

 [[Walter Freud]], 82, World War II Special Operations agent and chemical engineer.

===[[February 3|3]]===

Error:   [[Walter Freud]], 8

 [[Corne

ValueError: Length mismatch: Expected axis has 1 elements, new values have 5 elements

## Further attributes outside of the death summary Wiki page

### Article length: Fame Score

Get Wikipedia URL for getting fame score i.e. size of article:

In [10]:
def get_wiki_url(name_text):
    return name_text.strip('[').strip(']').split('|')[0]

In [11]:
df_entries['name'] = df_entries.name.map(get_wiki_url)

Get page metadata:

In [12]:
base_url_prefix = 'https://en.wikipedia.org/w/api.php?action=query&format=json&titles=' 
base_url_suffix = '&prop=revisions&rvprop=size'

In [13]:
def get_page_size(wiki_name):
    size_url = base_url_prefix + wiki_name.replace(' ','_') + base_url_suffix
    size_page = requests.get(size_url)
    return size_page.json()['query']['pages'].values()[0]['revisions'][0]['size']

In [14]:
%%time
df_entries['page_size'] = df_entries.name.map(get_page_size)

CPU times: user 6.56 s, sys: 340 ms, total: 6.9 s
Wall time: 40.6 s


Clean up wiki link text

In [15]:
def remove_wiki_url_delims(text):
    return text.replace('[[','').replace(']]','')

In [16]:
df_entries['description'] = df_entries.description.map(remove_wiki_url_delims)
df_entries['death'] = df_entries.death.map(remove_wiki_url_delims)

In [17]:
df_entries.head()

Unnamed: 0,name,age,description,death,nationality_text,page_size
0,Adele Faccio,86,Italian civil right activist.,,Italian,3028
1,Adelina Tattilo,78,Italian founder of ''Playmen'' magazine.,,Italian,2600
2,Ahmad Abu Laban,60,"Egyptian-born Danish Muslim leader, key figure...",cancer,Egyptian-born Danish Muslim,8303
3,Aida Mason,111,British oldest person.,,British,2101
4,Alan MacDiarmid,79,New Zealand recipient of Nobel Prize in Chemis...,injuries from a fall,New Zealand,21321


Sandbox for checking particular observations in February 2007:

In [18]:
#df_entries[df_entries.name == 'Lew Burdette']['description'][86]
df_entries[df_entries.name == 'Lew Burdette']
df_entries[df_entries.name == 'Fred Mustard Stewart'].values 
df_entries[df_entries.name == 'Aida Mason'].values
df_entries[df_entries.name == 'Ian Richardson'].values

array([['Ian Richardson', '72',
        "British actor (''House of Cards (UK TV series)|House of Cards'', ''Tinker, Tailor, Soldier, Spy'') and member of the Royal Shakespeare Company|RSC",
        'in his sleep', 'British', 22671]], dtype=object)

In [19]:
df_entries.loc[100:101]

Unnamed: 0,name,age,description,death,nationality_text,page_size
100,Ian Richardson,72,British actor (''House of Cards (UK TV series)...,in his sleep,British,22671
101,Ian Stevenson,88,Canadian psychiatrist and reincarnation resear...,,Canadian,65916


### Birth and death data

Getting birthday data. It's probably helpful to see [Stack Overflow thread](https://stackoverflow.com/questions/12250580/parse-birth-and-death-dates-from-wikipedia)

In [20]:
text_for_search = '         = Adele Faccio\n| honorific-suffix    =\n| image               =\n| caption             =\n| constituency_MP     =<!-- Can be repeated up to eight times by adding a number -->\n| parliament          =<!-- Can be repeated up to eight times by adding a number -->\n| majority            =<!-- Can be repeated up to eight times by adding a number -->\n| term_start          =<!-- Can be repeated up to eight times by adding a number -->\n| term_end            =<!-- Can be repeated up to eight times by adding a number -->\n| predecessor         =<!-- Can be repeated up to eight times by adding a number -->\n| successor           =<!-- Can be repeated up to eight times by adding a number -->\n| birth_date          ={{birth date|1920|11|13|df=y}} \n| birth_place         =Pontebba, Udine\n| death_date          ={{death date and age|2007|02|08|1920|11|13}}\n| death_place         =Rome\n| nationality         ={{ITA}}\n| party               = Radical Party (Partito Radicale)\n| otherparty          = <!--For additional political affiliations -->\n| spouse              =\n| partner             = <!--For those with a domestic partner and not married -->\n| relations           =\n| children            =\n| residence           =\n| alma_mater          =\n| occupation          =\n| profession          =\n| religion            =\n| signature           =\n| website             =\n| footnotes           =\n}}\n\n\'\'\'Adele Faccio\'\'\' (November 13, 1920 in [[Pontebba]], [[Udine]] \xe2\x80\x93 February 8, 2007 in [[Rome]]) was an [[Italy|Italian]] [[politician]] and deputy of the [[Radical Party (Italy)|Radical Party]] (\'\'Partito Radicale\'\').<ref>{{It icon}} "[http://www.corriere.it/Primo_Piano/Cronache/2007/02_Febbraio/09/adelefaccio.shtml Morta la radicale Adele Faccio]." (February 9, 2007). \'\'Corriere della Sera.\'\' Retrieved June 21, 2007.</ref>'

In [8]:
base_bday_prefix = 'https://en.wikipedia.org/w/api.php?action=query&format=json&titles='
base_bday_suffix = '&prop=revisions&rvprop=content&rvsection=0'

def check_if_number(elem):
    try:
        return type(int(elem)) == int
    except ValueError:
        return False

    
def get_only_ints(date_list):
    return [e for e in date_list if check_if_number(e) > 0]


no_birth_death_urls = []

def get_life_death(name, extract_bday=True):
    try:        
        entry_url = base_bday_prefix + name.replace(' ','_') + base_bday_suffix
        entry_text = requests.get(entry_url).json()['query']['pages']
        entry_text = str(entry_text.values()[0]['revisions'][0]['*'].encode('utf-8'))
        
        if extract_bday:
            pattern = re.compile('.*?\{{2}(?:B|b)irth (?:D|d)ate(.+?)\}{2}', 
                                 re.MULTILINE|re.DOTALL)
        else:
            pattern = re.compile('.*?\{{2}(?:D|d)eath (?:D|d)ate(.+?)\}{2}', 
                                 re.MULTILINE|re.DOTALL)

        date_re = re.match(pattern, entry_text)
        date_data = "-".join(get_only_ints(date_re.groups()[0].strip(" ").split('|'))[:3])
        
        #print date_data
        return date_data
    
    except AttributeError, e:
        no_birth_death_urls.append(entry_url)

Birthday

In [22]:
%%time
df_entries['birthday'] = df_entries.name.map(get_life_death)

1920-11-13
1912-11-07
1931-3-30
1968-08-07
1955-9-25
1967-11-28
1937-06-13
1919-12-1
1932-7-31
1934-3-4
1963-1-17
1919-8-15
1922-12-17
1945-1-25
1914-2-6
1915-4-9
1946-3-29
1916-10-14
1924-4-23
1914-4-7
1906-5-19
1914-8-30
1937-4-20
1921-6-25
1908-11-26
1922-12-09
1941-7-27
1982-4-14
1960-7-30
1931-11-13
1954-9-18
1927-4-26
1936-2-6
1914-08-31
1988-12-23
1923-7-4
1926-5-15
1930-04-09
1920-09-05
1922-11-9
1913-3-30
1932-09-17
1928-01-26
1925-12-26
1928-10-20
1933-6-25
1930-9-9
1922-7-31
1927-6-2
1934-4-7
1918-10-31
1946-9-29
1959-09-26
1927-02-11
1921-4-23
1932-8-31
1927-10-8
1915-9-7
1921-12-6
1974-12-30
1926-12-25
1934-9-7
1932-1-15
1922-3-27
1945-08-12
1913-11-12
1935-4-17
1928-3-2
1915-11-19
1930-08-15
1932-03-11
1926-11-22
 1938 - 9 - 3
1918-02-06
1952-11-2
1940-6-30
1919-09-29
1910-09-03
1919-6-25
1965-1-24
1925-2-20
1924-4-21
1921-9-30
1905-01-17
1958-08-26
1907-8-31
1915-2-4
1943-07-31
1930-12-23
1941-12-05
1935-2-10
1938-2-22
1915-06-15
1930-10-09
1944-12-27
1975-6-21
1925-2-6


Deathday

In [23]:
%%time
df_entries['deathday'] = df_entries.name.map(lambda text: get_life_death(text, extract_bday=False))

2007-02-08
2007-02-24
2007-2-20
2007-02-09
2007-2-26
2007-2-8
2007-02-19
2007-2-5
2007-2-18
2007-2-4
2007-2-21
2007-2-9
2007-2-6
2007-2-27
2007-02-10
2007-2-28
2007-2-18
2007-2-27
2007-2-27
2007-2-24
2007-2-15
2007-2-10
2007-2-19
2007-2-28
2007-2-11
2007-02-13
2007-2-24
2007-2-15
2007-2-26
2007-2-22
2007-2-20
2007-2-23
2007-02-20
2007-02-13
2007-2-13
2007-2-12
2007-02-20
2007-02-22
2007-2-9
2007-2-6
2007-02-7
2007-2-14
2007-02-16
2007-2-12
2007-02-03
2007-2-22
2007-2-24
2007-2-1
2007-2-2
2007-2-6
2007-2-9
2007-2-8
2007-02-06
2007-2-9
2007-02-08
2007-02-04
2007-02-27
2007-2-19
2007-2-21
2007-02-11
2007-2-23
2007-2-8
2007-2-13
2007-2-20
2007-2-21
2007-2-6
2007-2-18
2007-2-11
2007-2-4
2007-02-07
2007-2-20
2007-2-24
2007-2-18
2007-2-6
2007-02-26
2007-02-05
2007-02-24
2007-2-6
 2007 - 02 - 5 
2007-02-22
2007-2-17
2007-2-25
2007-02-02
2007-02-17
2007-2-27
2007-2-17
2007-2-16
2007-2-25
2007-02-24
2007-2-3
2007-02-12
2007-02-12
2007-2-1
2007-2-15
2007-2-15
2007-02-14
2007-2-8
2007-02-16
2007-2

## Write out file

In [28]:
df_entries.to_csv('../out/celeb_deaths_wikipedia_' +
                  month_select + '_' + str(year_select) + '.csv', 
                  index=False)

It would be useful to plot missing data by page_size i.e. fame or "importance."