### Using Existing Datasets

In [1]:
%matplotlib inline

In [2]:
import string

import numpy as np
import pandas as pd
import nltk

from functools import reduce
from collections import Counter

Files were obtained from [Kaggle](https://www.kaggle.com/hugodarwood/celebrity-deaths) and journalist DH Montgomery's [GitHub](https://github.com/dhmontgomery/personal-work/tree/master/wikipedia-deaths). DH Montgomery's article can be [found here](http://dhmontgomery.com/2016/12/wikipediadeaths/).

Initial file read and check.

In [3]:
df_mont = pd.read_csv('../data/montgomery_dh_1.csv')
df_wiki = pd.read_csv('../data/wikipedia_celebrity_deaths_4.csv')

In [4]:
print df_mont.shape
print df_wiki.shape

(158512, 4)
(21458, 9)


### Kaggle dataset from Wikpedia

In [5]:
df_wiki.head()

Unnamed: 0,age,birth_year,cause_of_death,death_month,death_year,famous_for,name,nationality,fame_score
0,85,1921,natural causes,January,2006,businessman chairman of IBM (1973���1981),Frank Cary,American,1.0
1,49,1957,murdered,January,2006,musician (House of Freaks Gutterball),Bryan Harvey,American,2.0
2,64,1942,Alzheimer's disease,January,2006,baseball player (Oakland Athletics),Paul Lindblad,American,1.0
3,86,1920,Alzheimer's disease,January,2006,politician Representative from Oregon (1957��...,Charles O. Porter,American,2.0
4,82,1924,cancer,January,2006,nightclub owner (Tropicana Club),Ofelia Fox,Cuban,


### Initial Cleaning

#### Nationality Field

In [6]:
df_wiki['nationality'] = df_wiki.nationality.map(lambda text: text.replace("-born", ""))

In [7]:
keyword = "parliament"
df_wiki[df_wiki.famous_for.map(lambda text: keyword in str(text))]

Unnamed: 0,age,birth_year,cause_of_death,death_month,death_year,famous_for,name,nationality,fame_score
882,62,1945,cancer,August,2007,member of parliament,Bahaedin Adab,Iranian,2
6521,59,1952,,April,2011,parliamentary leader of One Nation (2001���2004),Bill Flynn,Australian,2
8287,88,1924,,February,2012,member of parliament (1973���1977),Reidar T. Larsen,Norwegian,3
13458,79,1935,stroke,April,2014,politician Senator for Tasmania (1975���2005)...,Brian Harradine,Australian,14


#### Months to Numerical

In [8]:
month_to_num = {
    'January': 1,
    'February': 2,
    'March': 3,
    'April': 4,
    'May': 5,
    'June': 6,
    'July': 7,
    'August': 8,
    'September': 9,
    'October': 10,
    'November': 11,
    'December': 12
}

In [9]:
df_wiki['death_month'] = df_wiki.death_month.map(lambda text: month_to_num[text])

In [10]:
df_wiki['death_month'].value_counts()

1     2063
12    1902
4     1832
10    1819
3     1819
11    1781
7     1756
8     1729
6     1728
5     1718
2     1683
9     1628
dtype: int64

# Feature exploration to find potential features for cleaning

### Finding keywords in famous_for descriptions:

In [11]:
keyword = "born"
df_wiki[df_wiki.nationality.map(lambda text: keyword in str(text))]

Unnamed: 0,age,birth_year,cause_of_death,death_month,death_year,famous_for,name,nationality,fame_score
10201,64,1949,,1,2013,Australian singer (The Marbles),Trevor Gordon,British���born,7


This is more efficient:

In [12]:
all_text_string = ""
word_counter = {}
df_wiki['fame_text'] = df_wiki.famous_for.map(
    lambda text: " ".join([w.translate(None, string.punctuation) for w in str(text).split(" ")]))

for idx, row in df_wiki.iterrows():
    all_text_string += row['fame_text']

In [13]:
def get_word_counts(text_block):
    word_list = nltk.word_tokenize(text_block)
    word_counts = {}
    for word in word_list:
        if word in word_counts.keys():
            word_counts[word] += 1
        else:
            word_counts[word] = 1
    print "Number of words:", len(word_counts)
    return word_counts

In [14]:
%%time
word_counter = get_word_counts(all_text_string)

Number of words: 16526
CPU times: user 51.3 s, sys: 565 ms, total: 51.8 s
Wall time: 52.6 s


In [15]:
word_count_table = pd.DataFrame.from_dict(word_counter, orient='index')
word_count_table.reset_index(inplace=True)
word_count_table.columns = ['word', 'count']

In [16]:
word_count_table['part_of_speech'] = word_count_table.word.map(
    lambda word: nltk.pos_tag([word])[0][1])

In [17]:
noun_type_list = ["NN", "NP"]
word_count_table[word_count_table.part_of_speech.map(
        lambda e: e in noun_type_list)].sort('count', ascending=False).head(300)

Unnamed: 0,word,count,part_of_speech
702,player,2649,NN
10984,politician,1788,NN
965,actor,1551,NN
7978,footballer,1243,NN
13385,football,1221,NN
16217,member,1216,NN
15032,Olympic,1147,NN
14201,actress,881,NN
13530,baseball,776,NN
5067,Bishop,770,NN


Notes:
- "MP" (more frequently used) stands for "member of parlianemt" (less frequently used)

### Nationality mapping

Using list compiled from [here](https://github.com/Dinu/country-nationality-list/blob/master/countries.csv) which in turn grabs from [Wikipedia's page](https://en.wikipedia.org/wiki/List_of_adjectival_and_demonymic_forms_for_countries_and_nations) for adjectival and demonymic forms for countries and nations.

In [18]:
natl_mapping = pd.read_csv('../data/countries.csv')
print natl_mapping.shape
natl_mapping.head()

(249, 5)


Unnamed: 0,num_code,alpha_2_code,alpha_3_code,en_short_name,nationality
0,4,AF,AFG,Afghanistan,Afghan
1,248,AX,ALA,Åland Islands,Åland Island
2,8,AL,ALB,Albania,Albanian
3,12,DZ,DZA,Algeria,Algerian
4,16,AS,ASM,American Samoa,American Samoan


In [19]:
text_filter = (lambda text: " or " in text)
idx_with_text = natl_mapping.nationality.map(text_filter)
natl_mapping[idx_with_text]['nationality_text_list'] = \
   natl_mapping[idx_with_text].nationality.map(lambda text: text.replace(" or ", ", "))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


List of nationalities:

In [20]:
"""
Run once
"""
natl_mapping.nationality = \
    natl_mapping.nationality.map(lambda text: text.replace(" or ", ", ").split(", "))

In [21]:
nationality_list = [label 
                    for label_list in natl_mapping.nationality.values 
                    for label in label_list]
Counter(nationality_list)
# Chinese, American, Dominican, Congolese, Channel Island

Counter({'Chinese': 3, 'American': 2, 'Dominican': 2, 'Congolese': 2, 'Channel Island': 2, 'Cook Island': 1, 'South Sudanese': 1, 'Zimbabwean': 1, 'Salvadoran': 1, 'Panamanian': 1, 'Serbian': 1, 'Chadian': 1, 'Grenadian': 1, 'Palauan': 1, 'Kittitian': 1, 'Malinese': 1, 'Saint Helenian': 1, 'Saint-Martinoise': 1, 'Central African': 1, 'Kyrgyz': 1, 'Aruban': 1, 'Kenyan': 1, 'New Zealand': 1, 'Argentine': 1, 'Belgian': 1, 'Bulgarian': 1, 'Azeri': 1, 'UK': 1, 'Timorese': 1, 'Polish': 1, 'Herzegovinian': 1, 'Seychellois': 1, 'French': 1, 'Peruvian': 1, 'Fijian': 1, 'Liberian': 1, 'Armenian': 1, 'Guyanese': 1, 'Monacan': 1, 'Qatari': 1, 'Haitian': 1, 'Vincentian': 1, 'Samoan': 1, 'Mauritanian': 1, 'Finnish': 1, 'Canadian': 1, 'Manx': 1, 'Albanian': 1, 'Omani': 1, 'Comoran': 1, 'Romanian': 1, 'Maltese': 1, 'Andorran': 1, 'Bahraini': 1, 'Martinican': 1, 'Vatican': 1, 'Wallis and Futuna': 1, 'Kirgiz': 1, 'NZ': 1, 'Emirian': 1, 'Italian': 1, 'Malian': 1, 'Ivorian': 1, 'Norfolk Island': 1, 'Saudi

In [22]:
natl_mapping.head()

Unnamed: 0,num_code,alpha_2_code,alpha_3_code,en_short_name,nationality
0,4,AF,AFG,Afghanistan,[Afghan]
1,248,AX,ALA,Åland Islands,[Åland Island]
2,8,AL,ALB,Albania,[Albanian]
3,12,DZ,DZA,Algeria,[Algerian]
4,16,AS,ASM,American Samoa,[American Samoan]


In [23]:
nationality_map_dict = {}
for idx, row, in natl_mapping.iterrows():
    for label in row['nationality']:
        row_info_entry = [row['num_code'], row['alpha_2_code'],
                          row['alpha_3_code'], row['en_short_name']]
        if label in nationality_map_dict.keys():
            nationality_map_dict[label].append(row_info_entry)
        else:
            nationality_map_dict[label] = [row_info_entry]

In [24]:
nationality_map_dict['British']

[[826, 'GB', 'GBR', 'United Kingdom of Great Britain and Northern Ireland']]

### Nationalities from Kaggle Wikipedia page scrapes

Checking for those with nationalities matched and not matched:

In [25]:
print len(df_wiki[df_wiki.nationality.map(lambda text: text in nationality_list)])
print len(df_wiki[df_wiki.nationality.map(lambda text: text not in nationality_list)])

19515
1943


In [26]:
df_wiki[df_wiki.nationality.map(lambda text: text not in nationality_list)].nationality.value_counts()

English                 628
Scottish                237
New                     183
South                   136
Welsh                    86
Northern                 43
Soviet                   34
Sri                      34
Puerto                   34
dies                     28
Hong                     22
football                 21
Argentinian              17
2012                     12
Solomon                  12
Costa                    12
North                    11
Luxembourgian            10
Saint                     9
Papua                     9
Dies                      9
Cook                      8
East                      7
Burkinab̩                 7
Painter                   7
United                    7
dead                      6
Yugoslav                  6
politician                6
former                    6
                       ... 
sailor\rCamille           1
humanitarian              1
Indiana"                  1
Ceramic                   1
Abstract            

The dataset from Kaggle has some serious issues due to missed delimiters and/or bad parsing. For instance, a New-Zealand nationality clearly throws off the nationality and famous_for columns. Whatever parsing method was used was not particularly robust at separating fields or did not fully utilize Wikipedia HTML tags/metadata.

In [27]:
df_wiki[(df_wiki.famous_for.map(lambda x: "Zealand" in str(x)))]

Unnamed: 0,age,birth_year,cause_of_death,death_month,death_year,famous_for,name,nationality,fame_score,fame_text
522,76,1931,kidney failure,2,2007,Zealand-born American politician and Lieutena...,Leo T. McCarthy,New,5,Zealandborn American politician and Lieutenan...
676,43,1964,cancer,5,2007,Zealand film director (In My Father's Den),Brad McGann,New,5,Zealand film director In My Fathers Den
895,68,1939,lung disease,8,2007,Zealand politician mayor of Wanganui (1986���...,Chas Poynter,New,8,Zealand politician mayor of Wanganui 1986���2004
917,92,1915,,8,2007,Zealand industrialist (Fletcher Challenge),Sir James Fletcher,New,4,Zealand industrialist Fletcher Challenge
1156,67,1940,cancer,12,2007,Zealand public servant Chief Ombudsman (2003�...,John Belgrave,New,6,Zealand public servant Chief Ombudsman 2003��...
1285,88,1920,heart failure,1,2008,Zealand mountaineer and the first person (wit...,Sir Edmund Hillary,New,117,Zealand mountaineer and the first person with...
1521,91,1917,,3,2008,Zealand cricket captain (1952���1953),Merv Wallace,New,10,Zealand cricket captain 1952���1953
1565,68,1940,,4,2008,Zealand rugby union player (All Blacks),Tony Davies,New,2,Zealand rugby union player All Blacks
1597,78,1930,after long illness,4,2008,Zealand district judge (1975���1990),Augusta Wallace,New,7,Zealand district judge 1975���1990
1758,82,1926,stroke,6,2008,Zealand Mayor of Auckland (1980���1983),Colin Kay,New,8,Zealand Mayor of Auckland 1980���1983


DH Montgomery's dataset is fine but the Kaggle dataset is not. Since the resulting dataset from Kaggle wasn't produced with a particularly robust parser, maybe it's a better idea to build a custom scraper.

# Write data out

In [29]:
df_wiki.to_csv('../out/cleaned_celeb_deaths_wiki_1.csv', index=False)