# The Bachelor & Race
* Filename: clean_data.ipynb
* Author: Angelina Li
* Date: 08/20/2018
* Description: Clean data for use in other notebooks

### Questions
1. Does the Bachelor have a representation problem?
    * How many POC are there on the show in total across seasons? Disaggregated by race? What percentage?
    * How has the percentage of POC changed over time? Has it differentially changed for different racial categories?
    * How has the total number of POC changed over time?
    * How does the number of POC compare to overall US race demographics? Which groups are represented more and which are represented less?

In [12]:
import csv
import pandas as pd
import os
import mwclient

input_dir = "../input"
intermed_dir = "../intermediate"
output_dir = "../output"
for file_dir in [input_dir, intermed_dir, output_dir]:
    if not os.path.exists(file_dir):
        os.makedirs(file_dir)

In [2]:
def is_data_file(fl):
    data_exts = ["csv", "xlsx", "html", "htm", "pdf", "png"]
    return fl.split(".")[-1] in data_exts

def get_all_files(path):
    files = []
    for fl in os.listdir(path):
        filepath = os.path.join(path, fl)
        if is_data_file(filepath):
            files.append(filepath)
        elif os.path.isdir(filepath):
            files += get_all_files(filepath)
    return files

all_files = get_all_files(input_dir)
for fl in all_files:
    print(fl)

../input/538/bachelorette.csv
../input/race/bachelorette_2015.pdf
../input/race/bachelorette_2016.pdf
../input/race/bachelorette_2017.pdf
../input/race/bachelorette_2018.pdf
../input/race/Minorities-Bachelor-2016.png
../input/race/splinter_black_contestants.pdf
../input/season_year.csv
../input/wikipedia/bachelor/The Bachelor (season 1) - Wikipedia.htm
../input/wikipedia/bachelor/The Bachelor (season 10) - Wikipedia.htm
../input/wikipedia/bachelor/The Bachelor (season 11) - Wikipedia.htm
../input/wikipedia/bachelor/The Bachelor (season 12) - Wikipedia.htm
../input/wikipedia/bachelor/The Bachelor (season 13) - Wikipedia.htm
../input/wikipedia/bachelor/The Bachelor (season 14) - Wikipedia.htm
../input/wikipedia/bachelor/The Bachelor (season 15) - Wikipedia.htm
../input/wikipedia/bachelor/The Bachelor (season 16) - Wikipedia.htm
../input/wikipedia/bachelor/The Bachelor (season 17) - Wikipedia.htm
../input/wikipedia/bachelor/The Bachelor (season 18) - Wikipedia.htm
../input/wikipedia/bache

### Reformat 538 Bachelorette Data
* In order to get race data (even via visual inspection), we need a cleaned, candidate level dataset of bachelor/ette contestants & lead roles.
* **Objective: Get unique list of contestant names and ids**

In [3]:
path_538 = all_files[0]
df_538 = pd.read_csv(path_538)
print(df_538.columns)
df_538.head()

Index(['SHOW', 'SEASON', 'CONTESTANT', 'ELIMINATION-1', 'ELIMINATION-2',
       'ELIMINATION-3', 'ELIMINATION-4', 'ELIMINATION-5', 'ELIMINATION-6',
       'ELIMINATION-7', 'ELIMINATION-8', 'ELIMINATION-9', 'ELIMINATION-10',
       'DATES-1', 'DATES-2', 'DATES-3', 'DATES-4', 'DATES-5', 'DATES-6',
       'DATES-7', 'DATES-8', 'DATES-9', 'DATES-10'],
      dtype='object')


Unnamed: 0,SHOW,SEASON,CONTESTANT,ELIMINATION-1,ELIMINATION-2,ELIMINATION-3,ELIMINATION-4,ELIMINATION-5,ELIMINATION-6,ELIMINATION-7,...,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
0,SHOW,SEASON,ID,1,2,3,4,5,6,7,...,1.0,2,3,4,5,6,7,8,9,10
1,Bachelorette,13,13_BRYAN_A,R1,,,R,R,,R,...,,,D6,D13,D1,D7,D1,D1,D1,D1
2,Bachelorette,13,13_PETER_K,,R,,,,R,R,...,,D1,D6,D13,D9,D7,D1,D1,D1,D1
3,Bachelorette,13,13_ERIC_B,,,R,,,R,R,...,,D10,D8,D13,D9,D1,D3,D1,D1,
4,Bachelorette,13,13_DEAN_U,,R,,R,,,R,...,,D8,D8,D1,D9,D7,D1,D1,,


In [5]:
# first fixes: drop non data rows; clean columns
df_538.columns = map(lambda x: x.lower(), df_538.columns)
df_538 = df_538[~(df_538["contestant"] == "ID") & 
                ~(df_538["season"] == "SEASON")]
print(df_538.columns)
df_538.head()

Index(['show', 'season', 'contestant', 'elimination-1', 'elimination-2',
       'elimination-3', 'elimination-4', 'elimination-5', 'elimination-6',
       'elimination-7', 'elimination-8', 'elimination-9', 'elimination-10',
       'dates-1', 'dates-2', 'dates-3', 'dates-4', 'dates-5', 'dates-6',
       'dates-7', 'dates-8', 'dates-9', 'dates-10'],
      dtype='object')


Unnamed: 0,show,season,contestant,elimination-1,elimination-2,elimination-3,elimination-4,elimination-5,elimination-6,elimination-7,...,dates-1,dates-2,dates-3,dates-4,dates-5,dates-6,dates-7,dates-8,dates-9,dates-10
1,Bachelorette,13,13_BRYAN_A,R1,,,R,R,,R,...,,,D6,D13,D1,D7,D1,D1,D1,D1
2,Bachelorette,13,13_PETER_K,,R,,,,R,R,...,,D1,D6,D13,D9,D7,D1,D1,D1,D1
3,Bachelorette,13,13_ERIC_B,,,R,,,R,R,...,,D10,D8,D13,D9,D1,D3,D1,D1,
4,Bachelorette,13,13_DEAN_U,,R,,R,,,R,...,,D8,D8,D1,D9,D7,D1,D1,,
5,Bachelorette,13,13_ADAM_G,,,,,,,ED,...,,D10,D8,D13,D9,D7,D3,,,


In [6]:
# second fixes: extract contestant name and elimination round
def get_name(row):
    cid = row["cid"]
    name = cid.split("_")[1:]
    return " ".join(map(lambda x: x.capitalize(), name))

df_538 = df_538.rename(columns={"contestant": "cid"})
df_538["name"] = df_538.apply(get_name, axis=1)
df_538.head()

Unnamed: 0,show,season,cid,elimination-1,elimination-2,elimination-3,elimination-4,elimination-5,elimination-6,elimination-7,...,dates-2,dates-3,dates-4,dates-5,dates-6,dates-7,dates-8,dates-9,dates-10,name
1,Bachelorette,13,13_BRYAN_A,R1,,,R,R,,R,...,,D6,D13,D1,D7,D1,D1,D1,D1,Bryan A
2,Bachelorette,13,13_PETER_K,,R,,,,R,R,...,D1,D6,D13,D9,D7,D1,D1,D1,D1,Peter K
3,Bachelorette,13,13_ERIC_B,,,R,,,R,R,...,D10,D8,D13,D9,D1,D3,D1,D1,,Eric B
4,Bachelorette,13,13_DEAN_U,,R,,R,,,R,...,D8,D8,D1,D9,D7,D1,D1,,,Dean U
5,Bachelorette,13,13_ADAM_G,,,,,,,ED,...,D10,D8,D13,D9,D7,D3,,,,Adam G


In [7]:
# third (final) fix - extract a simpler list of show, season, cid,
# and name of candidates.
df_538_names = df_538[["show", "season", "cid", "name"]]
df_538_names.head()

Unnamed: 0,show,season,cid,name
1,Bachelorette,13,13_BRYAN_A,Bryan A
2,Bachelorette,13,13_PETER_K,Peter K
3,Bachelorette,13,13_ERIC_B,Eric B
4,Bachelorette,13,13_DEAN_U,Dean U
5,Bachelorette,13,13_ADAM_G,Adam G


In [8]:
df_538_names.to_csv(os.path.join(intermed_dir, "538_cand_names.csv"))

### Merge in Race Data

* [karenx](http://www.karenx.com/blog/minorities-on-the-bachelor-when-do-they-get-eliminated/)'s fantastic blogpost lists candidates based on their first name, season year and lead. We want to match this data up to the data we already have from 538.
* **Objective: Match karenx's data with 538 df**

In [24]:
path_kx = os.path.join(input_dir, "race", "karenx_data.csv")
df_kx = pd.read_csv(path_kx)
df_kx.columns = map(lambda x: x.strip(), df_kx.columns)
df_kx.head()

Unnamed: 0,f_name,year,lead
0,Julie,2009,Jason Mesnick
1,Greg,2009,Jillian Harris
2,Channy,2010,Jake Pavelka
3,Roberto,2010,Ali Fedotowsky
4,Dianna,2012,Ben Flajnik


In [27]:
# grab season, year, show, lead data from wikipedia
def get_page(page_name):
    return mwclient.Site("en.wikipedia.org").pages[page_name]

bachelor = get_page("The_Bachelor_(U.S._TV_series)")
bach_text = bachelor.text()
for year in df_kx.year.unique():
    print(year, str(year) in bach_text)

2009 True
2010 True
2012 True
2013 True
2014 True
2015 True
2016 True
