# The Bachelor & Race
* Filename: clean_data.ipynb
* Author: Angelina Li
* Date: 08/20/2018
* Description: Clean data for use in other notebooks

### Questions
1. Does the Bachelor have a representation problem?
    * How many POC are there on the show in total across seasons? Disaggregated by race? What percentage?
    * How has the percentage of POC changed over time? Has it differentially changed for different racial categories?
    * How has the total number of POC changed over time?
    * How does the number of POC compare to overall US race demographics? Which groups are represented more and which are represented less?

In [10]:
import csv
import requests
import pandas as pd
import os

from bs4 import BeautifulSoup

input_dir = "../input"
intermed_dir = "../intermediate"
output_dir = "../output"
for file_dir in [input_dir, intermed_dir, output_dir]:
    if not os.path.exists(file_dir):
        os.makedirs(file_dir)

In [2]:
def is_data_file(fl):
    data_exts = ["csv", "xlsx", "html", "htm", "pdf", "png"]
    return fl.split(".")[-1] in data_exts

def get_all_files(path):
    files = []
    for fl in os.listdir(path):
        filepath = os.path.join(path, fl)
        if is_data_file(filepath):
            files.append(filepath)
        elif os.path.isdir(filepath):
            files += get_all_files(filepath)
    return files

all_files = get_all_files(input_dir)
for fl in all_files:
    print(fl)

../input/538/bachelorette.csv
../input/race/bachelorette_2015.pdf
../input/race/bachelorette_2016.pdf
../input/race/bachelorette_2017.pdf
../input/race/bachelorette_2018.pdf
../input/race/karenx_data.csv
../input/race/Minorities-Bachelor-2016.png
../input/race/splinter_black_contestants.pdf


### Reformat 538 Bachelorette Data
* In order to get race data (even via visual inspection), we need a cleaned, candidate level dataset of bachelor/ette contestants & lead roles.
* **Objective: Get unique list of contestant names and ids**

In [3]:
path_538 = all_files[0]
df_538 = pd.read_csv(path_538)
print(df_538.columns)
df_538.head()

Index(['SHOW', 'SEASON', 'CONTESTANT', 'ELIMINATION-1', 'ELIMINATION-2',
       'ELIMINATION-3', 'ELIMINATION-4', 'ELIMINATION-5', 'ELIMINATION-6',
       'ELIMINATION-7', 'ELIMINATION-8', 'ELIMINATION-9', 'ELIMINATION-10',
       'DATES-1', 'DATES-2', 'DATES-3', 'DATES-4', 'DATES-5', 'DATES-6',
       'DATES-7', 'DATES-8', 'DATES-9', 'DATES-10'],
      dtype='object')


Unnamed: 0,SHOW,SEASON,CONTESTANT,ELIMINATION-1,ELIMINATION-2,ELIMINATION-3,ELIMINATION-4,ELIMINATION-5,ELIMINATION-6,ELIMINATION-7,...,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
0,SHOW,SEASON,ID,1,2,3,4,5,6,7,...,1.0,2,3,4,5,6,7,8,9,10
1,Bachelorette,13,13_BRYAN_A,R1,,,R,R,,R,...,,,D6,D13,D1,D7,D1,D1,D1,D1
2,Bachelorette,13,13_PETER_K,,R,,,,R,R,...,,D1,D6,D13,D9,D7,D1,D1,D1,D1
3,Bachelorette,13,13_ERIC_B,,,R,,,R,R,...,,D10,D8,D13,D9,D1,D3,D1,D1,
4,Bachelorette,13,13_DEAN_U,,R,,R,,,R,...,,D8,D8,D1,D9,D7,D1,D1,,


In [4]:
# first fixes: drop non data rows; clean columns
df_538.columns = map(lambda x: x.lower(), df_538.columns)
df_538 = df_538[~(df_538["contestant"] == "ID") & 
                ~(df_538["season"] == "SEASON")]
print(df_538.columns)
df_538.head()

Index(['show', 'season', 'contestant', 'elimination-1', 'elimination-2',
       'elimination-3', 'elimination-4', 'elimination-5', 'elimination-6',
       'elimination-7', 'elimination-8', 'elimination-9', 'elimination-10',
       'dates-1', 'dates-2', 'dates-3', 'dates-4', 'dates-5', 'dates-6',
       'dates-7', 'dates-8', 'dates-9', 'dates-10'],
      dtype='object')


Unnamed: 0,show,season,contestant,elimination-1,elimination-2,elimination-3,elimination-4,elimination-5,elimination-6,elimination-7,...,dates-1,dates-2,dates-3,dates-4,dates-5,dates-6,dates-7,dates-8,dates-9,dates-10
1,Bachelorette,13,13_BRYAN_A,R1,,,R,R,,R,...,,,D6,D13,D1,D7,D1,D1,D1,D1
2,Bachelorette,13,13_PETER_K,,R,,,,R,R,...,,D1,D6,D13,D9,D7,D1,D1,D1,D1
3,Bachelorette,13,13_ERIC_B,,,R,,,R,R,...,,D10,D8,D13,D9,D1,D3,D1,D1,
4,Bachelorette,13,13_DEAN_U,,R,,R,,,R,...,,D8,D8,D1,D9,D7,D1,D1,,
5,Bachelorette,13,13_ADAM_G,,,,,,,ED,...,,D10,D8,D13,D9,D7,D3,,,


In [5]:
# second fixes: extract contestant name and elimination round
def get_name(row):
    cid = row["cid"]
    name = cid.split("_")[1:]
    return " ".join(map(lambda x: x.capitalize(), name))

df_538 = df_538.rename(columns={"contestant": "cid"})
df_538["name"] = df_538.apply(get_name, axis=1)
df_538.head()

Unnamed: 0,show,season,cid,elimination-1,elimination-2,elimination-3,elimination-4,elimination-5,elimination-6,elimination-7,...,dates-2,dates-3,dates-4,dates-5,dates-6,dates-7,dates-8,dates-9,dates-10,name
1,Bachelorette,13,13_BRYAN_A,R1,,,R,R,,R,...,,D6,D13,D1,D7,D1,D1,D1,D1,Bryan A
2,Bachelorette,13,13_PETER_K,,R,,,,R,R,...,D1,D6,D13,D9,D7,D1,D1,D1,D1,Peter K
3,Bachelorette,13,13_ERIC_B,,,R,,,R,R,...,D10,D8,D13,D9,D1,D3,D1,D1,,Eric B
4,Bachelorette,13,13_DEAN_U,,R,,R,,,R,...,D8,D8,D1,D9,D7,D1,D1,,,Dean U
5,Bachelorette,13,13_ADAM_G,,,,,,,ED,...,D10,D8,D13,D9,D7,D3,,,,Adam G


In [6]:
# third (final) fix - extract a simpler list of show, season, cid,
# and name of candidates.
df_538_names = df_538[["show", "season", "cid", "name"]]
df_538_names.head()

Unnamed: 0,show,season,cid,name
1,Bachelorette,13,13_BRYAN_A,Bryan A
2,Bachelorette,13,13_PETER_K,Peter K
3,Bachelorette,13,13_ERIC_B,Eric B
4,Bachelorette,13,13_DEAN_U,Dean U
5,Bachelorette,13,13_ADAM_G,Adam G


In [7]:
df_538_names.to_csv(os.path.join(intermed_dir, "538_cand_names.csv"))

### Merge in Race Data

* [karenx](http://www.karenx.com/blog/minorities-on-the-bachelor-when-do-they-get-eliminated/)'s fantastic blogpost lists candidates based on their first name, season year and lead. We want to match this data up to the data we already have from 538.
* **Objective: Match karenx's data with 538 df**

In [8]:
path_kx = os.path.join(input_dir, "race", "karenx_data.csv")
df_kx = pd.read_csv(path_kx)
df_kx.columns = map(lambda x: x.strip(), df_kx.columns)
df_kx.head()

Unnamed: 0,f_name,year,lead
0,Julie,2009,Jason Mesnick
1,Greg,2009,Jillian Harris
2,Channy,2010,Jake Pavelka
3,Roberto,2010,Ali Fedotowsky
4,Dianna,2012,Ben Flajnik


In [32]:
# grab season, year, show, lead data from wikipedia
def get_page_soup(url):
    try:
        resp = requests.get(url)
        page_text = resp.text
        return BeautifulSoup(page_text, "html.parser")
    except requests.exceptions.RequestException as e:
        print("Couldn't find soup object for url", url)

br_wiki = "https://en.wikipedia.org/wiki/The_Bachelor_(U.S._TV_series)"
be_wiki = "https://en.wikipedia.org/wiki/The_Bachelorette"
br_soup = get_page_soup(br_wiki)
be_soup = get_page_soup(be_wiki)
print(br_obj.prettify()[:500])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Bachelor (U.S. TV series) - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"The_Bachelor_(


In [35]:
# get bachelor dataframe object
def get_seasons(soup):
    return soup.find(id="Seasons").find_parent("h2").find_next_sibling("table", class_="wikitable")

def get_season_data(soup):
    """ returns headers and data associated with seasons data for soup obj """
    seasons = get_seasons(soup)
    header_row = seasons.find("tr")
    headers = [header.text.strip() for header in header_row.find_all("th")]
    data_rows = header_row.find_next_siblings("tr")
    return headers, data_rows

def get_data_df(soup):
    headers, data_rows = get_season_data(soup)
    all_data = []
    for row in data_rows:
        data = [val.text.strip() for val in row.find_all("td")]
        if not data[3].isnumeric():
            num_contest = all_data[-1].get("Number of contestants")
            data = data[:3] + [num_contest] + data[3:]
        data_dict = dict(zip(headers, data))
        all_data.append(data_dict)
    return pd.DataFrame(all_data)

br_df = get_data_df(br_soup)
br_df.head()

Unnamed: 0,#,Bachelor,Number of contestants,Original Run,Proposal,Relationship notes,Runner(s)-up,Still together,Winner
0,1,Alex Michel,25,"March 25–April 25, 2002",No,"Michel did not propose to Marsh, but instead t...",Trista Rehn,No,Amanda Marsh
1,2,Aaron Buerge,25,"September 25–November 20, 2002",Yes,Buerge and Eksterowicz broke up after several ...,Brooke Smith,No,Helene Eksterowicz
2,3,Andrew Firestone,25,"March 24–May 21, 2003",Yes,Schefft and Firestone broke up after several m...,Kirsten Buschbacher,No,Jen Schefft
3,4,Bob Guiney,25,"September 24–November 20, 2003",No,Guiney did not propose to Gardinier but she ac...,Kelly Jo Kuharski,No,Estella Gardinier
4,5,Jesse Palmer,25,"April 7–May 26, 2004",No,Palmer did not propose to Bowlin. They continu...,Tara Huckeby[21],No,Jessica Bowlin


In [36]:
be_df = get_data_df(be_soup)
be_df.head()

Unnamed: 0,#,Bachelorette,Number of contestants,Original run,Proposal,Relationship,Runner-up,Still together,Winner
0,1,Trista Rehn,25,"January 8–February 19, 2003",Yes,"Rehn and Sutter were married on December 6, 20...",Charlie Maher,Yes,Ryan Sutter
1,2,Meredith Phillips,25,"January 14–February 26, 2004",Yes,Phillips and McKee were engaged at the end of ...,Matthew Hickl,No,Ian Mckee
2,3,Jen Schefft,25,"January 10–February 28, 2005",Yes[a],"During the first live final rose ceremony, Sch...",John Paul Merritt,No,Jerry Ferris
3,4,DeAnna Pappas,25,"May 19–July 7, 2008",Yes,Pappas chose Csincsak and their wedding was se...,Jason Mesnick,No,Jesse Csincsak
4,5,Jillian Harris,30,"May 18–July 28, 2009",Yes,"Harris, the first Canadian bachelorette, chose...",Kiptyn Locke,No,Ed Swiderski


In [37]:
for df in [br_df, be_df]:
    df.columns = ["season_num", "lead", "num_contestants", "original_run", 
                  "proposal", "notes", "runner_up", "still_together", 
                  "winner"]
br_df.head()

Unnamed: 0,season_num,lead,num_contestants,original_run,proposal,notes,runner_up,still_together,winner
0,1,Alex Michel,25,"March 25–April 25, 2002",No,"Michel did not propose to Marsh, but instead t...",Trista Rehn,No,Amanda Marsh
1,2,Aaron Buerge,25,"September 25–November 20, 2002",Yes,Buerge and Eksterowicz broke up after several ...,Brooke Smith,No,Helene Eksterowicz
2,3,Andrew Firestone,25,"March 24–May 21, 2003",Yes,Schefft and Firestone broke up after several m...,Kirsten Buschbacher,No,Jen Schefft
3,4,Bob Guiney,25,"September 24–November 20, 2003",No,Guiney did not propose to Gardinier but she ac...,Kelly Jo Kuharski,No,Estella Gardinier
4,5,Jesse Palmer,25,"April 7–May 26, 2004",No,Palmer did not propose to Bowlin. They continu...,Tara Huckeby[21],No,Jessica Bowlin
