# The Bachelor & Race
* Filename: clean_data.ipynb
* Author: Angelina Li
* Date: 08/20/2018
* Description: Clean data for use in other notebooks

### Questions
1. Does the Bachelor have a representation problem?
    * How many POC are there on the show in total across seasons? Disaggregated by race? What percentage?
    * How has the percentage of POC changed over time? Has it differentially changed for different racial categories?
    * How has the total number of POC changed over time?
    * How does the number of POC compare to overall US race demographics? Which groups are represented more and which are represented less?

In [1]:
import csv
import re
import requests
import pandas as pd
import os

from bs4 import BeautifulSoup

In [2]:
# name key directories

input_dir = "../input"
intermed_dir = "../intermediate"
output_dir = "../output"
for file_dir in [input_dir, intermed_dir, output_dir]:
    if not os.path.exists(file_dir):
        os.makedirs(file_dir)

### Reformat 538 Bachelorette Data
* In order to get race data (even via visual inspection), we need a cleaned, candidate level dataset of bachelor/ette contestants & lead roles.
* **Objective: Get unique list of contestant names and ids**

In [3]:
# import in 538 data
path_538 = "../input/538/bachelorette.csv"
df_538 = pd.read_csv(path_538)
print("Columns:", df_538.columns)
df_538.head()

Columns: Index(['SHOW', 'SEASON', 'CONTESTANT', 'ELIMINATION-1', 'ELIMINATION-2',
       'ELIMINATION-3', 'ELIMINATION-4', 'ELIMINATION-5', 'ELIMINATION-6',
       'ELIMINATION-7', 'ELIMINATION-8', 'ELIMINATION-9', 'ELIMINATION-10',
       'DATES-1', 'DATES-2', 'DATES-3', 'DATES-4', 'DATES-5', 'DATES-6',
       'DATES-7', 'DATES-8', 'DATES-9', 'DATES-10'],
      dtype='object')


Unnamed: 0,SHOW,SEASON,CONTESTANT,ELIMINATION-1,ELIMINATION-2,ELIMINATION-3,ELIMINATION-4,ELIMINATION-5,ELIMINATION-6,ELIMINATION-7,...,DATES-1,DATES-2,DATES-3,DATES-4,DATES-5,DATES-6,DATES-7,DATES-8,DATES-9,DATES-10
0,SHOW,SEASON,ID,1,2,3,4,5,6,7,...,1.0,2,3,4,5,6,7,8,9,10
1,Bachelorette,13,13_BRYAN_A,R1,,,R,R,,R,...,,,D6,D13,D1,D7,D1,D1,D1,D1
2,Bachelorette,13,13_PETER_K,,R,,,,R,R,...,,D1,D6,D13,D9,D7,D1,D1,D1,D1
3,Bachelorette,13,13_ERIC_B,,,R,,,R,R,...,,D10,D8,D13,D9,D1,D3,D1,D1,
4,Bachelorette,13,13_DEAN_U,,R,,R,,,R,...,,D8,D8,D1,D9,D7,D1,D1,,


In [4]:
# 1. drop non data rows; clean columns
def clean_col_name(name):
    name = name.lower()
    changemap = {"elimination": "e", "dates": "d"}
    for curr, new in changemap.items():
        name = name.replace(curr, new)
    return name
    
df_538.columns = map(clean_col_name, df_538.columns)
df_538 = df_538[~(df_538["contestant"] == "ID") & 
                ~(df_538["season"] == "SEASON")]
print(df_538.columns)
df_538.head()

Index(['show', 'season', 'contestant', 'e-1', 'e-2', 'e-3', 'e-4', 'e-5',
       'e-6', 'e-7', 'e-8', 'e-9', 'e-10', 'd-1', 'd-2', 'd-3', 'd-4', 'd-5',
       'd-6', 'd-7', 'd-8', 'd-9', 'd-10'],
      dtype='object')


Unnamed: 0,show,season,contestant,e-1,e-2,e-3,e-4,e-5,e-6,e-7,...,d-1,d-2,d-3,d-4,d-5,d-6,d-7,d-8,d-9,d-10
1,Bachelorette,13,13_BRYAN_A,R1,,,R,R,,R,...,,,D6,D13,D1,D7,D1,D1,D1,D1
2,Bachelorette,13,13_PETER_K,,R,,,,R,R,...,,D1,D6,D13,D9,D7,D1,D1,D1,D1
3,Bachelorette,13,13_ERIC_B,,,R,,,R,R,...,,D10,D8,D13,D9,D1,D3,D1,D1,
4,Bachelorette,13,13_DEAN_U,,R,,R,,,R,...,,D8,D8,D1,D9,D7,D1,D1,,
5,Bachelorette,13,13_ADAM_G,,,,,,,ED,...,,D10,D8,D13,D9,D7,D3,,,


In [5]:
# 2. extract contestant name; clean errors
def clean_name_errors(row):
    cid = row["cid"]
    changemap = {
        "06_ROBERT_M": "06_ROBERTO_M",
        "17_SLEMA_A": "17_SELMA_A",
        "09_JUAN_G": "09_JUAN_PABLO_G"
    }
    # TODO: finish this lolllllllzzzzzz
    
def get_name(row):
    cid = row["cid"]
    name = cid.split("_")[1:]
    return " ".join(map(lambda x: x.capitalize(), name))

df_538 = df_538.rename(columns={"contestant": "cid"})
df_538["name"] = df_538.apply(get_name, axis=1)

get_fname = lambda row: " ".join(row["name"].split()[:-1])
df_538["f_name"] = df_538.apply(get_fname, axis=1)
df_538.head()

Unnamed: 0,show,season,cid,e-1,e-2,e-3,e-4,e-5,e-6,e-7,...,d-3,d-4,d-5,d-6,d-7,d-8,d-9,d-10,name,f_name
1,Bachelorette,13,13_BRYAN_A,R1,,,R,R,,R,...,D6,D13,D1,D7,D1,D1,D1,D1,Bryan A,Bryan
2,Bachelorette,13,13_PETER_K,,R,,,,R,R,...,D6,D13,D9,D7,D1,D1,D1,D1,Peter K,Peter
3,Bachelorette,13,13_ERIC_B,,,R,,,R,R,...,D8,D13,D9,D1,D3,D1,D1,,Eric B,Eric
4,Bachelorette,13,13_DEAN_U,,R,,R,,,R,...,D8,D1,D9,D7,D1,D1,,,Dean U,Dean
5,Bachelorette,13,13_ADAM_G,,,,,,,ED,...,D8,D13,D9,D7,D3,,,,Adam G,Adam


In [6]:
# 3. extract a simpler list of show, season, cid, and name of candidates.
df_538_names = df_538[["show", "season", "cid", "name"]]
df_538_names.head()

Unnamed: 0,show,season,cid,name
1,Bachelorette,13,13_BRYAN_A,Bryan A
2,Bachelorette,13,13_PETER_K,Peter K
3,Bachelorette,13,13_ERIC_B,Eric B
4,Bachelorette,13,13_DEAN_U,Dean U
5,Bachelorette,13,13_ADAM_G,Adam G


In [7]:
df_538_names.to_csv(os.path.join(intermed_dir, "538_cand_names.csv"))

### Merge in Race Data

* [karenx](http://www.karenx.com/blog/minorities-on-the-bachelor-when-do-they-get-eliminated/)'s fantastic blogpost lists candidates based on their first name, season year and lead. We want to match this data up to the data we already have from 538.
* **Objective: Match karenx's data with 538 df**

In [8]:
# import in karenx data
path_kx = os.path.join(input_dir, "race", "karenx_data.csv")
df_kx = pd.read_csv(path_kx)
df_kx.columns = map(lambda x: x.strip().lower(), df_kx.columns)

for col in ["f_name", "lead"]:
    df_kx[col] = df_kx.apply(lambda row: row[col].strip(), axis=1)

def clean_fname(row):
    f_name = row.f_name
    changemap = {"Kupa": "Kupah"}
    # TODO: finish this :)))) #
df_kx.head()

Unnamed: 0,f_name,year,lead
0,Julie,2009,Jason Mesnick
1,Greg,2009,Jillian Harris
2,Channy,2010,Jake Pavelka
3,Roberto,2010,Ali Fedotowsky
4,Dianna,2012,Ben Flajnik


In [9]:
# grab season, year, show, lead data from wikipedia
def get_page_soup(url):
    try:
        resp = requests.get(url)
        page_text = resp.text
        return BeautifulSoup(page_text, "html.parser")
    except requests.exceptions.RequestException as e:
        print("Couldn't find soup object for url", url)

wiki_br = "https://en.wikipedia.org/wiki/The_Bachelor_(U.S._TV_series)"
wiki_be = "https://en.wikipedia.org/wiki/The_Bachelorette"
soup_br = get_page_soup(wiki_br)
soup_be = get_page_soup(wiki_be)
print(soup_br.prettify()[:500])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Bachelor (U.S. TV series) - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"The_Bachelor_(


In [10]:
# get bachelor dataframe object
def get_seasons(soup):
    return soup.find(id="Seasons").find_parent("h2").find_next_sibling("table", class_="wikitable")

def get_season_data(soup):
    """ returns headers and data associated with seasons data for soup obj """
    seasons = get_seasons(soup)
    header_row = seasons.find("tr")
    headers = [header.text.strip() for header in header_row.find_all("th")]
    data_rows = header_row.find_next_siblings("tr")
    return headers, data_rows

def get_data_df(soup):
    headers, data_rows = get_season_data(soup)
    all_data = []
    for row in data_rows:
        data = [val.text.strip() for val in row.find_all("td")]
        if not data[3].isnumeric():
            num_contest = all_data[-1].get("Number of contestants")
            data = data[:3] + [num_contest] + data[3:]
        data_dict = dict(zip(headers, data))
        all_data.append(data_dict)
    return pd.DataFrame(all_data)

df_br = get_data_df(soup_br)
df_br.head()

Unnamed: 0,#,Bachelor,Number of contestants,Original Run,Proposal,Relationship notes,Runner(s)-up,Still together,Winner
0,1,Alex Michel,25,"March 25–April 25, 2002",No,"Michel did not propose to Marsh, but instead t...",Trista Rehn,No,Amanda Marsh
1,2,Aaron Buerge,25,"September 25–November 20, 2002",Yes,Buerge and Eksterowicz broke up after several ...,Brooke Smith,No,Helene Eksterowicz
2,3,Andrew Firestone,25,"March 24–May 21, 2003",Yes,Schefft and Firestone broke up after several m...,Kirsten Buschbacher,No,Jen Schefft
3,4,Bob Guiney,25,"September 24–November 20, 2003",No,Guiney did not propose to Gardinier but she ac...,Kelly Jo Kuharski,No,Estella Gardinier
4,5,Jesse Palmer,25,"April 7–May 26, 2004",No,Palmer did not propose to Bowlin. They continu...,Tara Huckeby[21],No,Jessica Bowlin


In [11]:
df_be = get_data_df(soup_be)
df_be.head()

Unnamed: 0,#,Bachelorette,Number of contestants,Original run,Proposal,Relationship,Runner-up,Still together,Winner
0,1,Trista Rehn,25,"January 8–February 19, 2003",Yes,"Rehn and Sutter were married on December 6, 20...",Charlie Maher,Yes,Ryan Sutter
1,2,Meredith Phillips,25,"January 14–February 26, 2004",Yes,Phillips and McKee were engaged at the end of ...,Matthew Hickl,No,Ian Mckee
2,3,Jen Schefft,25,"January 10–February 28, 2005",Yes[a],"During the first live final rose ceremony, Sch...",John Paul Merritt,No,Jerry Ferris
3,4,DeAnna Pappas,25,"May 19–July 7, 2008",Yes,Pappas chose Csincsak and their wedding was se...,Jason Mesnick,No,Jesse Csincsak
4,5,Jillian Harris,30,"May 18–July 28, 2009",Yes,"Harris, the first Canadian bachelorette, chose...",Kiptyn Locke,No,Ed Swiderski


In [12]:
wikiframes = [df_br, df_be]
for df in wikiframes:
    df["show"] = "Bachelorette" if "Bachelorette" in df.columns else "Bachelor"
    df.columns = ["season", "lead", "num_contestants", "original_run", 
                  "proposal", "notes", "runner_up", "still_together", 
                  "winner", "show"]
    
    clean_fn = lambda x: re.sub("\[\d+\]", "", x) # remove footnotes
    get_year = lambda row: int(clean_fn(row["original_run"].split()[-1]))
    clean_lead = lambda row: re.sub("[^a-zA-Z0-9\s]+", "", clean_fn(row["lead"]))
    
    df["year"] = df.apply(get_year, axis=1)
    df["lead"] = df.apply(clean_lead, axis=1)
    
df_wiki = pd.concat(wikiframes)
df_wiki.head()

Unnamed: 0,season,lead,num_contestants,original_run,proposal,notes,runner_up,still_together,winner,show,year
0,1,Alex Michel,25,"March 25–April 25, 2002",No,"Michel did not propose to Marsh, but instead t...",Trista Rehn,No,Amanda Marsh,Bachelor,2002
1,2,Aaron Buerge,25,"September 25–November 20, 2002",Yes,Buerge and Eksterowicz broke up after several ...,Brooke Smith,No,Helene Eksterowicz,Bachelor,2002
2,3,Andrew Firestone,25,"March 24–May 21, 2003",Yes,Schefft and Firestone broke up after several m...,Kirsten Buschbacher,No,Jen Schefft,Bachelor,2003
3,4,Bob Guiney,25,"September 24–November 20, 2003",No,Guiney did not propose to Gardinier but she ac...,Kelly Jo Kuharski,No,Estella Gardinier,Bachelor,2003
4,5,Jesse Palmer,25,"April 7–May 26, 2004",No,Palmer did not propose to Bowlin. They continu...,Tara Huckeby[21],No,Jessica Bowlin,Bachelor,2004


In [13]:
for df in [df_kx, df_wiki]:
    df.year = df.year.astype(int)
    df.lead = df.lead.astype(str)
    print(df.lead.unique())

['Jason Mesnick' 'Jillian Harris' 'Jake Pavelka' 'Ali Fedotowsky'
 'Ben Flajnik' 'Emily Maynard' 'Sean Lowe' 'Desiree Hartsock'
 'Juan Pablo Galavis' 'Andi Dorfman' 'Chris Soules' 'Kaitlyn Bristowe'
 'Ben Higgins']
['Alex Michel' 'Aaron Buerge' 'Andrew Firestone' 'Bob Guiney'
 'Jesse Palmer' 'Byron Velvick' 'Charlie OConnell' 'Travis Lane Stork'
 'Lorenzo Borghese' 'Andrew Baldwin' 'Brad Womack' 'Matt Grant'
 'Jason Mesnick' 'Jake Pavelka' 'Ben Flajnik' 'Sean Lowe'
 'Juan Pablo Galavis' 'Chris Soules' 'Ben Higgins' 'Nick Viall'
 'Arie Luyendyk Jr' 'Trista Rehn' 'Meredith Phillips' 'Jen Schefft'
 'DeAnna Pappas' 'Jillian Harris' 'Ali Fedotowsky' 'Ashley Hebert'
 'Emily Maynard' 'Desiree Hartsock' 'Andi Dorfman' 'Kaitlyn Bristowe'
 'Joelle JoJo Fletcher' 'Rachel Lindsay' 'Becca Kufrin']


In [14]:
# merge on Karen x data
df_kx_merged = pd.merge(df_kx, df_wiki, how="left", on=["year", "lead"])
df_kx_merged.head()

Unnamed: 0,f_name,year,lead,season,num_contestants,original_run,proposal,notes,runner_up,still_together,winner,show
0,Julie,2009,Jason Mesnick,13,25,"January 5–March 3, 2009",Yes,"On the season's finale, Mesnick had called off...",Molly Malaney,No[a],Melissa Rycroft,Bachelor
1,Greg,2009,Jillian Harris,5,30,"May 18–July 28, 2009",Yes,"Harris, the first Canadian bachelorette, chose...",Kiptyn Locke,No,Ed Swiderski,Bachelorette
2,Channy,2010,Jake Pavelka,14,25,"January 4–March 1, 2010",Yes,Pavelka and Girardi ended their relationship i...,Tenley Molzahn,No,Vienna Girardi,Bachelor
3,Roberto,2010,Ali Fedotowsky,6,25,"May 24–August 2, 2010",Yes,Fedotowsky and Martinez got engaged in the sea...,Chris Lambton,No,Roberto Martinez,Bachelorette
4,Dianna,2012,Ben Flajnik,16,25,"January 2–March 12, 2012",Yes,Flajnik and Robertson originally broke up in F...,Lindzi Cox,No,Courtney Robertson,Bachelor


In [15]:
# merge karen x data with 538 data
df_kx_538 = pd.merge(df_kx_merged, df_538, how="left", on=["f_name", "season", "show"])
df_kx_538.head()

Unnamed: 0,f_name,year,lead,season,num_contestants,original_run,proposal,notes,runner_up,still_together,...,d-2,d-3,d-4,d-5,d-6,d-7,d-8,d-9,d-10,name
0,Julie,2009,Jason Mesnick,13,25,"January 5–March 3, 2009",Yes,"On the season's finale, Mesnick had called off...",Molly Malaney,No[a],...,,,,,,,,,,Julie D
1,Greg,2009,Jillian Harris,5,30,"May 18–July 28, 2009",Yes,"Harris, the first Canadian bachelorette, chose...",Kiptyn Locke,No,...,,,,,,,,,,
2,Channy,2010,Jake Pavelka,14,25,"January 4–March 1, 2010",Yes,Pavelka and Girardi ended their relationship i...,Tenley Molzahn,No,...,,,,,,,,,,Channy C
3,Roberto,2010,Ali Fedotowsky,6,25,"May 24–August 2, 2010",Yes,Fedotowsky and Martinez got engaged in the sea...,Chris Lambton,No,...,,,,,,,,,,
4,Dianna,2012,Ben Flajnik,16,25,"January 2–March 12, 2012",Yes,Flajnik and Robertson originally broke up in F...,Lindzi Cox,No,...,,,,,,,,,,Dianna M


In [27]:
# check if the merge happened properly
df_kx_missing = df_kx_538[df_kx_538.cid.isnull()]
df_kx_missing[["show", "season", "f_name", "year", "lead"]]

Unnamed: 0,show,season,f_name,year,lead
1,Bachelorette,5,Greg,2009,Jillian Harris
3,Bachelorette,6,Roberto,2010,Ali Fedotowsky
5,Bachelorette,8,Lerone,2012,Emily Maynard
6,Bachelorette,8,Alejandro,2012,Emily Maynard
12,Bachelor,17,Selma,2013,Sean Lowe
14,Bachelorette,9,Diogo,2013,Desiree Hartsock
15,Bachelorette,9,Mike,2013,Desiree Hartsock
16,Bachelorette,9,Will,2013,Desiree Hartsock
17,Bachelorette,9,Mikey,2013,Desiree Hartsock
18,Bachelorette,9,Juan Pablo,2013,Desiree Hartsock


In [25]:
pass