# Clean Data (Pt. 2)
* **Filename**: clean_race_data.ipynb
* **Author**: Angelina Li
* **Date**: 08/22/2018
* **Description**: Contribute additional racial demographic data to existing dataset
* **Input**: master_dataset.csv, independently collected data
* **Output**: Person-level (leads + contestants) dataset (master_dataset) with full racial demographic data flags; U.S. yearly racial demographics data.

### Sections
* [Helper Datasets](#hand-coding-input)
* [U.S. Demographic Data](#us-demographics)

In [1]:
import re
import pandas as pd
import os

In [2]:
# name key directories

input_dir = "../input"
intermed_dir = "../intermediate"
output_dir = "../output"

<a id="hand-coding-input"></a>
### Create Helper Datasets
* To speed up the process of manually coding race-based flags for all Bachelor/ette candidates and leads, I'm going to assume that the karenx dataset found a complete set of POC for the seasons it examined. For those years, I will just independently categorize the people that the karenx dataset already identified.
* **Objective: Save two datasets: 1. People in the karenx dataset; 2. People in years not in the karenx dataset + leads **

In [3]:
# import the master dataset
master_path = os.path.join(intermed_dir, "master_dataset.csv")
df_master = pd.read_csv(master_path)
df_master.head()

Unnamed: 0,cid,d1,d10,d2,d3,d4,d5,d6,d7,d8,...,e8,e9,lead,lead_flag,name,num_contestants,poc_flag,season,show,year
0,BA_1_ALEX_M_L,,,,,,,,,,...,,,Alex Michel,True,Alex Michel,25,,1,Bachelor,2002
1,BA_2_AARON_B_L,,,,,,,,,,...,,,Aaron Buerge,True,Aaron Buerge,25,,2,Bachelor,2002
2,BA_3_ANDREW_F_L,,,,,,,,,,...,,,Andrew Firestone,True,Andrew Firestone,25,,3,Bachelor,2003
3,BA_4_BOB_G_L,,,,,,,,,,...,,,Bob Guiney,True,Bob Guiney,25,,4,Bachelor,2003
4,BA_5_JESSE_P_L,,,,,,,,,,...,,,Jesse Palmer,True,Jesse Palmer,25,,5,Bachelor,2004


In [5]:
# get people in the karenx dataset
df_kx_poc = df_master[df_master.poc_flag == True]
df_kx_poc.head()

Unnamed: 0,cid,d1,d10,d2,d3,d4,d5,d6,d7,d8,...,e8,e9,lead,lead_flag,name,num_contestants,poc_flag,season,show,year
105,BE_11_IAN_T,,,D7,,D5,D9,,,,...,,,Kaitlyn Bristowe,False,Ian T,26,True,11,Bachelorette,2015
107,BE_11_JONATHAN_H,,,D7,D6,D8,,,,,...,,,Kaitlyn Bristowe,False,Jonathan H,26,True,11,Bachelorette,2015
114,BE_11_KUPAH_J,,,D8,,,,,,,...,,,Kaitlyn Bristowe,False,Kupah J,26,True,11,Bachelorette,2015
116,BE_11_DAVID_X,,,,,,,,,,...,,,Kaitlyn Bristowe,False,David X,26,True,11,Bachelorette,2015
130,BE_10_MARQUEL_M,,,D14,D12,D6,D9,,,,...,,,Andi Dorfman,False,Marquel M,25,True,10,Bachelorette,2014


In [6]:
# get dataset of people to review
kx_ba_years = range(2009, 2017)
kx_be_years = range(2009, 2016)

is_not_ba_years = ~df_master.year.isin(kx_ba_years) & (df_master.show == "Bachelor")
is_not_be_years = ~df_master.year.isin(kx_be_years) & (df_master.show == "Bachelorette")
is_lead = df_master.lead_flag == True

df_review = df_master[ is_not_ba_years | ( is_not_be_years | is_lead ) ]
df_review.head()

Unnamed: 0,cid,d1,d10,d2,d3,d4,d5,d6,d7,d8,...,e8,e9,lead,lead_flag,name,num_contestants,poc_flag,season,show,year
0,BA_1_ALEX_M_L,,,,,,,,,,...,,,Alex Michel,True,Alex Michel,25,,1,Bachelor,2002
1,BA_2_AARON_B_L,,,,,,,,,,...,,,Aaron Buerge,True,Aaron Buerge,25,,2,Bachelor,2002
2,BA_3_ANDREW_F_L,,,,,,,,,,...,,,Andrew Firestone,True,Andrew Firestone,25,,3,Bachelor,2003
3,BA_4_BOB_G_L,,,,,,,,,,...,,,Bob Guiney,True,Bob Guiney,25,,4,Bachelor,2003
4,BA_5_JESSE_P_L,,,,,,,,,,...,,,Jesse Palmer,True,Jesse Palmer,25,,5,Bachelor,2004


In [7]:
# check there are no already identified POC in df_review
df_review[df_review.poc_flag == True]

Unnamed: 0,cid,d1,d10,d2,d3,d4,d5,d6,d7,d8,...,e8,e9,lead,lead_flag,name,num_contestants,poc_flag,season,show,year


In [8]:
# save it all!
kx_poc_path = os.path.join(intermed_dir, "karenx_poc.csv")
review_path = os.path.join(intermed_dir, "review_poc.csv")

df_kx_poc.to_csv(kx_poc_path)
df_review.to_csv(review_path)

<a id="us-demographics"></a>
### Grab yearly U.S. based racial demographics data
* It might be interesting to normalize Bachelorette race data with U.S. wide yearly race data.
* **Objective: Pull and import yearly U.S. wide race data for years in dataset; interpolate data for missing years**

In [9]:
# determine which years of race data to source from IPUMS
df_master.year.unique()

array([2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
       2013, 2014, 2015, 2016, 2017, 2018])