# Clean Data (Pt. 2)
* **Filename**: clean_race_data.ipynb
* **Author**: Angelina Li
* **Date**: 08/22/2018
* **Description**: Contribute additional racial demographic data to existing dataset
* **Input**: master_dataset.csv, independently collected data
* **Output**: Person-level (leads + contestants) dataset (master_dataset) with full racial demographic data flags; U.S. yearly racial demographics data.

### Sections
* [Helper Datasets](#hand-coding-input)
* [U.S. Demographic Data](#us-demographics)

In [1]:
import re
import pandas as pd
import os

In [2]:
# name key directories

input_dir = "../input"
intermed_dir = "../intermediate"
output_dir = "../output"

<a id="hand-coding-input"></a>
### Create Helper Datasets
* To speed up the process of manually coding race-based flags for all Bachelor/ette candidates and leads, I'm going to assume that the karenx dataset found a complete set of POC for the seasons it examined. For those years, I will just independently categorize the people that the karenx dataset already identified.
* **Objective: Save two datasets: 1. People in the karenx dataset; 2. People in years not in the karenx dataset + leads **

In [3]:
# import the master dataset
master_path = os.path.join(intermed_dir, "master_dataset.csv")
df_master = pd.read_csv(master_path)
df_master.head()

Unnamed: 0,cid,d1,d10,d2,d3,d4,d5,d6,d7,d8,...,e8,e9,lead,lead_flag,name,num_contestants,poc_flag,season,show,year
0,BA_1_ALEX_M_L,,,,,,,,,,...,,,Alex Michel,True,Alex Michel,25,,1,Bachelor,2002
1,BA_2_AARON_B_L,,,,,,,,,,...,,,Aaron Buerge,True,Aaron Buerge,25,,2,Bachelor,2002
2,BA_3_ANDREW_F_L,,,,,,,,,,...,,,Andrew Firestone,True,Andrew Firestone,25,,3,Bachelor,2003
3,BA_4_BOB_G_L,,,,,,,,,,...,,,Bob Guiney,True,Bob Guiney,25,,4,Bachelor,2003
4,BA_5_JESSE_P_L,,,,,,,,,,...,,,Jesse Palmer,True,Jesse Palmer,25,,5,Bachelor,2004


In [4]:
# get people in the karenx dataset
df_kx_poc = df_master[df_master.poc_flag == True]
df_kx_poc.head()

Unnamed: 0,cid,d1,d10,d2,d3,d4,d5,d6,d7,d8,...,e8,e9,lead,lead_flag,name,num_contestants,poc_flag,season,show,year
105,BE_11_IAN_T,,,D7,,D5,D9,,,,...,,,Kaitlyn Bristowe,False,Ian T,26,True,11,Bachelorette,2015
107,BE_11_JONATHAN_H,,,D7,D6,D8,,,,,...,,,Kaitlyn Bristowe,False,Jonathan H,26,True,11,Bachelorette,2015
114,BE_11_KUPAH_J,,,D8,,,,,,,...,,,Kaitlyn Bristowe,False,Kupah J,26,True,11,Bachelorette,2015
116,BE_11_DAVID_X,,,,,,,,,,...,,,Kaitlyn Bristowe,False,David X,26,True,11,Bachelorette,2015
130,BE_10_MARQUEL_M,,,D14,D12,D6,D9,,,,...,,,Andi Dorfman,False,Marquel M,25,True,10,Bachelorette,2014


In [5]:
# get dataset of people to review
kx_ba_years = range(2009, 2017)
kx_be_years = range(2009, 2016)

is_not_ba_years = ~df_master.year.isin(kx_ba_years) & (df_master.show == "Bachelor")
is_not_be_years = ~df_master.year.isin(kx_be_years) & (df_master.show == "Bachelorette")
is_lead = df_master.lead_flag == True

df_review = df_master[ is_not_ba_years | ( is_not_be_years | is_lead ) ]
df_review.head()

Unnamed: 0,cid,d1,d10,d2,d3,d4,d5,d6,d7,d8,...,e8,e9,lead,lead_flag,name,num_contestants,poc_flag,season,show,year
0,BA_1_ALEX_M_L,,,,,,,,,,...,,,Alex Michel,True,Alex Michel,25,,1,Bachelor,2002
1,BA_2_AARON_B_L,,,,,,,,,,...,,,Aaron Buerge,True,Aaron Buerge,25,,2,Bachelor,2002
2,BA_3_ANDREW_F_L,,,,,,,,,,...,,,Andrew Firestone,True,Andrew Firestone,25,,3,Bachelor,2003
3,BA_4_BOB_G_L,,,,,,,,,,...,,,Bob Guiney,True,Bob Guiney,25,,4,Bachelor,2003
4,BA_5_JESSE_P_L,,,,,,,,,,...,,,Jesse Palmer,True,Jesse Palmer,25,,5,Bachelor,2004


In [6]:
# check there are no already identified POC in df_review
df_review[df_review.poc_flag == True]

Unnamed: 0,cid,d1,d10,d2,d3,d4,d5,d6,d7,d8,...,e8,e9,lead,lead_flag,name,num_contestants,poc_flag,season,show,year


In [7]:
# save it all!
kx_poc_path = os.path.join(intermed_dir, "karenx_poc.csv")
review_path = os.path.join(intermed_dir, "review_poc.csv")

df_kx_poc.to_csv(kx_poc_path)
df_review.to_csv(review_path)

<a id="us-demographics"></a>
### Grab yearly U.S. based racial demographics data
* It might be interesting to normalize Bachelorette race data with U.S. wide yearly race data.
* **Objective: Pull and import yearly U.S. wide race data for years in dataset; interpolate data for missing years**

In [8]:
# determine which years of race data to source from IPUMS
df_master.year.unique()

array([2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
       2013, 2014, 2015, 2016, 2017, 2018])

In [9]:
census_dir = os.path.join(input_dir, "race", "census")
census_fls = os.listdir(census_dir)
print(census_fls)

['ACS_05_EST_B03002.txt', 'ACS_05_EST_B03002_metadata.csv', 'ACS_05_EST_B03002_with_ann.csv', 'ACS_06_EST_B03002.txt', 'ACS_06_EST_B03002_metadata.csv', 'ACS_06_EST_B03002_with_ann.csv', 'ACS_07_1YR_B03002.txt', 'ACS_07_1YR_B03002_metadata.csv', 'ACS_07_1YR_B03002_with_ann.csv', 'ACS_08_1YR_B03002.txt', 'ACS_08_1YR_B03002_metadata.csv', 'ACS_08_1YR_B03002_with_ann.csv', 'ACS_09_1YR_B03002.txt', 'ACS_09_1YR_B03002_metadata.csv', 'ACS_09_1YR_B03002_with_ann.csv', 'ACS_10_1YR_B03002.txt', 'ACS_10_1YR_B03002_metadata.csv', 'ACS_10_1YR_B03002_with_ann.csv', 'ACS_11_1YR_B03002.txt', 'ACS_11_1YR_B03002_metadata.csv', 'ACS_11_1YR_B03002_with_ann.csv', 'ACS_12_1YR_B03002.txt', 'ACS_12_1YR_B03002_metadata.csv', 'ACS_12_1YR_B03002_with_ann.csv', 'ACS_13_1YR_B03002.txt', 'ACS_13_1YR_B03002_metadata.csv', 'ACS_13_1YR_B03002_with_ann.csv', 'ACS_14_1YR_B03002.txt', 'ACS_14_1YR_B03002_metadata.csv', 'ACS_14_1YR_B03002_with_ann.csv', 'ACS_15_1YR_B03002.txt', 'ACS_15_1YR_B03002_metadata.csv', 'ACS_15_1Y

In [13]:
# now we have a repository of datasets - time to compile them!
def is_data_file(fl):
    fn, ext = os.path.splitext(fl)
    return "metadata" not in fn and ext == ".csv"

get_fp = lambda fl: os.path.join(census_dir, fl)
    
census_data_fls = list(map(get_fp, filter(is_data_file, census_fls)))
print(census_data_fls)

['../input/race/census/ACS_05_EST_B03002_with_ann.csv', '../input/race/census/ACS_06_EST_B03002_with_ann.csv', '../input/race/census/ACS_07_1YR_B03002_with_ann.csv', '../input/race/census/ACS_08_1YR_B03002_with_ann.csv', '../input/race/census/ACS_09_1YR_B03002_with_ann.csv', '../input/race/census/ACS_10_1YR_B03002_with_ann.csv', '../input/race/census/ACS_11_1YR_B03002_with_ann.csv', '../input/race/census/ACS_12_1YR_B03002_with_ann.csv', '../input/race/census/ACS_13_1YR_B03002_with_ann.csv', '../input/race/census/ACS_14_1YR_B03002_with_ann.csv', '../input/race/census/ACS_15_1YR_B03002_with_ann.csv', '../input/race/census/ACS_16_1YR_B03002_with_ann.csv']


In [43]:
def get_colname(colname):
    new_colname = colname
    if "Total" in colname:
        return "total"
    elif "Estimate;" in colname:
        # if "Not Hispanic" is in colname, this is false
        hl = "Not Hispanic" not in colname
        hl_stub = "h" if hl else "nh"
        race_stub = colname.split(" - ")[1:]
        new_colname = " ".join([hl_stub] + race_stub)
    
    new_colname = new_colname.lower()
    new_colname = re.sub("[^a-zA-Z0-9\s]+", "", new_colname)
    new_colname = re.sub("\s+", "_", new_colname)
        
    return new_colname 

census_dfs = []
for data_file in census_data_fls:
    # expects files named in the format: '../input/race/census/ACS_{YEARSTUB}_1YR_B03002_with_ann.csv'
    year_stub = os.path.basename(data_file).split("_")[1]
    year = int(year_stub) + 2000
    df_year = pd.read_csv(data_file, skiprows=[0])
    
    # filter out margin of error
    df_year = df_year.filter(regex=r"^(?!Margin of Error).*$", axis=1)
    df_year["year"] = year
    df_year.columns = map(get_colname, df_year.columns)
    df_year = df_year.drop(["id", "id2", "geography"], axis=1) \
                     .set_index("year")
    
    census_dfs.append(df_year)
    
df_census = pd.concat(census_dfs)
df_census.head()

Unnamed: 0_level_0,total,nh,nh_white_alone,nh_black_or_african_american_alone,nh_american_indian_and_alaska_native_alone,nh_asian_alone,nh_native_hawaiian_and_other_pacific_islander_alone,nh_some_other_race_alone,nh_two_or_more_races,nh_two_or_more_races_two_races_including_some_other_race,...,h,h_white_alone,h_black_or_african_american_alone,h_american_indian_and_alaska_native_alone,h_asian_alone,h_native_hawaiian_and_other_pacific_islander_alone,h_some_other_race_alone,h_two_or_more_races,h_two_or_more_races_two_races_including_some_other_race,h_two_or_more_races_two_races_excluding_some_other_race_and_three_or_more_races
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005,288378137,246507434,192615561,34364572,2046735,12312949,355513,777679,4034425,215466,...,41870703,22717833,597997,310809,158866,41517,16520922,1522759,1030575,492184
2006,299398485,255146207,198176991,36434530,2035551,12945401,387230,768782,4397722,231584,...,44252278,23154516,616953,333880,154694,38964,18238347,1714924,1158753,556171
2007,301621159,256193722,198553437,36657280,2019204,13077192,401932,715275,4769402,209750,...,45427437,24452046,677290,346143,156095,32743,18023509,1739611,1147301,592310
2008,304059728,257168272,198942886,36701103,1993622,13239894,402725,701823,5186219,238046,...,46891456,29239524,884947,449800,174082,25085,14290365,1827653,1006533,821120
2009,307006556,258649796,199325978,37144530,1975193,13627633,426897,676733,5472832,236409,...,48356760,30447153,949195,482359,146978,27104,14271630,2032341,1084815,947526
