# Cleaning: 

The notebook cleans the collected and merged ACS data, preparing it for the EDA process. Noteable outliers regarding data consistency are Census Tracts including the National Mall, Georgetown University, and two DC aggregate observations included in the ACS data-download. Given a population of 60, the National Mall Census Tract and DC aggregate observations are dropped. The Georgetown University Census Tract is kept, but the missing values are converted to zeros. This will not present an issiue with clustering, and given its status as the only Census Tract in DC that is specific to a university and not surrounding residential areas, converting missing values to zeros will not interfere in the modeling process. I.e., the models will still capture the Census Tract as an outlier. 

In [40]:
# Import Libraries:

import numpy as np
import pandas as pd

In [41]:
# Import data, set index to 'geo_id':

df = pd.read_csv('../data/outputs/01_merged.csv', index_col=False)
df = df.set_index('geo_id')

## Drop Unneeded Observations:

In [42]:
# Drop observations aggregating DC data (ACS includes aggregate DC data in download): 
# Aggregated DC observations by geo_id index = ['0400000US11', '1600000US1150000']
# Census Tracts to drop -- DC Mall/Whitehouse, Georgetown University = ['1400000US11001006202', 1400000US11001000201]

df = df.drop(index = ['0400000US11', '1600000US1150000', '1400000US11001006202', '1400000US11001000201'])

## Drop Duplicate Name Columns from Merge:

In [43]:
df = df.drop(columns=['name_y', 'name_x.1', 'name_y.1'])

## Remove Non-Numeric Characters from Variables:

In [44]:
# Remove + character from 'median_hsld_income' and 'median_rent' columns:

df['median_hsld_income'] = df['median_hsld_income'].str.rstrip('+')
df['median_rent'] = df['median_rent'].str.rstrip('+')

In [45]:
# Replace Census Tract missing data with '0.' For clustering purposes, this will suffice. 

df = df.replace('-', 0)

In [46]:
# Remove commas from ACS 'median_hsld_income' and 'median_rent' columns:

df['median_hsld_income'] = df['median_hsld_income'].replace(',','', regex=True)
df['median_rent'] = df['median_rent'].replace(',','', regex=True)

## Convert Columns to Numeric:

In [47]:
# Convert columns to numeric:

df[['median_age', 
    'median_hsld_income', 
    'avg_wrk_commute_mins',
    'per_cap_income',
    'avg_household_size',
    'avg_family_size',
    'median_rent',
    'median_home_value']] = df[['median_age', 
                              'median_hsld_income', 
                              'avg_wrk_commute_mins',
                              'per_cap_income',
                              'avg_household_size',
                              'avg_family_size',
                              'median_rent',
                              'median_home_value']].apply(pd.to_numeric)

In [48]:
# Ensure fields are numeric, and 178 entries, no null values, and drop any existing duplicates:

df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 177 entries, 1400000US11001000100 to 1400000US11001011100
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   name_x                    177 non-null    object 
 1   total_pop                 177 non-null    int64  
 2   pct_male                  177 non-null    float64
 3   median_age                177 non-null    float64
 4   pct_hisp_latino           177 non-null    float64
 5   pct_white                 177 non-null    float64
 6   pct_black                 177 non-null    float64
 7   pct_american_ind          177 non-null    float64
 8   pct_asian                 177 non-null    float64
 9   pct_hawaiian_pacisldr     177 non-null    float64
 10  pct_other_race            177 non-null    float64
 11  pct_unemployed            177 non-null    float64
 12  avg_wrk_commute_mins      177 non-null    float64
 13  median_hsld_income        177 non-

## Export Clean Data for EDA:

In [49]:
# Export without index:

df.to_csv('../data/outputs/02_merged_clean.csv', index='geo_id')

In [50]:
# EDA conducted in notebook '03_eda.ipynb'