## Merge Census Data

spent

So far I have spent 3:58 on this project. 

Next steps. Quickly merge census data. Maybe just pick a few columns.

Drop all unknown zip codes. 

Decide how to aggregate to zip level.

Put the data up in looker. Maybe multiple datasets.

Create some vizzes. The viz is the final product that they really care about btw.

In [1]:
import pandas as pd

# Load Data

## Reading

In [2]:
reading = pd.read_csv("../data/interim/zipcode_scores.csv")

## Income

In [3]:
na_values = ['-', 'N', ' (X)', '(X)', 'median-', 'median+', '**', '***', '*****']

data_cols = {"GEO_ID": "GEO_ID",
             "NAME": "GEO_NAME",
             "S1901_C01_001E": "total_households",
             "S1901_C02_001E": "total_families",
             "S1901_C01_012E": "median_household_income",
             "S1901_C02_012E": "median_family_income",
             }

income = pd.read_csv("../data/raw/ACSST5Y2022.S1901_2024-12-22T150557/ACSST5Y2022.S1901-Data.csv", 
                     na_values=na_values,
                     usecols = data_cols.keys(),
                     skiprows=[1])

income = income.rename(columns=data_cols)

In [4]:
# XXX: Note: many incomes are already masked.
print("Percent nan median household income: ", pd.to_numeric(income['median_household_income'],'coerce').isna().mean())
print("Percent nan median family income: ", pd.to_numeric(income['median_family_income'],'coerce').isna().mean())

Percent nan median household income:  0.11546184738955824
Percent nan median family income:  0.13253012048192772


In [5]:
# XXX: For simplicity I'm going to truncate these median values
income['median_household_income'] = income['median_household_income'].str.replace(r'[+-]','',regex=True)
income['median_family_income'] = income['median_family_income'].str.replace(r'[+-]','',regex=True)

income['median_household_income'] = pd.to_numeric(income['median_household_income'], 'coerce')
income['median_family_income'] = pd.to_numeric(income['median_family_income'], 'coerce')


## Poverty

In [6]:
data_cols = {"GEO_ID": "GEO_ID",
             "NAME": "GEO_NAME",
             "S1701_C03_001E": "percent_below_poverty",
             "S1701_C03_002E": "percent_below_poverty_minors",
             "S1701_C03_014E": "percent_below_poverty_black",
             "S1701_C03_020E": "percent_below_poverty_hispanic",
             }

poverty = pd.read_csv("../data/raw/ACSST5Y2022.S1701_2024-12-22T150853/ACSST5Y2022.S1701-Data.csv", 
                     na_values=na_values,
                     usecols = data_cols.keys(),
                     skiprows=[1])

poverty = poverty.rename(columns=data_cols)


# Validate & CLean

## Income

In [7]:
assert all(income['GEO_ID'].str.slice(-5) == income['GEO_NAME'].str.slice(-5))
assert all(pd.to_numeric(income['GEO_ID'].str.slice(-5),'coerce').notna())

income['zip'] = pd.to_numeric(income['GEO_ID'].str.slice(-5))
income = income.drop(columns=['GEO_ID','GEO_NAME'])


## Poverty

In [8]:
assert all(poverty['GEO_ID'].str.slice(-5) == poverty['GEO_NAME'].str.slice(-5))
assert all(pd.to_numeric(poverty['GEO_ID'].str.slice(-5),'coerce').notna())

poverty['zip'] = pd.to_numeric(poverty['GEO_ID'].str.slice(-5))
poverty = poverty.drop(columns=['GEO_ID','GEO_NAME'])

# Merge

In [9]:
scores = reading.merge(income, how='left', on='zip').merge(poverty, how='left', on='zip')

# Export

In [10]:
scores.to_csv("../data/final/scores.csv", index=False)