# 2 Data Collection: US Census
This notebook contains queries to the US Census Bureau API. Specifically, the data was taken from the American Community Survey 1-Year Supplemental Data (2016) in order to examine the income levels for every zip code in the US. Information about the API provided by the Census Bureau can be found at (https://www.census.gov/developers/).
  
In order to examine the wealth of a particular zip code the percentage of households that have income of $200,000 or more was calculated from the data and stored in the dataframe.  
  
The information is exported to "incomebyzip.csv" and it contains the following information:
* ZipCode: Location of the business
* Households: number of individual households
* Median Family Income
* Percentage of Households with income in excess of $200,000
 
 
## Dependency
#### 1 Census
The Census library was imported to communicate with the API to pull data; it is a wrapper to the census API.  It is installed with the following :
'pip install census'

#### 2 secrets.py
The Census API requires an api key for usage. This key is stored within the secrets.py that contains a single variable, "censusKey"


In [1]:
#-- Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import os
from census import Census
from us import states


# Census  API Keys
from secrets import censusKey


#-- Configuration Settings

# Folder that is to contain output of different processing
outputDirectory = "AnalysisData"

# Name of the file that contains the census data from API call
censusFileName = 'incomebyzip.csv'

## 2.1 Get Census Data
Call the Census API requesting data for 2016 summarized by zipcode.

In [2]:
#-- Get Census Data

#- Get from API
c = Census(censusKey, year=2016)

census_data = c.acs5.get(("NAME","B01003_001E",
                          "B19001_017E",
                          "B19113_001E",
                          "B25002_002E"),{'for':'zip code tabulation area:*'})

#- Convert to DataFrame
census_df = pd.DataFrame(census_data)

#- Column Reordering
census_df = census_df.rename(columns={"B01003_001E": "Population",
                                     "B25002_002E": "Households",
                                     "B19113_001E": "Median family income",
                                     "B19001_017E":"Households with household income $200,000 or more",
                                     "NAME": "Name", "zip code tabulation area": "Zipcode"})

#- Preview Data
census_df.head()

Unnamed: 0,Population,"Households with household income $200,000 or more",Median family income,Households,Name,Zipcode
0,17423.0,146.0,82512.0,7190.0,ZCTA5 01001,1001
1,29970.0,722.0,94489.0,9561.0,ZCTA5 01002,1002
2,11296.0,0.0,-666666666.0,26.0,ZCTA5 01003,1003
3,5228.0,89.0,99127.0,1840.0,ZCTA5 01005,1005
4,14888.0,350.0,92100.0,5611.0,ZCTA5 01007,1007


In [3]:
#- Sort Median Family Income and Preview
census_df.sort_values("Median family income", ascending=True).head()

Unnamed: 0,Population,"Households with household income $200,000 or more",Median family income,Households,Name,Zipcode
11747,30.0,0.0,-666666666.0,21.0,ZCTA5 36865,36865
6237,382.0,0.0,-666666666.0,179.0,ZCTA5 20687,20687
6244,0.0,0.0,-666666666.0,0.0,ZCTA5 20701,20701
8553,80.0,0.0,-666666666.0,53.0,ZCTA5 27982,27982
8549,169.0,0.0,-666666666.0,108.0,ZCTA5 27978,27978


## 2.2 Clean Data
Remove non-sensical data; where the "Median Family Income" is less than 0. A new DataFrame is created.

In [5]:
#- Create New DataFrame
cleaned_census_df = census_df[census_df['Median family income'] > 0 ] 


#- Preview Data
cleaned_census_df.sort_values("Median family income", ascending=True).head()

Unnamed: 0,Population,"Households with household income $200,000 or more",Median family income,Households,Name,Zipcode
6857,5486.0,0.0,2499.0,31.0,ZCTA5 22904,22904
32590,89.0,0.0,2499.0,43.0,ZCTA5 98939,98939
29387,98.0,3.0,2499.0,54.0,ZCTA5 87064,87064
11901,1107.0,0.0,2499.0,577.0,ZCTA5 37228,37228
13452,321.0,0.0,2499.0,104.0,ZCTA5 42151,42151


## 2.3 Calculate High Incomes
Using the cleaned census DataFrame and calcualtes the include over 200k.

In [6]:
#- Create Copy of DataFrame
income_df = cleaned_census_df.copy()

#- Calculate % "rich"(over200k)
income_df["Percent of households with income over $200,000"] = income_df["Households with household income $200,000 or more"]/income_df["Households"]*100

#- Preview Data
income_df.head()


Unnamed: 0,Population,"Households with household income $200,000 or more",Median family income,Households,Name,Zipcode,"Percent of households with income over $200,000"
0,17423.0,146.0,82512.0,7190.0,ZCTA5 01001,1001,2.030598
1,29970.0,722.0,94489.0,9561.0,ZCTA5 01002,1002,7.551511
3,5228.0,89.0,99127.0,1840.0,ZCTA5 01005,1005,4.836957
4,14888.0,350.0,92100.0,5611.0,ZCTA5 01007,1007,6.237747
5,1194.0,24.0,72000.0,530.0,ZCTA5 01008,1008,4.528302


## 2.4 Create Zipcode column
Creates a new column named "ZipCode" for use with the plots; changing from "Zipcode".  This is to resolve issue with charts.

In [7]:
#- Preview the Columns
income_df.dtypes

Population                                           float64
Households with household income $200,000 or more    float64
Median family income                                 float64
Households                                           float64
Name                                                  object
Zipcode                                               object
Percent of households with income over $200,000      float64
dtype: object

In [9]:
#- Create New Zipcode Column
income_df['ZipCode'] = income_df['Zipcode'].astype(str)

#- Preview Column types
income_df.dtypes

Population                                           float64
Households with household income $200,000 or more    float64
Median family income                                 float64
Households                                           float64
Name                                                  object
Zipcode                                               object
Percent of households with income over $200,000      float64
ZipCode                                               object
dtype: object

## 2.5 Export to CSV
Export the completed income DataFrame to .CSV.

In [12]:
# Create Path
censusFilePath = os.path.join('.', outputDirectory, censusFileName)

# Export to disk
income_df.to_csv('incomebyzip.csv')

print(f"Completed export census information to disk. Path: {censusFilePath}")

Completed export census information to disk. Path: ./AnalysisData/incomebyzip.csv
