# Explore the Census Wrapper and API

### Requirements
* install the `census` module before getting started. To do this, run the following command from the command line: 
    * **`pip install census`**

### Documentation
* [Python wrapper for census API](https://github.com/datamade/census)
* [List of available fields and labels](https://gist.github.com/afhaque/60558290d6efd892351c4b64e5c01e9b)
* [Census API Docs](https://www.census.gov/data/developers/data-sets.html)


### Import Dependencies

In [3]:
import pandas as pd
from census import Census #<-- Python wrapper for census API
import requests

# Census API Key
from config import api_key

# provide the api key and the year to establish a session
c = Census(api_key, year=2017)

# Set an option to allow up to 300 characters to print in each column
pd.set_option('max_colwidth', 300)

### Gather all of the available tables for the 2013 ACS5 data

There are a number of convenient methods that the wrapper provides, but the standard function requires a tuple of field IDs that you're interested in, and a geographic reference stored in a dictionary as seen below. In this code, we're saying we want data for these 6 fields for ALL zip codes

**NOTE:** We're using the `acs5` function set to pull our data from the 5-year American Consumer Survey.

In [4]:
# query for all tables
tables = c.acs5.tables()

# The tables variable contains a list of dicts, so we can convert directly to a dataframe
table_df = pd.DataFrame(tables)
print(f"Number of available tables: {len(table_df)}")
table_df.head(15)

Number of available tables: 1127


Unnamed: 0,description,name,variables
0,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY FAMILY TYPE BY SOCIAL SECURITY INCOME BY SUPPLEMENTAL SECURITY INCOME (SSI) AND CASH PUBLIC ASSISTANCE INCOME,B17015,https://api.census.gov/data/2017/acs/acs5/groups/B17015.json
1,SEX BY AGE BY COGNITIVE DIFFICULTY,B18104,https://api.census.gov/data/2017/acs/acs5/groups/B18104.json
2,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY FAMILY TYPE BY WORK EXPERIENCE OF HOUSEHOLDER AND SPOUSE,B17016,https://api.census.gov/data/2017/acs/acs5/groups/B17016.json
3,SEX BY AGE BY AMBULATORY DIFFICULTY,B18105,https://api.census.gov/data/2017/acs/acs5/groups/B18105.json
4,POVERTY STATUS IN THE PAST 12 MONTHS BY HOUSEHOLD TYPE BY AGE OF HOUSEHOLDER,B17017,https://api.census.gov/data/2017/acs/acs5/groups/B17017.json
5,SEX BY AGE BY SELF-CARE DIFFICULTY,B18106,https://api.census.gov/data/2017/acs/acs5/groups/B18106.json
6,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER,B17018,https://api.census.gov/data/2017/acs/acs5/groups/B17018.json
7,SEX BY AGE BY INDEPENDENT LIVING DIFFICULTY,B18107,https://api.census.gov/data/2017/acs/acs5/groups/B18107.json
8,GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY AGE FOR CURRENT RESIDENCE IN PUERTO RICO,B07001PR,https://api.census.gov/data/2017/acs/acs5/groups/B07001PR.json
9,PER CAPITA INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE),B19301A,https://api.census.gov/data/2017/acs/acs5/groups/B19301A.json


### Execute a string search against the *description* column to filter to an area of interest

In [3]:
table_df[table_df['description'].str.contains("MEDIAN VALUE")]

Unnamed: 0,description,name,variables
65,MORTGAGE STATUS BY MEDIAN VALUE (DOLLARS),B25097,https://api.census.gov/data/2017/acs/acs5/groups/B25097.json
133,MEDIAN VALUE (DOLLARS) FOR MOBILE HOMES,B25083,https://api.census.gov/data/2017/acs/acs5/groups/B25083.json
303,MEDIAN VALUE (DOLLARS),B25077,https://api.census.gov/data/2017/acs/acs5/groups/B25077.json
934,MEDIAN VALUE BY YEAR STRUCTURE BUILT,B25107,https://api.census.gov/data/2017/acs/acs5/groups/B25107.json
936,MEDIAN VALUE BY YEAR HOUSEHOLDER MOVED INTO UNIT,B25109,https://api.census.gov/data/2017/acs/acs5/groups/B25109.json


### Use the provided URL for your table of interest to retrieve all available variables

Note: I couldn't find a wrapper function for this, so we're using requests to make the API call

In [5]:
# Determine which table you're interested in
table_id = 'B19013'

# Capture the variables URL from the table_df
url = table_df.loc[table_df['name']==table_id, 'variables'].values[0]

# Make the API call
response = requests.get(url).json()

# convert the response to a DataFrame
variables = pd.DataFrame(response['variables']).transpose()

print(f"Number of available variables: {len(variables)}")
variables

Number of available variables: 4


Unnamed: 0,concept,group,label,limit,predicateOnly,predicateType
B19013_001MA,,B19013,Annotation of Margin of Error!!Median household income in the past 12 months (in 2017 inflation-adjusted dollars),0,True,string
B19013_001EA,,B19013,Annotation of Estimate!!Median household income in the past 12 months (in 2017 inflation-adjusted dollars),0,True,string
B19013_001E,MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS),B19013,Estimate!!Median household income in the past 12 months (in 2017 inflation-adjusted dollars),0,True,int
B19013_001M,MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS),B19013,Margin of Error!!Median household income in the past 12 months (in 2017 inflation-adjusted dollars),0,True,int


### Filter to only fields that will contain an integer

Many of the available variables for a table are annotation (notes) fields that are typically null. Luckily the API lets us know what data type each variable is. We can use this to filter to only the ones that will contain an integer.

In [5]:
variables[variables['predicateType']=='int']
# variables

Unnamed: 0,concept,group,label,limit,predicateOnly,predicateType
B17001_050M,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Margin of Error!!Total!!Income in the past 12 months at or above poverty level!!Female!!12 to 14 years,0,True,int
B17001_050E,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Estimate!!Total!!Income in the past 12 months at or above poverty level!!Female!!12 to 14 years,0,True,int
B17001_051E,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Estimate!!Total!!Income in the past 12 months at or above poverty level!!Female!!15 years,0,True,int
B17001_051M,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Margin of Error!!Total!!Income in the past 12 months at or above poverty level!!Female!!15 years,0,True,int
B17001_052E,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Estimate!!Total!!Income in the past 12 months at or above poverty level!!Female!!16 and 17 years,0,True,int
B17001_052M,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Margin of Error!!Total!!Income in the past 12 months at or above poverty level!!Female!!16 and 17 years,0,True,int
B17001_053E,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Estimate!!Total!!Income in the past 12 months at or above poverty level!!Female!!18 to 24 years,0,True,int
B17001_053M,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Margin of Error!!Total!!Income in the past 12 months at or above poverty level!!Female!!18 to 24 years,0,True,int
B17001_054E,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Estimate!!Total!!Income in the past 12 months at or above poverty level!!Female!!25 to 34 years,0,True,int
B17001_054M,POVERTY STATUS IN THE PAST 12 MONTHS BY SEX BY AGE,B17001,Margin of Error!!Total!!Income in the past 12 months at or above poverty level!!Female!!25 to 34 years,0,True,int


### Use the wrapper to query for your selected fields

Once we've identified which fields we want, we can begin to query for the actual content

There are a number of convenient methods that the wrapper provides, but the standard `get()` function requires a tuple of field IDs, and a geographic reference stored in a dictionary as seen below. 

In this code, we're saying we want data for these 6 fields for ALL zip codes.


In [6]:
census_data = c.acs5.get(("NAME", "B19013_001E", "B01003_001E", "B01002_001E", "B19301_001E", "B17001_002E","B25033_001E","B25033_002E"), 
                         {'for': 'zip code tabulation area:*'})

census_data[:5]

[{'NAME': 'ZCTA5 00601',
  'B19013_001E': 11757.0,
  'B01003_001E': 17599.0,
  'B01002_001E': 38.9,
  'B19301_001E': 7041.0,
  'B17001_002E': 11282.0,
  'B25033_001E': 17525.0,
  'B25033_002E': 9779.0,
  'zip code tabulation area': '00601'},
 {'NAME': 'ZCTA5 00602',
  'B19013_001E': 16190.0,
  'B01003_001E': 39209.0,
  'B01002_001E': 40.9,
  'B19301_001E': 8978.0,
  'B17001_002E': 20428.0,
  'B25033_001E': 39116.0,
  'B25033_002E': 30453.0,
  'zip code tabulation area': '00602'},
 {'NAME': 'ZCTA5 00603',
  'B19013_001E': 16645.0,
  'B01003_001E': 50135.0,
  'B01002_001E': 40.4,
  'B19301_001E': 10897.0,
  'B17001_002E': 25176.0,
  'B25033_001E': 48160.0,
  'B25033_002E': 28468.0,
  'zip code tabulation area': '00603'},
 {'NAME': 'ZCTA5 00606',
  'B19013_001E': 13387.0,
  'B01003_001E': 6304.0,
  'B01002_001E': 42.8,
  'B19301_001E': 5960.0,
  'B17001_002E': 4092.0,
  'B25033_001E': 6288.0,
  'B25033_002E': 4354.0,
  'zip code tabulation area': '00606'},
 {'NAME': 'ZCTA5 00610',
  'B190

### Format the response

In [7]:
# Convert to DataFrame
census_pd = pd.DataFrame(census_data)

# Renaming columns to be more user-friendly
census_pd = census_pd.rename(columns={"B01003_001E": "Population",
                                      "B01002_001E": "Median Age",
                                      "B19013_001E": "Household Income",
                                      "B19301_001E": "Per Capita Income",
                                      "B17001_002E": "Poverty Count",
                                      "B25033_001E": "Total Households",
                                      "B25033_002E": "Total Owner Occupied",
                                      "NAME": "Name", 
                                      "zip code tabulation area": "Zipcode"})

# Since Census doesn't provide the poverty rate, we can divide Poverty Count by Population to calculate it ourselves
census_pd["Poverty Rate"] = 100 * census_pd["Poverty Count"].astype(int) / census_pd["Population"].astype(int)
census_pd["% Owner Occupied"] = 100 * (census_pd["Total Owner Occupied"].astype(int) / census_pd["Total Households"].astype(int))

# Reorder columns and only include ones we're interested in for the final DataFrame
census_pd = census_pd[["Zipcode", "Population", "Median Age", "Household Income",
                       "Per Capita Income", "Poverty Rate","Total Households","Total Owner Occupied", "% Owner Occupied"]]

# Visualize
print("Total number of zip codes in response: " + str(len(census_pd)))


census_pd.head()

Total number of zip codes in response: 33120


Unnamed: 0,Zipcode,Population,Median Age,Household Income,Per Capita Income,Poverty Rate,Total Households,Total Owner Occupied,% Owner Occupied
0,601,17599.0,38.9,11757.0,7041.0,64.105915,17525.0,9779.0,55.800285
1,602,39209.0,40.9,16190.0,8978.0,52.100283,39116.0,30453.0,77.853052
2,603,50135.0,40.4,16645.0,10897.0,50.216416,48160.0,28468.0,59.111296
3,606,6304.0,42.8,13387.0,5960.0,64.911168,6288.0,4354.0,69.243003
4,610,27590.0,41.4,18741.0,9266.0,45.498369,27474.0,21493.0,78.230327


In [21]:
census_pd["Zipcode"] = census_pd["Zipcode"].astype('str')

census_pd["Zip Length"]= census_pd["Zipcode"].str.len()

census_pd.dtypes

# census_pd.head()

# census_pd['Zip Length'].value_counts()
# census_pd2 = census_pd[census_pd["Zipcode" == "00601"]]

census_pd["first2"]= census_pd["Zipcode"].str.slice(0, 2, 1)

census_pd.head()

census_pd2=census_pd[census_pd["first2"] == "77"]
census_pd2=census_pd2.drop(columns=["Zip Length","first2"])
census_pd2.head()

Unnamed: 0,Zipcode,Population,Median Age,Household Income,Per Capita Income,Poverty Rate,Total Households,Total Owner Occupied,% Owner Occupied
26654,77002,12370.0,34.1,72306.0,34779.0,6.98464,4457.0,1213.0,27.215616
26655,77003,9646.0,34.1,59575.0,37760.0,27.628032,9597.0,4731.0,49.296655
26656,77004,37642.0,28.3,48592.0,31067.0,19.733277,28125.0,9997.0,35.544889
26657,77005,28233.0,36.5,180758.0,100896.0,3.71551,25438.0,20433.0,80.324711
26658,77006,21945.0,34.3,82878.0,68705.0,7.304625,21280.0,8945.0,42.034774


### Save to a CSV

In [22]:
census_pd2.to_csv("../Clean Data Files/census_data.csv", encoding="utf-8", index=False)

In [23]:
census_pd2.to_json("../static/data/census_data.json", orient="records")