# Explore the Census Wrapper and API

### Requirements
* install the `census` module before getting started. To do this, run the following command from the command line: 
    * **`pip install census`**

### Documentation
* [Python wrapper for census API](https://github.com/datamade/census)
* [List of available fields and labels](https://gist.github.com/afhaque/60558290d6efd892351c4b64e5c01e9b)
* [Census API Docs](https://www.census.gov/data/developers/data-sets.html)


### Import Dependencies

In [22]:
import pandas as pd
from census import Census #<-- Python wrapper for census API
import requests

# Census API Key
from config import api_key

# provide the api key and the year to establish a session
c = Census(api_key, year=2018)

# Set an option to allow up to 300 characters to print in each column
pd.set_option('max_colwidth', 300)

### Gather all of the available tables for the 2013 ACS5 data

There are a number of convenient methods that the wrapper provides, but the standard function requires a tuple of field IDs that you're interested in, and a geographic reference stored in a dictionary as seen below. In this code, we're saying we want data for these 6 fields for ALL zip codes

**NOTE:** We're using the `acs5` function set to pull our data from the 5-year American Consumer Survey.

In [23]:
# query for all tables
tables = c.acs5.tables()

# The tables variable contains a list of dicts, so we can convert directly to a dataframe
table_df = pd.DataFrame(tables)
print(f"Number of available tables: {len(table_df)}")
table_df.head(15)

Number of available tables: 1135


Unnamed: 0,description,name,variables
0,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY FAMILY TYPE BY SOCIAL SECURITY INCOME BY SUPPLEMENTAL SECURITY INCOME (SSI) AND CASH PUBLIC ASSISTANCE INCOME,B17015,https://api.census.gov/data/2018/acs/acs5/groups/B17015.json
1,SEX BY AGE BY COGNITIVE DIFFICULTY,B18104,https://api.census.gov/data/2018/acs/acs5/groups/B18104.json
2,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY FAMILY TYPE BY WORK EXPERIENCE OF HOUSEHOLDER AND SPOUSE,B17016,https://api.census.gov/data/2018/acs/acs5/groups/B17016.json
3,SEX BY AGE BY AMBULATORY DIFFICULTY,B18105,https://api.census.gov/data/2018/acs/acs5/groups/B18105.json
4,POVERTY STATUS IN THE PAST 12 MONTHS BY HOUSEHOLD TYPE BY AGE OF HOUSEHOLDER,B17017,https://api.census.gov/data/2018/acs/acs5/groups/B17017.json
5,SEX BY AGE BY SELF-CARE DIFFICULTY,B18106,https://api.census.gov/data/2018/acs/acs5/groups/B18106.json
6,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY HOUSEHOLD TYPE BY EDUCATIONAL ATTAINMENT OF HOUSEHOLDER,B17018,https://api.census.gov/data/2018/acs/acs5/groups/B17018.json
7,SEX BY AGE BY INDEPENDENT LIVING DIFFICULTY,B18107,https://api.census.gov/data/2018/acs/acs5/groups/B18107.json
8,GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY AGE FOR CURRENT RESIDENCE IN PUERTO RICO,B07001PR,https://api.census.gov/data/2018/acs/acs5/groups/B07001PR.json
9,PER CAPITA INCOME IN THE PAST 12 MONTHS (IN 2018 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE),B19301A,https://api.census.gov/data/2018/acs/acs5/groups/B19301A.json


### Execute a string search against the *description* column to filter to an area of interest

In [24]:
table_df[table_df['description'].str.contains("MEDIAN VALUE")]

Unnamed: 0,description,name,variables
65,MORTGAGE STATUS BY MEDIAN VALUE (DOLLARS),B25097,https://api.census.gov/data/2018/acs/acs5/groups/B25097.json
133,MEDIAN VALUE (DOLLARS) FOR MOBILE HOMES,B25083,https://api.census.gov/data/2018/acs/acs5/groups/B25083.json
306,MEDIAN VALUE (DOLLARS),B25077,https://api.census.gov/data/2018/acs/acs5/groups/B25077.json
942,MEDIAN VALUE BY YEAR STRUCTURE BUILT,B25107,https://api.census.gov/data/2018/acs/acs5/groups/B25107.json
944,MEDIAN VALUE BY YEAR HOUSEHOLDER MOVED INTO UNIT,B25109,https://api.census.gov/data/2018/acs/acs5/groups/B25109.json


### Use the provided URL for your table of interest to retrieve all available variables

Note: I couldn't find a wrapper function for this, so we're using requests to make the API call

In [25]:
# Determine which table you're interested in
table_id = 'B25077'

# Capture the variables URL from the table_df
url = table_df.loc[table_df['name']==table_id, 'variables'].values[0]

# Make the API call
response = requests.get(url).json()

# convert the response to a DataFrame
variables = pd.DataFrame(response['variables']).transpose()

print(f"Number of available variables: {len(variables)}")
variables

Number of available variables: 4


Unnamed: 0,concept,group,label,limit,predicateOnly,predicateType
B25077_001M,MEDIAN VALUE (DOLLARS),B25077,Margin of Error!!Median value (dollars),0,True,int
B25077_001E,MEDIAN VALUE (DOLLARS),B25077,Estimate!!Median value (dollars),0,True,int
B25077_001EA,,B25077,Annotation of Estimate!!Median value (dollars),0,True,string
B25077_001MA,,B25077,Annotation of Margin of Error!!Median value (dollars),0,True,string


### Filter to only fields that will contain an integer

Many of the available variables for a table are annotation (notes) fields that are typically null. Luckily the API lets us know what data type each variable is. We can use this to filter to only the ones that will contain an integer.

In [26]:
variables[(variables['predicateType']=='float') | (variables['predicateType']=='int') ]
# variables

Unnamed: 0,concept,group,label,limit,predicateOnly,predicateType
B25077_001M,MEDIAN VALUE (DOLLARS),B25077,Margin of Error!!Median value (dollars),0,True,int
B25077_001E,MEDIAN VALUE (DOLLARS),B25077,Estimate!!Median value (dollars),0,True,int


### Use the wrapper to query for your selected fields

Once we've identified which fields we want, we can begin to query for the actual content

There are a number of convenient methods that the wrapper provides, but the standard `get()` function requires a tuple of field IDs, and a geographic reference stored in a dictionary as seen below. 

In this code, we're saying we want data for these 6 fields for ALL zip codes.


In [29]:
census_data = c.acs5.get(("NAME", "B19013_001E", "B01003_001E", "B01002_001E", "B17001_002E","B25033_001E",
                          "B25033_002E", "B25077_001E"), 
                         {'for': 'zip code tabulation area:*'})

census_data[:1]

APIKeyError: ' <html>     <head>         <title>Invalid Key</title>     </head>     <body>         <p>             A valid <em>key</em> must be included with each data API request.             You included a key with this request, however, it is not valid.             Please check your key and try again.         </p>         <p>             If you do not have a key you my sign up for one <a href="key_signup.html">here</a>.         </p>     </body> </html>'

In [111]:
# Convert to DataFrame
census_pd = pd.DataFrame(census_data)


census_pd.head()

Unnamed: 0,B01002_001E,B01003_001E,B17001_002E,B19013_001E,B25033_001E,B25033_002E,B25077_001E,NAME,zip code tabulation area
0,38.9,17599.0,11282.0,11757.0,17525.0,9779.0,82500.0,ZCTA5 00601,601
1,40.9,39209.0,20428.0,16190.0,39116.0,30453.0,87300.0,ZCTA5 00602,602
2,40.4,50135.0,25176.0,16645.0,48160.0,28468.0,122300.0,ZCTA5 00603,603
3,42.8,6304.0,4092.0,13387.0,6288.0,4354.0,92700.0,ZCTA5 00606,606
4,41.4,27590.0,12553.0,18741.0,27474.0,21493.0,90300.0,ZCTA5 00610,610


### Format the response

In [112]:

# Renaming columns to be more user-friendly
census_pd = census_pd.rename(columns={"B01003_001E": "Population",
                                      "B01002_001E": "Median Age",
                                      "B19013_001E": "Median Income",
                                      "B17001_002E": "Poverty Count",
                                      "B25033_001E": "Total Households",
                                      "B25033_002E": "Total Owner Occupied",
                                      "B25077_001E" : "Median Home Value",
                                      "NAME": "Name", 
                                      "zip code tabulation area": "Zipcode"})

# Since Census doesn't provide the poverty rate, we can divide Poverty Count by Population to calculate it ourselves
census_pd["Poverty Rate"] = 100 * census_pd["Poverty Count"].astype(int) / census_pd["Population"].astype(int)
census_pd["% Owner Occupied"] = 100 * (census_pd["Total Owner Occupied"].astype(int) / census_pd["Total Households"].astype(int))

# Reorder columns and only include ones we're interested in for the final DataFrame
census_pd = census_pd[["Zipcode", "Population", "Median Age", "Median Income",
                        "Poverty Rate","Total Households","Total Owner Occupied", "% Owner Occupied", "Median Home Value"]]

# Visualize
print("Total number of zip codes in response: " + str(len(census_pd)))
census_pd.head()


Total number of zip codes in response: 33120


Unnamed: 0,Zipcode,Population,Median Age,Median Income,Poverty Rate,Total Households,Total Owner Occupied,% Owner Occupied,Median Home Value
0,601,17599.0,38.9,11757.0,64.105915,17525.0,9779.0,55.800285,82500.0
1,602,39209.0,40.9,16190.0,52.100283,39116.0,30453.0,77.853052,87300.0
2,603,50135.0,40.4,16645.0,50.216416,48160.0,28468.0,59.111296,122300.0
3,606,6304.0,42.8,13387.0,64.911168,6288.0,4354.0,69.243003,92700.0
4,610,27590.0,41.4,18741.0,45.498369,27474.0,21493.0,78.230327,90300.0


In [113]:
census_pd["Zipcode"] = census_pd["Zipcode"].astype('str')

census_pd["Zip Length"]= census_pd["Zipcode"].str.len()

census_pd.dtypes

# census_pd.head()

# census_pd['Zip Length'].value_counts()
# census_pd2 = census_pd[census_pd["Zipcode" == "00601"]]

census_pd["first2"]= census_pd["Zipcode"].str.slice(0, 2, 1)

census_pd.head()

census_pd2=census_pd[census_pd["first2"] == "77"]
census_pd2.head()

Unnamed: 0,Zipcode,Population,Median Age,Median Income,Poverty Rate,Total Households,Total Owner Occupied,% Owner Occupied,Median Home Value,Zip Length,first2
26654,77002,12370.0,34.1,72306.0,6.98464,4457.0,1213.0,27.215616,233000.0,5,77
26655,77003,9646.0,34.1,59575.0,27.628032,9597.0,4731.0,49.296655,272800.0,5,77
26656,77004,37642.0,28.3,48592.0,19.733277,28125.0,9997.0,35.544889,247300.0,5,77
26657,77005,28233.0,36.5,180758.0,3.71551,25438.0,20433.0,80.324711,940800.0,5,77
26658,77006,21945.0,34.3,82878.0,7.304625,21280.0,8945.0,42.034774,430600.0,5,77


### Save to a CSV

In [None]:
census_pd.to_csv("census_data.csv", encoding="utf-8", index=False)