Author: Dana Chermesh, Regional Planning intern; NYC DCP<br>
Summer 2018

### _US Metros comparison  Notebook no.1_
# POP00 from SF1 using Census API

----

A user guide for Census Data API:

# [Census Data API User Guide](https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf)

The Census Data API in an API that gives the public access to raw statistical data from various Census Bureau data
programs. In terms of space, we aggregate the data and usually associate them with a
certain Census geographic boundary/area defined by a FIPS code. 

## _get your API key from:_ 
https://api.census.gov/data/key_signup.html

**Recommended:** In order to keep your API key confidential, please save your API key in a .py file named **censusAPI.py** as follows:

```python
myAPI = 'XXXXXXXXXXXXXXX'
```
Then read into this notebook as in the following cell:
```python
from censusAPI import myAPI
```

### The complete list of all available datasets for the API is located here:
https://api.census.gov/data.html

---- 

### More on the 2000 Decennial: Summary File 1 (SF1)
- [geographies](https://api.census.gov/data/2000/sf1/geography.html)
- [variables](https://api.census.gov/data/2000/sf1/variables.html)

----

## Obtaining data for pop 2000, SF1, all counties in the US

In [6]:
import pandas as pd

# reading in my api key saved in censusAPI.py as
# myAPI = 'XXXXXXXXXXXXXXX'
# request an api key in: https://api.census.gov/data/key_signup.html
from censusAPI import myAPI

In [8]:
# total POP for all counties in the US, 2000
Pop00 = pd.read_json('https://api.census.gov/data/2000/sf1?get=P001001,NAME&for=county:*')
Pop00.columns = Pop00.iloc[0]
Pop00 = Pop00[1:]

Pop00['STCO'] = Pop00[['state', 'county']].apply(lambda x: ''.join(x), axis=1)

print(Pop00.shape)
Pop00.head()

(3141, 5)


Unnamed: 0,P001001,NAME,state,county,STCO
1,43671,Autauga County,1,1,1001
2,140415,Baldwin County,1,3,1003
3,29038,Barbour County,1,5,1005
4,20826,Bibb County,1,7,1007
5,51024,Blount County,1,9,1009


##  Reading in geo-coded dataset
created on a different notebook, please refer to [notebook no.0: 0-US_Metro_Comparison_Geographies.ipynb](https://github.com/NYCPlanning/rp-USmetros_comparison/blob/master/0-US_Metro_Comparison_Geographies.ipynb)

In [10]:
data = pd.read_csv('data/USmetros_full_correct.csv').drop('Unnamed: 0', axis=1)
data['STCO'] = data['STCO'].apply(lambda x: '{0:0>5}'.format(x))
# data['STCOzip'] = data['STCOzip'].str.replace("'", "")

print(data.shape)
data.head()

(274, 4)


Unnamed: 0,CSA,CSA_name,County_name,STCO
0,348,"Los Angeles-Long Beach, CA",Riverside,6065
1,348,"Los Angeles-Long Beach, CA",San Bernardino,6071
2,348,"Los Angeles-Long Beach, CA",Ventura,6111
3,176,"Chicago-Naperville, IL-IN-WI",Cook,17031
4,488,"San Jose-San Francisco-Oakland, CA",Alameda,6001


## Getting data for pop 2010, SF1, all counties in the US

In [12]:
# total POP for all counties in the US, 2010
POP10 = pd.read_json('https://api.census.gov/data/2010/sf1?get=P0010001,NAME&for=county:*')
POP10.columns = POP10.iloc[0]
POP10 = POP10[1:]

POP10['state'] = POP10['state'].apply(lambda x: '{0:0>2}'.format(x))
POP10['county'] = POP10['county'].apply(lambda x: '{0:0>3}'.format(x))

POP10['STCO'] = POP10[['state', 'county']].apply(lambda x: ''.join(x), axis=1)

print(POP10.shape)
POP10.head()

(3221, 5)


Unnamed: 0,P0010001,NAME,state,county,STCO
1,54571,Autauga County,1,1,1001
2,182265,Baldwin County,1,3,1003
3,27457,Barbour County,1,5,1005
4,22915,Bibb County,1,7,1007
5,57322,Blount County,1,9,1009


### Merging datasets

In [13]:
# merging 2000 pop with US metros' counties
POP00 = Pop00.merge(data, on='STCO')
POP00 = POP00.drop(['NAME'], axis=1)

# merging 2010 pop with US metros' counties + 2000
POP00 = POP00.merge(POP10, on='STCO').set_index('County_name')
POP00 = POP00.drop(['state_x', 'county_x', 'NAME', 'state_y', 'county_y'], axis=1)

POP00.columns = ['2000', 'STCO', 'CSA', 'CSA_Name', '2010']

# converting 2000 and 2010 to int
POP00['2000'] = POP00['2000'].astype(int)
POP00['2010'] = POP00['2010'].astype(int)

print(POP00.shape)
POP00.head()

(273, 5)


Unnamed: 0_level_0,2000,STCO,CSA,CSA_Name,2010
County_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alameda,1443741,6001,488,"San Jose-San Francisco-Oakland, CA",1510271
Contra Costa,948816,6013,488,"San Jose-San Francisco-Oakland, CA",1049025
Los Angeles,9519338,6037,348,"Los Angeles-Long Beach, CA",9818605
Marin,247289,6041,488,"San Jose-San Francisco-Oakland, CA",252409
Napa,124279,6055,488,"San Jose-San Francisco-Oakland, CA",136484


In [15]:
POP00[POP00['CSA']==408].shape

(31, 5)

In [16]:
POP00['2000-2010_NET'] = POP00['2010'] - POP00['2000']
POP00['2000-2010_%'] = (POP00['2010'] - POP00['2000'])/POP00['2000']
POP00.head()

Unnamed: 0_level_0,2000,STCO,CSA,CSA_Name,2010,2000-2010_NET,2000-2010_%
County_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alameda,1443741,6001,488,"San Jose-San Francisco-Oakland, CA",1510271,66530,0.046082
Contra Costa,948816,6013,488,"San Jose-San Francisco-Oakland, CA",1049025,100209,0.105615
Los Angeles,9519338,6037,348,"Los Angeles-Long Beach, CA",9818605,299267,0.031438
Marin,247289,6041,488,"San Jose-San Francisco-Oakland, CA",252409,5120,0.020705
Napa,124279,6055,488,"San Jose-San Francisco-Oakland, CA",136484,12205,0.098206


In [17]:
# checking for total pop in NYC Metro for 2000
POP00[POP00['CSA']==408]['2000'].sum()

21491898

### Exporting all counties 2010-2000 to .csv

In [18]:
POP00.to_csv('SF1_POP00-10_NEW.csv')

## Groupby CSAs to sum

In [19]:
CSAs10_00 = POP00.groupby(['CSA', 'CSA_Name']).sum().iloc[:,:-2]

CSAs10_00['2000-2010_NET'] = CSAs10_00['2010'] - CSAs10_00['2000']
CSAs10_00['2000-2010_%'] = (CSAs10_00['2010'] - CSAs10_00['2000'])/CSAs10_00['2000']

CSAs10_00['2000-2010_NET'] = CSAs10_00['2000-2010_NET'].astype(int)
CSAs10_00['2000-2010_%'] = CSAs10_00['2000-2010_%'].astype(float)

CSAs10_00

Unnamed: 0_level_0,Unnamed: 1_level_0,2000,2010,2000-2010_NET,2000-2010_%
CSA,CSA_Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
122,"Atlanta--Athens-Clarke County--Sandy Springs, GA",4778990,5910296,1131306,0.236725
148,"Boston-Worcester-Providence, MA-RI-NH-CT",7630016,7893376,263360,0.034516
176,"Chicago-Naperville, IL-IN-WI",9465353,9840929,375576,0.039679
206,"Dallas-Fort Worth, TX-OK",5565005,6817483,1252478,0.225063
216,"Denver-Aurora, CO",2629980,3034985,405005,0.153995
220,"Detroit-Warren-Ann Arbor, MI",5456428,5318744,-137684,-0.025233
288,"Houston-The Woodlands, TX",4878216,6114562,1236346,0.253442
348,"Los Angeles-Long Beach, CA",16373645,17877006,1503361,0.091816
370,"Miami-Fort Lauderdale-Port St. Lucie, FL",5475847,6166766,690919,0.126176
378,"Minneapolis-St. Paul, MN-WI",3335000,3684928,349928,0.104926


### Exporting CSA's 2000 table to .csv

In [20]:
CSAs10_00.to_csv('SF1_POP00-10_CSAs_FULLYEARS_NEW.csv')

------

## Obtaining PLACES 2000 data in order to define cities within major metros

### 2000

In [21]:
# total POP for all counties in the US, 2000
POP00_place = pd.read_json('https://api.census.gov/data/2000/sf1?get=P001001,NAME&for=place:*')
POP00_place.columns = POP00_place.iloc[0]
POP00_place = POP00_place[1:]

POP00_place.rename(columns={'P001001':'2000'}, inplace=True)
POP00_place['GEOID'] = POP00_place[['state', 'place']].apply(lambda x: ''.join(x), axis=1)

print(POP00_place.shape)
POP00_place.head()

(25150, 5)


Unnamed: 0,2000,NAME,state,place,GEOID
1,984,Altoona town,1,1660,101660
2,7411,Boaz city,1,7912,107912
3,3158,Calera city,1,11416,111416
4,4927,Childersburg city,1,14464,114464
5,53929,Decatur city,1,20104,120104


In [22]:
POP00_place[POP00_place['NAME'] == 'San Francisco city']

Unnamed: 0,2000,NAME,state,place,GEOID
2330,776733,San Francisco city,6,67000,667000


### Reading in my Geocoded places table
Created by Dara Goldberg

In [23]:
cities = pd.read_excel('data/CSA Population+Change_2010-2017.xlsx', 
             sheet_name='Cities_pop+geoinfo')

# setting GEOID to 7 digits to assure match
cities['GEOID'] = cities['GEOID'].apply(lambda x: '{0:0>7}'.format(x))
# setting GEOID to str
cities.GEOID = cities.GEOID.astype(str)

print(cities.shape)
cities

(19, 15)


Unnamed: 0,GEOID,NAMELSAD,NAME,CSA,ALAND_mi,Dec,Est,2010,2011,2012,2013,2014,2015,2016,2017
0,644000,"Los Angeles city, California",Los Angeles,348,468.65867,3792621,3792724,3796060,3824592,3859267,3891783,3922668,3953459,3981116,3999759
1,653000,"Oakland city, California",Oakland,488,55.89604,390724,390822,391571,396480,401906,407567,413933,418929,421566,425195
2,667000,"San Francisco city, California",San Francisco,488,46.90564,805235,805193,805770,816294,830406,841270,853258,866320,876103,884363
3,668000,"San Jose city, California",San Jose,488,177.5141,945942,952574,955255,971352,985722,1003735,1016708,1027560,1031942,1035317
4,820000,"Denver city, Colorado",Denver,216,153.30483,600158,599813,603218,619356,633798,648049,663271,681618,694777,704621
5,1150000,"Washington city, District of Columbia",Washington,548,61.13988,601723,601766,605040,620336,635630,650114,660797,672736,684336,693972
6,1245000,"Miami city, Florida",Miami,370,35.98691,399457,399527,400864,410932,416157,421149,431645,442277,456632,463347
7,1304000,"Atlanta city, Georgia",Atlanta,122,133.43344,420003,420425,422849,431729,443008,447812,455589,463479,472967,486290
8,1714000,"Chicago city, Illinois",Chicago,176,227.3401,2695598,2695620,2697661,2706670,2717989,2724482,2726533,2725154,2720275,2716450
9,2507000,"Boston city, Massachusetts",Boston,148,48.34364,617594,617725,620702,630072,641955,652039,661103,669255,678430,685094


### Merging 2000 + cities datasets

In [24]:
POP_place = cities.merge(POP00_place, on='GEOID')
POP_place = POP_place.drop(['NAME_x', 'Dec', 'Est', 2011, 2012, 2013, 
                           2014, 2015, 2016, 2017, 'state', 'place'], axis=1)
POP_place.columns = ['GEOID', 'NAMELSAD', 'CSA', 'ALAND_mi',
                     '2010', '2000', 'NAME']

POP_place['2000'] = POP_place['2000'].astype(int)
POP_place['2010'] = POP_place['2010'].astype(int)

POP_place['2000-2010_NET'] = POP_place['2010'] - POP_place['2000']
POP_place['2000-2010_%'] = (POP_place['2010'] - POP_place['2000'])/POP_place['2000']

print(POP_place.shape)
POP_place

(19, 9)


Unnamed: 0,GEOID,NAMELSAD,CSA,ALAND_mi,2010,2000,NAME,2000-2010_NET,2000-2010_%
0,644000,"Los Angeles city, California",348,468.65867,3796060,3694820,Los Angeles city,101240,0.027401
1,653000,"Oakland city, California",488,55.89604,391571,399484,Oakland city,-7913,-0.019808
2,667000,"San Francisco city, California",488,46.90564,805770,776733,San Francisco city,29037,0.037384
3,668000,"San Jose city, California",488,177.5141,955255,894943,San Jose city,60312,0.067392
4,820000,"Denver city, Colorado",216,153.30483,603218,554636,Denver city,48582,0.087593
5,1150000,"Washington city, District of Columbia",548,61.13988,605040,572059,Washington city,32981,0.057653
6,1245000,"Miami city, Florida",370,35.98691,400864,362470,Miami city,38394,0.105923
7,1304000,"Atlanta city, Georgia",122,133.43344,422849,416474,Atlanta city,6375,0.015307
8,1714000,"Chicago city, Illinois",176,227.3401,2697661,2896016,Chicago city,-198355,-0.068492
9,2507000,"Boston city, Massachusetts",148,48.34364,620702,589141,Boston city,31561,0.053571


### Exporting Places 2000 table to .csv

In [25]:
POP_place.to_csv('SF1_POP00-10_Places.csv')

----


----

# _Another approach to collect census data and export to excel file:_
* example for the desirable var of this notebook of **population for all counties from SF1 Decennial Census 2000**: 

### Census/Collect_Census_into_Excel.py
from:
https://github.com/xbwei/Data-Mining-on-Social-Media/blob/master/Census/Collect_Census_into_Excel.py

- Tutorial on this package:
https://www.youtube.com/watch?v=5vvAOsIB2fY

The code is as bellow:

In [6]:
from urllib import request
import json
# from pprint import pprint
import xlwt
# import xlrd
# from xlutils.copy import copy

census_api_key = myAPI #get your key from https://api.census.gov/data/key_signup.html
 
 
url_str = 'https://api.census.gov/data/2000/sf1?get=P001001,NAME&for=county:*&in=state:*&key='+census_api_key # create the url of your census data
 
response = request.urlopen(url_str) # read the response into computer
 
 
file = xlwt.Workbook() # create a new excel file
sheet_Co = file.add_sheet('SF1_2000_Co') # add a new sheet named test
html_str = response.read().decode("utf-8") # convert the response into string
i = 0 
if (html_str):
    json_data = json.loads(html_str) # convert the string into json
    for row in json_data:
        cl1, cl2, cl3, cl4 =row
        
        #write format (row_num, col_num, value)
        sheet_Co.write(i,0,cl1)
        sheet_Co.write(i,1,cl2)
        sheet_Co.write(i,2,cl3)
        sheet_Co.write(i,3,cl4)
        i = i+1

file.save('SF1_2000_Co.xlsx')#define the location of your excel file

In [8]:
# reading in the excel file we extracted from 
POP00all = pd.read_excel('SF1_2000_Co.xlsx')

POP00all['state'] = POP00all['state'].apply(lambda x: '{0:0>2}'.format(x))
POP00all['county'] = POP00all['county'].apply(lambda x: '{0:0>3}'.format(x))

POP00all['STCO'] = POP00all[['state', 'county']].apply(lambda x: ''.join(x), axis=1)

print(POP00all.shape)
POP00all.tail(10)

(3141, 5)


Unnamed: 0,P001001,NAME,state,county,STCO
3131,2407,Niobrara County,56,27,56027
3132,25786,Park County,56,29,56029
3133,8807,Platte County,56,31,56031
3134,26560,Sheridan County,56,33,56033
3135,5920,Sublette County,56,35,56035
3136,37613,Sweetwater County,56,37,56037
3137,18251,Teton County,56,39,56039
3138,19742,Uinta County,56,41,56041
3139,8289,Washakie County,56,43,56043
3140,6644,Weston County,56,45,56045
