Author: Dana Chermesh Reshef, DRAW Brooklyn<br>
April 2019

### _Digital CEQR -- #1_
# Data Sources and Data acquisition

### Data Sources:
- [Bureau of Labor Statistics, Quarterly Census of Employment and Wages](https://www.bls.gov/cew/datatoc.htm) (BLS-QCEW)

---- 
# 0 - Imports

In [1]:
import requests
import pandas as pd
import numpy as np

from __future__ import print_function, division
import matplotlib.pylab as pl
import seaborn as sns
sns.set_style('whitegrid')
# import json

# Spatial
import geopandas as gpd
import fiona
import shapely

import statsmodels.formula.api as smf
import statsmodels.api as sm

%pylab inline

Populating the interactive namespace from numpy and matplotlib


# 1 - Data acquisition

## Download directly from the BLS-QCEW website

The following code was downloaded from the [BLS-QCEW website](https://data.bls.gov/cew/doc/access/data_access_examples.htm#PYTHON), sapmle code for Python 3 ([Download](https://data.bls.gov/cew/doc/access/qcew_python_3x_example.zip))

In [2]:
import urllib.request
import urllib

# *******************************************************************************
# qcewCreateDataRows : This function takes a raw csv string and splits it into
# a two-dimensional array containing the data and the header row of the csv file
# a try/except block is used to handle for both binary and char encoding
def qcewCreateDataRows(csv):
    dataRows = []
    try: dataLines = csv.decode().split('\r\n')
    except er: dataLines = csv.split('\r\n');
    for row in dataLines:
        dataRows.append(row.split(','))
    return dataRows
# *******************************************************************************


# *******************************************************************************
# qcewGetAreaData : This function takes a year, quarter, and area argument and
# returns an array containing the associated area data. Use 'a' for annual
# averages. 
# For all area codes and titles see:
# http://www.bls.gov/cew/doc/titles/area/area_titles.htm
#
def qcewGetAreaData(year,qtr,area):
    urlPath = "http://data.bls.gov/cew/data/api/[YEAR]/[QTR]/area/[AREA].csv"
    urlPath = urlPath.replace("[YEAR]",year)
    urlPath = urlPath.replace("[QTR]",qtr.lower())
    urlPath = urlPath.replace("[AREA]",area.upper())
    httpStream = urllib.request.urlopen(urlPath)
    csv = httpStream.read()
    httpStream.close()
    return qcewCreateDataRows(csv)
# *******************************************************************************


# *******************************************************************************
# qcewGetIndustryData : This function takes a year, quarter, and industry code
# and returns an array containing the associated industry data. Use 'a' for 
# annual averages. Some industry codes contain hyphens. The CSV files use
# underscores instead of hyphens. So 31-33 becomes 31_33. 
# For all industry codes and titles see:
# http://www.bls.gov/cew/doc/titles/industry/industry_titles.htm
#
def qcewGetIndustryData(year,qtr,industry):
    urlPath = "http://data.bls.gov/cew/data/api/[YEAR]/[QTR]/industry/[IND].csv"
    urlPath = urlPath.replace("[YEAR]",year)
    urlPath = urlPath.replace("[QTR]",qtr.lower())
    urlPath = urlPath.replace("[IND]",industry)
    httpStream = urllib.request.urlopen(urlPath)
    csv = httpStream.read()
    httpStream.close()
    return qcewCreateDataRows(csv)
# *******************************************************************************


# *******************************************************************************
# qcewGetSizeData : This function takes a year and establishment size class code
# and returns an array containing the associated size data. Size data
# is only available for the first quarter of each year.
# For all establishment size classes and titles see:
# http://www.bls.gov/cew/doc/titles/size/size_titles.htm
#
def qcewGetSizeData(year,size):
    urlPath = "http://data.bls.gov/cew/data/api/[YEAR]/1/size/[SIZE].csv"
    urlPath = urlPath.replace("[YEAR]",year)
    urlPath = urlPath.replace("[SIZE]",size)
    httpStream = urllib.request.urlopen(urlPath)
    csv = httpStream.read()
    httpStream.close()
    return qcewCreateDataRows(csv)

## Getting the 'annual_avg_emplvl' by industry and ownership for every county in the major US metros of this analysis 
using the `qcewGetAreaData()` function built by the BLS QCEW

In [3]:
# define industries to be extracted
# using python dictionary data storage, 
# where each value is a list of str
# of industries codes

industry = {'office':['1022', '1023', '1024'],
            'health':['1025']}
            # continue adding industries as needed
            # in the format of
            # '<IndustryName>' : [<list of ind_codes as str>] 
    
ownership = '5' # could become a list if you want to include more ownership codes

In [4]:
list(industry.keys()) # dictionaries are alphabething themselves!

['health', 'office']

In [5]:
# TEST ONE COUNTY
# this one is a test of one STCO -- will not be stored

test = qcewGetAreaData("2017","A",'06041') # using BLS code (only after 2013!)
test = pd.DataFrame(test) # put it in a pandas table
test.columns = test.iloc[0] # first row to headers
test = test[1:] # same
test.columns = [i.replace('"', '') for i in test.columns] # cleaning data
test = test.replace({'"':''}, regex=True) # cleaning data
test = test[['area_fips', 'own_code', 'industry_code', 'annual_avg_emplvl']] # selecting only relevant columns

# index to relevant row by ownership and industry
test = test.loc[(test['own_code'] == ownership) & test['industry_code'].isin(industry['office'])] 

# summing all rows to total and create new row
test.annual_avg_emplvl = test.annual_avg_emplvl.astype(int)
test = test.append(test.sum(numeric_only=True), ignore_index=True)

# assigning fipa, own, industry data to the new row
test['area_fips'][-1:] = test['area_fips'][:1]
test['own_code'][-1:] = test['own_code'][:1]
test['industry_code'][-1:] = 'office'

test = test[-1:] # dropping all rows but the sum
test.annual_avg_emplvl = test.annual_avg_emplvl.astype(int)

print(test.shape)
print(test.dtypes)
test.head()

(1, 4)
area_fips            object
own_code             object
industry_code        object
annual_avg_emplvl     int64
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,area_fips,own_code,industry_code,annual_avg_emplvl
3,6041,5,office,25937


# QCEW data before 2013
### _** Open data of the BLS QCEW is available from 2013 only; data for earlier years should be downloaded directly to the local machine**_

Download from https://www.bls.gov/cew/datatoc.htm the .zip files for the selected year under _**CSVs By Area**_ and unzip to your local folder, then run the following code to read in only _**selected counties, 'annual avg emplvl'**_, using python's _streaming_ method.

## example: year 2000

In [6]:
# check where your notebook is at, 
# in order to pass the right path in the next cell
!pwd

/Users/danachermesh/Desktop/digitalceqr


In [None]:
# creating a list of all file names in the unzipped folder

import os

mypath = '/Users/danachermesh/Desktop/digitalceqr/data/2000.annual.by_area/'# change to your local path
filesList = os.listdir(mypath)

print(len(filesList))
filesList