# Converting SAS to CSV

This is a quick notebook to conver the SAS data files that hold the TEL data to CSV.  Once this occurs, we can merge these data with the Reuter's debt data.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sas7bdat import SAS7BDAT
import sys

Let's take a look at the files...

In [2]:
!ls ../../debt_data/

1984to1985.csv	1994to1995.csv	2004.csv	2012to2013.csv
1984to1985.txt	1994to1995.txt	2004.txt	2012to2013.txt
1986to1987.csv	1996to1997.csv	2005.csv	2014to2015.csv
1986to1987.txt	1996to1997.txt	2005.txt	2014to2015.txt
1988to1989.csv	1998to1999.csv	2006to2007.csv	co_pums11.sas7bdat
1988to1989.txt	1998to1999.txt	2006to2007.txt	costat12.sas7bdat
1990to1991.csv	2000to2001.csv	2008to2009.csv	debt_data_compile.ipynb
1990to1991.txt	2000to2001.txt	2008to2009.txt	loc_tel_yr70_13.sas7bdat
1992to1993.csv	2002to2003.csv	2010to2011.csv	tel_cl_jm.sas7bdat
1992to1993.txt	2002to2003.txt	2010to2011.txt


In [3]:
files=!ls ../../debt_data/*.sas7bdat
files

['../../debt_data/co_pums11.sas7bdat',
 '../../debt_data/costat12.sas7bdat',
 '../../debt_data/loc_tel_yr70_13.sas7bdat',
 '../../debt_data/tel_cl_jm.sas7bdat']

Supposedly we can read them right into pandas DFs.

In [4]:
#Create container for each set
df_list=[]

#For each file...
for file_in in files:
    #...create a sas7bdat object...
    with SAS7BDAT(file_in) as f:
        #...and throw a DF version in df_list
        df_list.append(f.to_data_frame())

[33m[costat12.sas7bdat] column count mismatch[0m


Just what is contained in these guys?

In [5]:
for df in df_list:
    print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28778 entries, 0 to 28777
Data columns (total 41 columns):
STCOU         28778 non-null object
POP_TH18      28778 non-null float64
POP_OV65      28778 non-null float64
TOT_EMP       28778 non-null float64
MFG_EMP       28778 non-null float64
RETL_EMP      28778 non-null float64
T_PUB_SCH     28778 non-null float64
PUB_SCHL      28778 non-null float64
PVT_SCHL      28778 non-null float64
HSLD_PERS     28778 non-null float64
HSG_UNITS     28778 non-null float64
CH_HS_UNT     28778 non-null float64
PRE_1940      28778 non-null float64
VACANT        28778 non-null float64
MDHOMEVAL     28778 non-null float64
MED_INC       28778 non-null float64
PC_INC        28778 non-null float64
LANDAREA      28778 non-null float64
GEN_REV       28778 non-null float64
IGR_ST        28778 non-null float64
TAX_REV       28778 non-null float64
PT_REV        28778 non-null float64
D_GEN_EXP     28778 non-null float64
PC_GEN_EXP    28778 non-null float64
RES_

First, what is our year coverage with the sets with a year variable?

In [6]:
for i in [0,2,3]:
    print df_list[i]['YEAR'].describe()

count    28778.000000
mean      2005.112517
std          6.153490
min       1990.000000
25%       2005.000000
50%       2007.000000
75%       2009.000000
max       2011.000000
Name: YEAR, dtype: float64
count    2244.000000
mean     1991.500000
std        12.701255
min      1970.000000
25%      1980.750000
50%      1991.500000
75%      2002.250000
max      2013.000000
Name: YEAR, dtype: float64
count     663.000000
mean     2004.307692
std         5.301090
min      1990.000000
25%      2002.000000
50%      2005.000000
75%      2008.000000
max      2011.000000
Name: YEAR, dtype: float64


## Capturing COSTAT Data of Interest

It turns out many of the variables needed for this analysis come out of raw COSTAT data in the second file in the list.  These data are not arranged in long format.  Each variable name includes three parts:

1. Three letters representing the concept family;
2. Three numbers representing the specific concept within the family; and,
3. Three numbers that serve as a year abbreviation.

Any variable that has a `D` appended to the end of the above sequence is a data value.  We require these variables to be arranged in time series fashion within concept.  Each family and concept will be mapped to a new variable descriptor.  The year portion of the variable name will be converted to an actual year and come into play in the DF index.

In previous updates to the TEL set, the conceptual variables in the final set have been constructed as straight mapping from a COSTAT variable (less the year component and `D` suffix), or some combination of COSTAT variables. These mappings from family and concept to variable are captured in different locations.  The first is  `costat02and07.sas`...

In [7]:
!head -31 costat02and07.sas

/****EXTRACTS COUNTY DEMOGRAPHIC DATA TO USE WITH pums****/

/*********************************************************/
/*Program adds 2002 and 2007 data to the EFFECT97 set in the effect library below.
This involves first making a 2002 subset commensurate in type to the 2007 subset.
We then need to make the Costat match the EFFECT97 set to the extent possible*/

LIBNAME IN1 'Z:\PublicFinance\Data\Demographics\Counties\COSTATS08\Data\';
libname effect 'Z:\PublicFinance\Data\Government Finances\TELS\LOCAL\DATA\';

/*Corrected for 2002 variables, or the most recent*/
	DATA X2002;
  SET IN1.COSTAT08;
  YEAR =2002;
  KEEP STCOU YEAR
         HSG495200D           IPE010202D           PEN020202D            PST045202D
         IPE120202D           SPR010202D           SPR030202D            EMN010202D
         AGE290202D           AGE770202D           LOG315202D            LOG310202D 
         LOG130202D           LOG010202D           LOG220202D            

...and the second is in `TEL_UPDATE.sas`.

In [8]:
!head -75 TEL_UPDATE.sas

*THIS HAS BEEN DONE AND FILES ARE NOW IN TEL LIBRARY--USED TO IMPORT MISSING DATA;
DATA COSTATS;
	SET 'Y:\Dropbox\Data\costat12';
RUN;
DATA AGE;
	SET 'Y:\Dropbox\Data\AGE02';
RUN;

DATA COSTATCOMPLETE;
	MERGE COSTATS AGE;
	BY STCOU;
RUN;


DATA DEMOGRAPHIC;
	SET COSTATCOMPLETE;
 		KEEP STCOU AREANAME AGE290202D AGE290207D AGE770202D AGE770207D BZA110202D BZA110207D BZN230202D BZN230207D BZN430202D BZN430207D BZN630202D BZN630207D 
 			BZN675202D BZN675207D BZN695202D BZN695207D BZN830202D BZN830207D BZN870202D BZN870207D EDU300200D EDU314200D EDU334200D EDU354200D EDU374200D EDU404209D
 			EDU410209D EDU416209D EDU422209D EDU428209D EDU434209D EDU452209D EDU458209D EDU464209D EDU470209D EDU476209D EDU482209D HSD310200D HSD310209D HSG050200D
 			HSG050209D HSG170200D HSG170209D LND010200D LOG020197D LOG020202D PEN020202D PEN020207D PVY410200D PVY410209D PVY420199D PVY420209D SPR010202D SPR010207D
 			SPR030202D SPR030207D;
RUN;

DATA DEMOGRA

We will need to confirm the concepts being mapped here, so all reference information for the COSTATs database (a.k.a. USA Counties) has been deposited in **`../doc/COSTAT`** (including `csv` versions of family and specific variable descriptions).

In [9]:
!ls ../doc/COSTAT/

Flag_Reference.xls	Mastdata.xls	Ref.zip
Footnote_Reference.xls	mastgroups.csv	Source.xls
mastdata.csv		Mastgroups.xls	Unit_Reference.xls


A dictionary mapping variables to their descriptions would be useful for the confirmation process, so let's go ahead and do that.

In [10]:
#Read in variable mapping set
mastdata=pd.read_csv('../doc/COSTAT/mastdata.csv')

#Generate COSTAT variable description map
costat_map=dict(zip(mastdata['Item_Id'],mastdata['Item_Description']))

The variables identified in the first `KEEP` statement of each file will need to be checked with `costat_map` to confirm their "identity".  We can grab these directly from the files and throw them in a list of unique elements. 

In [11]:
#Define function that converts file lines containing COSTAT variables to a nice list
def read2list(line_list):
    #Capture the lines as one string
    ll='\r\n'.join(line_list).replace('\r\n','')
    #Remove semi-colons
    ll=ll.replace(';','')
    #Convert to list
    ll=[elem for elem in ll.split() if elem != '']
    return ll

#Capture the relevant variable lines from costat02* and convert them into a list
with open('costat02and07.sas') as f:
    costat02_vars=read2list(f.readlines()[15:22])
f.close()

#Capture the relevant variable lines from TEL_UPDATE and convert them into a list
with open('TEL_UPDATE.sas') as f:
    tel_up_vars=read2list(f.readlines()[16:21])[3:] #first three (KEEP,STCOU,AREANAME) are not required for this exercise
f.close()

#Capture the set of unique COSTAT variables
costat_vars=sorted(set(costat02_vars+tel_up_vars))

#For each COSTAT variable
for var in costat_vars:
    #...print the description
    try:
        print var,'|',costat_map[var]
    except:
        print '*** ',sys.exc_info()[0],' ***'

AGE290202D | Resident population under 18 years (July 1 - estimate) 2002
AGE290207D | Resident population under 18 years (July 1 - estimate) 2007
AGE770202D | Resident population 65 years and over (July 1 - estimate) 2002
AGE770207D | Resident population 65 years and over (July 1 - estimate) 2007
BZA110202D | Private nonfarm employment for pay period including March 12, 2002
BZA110207D | Private nonfarm employment for pay period including March 12, 2007
BZN230202D | Private nonfarm employment for pay period including March 12, 2002 - manufacturing (NAICS 31)
BZN230207D | Private nonfarm employment for pay period including March 12, 2007 - manufacturing (2002 NAICS 31)
BZN430202D | Private nonfarm employment for pay period including March 12, 2002 - retail trade (NAICS 44)
BZN430207D | Private nonfarm employment for pay period including March 12, 2007 - retail trade (2002 NAICS 44)
BZN630202D | Private nonfarm employment for pay period including March 12, 2002 - professional, scientific

Now let's do this mapping explicitly.  For each family and concept above, we will explicitly map in a name.  If the name already exists in our SAS files above, we will use it. Otherwise, a new name will be generated.  The explicit mapping will cover our validation objective.  Note that some of these will only be input variables into derived variables we need later.

*Note that there are two variables for resident population.  In the dict, `RES_POP1` captures years up through 2001, and `RES_POP2` captures years from 2002 on.*

In [12]:
costat_concept_map={'AGE290':'POP_TH18',
				    'AGE770':'POP_OV65',
				    'BZA110':'TOT_EMP_PNFARM',
				    'BZN230':'MANU_EMP_PNFARM',
				    'BZN430':'RETL_EMP_PNFARM',
				    'BZN630':'PROF_SERV_EMP_PNFARM',
				    'BZN675':'SUPP_SERV_EMP_PNFARM',
				    'BZN695':'EDUC_SERV_EMP_PNFARM',
				    'BZN830':'FOOD_SERV_EMP_PNFARM',
				    'BZN870':'OTH_SERV_EMP_PNFARM',
				    'EDU010':'PUB_SCHL_TOT',
				    'EDU300':'PUB_SCHL_OV3',
				    'EDU314':'PRV_SCHL_PREK',
				    'EDU334':'PRV_SCHL_KIND',
				    'EDU354':'PRV_SCHL_1_8',
				    'EDU362':'PUB_SCHL_ELEM_HS',
				    'EDU364':'PRV_SCHL_ELEM_HS',
				    'EDU374':'PRV_SCHL_9_12',
				    'EDU404':'PUB_SCHL_OV3_M',
				    'EDU410':'PRV_SCHL_PREK_M',
				    'EDU416':'PRV_SCHL_KIND_M',
				    'EDU422':'PRV_SCHL_1_4_M',
				    'EDU428':'PRV_SCHL_5_8_M',
				    'EDU434':'PRV_SCHL_9_12_M',
				    'EDU452':'PUB_SCHL_OV3_F',
				    'EDU458':'PRV_SCHL_PREK_F',
				    'EDU464':'PRV_SCHL_KIND_F',
				    'EDU470':'PRV_SCHL_1_4_F',
				    'EDU476':'PRV_SCHL_5_8_F',
				    'EDU482':'PRV_SCHL_9_12_F',
				    'EMN010':'TOT_EMP',
				    'EMN240':'MFG_EMP',
				    'EMN260':'RETL_EMP',
				    'HSD310':'HSLD_PERS',
				    'HSG030':'HSG_UNITS',
				    'HSG045':'CH_HS_UNT',
				    'HSG050':'HSG_UNITS_ACS',
				    'HSG170':'PRE_1940',
				    'HSG190':'VACANT',
				    'HSG495':'MDHOMEVAL',
				    'IPE010':'MED_INC',
				    'IPE120':'PERS_POVT',
				    'LND010':'TOT_AREA',
				    'LND110':'LANDAREA',
				    'LOG010':'GEN_REV',
				    'LOG020':'RES_POP1',
				    'LOG130':'IGR_ST',
				    'LOG220':'TAX_REV',
				    'LOG230':'PT_REV',
				    'LOG310':'D_GEN_EXP',
				    'LOG315':'PC_GEN_EXP',
				    'PEN020':'PC_INC',
				    'PST045':'RES_POP2',
				    'PVY410':'POV_EST_FAM_DENOM',
				    'PVY420':'POV_EST_FAM_NUMER',
				    'SPR010':'SS_PERS',
				    'SPR030':'SS_PMT'}

We also need a mapping to convert the year components in the variable labels to more functional year integers.

In [13]:
#Capture year abbreviations
yr_abbr=sorted(list(set([var[6:9] for var in df_list[1].columns])))

#Create year mapping
yr_map={}

#For each year abbreviation...
for yr in yr_abbr:
    #...if it's last century...
    if yr.startswith('1'):
        yr_map.update({yr:int(yr[0]+'9'+yr[1:])})
    elif yr.startswith('2'):
        yr_map.update({yr:int(yr[0]+'0'+yr[1:])})
    else:
        yr_map.update({yr:0})

yr_map

{u'': 0,
 u'130': 1930,
 u'140': 1940,
 u'150': 1950,
 u'160': 1960,
 u'169': 1969,
 u'170': 1970,
 u'171': 1971,
 u'172': 1972,
 u'173': 1973,
 u'174': 1974,
 u'175': 1975,
 u'176': 1976,
 u'177': 1977,
 u'178': 1978,
 u'179': 1979,
 u'180': 1980,
 u'181': 1981,
 u'182': 1982,
 u'183': 1983,
 u'184': 1984,
 u'185': 1985,
 u'186': 1986,
 u'187': 1987,
 u'188': 1988,
 u'189': 1989,
 u'190': 1990,
 u'191': 1991,
 u'192': 1992,
 u'193': 1993,
 u'194': 1994,
 u'195': 1995,
 u'196': 1996,
 u'197': 1997,
 u'198': 1998,
 u'199': 1999,
 u'200': 2000,
 u'201': 2001,
 u'202': 2002,
 u'203': 2003,
 u'204': 2004,
 u'205': 2005,
 u'206': 2006,
 u'207': 2007,
 u'208': 2008,
 u'209': 2009,
 u'210': 2010,
 u'me': 0}

Now comes the interesting part. We need to roll through these keys, grab all of the variables assigned to them, and arrange them in a time series with their new labels.

In [14]:
#Capture COSTAT data columns
cd_cols=[var for var in df_list[1].columns if var.endswith('D')]

def vars2df(var_str):
    #Capture associated variables
    varlist=[var for var in cd_cols if var_str in var]
    #Capture associated sub-DF
    df=df_list[1].set_index('STCOU')[varlist]
    #Generate new multiindex for columns
    col_idx=pd.MultiIndex.from_tuples([(var[:6],yr_map[var[6:9]]) for var in varlist])
    #Set to column
    df.columns=col_idx
    df.columns.names=['VAR','YEAR']
    #Stack the years into the index
    df=df.stack('YEAR')
    #Rename the variable
    df.columns=[costat_concept_map[var_str]]
    return df
    
#Create a container to hold all of the new single-variable DFs
costat_var_dfs=[]

#For each variable concept...
for cvar in costat_concept_map.keys():
    #...capture the single-variable DF 
    costat_var_dfs.append(vars2df(cvar))
    
#Capture RES_POP* DFs
res_pop1=[df for df in costat_var_dfs if df.columns in ['RES_POP1']][0].rename(columns={'RES_POP1':'RESPOP'})
res_pop2=[df for df in costat_var_dfs if df.columns in ['RES_POP2']][0].rename(columns={'RES_POP2':'RESPOP'})

#Generate consolidated RESPOP variable
respop=pd.concat([res_pop1.loc[(slice(None),slice(1972,1997)),:],res_pop2]).sortlevel(0)

#Throw respop in costat_var_dfs
costat_var_dfs.append(respop)

costat_var_dfs[0]

Unnamed: 0_level_0,Unnamed: 1_level_0,D_GEN_EXP
STCOU,YEAR,Unnamed: 2_level_1
00000,1972,106499000
00000,1977,170938144
00000,1982,264355426
00000,1987,392014582
00000,1992,566959282
00000,1997,723603907
00000,2002,986370919
01000,1972,1060272
01000,1977,1905744
01000,1982,3153193


Depending on the variable, we end up having different time periods being covered.  We need to make sure that all variables can be joined together, so all years must be represented in the original set (to which all others will be joined).  We can execute this by taking the union of all the tuples underlying the MultiIndex in every variable-specific DF.  With this in hand, we can reindex the original DF, and join all the variables together.

In [15]:
#Capture union of all tuples from all indices
idx_tups=sorted(set.union(*[set(var_df.index.values) for var_df in costat_var_dfs]))

#Generate a new MultiIndex
idx_all_vars=pd.MultiIndex.from_tuples(idx_tups)

#Capture first variable DF
first_var=costat_var_dfs[0].reindex(idx_all_vars)

#Join all variables together
cvar_df=first_var.join(costat_var_dfs[1:])

Let's check our year coverage of each variable.

In [16]:
#Create dict to hold years of observation
var_yr_obs={}

#For each variable...
for var in cvar_df.columns:
    #...capture the years in which the variable is observed...
    yr_list=list(cvar_df[cvar_df[var].notnull()][var].ix['00000'].index.values)
    #...and update var_yr_obs
    var_yr_obs.update({var:yr_list})
    
var_yr_obs

{'CH_HS_UNT': [1980, 1990, 2000],
 'D_GEN_EXP': [1972, 1977, 1982, 1987, 1992, 1997, 2002],
 'EDUC_SERV_EMP_PNFARM': [2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009],
 'FOOD_SERV_EMP_PNFARM': [1998,
  1999,
  2000,
  2001,
  2002,
  2003,
  2004,
  2005,
  2006,
  2007,
  2008,
  2009],
 'GEN_REV': [1972, 1977, 1982, 1987, 1992, 1997, 2002],
 'HSG_UNITS': [1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010],
 'HSG_UNITS_ACS': [1980, 1990, 2000, 2009],
 'HSLD_PERS': [1970, 1980, 1990, 2000, 2009, 2010],
 'IGR_ST': [1977, 1982, 1987, 1992, 1997, 2002],
 'LANDAREA': [1980, 1990, 2000, 2010],
 'MANU_EMP_PNFARM': [1998,
  1999,
  2000,
  2001,
  2002,
  2003,
  2004,
  2005,
  2006,
  2007,
  2008,
  2009],
 'MDHOMEVAL': [2000, 2009],
 'MED_INC': [1995,
  1997,
  1998,
  1999,
  2000,
  2001,
  2002,
  2003,
  2004,
  2005,
  2006,
  2007,
  2008,
  2009],
 'MFG_EMP': [2001, 2002, 2003, 2004, 2005, 2006, 2007],
 'OTH_SERV_EMP_PNFARM': [2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009],
 'PC_GEN_E

Now we have an interesting situation with the schooling variables.  There are only two observations for each of the variables concepts.  The observation in 2000 count is total students for each grade group, while the 2009 is split by sex.  Probably the best thing to do is just combine the male and female counts for each group, and insert the sum into the original variable. For school enrollment over three (`PUB_SCHL_OV3`) the 2000 observation would just be the value of `PUB_SCHL_OV3` and the edited 2009 value would be `PUB_SCHL_OV3_F + PUB_SCHL_OV3_M`.  On the private school side, things are slightly different because the actual grade groups change.  Since we are only concerned with the total private enrollment, we will create a new variable (`PRV_SCHL`).

In [17]:
#Capture FIPS codes in index
idx_fips=sorted(set([idx[0] for idx in cvar_df.index]))

#Create container for assignment by FIPS subset
fips_dfs=[]

#For each FIPS code...
for fips in idx_fips:
    #...subset to that county...
    fips_tmp=cvar_df.ix[fips]
    #...add fips back in...
    fips_tmp['STCOU']=fips
    #Generate new PUB_SCHL_OV3 at 2009
    fips_tmp.ix[2009,'PUB_SCHL_OV3']=fips_tmp[['PUB_SCHL_OV3_F','PUB_SCHL_OV3_M']].sum(axis=1).ix[2009]
    #Generate lists to capture private school enrollment
    pre2003_prv_schl=['PRV_SCHL_PREK','PRV_SCHL_KIND','PRV_SCHL_1_8','PRV_SCHL_9_12']
    pst2002_prv_schl=['PRV_SCHL_PREK_F','PRV_SCHL_KIND_F','PRV_SCHL_1_4_F','PRV_SCHL_5_8_F','PRV_SCHL_9_12_F',\
                     'PRV_SCHL_PREK_M','PRV_SCHL_KIND_M','PRV_SCHL_1_4_M','PRV_SCHL_5_8_M','PRV_SCHL_9_12_M']
    #Create new private school variable
    fips_tmp['PRV_SCHL']=np.NaN
    fips_tmp.ix[2000,'PRV_SCHL']=fips_tmp[pre2003_prv_schl].sum(axis=1).ix[2000]
    fips_tmp.ix[2009,'PRV_SCHL']=fips_tmp[pst2002_prv_schl].sum(axis=1).ix[2009]
    #Throw it in the list
    fips_dfs.append(fips_tmp)
    
#Concatenate back together
cvar_df=pd.concat(fips_dfs)

#Move FIPS back to index
cvar_df.set_index('STCOU',append=True,inplace=True)

#Swap and sort index levels
cvar_df=cvar_df.swaplevel(0,1)
cvar_df.sortlevel(0,inplace=True)

cvar_df[['PUB_SCHL_OV3','PUB_SCHL_OV3_F','PRV_SCHL']+pre2003_prv_schl].ix['00000'].ix[2000:]

Unnamed: 0,PUB_SCHL_OV3,PUB_SCHL_OV3_F,PRV_SCHL,PRV_SCHL_PREK,PRV_SCHL_KIND,PRV_SCHL_1_8,PRV_SCHL_9_12
2000,76632927.0,,8108420.0,2290976.0,621446.0,3663675.0,1532323.0
2001,,,,,,,
2002,,,,,,,
2003,,,,,,,
2004,,,,,,,
2005,,,,,,,
2006,,,,,,,
2007,,,,,,,
2008,,,,,,,
2009,79887998.0,40465637.0,8107475.0,,,,


Now we are in a position to fill in missing values.  For expediency, we will use a combination of linear interpolation (between observed values) and padding for exterior values.

In [18]:
#Capture FIPS codes in index
idx_fips=sorted(set([idx[0] for idx in cvar_df.index]))

#Create container to hold sub-DFs after interpolation and padding
cvar_df_list=[]

#For each FIPS code...
for code in idx_fips:
    #...perform interpolation/padding on associated subset...
    tmp_sub=cvar_df.ix[code].interpolate().fillna(method='bfill')
    #...redefine FIPS code variable...
    tmp_sub['STCOU']=code
    #...and throw it in cvar_df_list
    cvar_df_list.append(tmp_sub)

#Concatenate back together
cvar_df2=pd.concat(cvar_df_list)

#Set index
cvar_df2.set_index('STCOU',append=True,inplace=True)

#Swap and sort index levels
cvar=cvar_df2.swaplevel(0,1)
cvar.sortlevel(0,inplace=True)

cvar

Unnamed: 0_level_0,Unnamed: 1_level_0,D_GEN_EXP,FOOD_SERV_EMP_PNFARM,PRV_SCHL_KIND,OTH_SERV_EMP_PNFARM,PRE_1940,MED_INC,PERS_POVT,PRV_SCHL_9_12,PC_GEN_EXP,MANU_EMP_PNFARM,...,RES_POP2,PRV_SCHL_9_12_M,HSLD_PERS,PRV_SCHL_1_4_F,PRV_SCHL_5_8_F,TOT_EMP_PNFARM,RETL_EMP,IGR_ST,RESPOP,PRV_SCHL
STCOU,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
00000,1940,106499000.0,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,509.00,16945834,...,282171957,849224,3.110000,886268,885973,74844180,18549500,6.027697e+07,2.092839e+08,8108420.000000
00000,1950,106499000.0,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,509.00,16945834,...,282171957,849224,3.110000,886268,885973,74844180,18549500,6.027697e+07,2.092839e+08,8108420.000000
00000,1960,106499000.0,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,509.00,16945834,...,282171957,849224,3.110000,886268,885973,74844180,18549500,6.027697e+07,2.092839e+08,8108420.000000
00000,1969,106499000.0,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,509.00,16945834,...,282171957,849224,3.110000,886268,885973,74844180,18549500,6.027697e+07,2.092839e+08,8108420.000000
00000,1970,106499000.0,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,509.00,16945834,...,282171957,849224,3.110000,886268,885973,74844180,18549500,6.027697e+07,2.092839e+08,8108420.000000
00000,1972,106499000.0,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,509.00,16945834,...,282171957,849224,3.038000,886268,885973,74844180,18549500,6.027697e+07,2.092839e+08,8108420.000000
00000,1975,138718572.0,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,643.50,16945834,...,282171957,849224,2.966000,886268,885973,74844180,18549500,6.027697e+07,2.145219e+08,8108420.000000
00000,1977,170938144.0,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,778.00,16945834,...,282171957,849224,2.894000,886268,885973,74844180,18549500,6.027697e+07,2.197599e+08,8108420.000000
00000,1979,194292464.5,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,868.75,16945834,...,282171957,849224,2.822000,886268,885973,74844180,18549500,6.904845e+07,2.227362e+08,8108420.000000
00000,1980,217646785.0,9466088,621446,5420087,18832498.000000,34076.0,13.80,1532323,959.50,16945834,...,282171957,849224,2.750000,886268,885973,74844180,18549500,7.781994e+07,2.257125e+08,8108420.000000


### Derived Variables

There are a number of variables that must be calculated and included into the set at this point.

Derived Concept|Variable|Calculation From Existing Variables
---------------|--------|-----------
Population Density|`DENSITY`|$\frac{\text{RESPOP}}{\text{TOT_AREA}}$
Population Growth|`POPGROWTH`|$\frac{\text{RESPOP}_t - \text{RESPOP}_{t-1}}{\text{RESPOP}_{t-1}}$
Child Proportion|`PYOUNG`|$\frac{\text{POP_TH18}}{\text{RESPOP}}$
Senior Proportion|`POP65`|$\frac{\text{POP_OV65}}{\text{RESPOP}}$
Population$^2$|`RESPOP2`|$\text{RESPOP}^2$
Pre-1940 Housing Proportion|`PRE1940`|$\frac{\text{PRE_1940}}{\text{HSG_UNITS}}$
Ratio of Private to Public School Enrollment|`PVT_SCH`|$\frac{\text{PRV_SCHL}}{\text{PUB_SCHL_OV3}}$
Poverty Rate|`POVERTY`|$\frac{\text{POV_EST_FAM_NUMER}}{\text{POV_EST_FAM_DENOM}}$
Average Social Security Benefit per Recipient|`PC_SSI`|$\frac{\text{SS_PMT}}{\text{SS_PERS}}$
Income Inequality (?)|`DIVERSITY`|$\frac{\text{PC_INC}}{\text{POVERTY}}$
Private Nonfarm Employment to Population Ratio|`EMP_RES`|$\frac{\text{TOT_EMP_PNFARM}}{\text{RESPOP}}$
Private Nonfarm Manufacturing Employment to Population Ratio|`MANU_RES`|$\frac{\text{MANU_EMP_PNFARM}}{\text{RESPOP}}$
Private Nonfarm Retail Employment to Population Ratio|`RETL_RES`|$\frac{\text{RETL_EMP_PNFARM}}{\text{RESPOP}}$
Private Nonfarm Service Employment to Population Ratio|`SERV_RES`|$\frac{\text{PROF_SERV_EMP_PNFARM}+\text{SUPP_SERV_EMP_PNFARM}+\text{EDUC_SERV_EMP_PNFARM}+\text{FOOD_SERV_EMP_PNFARM}+\text{OTH_SERV_EMP_PNFARM}}{\text{RESPOP}}$

### Land Area

Before we generate these variables, we need to ensure that our denominators are present and nonzero.  To start, let's look at `LANDAREA`.  There are many instances in which this value is missing or zero.  (Note that we are referring to the data before interpolation for this investigation.)

In [19]:
print 'Instances of missing or zero values for LANDAREA'
len(cvar_df[(cvar_df['LANDAREA'].isnull()) | (cvar_df['LANDAREA']==0)]['LANDAREA'])

Instances of missing or zero values for LANDAREA


115150

Data quality seems to get better over time, however.  There are only three instances in which this occurs in 2010.

In [20]:
cvar_df[(cvar_df['LANDAREA'].isnull()) | (cvar_df['LANDAREA']==0)]['LANDAREA'].xs(2010,level=1)

STCOU
30113    0
51560    0
51780    0
Name: LANDAREA, dtype: float64

Do all the years have zero values for these counties?

In [21]:
cvar_df.ix['30113']['LANDAREA']

1940       NaN
1950       NaN
1960       NaN
1969       NaN
1970       NaN
1972       NaN
1975       NaN
1977       NaN
1979       NaN
1980    244.95
1981       NaN
1982       NaN
1983       NaN
1984       NaN
1985       NaN
1986       NaN
1987       NaN
1988       NaN
1989       NaN
1990    245.39
1991       NaN
1992       NaN
1993       NaN
1994       NaN
1995       NaN
1996       NaN
1997       NaN
1998       NaN
1999       NaN
2000      0.00
2001       NaN
2002       NaN
2003       NaN
2004       NaN
2005       NaN
2006       NaN
2007       NaN
2008       NaN
2009       NaN
2010      0.00
Name: LANDAREA, dtype: float64

In fact, they do not.  Lest there be concern that this is a function of the reshaping of the data, note that the input data corroborate this behavior.

In [22]:
df_list[1][df_list[1]['STCOU']=='30113'][[var for var in df_list[1].columns if 'LND110' in var]+['STCOU']]

Unnamed: 0,LND110180F,LND110180D,LND110190F,LND110190D,LND110200F,LND110200D,LND110210F,LND110210D,STCOU
1682,0,244.95,0,245.39,5,0,5,0,30113


It seems to me that the land area associated with a county rarely changes all that much, and it certainly never reaches zero.  These constraints afford a simple solution. Each county will be assigned the maximum area associated with their county.  *It is critical to note that if we every change how we impute exterior missing values above so that the boundary values are not simply repeated, we will need to revisit how this is done (imputed values could end up being the maximum in each county).*

In [23]:
#Create dictionary to hold FIPS and max area
max_area={}

#For each FIPS code...
for code in idx_fips:
    #...update max_area with the max area for that county...
    max_area.update({code:cvar_df.ix[code]['LANDAREA'].max()})
    #...and update the value in cvar
    cvar.ix[code,'LANDAREA']=max_area[code]
    
print 'Number of counties with zero land area:',len([code for code in idx_fips if max_area[code]==0])
print 'Number of counties with all missing values for land area:',len([code for code in idx_fips if max_area[code]==np.NaN])

Number of counties with zero land area: 0
Number of counties with all missing values for land area: 0


Did we successfully remove problematic values from the `cvar` set?

In [24]:
print 'Instances of missing or zero values for LANDAREA'
len(cvar[(cvar['LANDAREA'].isnull()) | (cvar['LANDAREA']==0)]['LANDAREA'])

Instances of missing or zero values for LANDAREA


0

### Resident Population

How many problematic values do we have?  (Now we are looking post-interpolation because we know these things have holes, and we interpolated for that purpose.)

In [25]:
print 'Instances of missing or zero values for RESPOP'
print len(cvar[(cvar['RESPOP'].isnull()) | (cvar['RESPOP']==0)]['RESPOP']),len(cvar[(cvar['RESPOP']==0)]['RESPOP'])

Instances of missing or zero values for RESPOP
374 374


What do these look like?

In [26]:
#Capture counties featuring zero or missing values for RESPOP
bad_respop=sorted(set([item[0] for item in cvar[(cvar['RESPOP'].isnull()) | (cvar['RESPOP']==0)]['RESPOP'].index]))
print 'Number of problem counties:',len(bad_respop)

print bad_respop

Number of problem counties: 23
[u'02013', u'02050', u'02068', u'02164', u'02230', u'02275', u'02282', u'02290', u'04012', u'08014', u'15005', u'30113', u'35006', u'36005', u'36061', u'36081', u'36085', u'51083', u'51560', u'51683', u'51685', u'51735', u'51780']


Let's compare pre- and post-interpolation `RESPOP` values, throwing in the components that went into creating pre-interpolation `RESPOP` for good measure.

In [27]:
test_cty=bad_respop[11]
DataFrame(cvar.ix[test_cty]['RESPOP']).join(DataFrame(cvar_df.ix[test_cty][['RESPOP','RES_POP1','RES_POP2']]),rsuffix='_pre')

Unnamed: 0,RESPOP,RESPOP_pre,RES_POP1,RES_POP2
1940,85.0,,,
1950,85.0,,,
1960,85.0,,,
1969,85.0,,,
1970,85.0,,,
1972,85.0,85.0,85.0,
1975,156.0,,,
1977,227.0,227.0,227.0,
1979,186.25,,,
1980,145.5,,,


So, these zeroes are occurring in the observed data, and they come in blocks.  Either we start with zeroes and then move into larger numbers, or the opposite happens.  It's as though the FIPS codes are coming in and out of use. Do these zero population values align with zero or missing data in all other fields?

In [28]:
cvar_df[(cvar_df['RESPOP']==0)][:20].T

STCOU,02013,02013,02013,02013,02050,02050,02068,02068,02068,02068,02164,02164,02164,02164,02230,02230,02230,02230,02230,02230
Unnamed: 0_level_1,1972,1977,1982,1987,1972,1977,1972,1977,1982,1987,1972,1977,1982,1987,1972,1977,1982,1987,1992,1997
D_GEN_EXP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FOOD_SERV_EMP_PNFARM,,,,,,,,,,,,,,,,,,,,
PRV_SCHL_KIND,,,,,,,,,,,,,,,,,,,,
OTH_SERV_EMP_PNFARM,,,,,,,,,,,,,,,,,,,,
PRE_1940,,,,,,,,,,,,,,,,,,,,
MED_INC,,,,,,,,,,,,,,,,,,,,0.0
PERS_POVT,,,,,,,,,,,,,,,,,,,,0.0
PRV_SCHL_9_12,,,,,,,,,,,,,,,,,,,,
PC_GEN_EXP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MANU_EMP_PNFARM,,,,,,,,,,,,,,,,,,,,


It certainly looks that way.  The best path forward here is a little unclear.  I am inclined to drop all records with a zero value for `RESPOP`, but the imputation routine is a wrinkle.  The zeroes in the raw data are observed, and therefore influence the interpolation process for the values between the zero and the last non-zero value.  **This is just another deficiency in our quick and dirty interpolation routine that really should be revisited.**  Precisely when values should cease to exist for a given county is also unknown given the sparsity of the data.  For that time being, we will proceed (unhappily) by dropping records with zero population.  As can be seen below, we still retain 99.7% of the data.

In [29]:
#Capture good population subset
cvar_gp=cvar[cvar['RESPOP']>0]

len(cvar),len(cvar_gp),len(cvar_gp)/float(len(cvar))

(127920, 127546, 0.9970762976860538)

### Housing Units

Did we get lucky enough to drop problematic housing unit counts with our population fix?

In [30]:
print 'Instances of missing or zero values for HSG_UNITS'
print len(cvar_gp[(cvar_gp['HSG_UNITS'].isnull()) |\
                  (cvar_gp['HSG_UNITS']==0)]['HSG_UNITS']),\
      len(cvar_gp[(cvar_gp['HSG_UNITS']==0)]['HSG_UNITS'])

Instances of missing or zero values for HSG_UNITS
141 141


Negative.  Where are these guys?

In [31]:
#Capture counties featuring zero or missing values for HSG_UNITS
bad_hsg=sorted(set([item[0] for item in cvar_gp[(cvar_gp['HSG_UNITS'].isnull()) | \
                                                (cvar_gp['HSG_UNITS']==0)]['HSG_UNITS'].index]))
print 'Number of problem counties:',len(bad_hsg)

print bad_hsg

Number of problem counties: 44
[u'02016', u'02020', u'02060', u'02068', u'02070', u'02090', u'02100', u'02105', u'02110', u'02122', u'02130', u'02150', u'02170', u'02180', u'02185', u'02188', u'02195', u'02198', u'02220', u'02230', u'02240', u'02261', u'02270', u'02275', u'08014', u'32510', u'35006', u'35028', u'51515', u'51550', u'51570', u'51580', u'51595', u'51600', u'51610', u'51620', u'51640', u'51678', u'51720', u'51775', u'51780', u'51810', u'51820', u'55078']


In [32]:
test_cty=bad_hsg[36]

cvar_gp.ix[test_cty][['RESPOP','HSG_UNITS']]

Unnamed: 0,RESPOP,HSG_UNITS
1940,6111.0,0.0
1950,6111.0,0.0
1960,6111.0,1657.0
1969,6111.0,1967.5
1970,6111.0,2278.0
1972,6111.0,2384.6
1975,6294.5,2491.2
1977,6478.0,2597.8
1979,6516.0,2704.4
1980,6554.0,2811.0


Spot checks reveal a pattern of zeroes occurring in very early years.  Just what is the temporal distribution?

In [33]:
Series(list([item[1] for item in cvar_gp[(cvar['HSG_UNITS'].isnull()) | \
                                         (cvar_gp['HSG_UNITS']==0)]['HSG_UNITS'].index])).value_counts()



1940    39
1950    35
1960    29
1970    12
1969    12
2000     3
1999     3
1998     3
1990     1
1989     1
1988     1
1980     1
1979     1
dtype: int64

Only 10 observations occur within our statistical analysis window (1990 on).  Given the size of our set, we can probably drop these without fear of much bias.  If these few observations did materially change the estimates, it could only be due to extremely high leverage within the subset.  

In [34]:
cvar_gh=cvar_gp[cvar_gp['HSG_UNITS']>0]

### Poverty

The poverty measure is constructed as the ratio of people in poverty (`POV_EST_FAM_NUMER`) to the number of people for whom a poverty determination could be estimated (`POV_EST_FAM_DENOM`).  Have we taken care of problematic values while fixing issues with the above variables?

In [35]:
print 'Instances of missing or zero values for POV_EST_FAM_NUMER'
print len(cvar_gh[(cvar_gh['POV_EST_FAM_NUMER'].isnull()) |\
                  (cvar_gh['POV_EST_FAM_NUMER']==0)]['POV_EST_FAM_NUMER']),\
      len(cvar_gh[(cvar_gh['POV_EST_FAM_NUMER']==0)]['POV_EST_FAM_NUMER'])
print '\nInstances of missing or zero values for POV_EST_FAM_DENOM'
print len(cvar_gh[(cvar_gh['POV_EST_FAM_DENOM'].isnull()) |\
                  (cvar_gh['POV_EST_FAM_DENOM']==0)]['POV_EST_FAM_DENOM']),\
      len(cvar_gh[(cvar_gh['POV_EST_FAM_DENOM']==0)]['POV_EST_FAM_DENOM'])

Instances of missing or zero values for POV_EST_FAM_NUMER
86 86

Instances of missing or zero values for POV_EST_FAM_DENOM
0 0


Nice.  There are no missing or zero values in the denominator, and no missing values in the numerator.  Zeroes in the numerator are fishy, but not numerically problematic.  Let's see where they are.

In [36]:
#Capture counties featuring zero or missing values for POV_EST_FAM_NUMER
bad_pov=sorted(set([item[0] for item in cvar_gh[(cvar_gh['POV_EST_FAM_NUMER'].isnull()) | \
                                                (cvar_gh['POV_EST_FAM_NUMER']==0)]['POV_EST_FAM_NUMER'].index]))
print 'Number of problem counties:',len(bad_pov)

print bad_pov

Number of problem counties: 10
[u'02060', u'02230', u'08053', u'08111', u'15005', u'30113', u'32009', u'32029', u'48269', u'48301']


They can't possible cover all years within each county affected.

In [37]:
test_cty=bad_pov[0]

cvar_gh.ix[test_cty][['POV_EST_FAM_NUMER','POV_EST_FAM_DENOM']].\
    join(cvar_df.ix[test_cty][['POV_EST_FAM_NUMER','POV_EST_FAM_DENOM']],rsuffix='_pre')


Unnamed: 0,POV_EST_FAM_NUMER,POV_EST_FAM_DENOM,POV_EST_FAM_NUMER_pre,POV_EST_FAM_DENOM_pre
1969,16.1,149.0,16.1,
1970,12.88,149.0,,149.0
1972,9.66,154.4,,
1975,6.44,159.8,,
1977,3.22,165.2,,
1979,0.0,170.6,0.0,
1980,1.1,176.0,,176.0
1981,2.2,186.9,,
1982,3.3,197.8,,
1983,4.4,208.7,,


The fact that observed zeroes occur between observed nonzero values is interesting.  I guess it leads me to believe we shouldn't mess with them.  

On an unrelated note, the disparity in measurement years for the ratio is no cause for alarm.  That's how it is in the raw data:

    PVY410200D | Families for whom poverty status has been determined 2000
    PVY420199D | Families below poverty level 1999

### Income Inequality

The downside of leaving a few poverty values a zeroes is that it leads to complications with our measure of economic inequality (`DIVERSITY`).  `DIVERSITY` is the ratio of `PC_INC` to `POVERTY`.  So, the question here is really what should we do with `DIVERSITY` when `POVERTY` is zero?  I guess my inclination is to use the lowest nonzero value.  It gets the point across about a negligible poverty rate, and preserves the calculation.

We can handle this on the fly.

### Finally ... Calculating Derivative Variables

Note that all of the adjustments above related to the denominators in the calculations below were definitely a retroactive set of actions motivated by a number of infinite values.  In the original version, all of these were calculated within `cvar`, and now we are using our set without problematic housing and population figures, `cvar_gh`.

In [38]:
#Generate new variables
cvar_gh['DENSITY']=cvar_gh['RESPOP']/cvar_gh['LANDAREA']
cvar_gh['POPGROWTH']=cvar_gh['RESPOP']/cvar_gh['RESPOP'].shift()-1
cvar_gh['PYOUNG']=cvar_gh['POP_TH18']/cvar_gh['RESPOP']
cvar_gh['POP65']=cvar_gh['POP_OV65']/cvar_gh['RESPOP']
cvar_gh['RESPOP2']=cvar_gh['RESPOP']**2
cvar_gh['PRE1940']=cvar_gh['PRE_1940']/cvar_gh['HSG_UNITS']
cvar_gh['PVT_SCH']=cvar_gh['PRV_SCHL']/cvar_gh['PUB_SCHL_OV3']
cvar_gh['POVERTY']=cvar_gh['POV_EST_FAM_NUMER']/cvar_gh['POV_EST_FAM_DENOM']
cvar_gh['PC_SSI']=cvar_gh['SS_PMT']/cvar_gh['SS_PERS']
### Don't like this inequality measure, but no obvious alternative comes to mind
min_nonzero_pov=cvar_gh['POVERTY'][cvar_gh['POVERTY']>0].min()
cvar_gh['DIVERSITY']=np.where(cvar_gh['POVERTY']>0,\
                              cvar_gh['PC_INC']/cvar_gh['POVERTY'],
                              cvar_gh['PC_INC']/min_nonzero_pov)
cvar_gh['EMP_RES']=cvar_gh['TOT_EMP_PNFARM']/cvar_gh['RESPOP']
cvar_gh['MANU_RES']=cvar_gh['MANU_EMP_PNFARM']/cvar_gh['RESPOP']
cvar_gh['RETL_RES']=cvar_gh['RETL_EMP_PNFARM']/cvar_gh['RESPOP']
nonfarm_serv_emp=cvar_gh[['PROF_SERV_EMP_PNFARM','SUPP_SERV_EMP_PNFARM','EDUC_SERV_EMP_PNFARM',\
                       'FOOD_SERV_EMP_PNFARM','OTH_SERV_EMP_PNFARM']].sum(axis=1)
cvar_gh['SERV_RES']=nonfarm_serv_emp/cvar_gh['RESPOP']

#Sort index
cvar_gh.sortlevel(0,inplace=True)

#Write to disk
cvar_gh.to_csv('../data/costat_mod_vars1940_2010.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http:/

In [39]:
cvar_gh[['DENSITY','POPGROWTH','PYOUNG','POP65','RESPOP2','PRE1940','PVT_SCH','POVERTY','PC_SSI','DIVERSITY',\
     'EMP_RES','MANU_RES','RETL_RES','SERV_RES']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
DENSITY,127405,223.95,1906.109,0.013034,16.16254,39.01984,95.64201,110638.6
POPGROWTH,127404,1.243808,386.942,-0.999963,-0.002055149,0.002072497,0.01323712,138025.6
PYOUNG,127405,0.2832753,0.1157505,0.0,0.2343727,0.260479,0.299647,10.63735
POP65,127405,0.1625568,0.06571715,0.0,0.1331831,0.1552759,0.1804389,6.466156
RESPOP2,127405,21639670000000.0,1201813000000000.0,121.0,112211600.0,534395700.0,3444399000.0,9.425302e+16
PRE1940,127405,0.2218185,0.1450791,0.002534,0.1028987,0.1880092,0.323802,2.831754
PVT_SCH,127343,0.07588627,0.04238231,0.0,0.04536906,0.06991212,0.09954163,0.3405466
POVERTY,127405,0.1032074,0.07043514,0.0,0.06076257,0.09423394,0.1361964,3.553846
PC_SSI,127241,0.5172212,0.3315889,0.01404,0.2828775,0.511386,0.7333718,48.8329
DIVERSITY,127405,124874100.0,6067435000.0,0.0,159805.4,261368.8,459525.0,481611300000.0


In the end, after all our adjustments for bad denominator values, what is the proportion of records retained?

In [40]:
len(cvar_gh)/float(len(cvar))

0.9959740462789243

## Joining Data Together

It looks like we will still be missing the cumulative TEL variable (`TYPE2_y`) which identifies the number of years since the passage of a Type 2 TEL.  We should be able to execute the definition of this variable so long as TEL passage occurs in the portion of the data set in which subsequent observations are separated by a single year (that is, after 1969).  This should be the case, but let's confirm this is the case in the TEL set (`df_list[2]`).

In [41]:
# df_list[2][df_list[2]['TYPE2']==1]['YEAR'].value_counts()

Turns out we are good to go.  To define the cumulative measure, it must be done on a state-by-state basis.  Consequently, we will go with the "split-apply-combine" approach.

In [42]:
# #Capture the tel set as third (makes sense further down as a retrofit)
# third_tmp=df_list[2].set_index(['FIPSST','YEAR'])

# #Sort index
# third_tmp.sortlevel(0,inplace=True)

# #Capture state FIPS
# st_fips_int=sorted(set([idx[0] for idx in third_tmp.index]))

# #Create container for processed subsets
# third_tmp_list=[]

# #For each state...
# for st in st_fips_int:
#     #...capture the subset...
#     tmp=third_tmp.ix[st]
#     #...redefine state variable...
#     tmp['FIPSST']=st
#     #...define years since TYPE2 enactment (0 in year of enactment)...
#     tmp['TYPE2_y']=tmp['TYPE2'].cumsum().apply(lambda x: max(0,x-1))
#     #...and throw the subset in third_tmp_list
#     third_tmp_list.append(tmp)
    
# #Concatenate back together
# third_tmp=pd.concat(third_tmp_list).reset_index()
    
# third_tmp[third_tmp['FIPSST']==55][['FIPSST','YEAR','TYPE2','TYPE2_y']].tail(30)

From Dan's text, it sounds like we want the first DF (which contains selected variables from COStats and PUMS sets) and the third (which contains TEL-related variables).  Note that third set is now called `third_tmp`.  Since the TEL data house only one observation per year-state, we will try to match on year and FIPS code.  Do the first and third have this information?

In [43]:
#Generate a consistent FIPS code name in the third set
df_list[2]['FIPSST']=df_list[2]['FIPST_N']

print df_list[0][['YEAR','FIPSST']].head()
print df_list[2][['YEAR','FIPSST']].head()

   YEAR  FIPSST
0  1990       0
1  1990       1
2  1990       1
3  1990       1
4  1990       1
   YEAR  FIPSST
0  1970       1
1  1971       1
2  1972       1
3  1973       1
4  1974       1


It appears they are, which makes our job easy.  Nevertheless, since we are joining the third set to the first, let's confirm that the first holds a useful value of the entire FIPS code (state and county).  It is formatted as a 5-digit number string in the debt data to which we will join the output of this script.

In [44]:
print df_list[0][['YEAR','STCOU']].head()

   YEAR  STCOU
0  1990  00000
1  1990  01000
2  1990  01001
3  1990  01003
4  1990  01005


Looking good.  We want to join the third set to the first.

In [45]:
sorted(set(zip(df_list[0]['FIPSST'],df_list[0]['YEAR'])))

[(0.0, 1990.0),
 (0.0, 2000.0),
 (0.0, 2005.0),
 (0.0, 2006.0),
 (0.0, 2007.0),
 (0.0, 2008.0),
 (0.0, 2009.0),
 (0.0, 2010.0),
 (0.0, 2011.0),
 (1.0, 1990.0),
 (1.0, 2000.0),
 (1.0, 2005.0),
 (1.0, 2006.0),
 (1.0, 2007.0),
 (1.0, 2008.0),
 (1.0, 2009.0),
 (1.0, 2010.0),
 (1.0, 2011.0),
 (2.0, 1990.0),
 (2.0, 2000.0),
 (2.0, 2005.0),
 (2.0, 2006.0),
 (2.0, 2007.0),
 (2.0, 2008.0),
 (2.0, 2009.0),
 (2.0, 2010.0),
 (2.0, 2011.0),
 (4.0, 1990.0),
 (4.0, 2000.0),
 (4.0, 2005.0),
 (4.0, 2006.0),
 (4.0, 2007.0),
 (4.0, 2008.0),
 (4.0, 2009.0),
 (4.0, 2010.0),
 (4.0, 2011.0),
 (5.0, 1990.0),
 (5.0, 2000.0),
 (5.0, 2005.0),
 (5.0, 2006.0),
 (5.0, 2007.0),
 (5.0, 2008.0),
 (5.0, 2009.0),
 (5.0, 2010.0),
 (5.0, 2011.0),
 (6.0, 1990.0),
 (6.0, 2000.0),
 (6.0, 2005.0),
 (6.0, 2006.0),
 (6.0, 2007.0),
 (6.0, 2008.0),
 (6.0, 2009.0),
 (6.0, 2010.0),
 (6.0, 2011.0),
 (8.0, 1990.0),
 (8.0, 2000.0),
 (8.0, 2005.0),
 (8.0, 2006.0),
 (8.0, 2007.0),
 (8.0, 2008.0),
 (8.0, 2009.0),
 (8.0, 2010.0),
 (8.0, 2

In [46]:
#Capture sets
first=df_list[0].set_index(['YEAR','FIPSST'])
third=df_list[2].set_index(['YEAR','FIPSST'])

#Sort indices
first.sortlevel(0,inplace=True)
third.sortlevel(0,inplace=True)

#Join sets together
data=first.join(third,rsuffix='_TEL')

data.head().T

YEAR,1990,1990,1990,1990,1990
FIPSST,0,1,1.1,1.2,1.3
STCOU,00000,01000,01001,01003,01005
POP_TH18,6.360443e+07,1058788,10098,25533,7464
POP_OV65,3.124183e+07,522989,3372,14879,3726
TOT_EMP,1.393809e+08,2061101,11471,40809,12163
MFG_EMP,1.96942e+07,396248,2350,5586,3698
RETL_EMP,2.28855e+07,321969,2163,8007,1577
T_PUB_SCH,4.07376e+07,728252,6847,17054,5156
PUB_SCHL,3.837969e+07,680875,6567,15507,4848
PVT_SCHL,4187099,57284,604,1761,466
HSLD_PERS,2.63,2.62,2.88,2.62,2.7


In [47]:
print len(data.columns)
print sorted(data.columns)

100
[u'ASMT_L', u'ASMT_L2', u'ASMT_L3', u'BOTH', u'CB_E', u'CB_E2', u'CB_E3', u'CB_E4', u'CB_G', u'CB_G2', u'CFDISC_L', u'CGEXP_L', u'CH_HS_UNT', u'CLEVY_L', u'CLEVY_L2', u'CLEVY_L3', u'CLEVY_L4', u'CRATE_L', u'CRATE_L2', u'CREVU_L', u'D_GEN_EXP', u'FFDISC_L', u'FIPSCO', u'FIPST_N', u'GEN_REV', u'GEXP_L', u'GP_GEXP', u'GP_LEVY', u'GP_LMT', u'GP_RATE', u'GP_REVU', u'HOME_STEAD', u'HOME_STEAD2', u'HOME_STEAD3', u'HSG_UNITS', u'HSLD_PERS', u'IGR_ST', u'LANDAREA', u'LEVY_L', u'LIMITS', u'MDHOMEVAL', u'MED_INC', u'MFDISC_L', u'MFG_EMP', u'MGEXP_L', u'MGEXP_L2', u'MLEVY_L', u'MLEVY_L2', u'MLEVY_L3', u'MLEVY_L4', u'MRATE_L', u'MRATE_L2', u'MREVU_L', u'PC_GEN_EXP', u'PC_INC', u'PERS_POVT', u'POP_OV65', u'POP_TH18', u'POVT_PCT', u'PRE_1940', u'PT_REV', u'PUB_SCHL', u'PVT_SCHL', u'RATE_L', u'RATE_L2', u'RES_POP', u'RETL_EMP', u'REVU_L', u'SC_LMT', u'SFDISC_L', u'SGEXP_L', u'SGEXP_L2', u'SLEVY_L', u'SLEVY_L2', u'SLEVY_L3', u'SLEVY_L4', u'SPC_RATE', u'SRATE_L', u'SRATE_L2', u'SREVU_L', u'SS_PERS',

It turns out we are still missing some variables.  (Hence the motivation behind the entire COSTAT set building exercise above.)  We will pull from our `cvar` set the variables that remain.  Note that the potential for separation can occur here.  We may pull derivative variables that are not based on the same values for input values of variables that already exist in the data set.  This could happen because our interpolation routines differed across implementations.  At the current time, we will acknowledge this potential conflict, but in the interest of disturbing as little of the existing data as possible, we will not overwrite any of the set that exists at the current time.

In [48]:
#Capture variables of interest
cvar_join_vars=[var for var in cvar_gh.columns if var not in data.columns]

#Reset data index
data_out=data.reset_index()

#Convert year to integer in data set
data_out['YEAR']=data_out['YEAR'].astype(int)

#Set index to FIPS and year
data_out.set_index(['STCOU','YEAR'],inplace=True)

#Sort index of both sets
data_out.sortlevel(0,inplace=True)
cvar_gh.sortlevel(0,inplace=True)

#Set year index level name in cvar
cvar_gh.index.names=['STCOU','YEAR']

#Join missing COSTAT vars in with output data
data_out=data_out.join(cvar_gh[cvar_join_vars])

data_out.head().T

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


STCOU,00000,00000,00000,00000,00000
YEAR,1990,2000,2005,2006,2007
FIPSST,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
POP_TH18,6.360443e+07,7.229381e+07,7.374917e+07,7.401009e+07,7.434013e+07
POP_OV65,3.124183e+07,3.499175e+07,3.670370e+07,3.720592e+07,3.786714e+07
TOT_EMP,1.393809e+08,1.667588e+08,1.742284e+08,1.778176e+08,1.809438e+08
MFG_EMP,1.969420e+07,1.911480e+07,1.481920e+07,1.477420e+07,1.451200e+07
RETL_EMP,2.288550e+07,2.722230e+07,1.898080e+07,1.912180e+07,1.928200e+07
T_PUB_SCH,4.073760e+07,4.681869e+07,4.869329e+07,4.897856e+07,4.914070e+07
PUB_SCHL,3.837969e+07,4.483859e+07,4.483859e+07,4.434209e+07,4.434209e+07
PVT_SCHL,4.187099e+06,5.195998e+06,5.195998e+06,5.234904e+06,5.234904e+06
HSLD_PERS,2.630000e+00,2.590000e+00,2.600000e+00,2.600000e+00,2.600000e+00


In [49]:
print data_out.describe().ix[['count','mean','min','max']].T.to_string()

                      count          mean           min           max
FIPSST                28778  3.027285e+01  0.000000e+00  5.600000e+01
POP_TH18              28778  6.831658e+04  0.000000e+00  7.454822e+07
POP_OV65              28778  3.512233e+04  0.000000e+00  4.026798e+07
TOT_EMP               28778  1.629270e+05  0.000000e+00  1.809438e+08
MFG_EMP               28778  1.466016e+04  0.000000e+00  1.969420e+07
RETL_EMP              28778  1.923648e+04  0.000000e+00  2.722230e+07
T_PUB_SCH             28778  4.492620e+04  0.000000e+00  4.918340e+07
PUB_SCHL              28778  4.108445e+04  0.000000e+00  4.483859e+07
PVT_SCHL              28778  4.794133e+03  0.000000e+00  5.234904e+06
HSLD_PERS             28778  2.505594e+00  0.000000e+00  4.730000e+00
HSG_UNITS             28778  1.172043e+05  0.000000e+00  1.317047e+08
CH_HS_UNT             28778  1.169294e+01 -1.288333e+02  1.841000e+02
PRE_1940              28778  1.711621e+04  0.000000e+00  1.883250e+07
VACANT              

Ok, let's write this to disk.

In [50]:
data_out.to_csv('../data/tel_data.csv')

In [51]:
print sorted(data_out.columns)

[u'ASMT_L', u'ASMT_L2', u'ASMT_L3', u'BOTH', u'CB_E', u'CB_E2', u'CB_E3', u'CB_E4', u'CB_G', u'CB_G2', u'CFDISC_L', u'CGEXP_L', u'CH_HS_UNT', u'CLEVY_L', u'CLEVY_L2', u'CLEVY_L3', u'CLEVY_L4', u'CRATE_L', u'CRATE_L2', u'CREVU_L', 'DENSITY', 'DIVERSITY', u'D_GEN_EXP', 'EDUC_SERV_EMP_PNFARM', 'EMP_RES', u'FFDISC_L', u'FIPSCO', 'FIPSST', u'FIPST_N', 'FOOD_SERV_EMP_PNFARM', u'GEN_REV', u'GEXP_L', u'GP_GEXP', u'GP_LEVY', u'GP_LMT', u'GP_RATE', u'GP_REVU', u'HOME_STEAD', u'HOME_STEAD2', u'HOME_STEAD3', u'HSG_UNITS', 'HSG_UNITS_ACS', u'HSLD_PERS', u'IGR_ST', u'LANDAREA', u'LEVY_L', u'LIMITS', 'MANU_EMP_PNFARM', 'MANU_RES', u'MDHOMEVAL', u'MED_INC', u'MFDISC_L', u'MFG_EMP', u'MGEXP_L', u'MGEXP_L2', u'MLEVY_L', u'MLEVY_L2', u'MLEVY_L3', u'MLEVY_L4', u'MRATE_L', u'MRATE_L2', u'MREVU_L', 'OTH_SERV_EMP_PNFARM', u'PC_GEN_EXP', u'PC_INC', 'PC_SSI', u'PERS_POVT', 'POP65', 'POPGROWTH', u'POP_OV65', u'POP_TH18', 'POVERTY', u'POVT_PCT', 'POV_EST_FAM_DENOM', 'POV_EST_FAM_NUMER', 'PRE1940', u'PRE_1940', '

### An Alternate Join

So, it turns out that the `first` set of COSTATs/PUMS data is very limited from a temporal scope perspective.  Here are the years covered.

In [52]:
sorted(set([item[0] for item in first.index]))

[1990.0, 2000.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0]

Moreover, I am not sure there are any columns of interest to us that do not appear in the full COSTATs set (the second file) or the TEL set (`third`).

In [53]:
print sorted(third.columns)

[u'ASMT_L', u'ASMT_L2', u'ASMT_L3', u'BOTH', u'CB_E', u'CB_E2', u'CB_E3', u'CB_E4', u'CB_G', u'CB_G2', u'CFDISC_L', u'CGEXP_L', u'CLEVY_L', u'CLEVY_L2', u'CLEVY_L3', u'CLEVY_L4', u'CRATE_L', u'CRATE_L2', u'CREVU_L', u'FFDISC_L', u'FIPST_N', u'GEXP_L', u'GP_GEXP', u'GP_LEVY', u'GP_LMT', u'GP_RATE', u'GP_REVU', u'HOME_STEAD', u'HOME_STEAD2', u'HOME_STEAD3', u'LEVY_L', u'LIMITS', u'MFDISC_L', u'MGEXP_L', u'MGEXP_L2', u'MLEVY_L', u'MLEVY_L2', u'MLEVY_L3', u'MLEVY_L4', u'MRATE_L', u'MRATE_L2', u'MREVU_L', u'RATE_L', u'RATE_L2', u'REVU_L', u'SC_LMT', u'SFDISC_L', u'SGEXP_L', u'SGEXP_L2', u'SLEVY_L', u'SLEVY_L2', u'SLEVY_L3', u'SLEVY_L4', u'SPC_RATE', u'SRATE_L', u'SRATE_L2', u'SREVU_L', u'TREND', u'TYPE1', u'TYPE2', u'TYPE2_Y']


The question becomes, what are we buying here with observational sparsity here?  We have already processed a COSTAT set to give us full coverage in the study window.  Why not just use that?  Consequently, we will create an alternative output set consisting of a join between the COSTAT (`cvar_gh`, which will be converted to `second`) and TEL (`third`) sets.

The first step is establishing an integer `FIPSST` variable in `second` that can be used as a join key.

In [54]:
#Reset index
second=cvar_gh.reset_index()

#Generate FIPSST
second['FIPSST']=second['STCOU'].apply(lambda x: int(x[:2]))

#Set index
second.set_index(['YEAR','FIPSST'],inplace=True)

second.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,STCOU,D_GEN_EXP,FOOD_SERV_EMP_PNFARM,PRV_SCHL_KIND,OTH_SERV_EMP_PNFARM,PRE_1940,MED_INC,PERS_POVT,PRV_SCHL_9_12,PC_GEN_EXP,...,RESPOP2,PRE1940,PVT_SCH,POVERTY,PC_SSI,DIVERSITY,EMP_RES,MANU_RES,RETL_RES,SERV_RES
YEAR,FIPSST,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1940,0,0,106499000,9466088,621446,5420087,18832498,34076,13.8,1532323,509,...,4.379975e+16,0.503022,0.105809,2.091126e-07,0.100894,146246600000.0,0.35762,0.080971,0.068045,0.152609
1950,0,0,106499000,9466088,621446,5420087,18832498,34076,13.8,1532323,509,...,4.379975e+16,0.408186,0.105809,2.091126e-07,0.100894,146246600000.0,0.35762,0.080971,0.068045,0.152609
1960,0,0,106499000,9466088,621446,5420087,18832498,34076,13.8,1532323,509,...,4.379975e+16,0.322881,0.105809,2.091126e-07,0.100894,146246600000.0,0.35762,0.080971,0.068045,0.152609
1969,0,0,106499000,9466088,621446,5420087,18832498,34076,13.8,1532323,509,...,4.379975e+16,0.296503,0.105809,2.091126e-07,0.100894,146246600000.0,0.35762,0.080971,0.068045,0.152609
1970,0,0,106499000,9466088,621446,5420087,18832498,34076,13.8,1532323,509,...,4.379975e+16,0.274109,0.105809,0.02216304,0.100894,1379865.0,0.35762,0.080971,0.068045,0.152609


The keys are well aligned, so let's join the sets and write to disk.

In [61]:
DataFrame(data_out_alt.describe().T)['count'].value_counts()

127405    70
114768    61
127343     1
127404     1
127241     1
dtype: int64

In [55]:
#Join third to second
data_out_alt=second.join(third)

#Write to disk
data_out_alt.to_csv('../../debt_data/tel_data_alt.csv')

data_out_alt.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
D_GEN_EXP,127405,509952.090407,11568008.039997,0.0,11317.80,30650.800000,92745.00,9.863709e+08
FOOD_SERV_EMP_PNFARM,127405,9240.294047,178870.207797,0.0,162.00,597.000000,2076.00,1.192633e+07
PRV_SCHL_KIND,127405,581.807975,11278.378273,0.0,6.00,29.000000,105.00,6.214460e+05
OTH_SERV_EMP_PNFARM,127405,5067.619827,97971.126064,0.0,96.00,292.000000,1012.00,5.519773e+06
PRE_1940,127405,17186.637691,335475.680024,2.0,765.90,1635.444444,3928.00,1.883250e+07
MED_INC,127405,33257.900118,9557.860729,0.0,26829.00,31477.000000,37624.00,1.142000e+05
PERS_POVT,127405,15.133350,6.730214,0.0,10.40,13.900000,18.50,6.200000e+01
PRV_SCHL_9_12,127405,1431.374860,27786.077063,0.0,25.00,84.000000,257.00,1.532323e+06
PC_GEN_EXP,127405,1762.005226,1803.733095,0.0,748.00,1522.400000,2416.00,1.985100e+05
MANU_EMP_PNFARM,127405,15058.781743,292124.751471,0.0,219.00,1295.000000,4430.00,1.694583e+07


In [64]:
data_out_alt[data_out_alt['RESPOP'].isnull()]

Unnamed: 0_level_0,Unnamed: 1_level_0,STCOU,D_GEN_EXP,FOOD_SERV_EMP_PNFARM,PRV_SCHL_KIND,OTH_SERV_EMP_PNFARM,PRE_1940,MED_INC,PERS_POVT,PRV_SCHL_9_12,PC_GEN_EXP,...,LEVY_L,REVU_L,GEXP_L,GP_RATE,GP_LEVY,GP_REVU,GP_GEXP,GP_LMT,SC_LMT,TREND
YEAR,FIPSST,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1


## Data Checks

In [65]:
sorted(data_out_alt.columns)

[u'ASMT_L',
 u'ASMT_L2',
 u'ASMT_L3',
 u'BOTH',
 u'CB_E',
 u'CB_E2',
 u'CB_E3',
 u'CB_E4',
 u'CB_G',
 u'CB_G2',
 u'CFDISC_L',
 u'CGEXP_L',
 'CH_HS_UNT',
 u'CLEVY_L',
 u'CLEVY_L2',
 u'CLEVY_L3',
 u'CLEVY_L4',
 u'CRATE_L',
 u'CRATE_L2',
 u'CREVU_L',
 'DENSITY',
 'DIVERSITY',
 'D_GEN_EXP',
 'EDUC_SERV_EMP_PNFARM',
 'EMP_RES',
 u'FFDISC_L',
 u'FIPST_N',
 'FOOD_SERV_EMP_PNFARM',
 'GEN_REV',
 u'GEXP_L',
 u'GP_GEXP',
 u'GP_LEVY',
 u'GP_LMT',
 u'GP_RATE',
 u'GP_REVU',
 u'HOME_STEAD',
 u'HOME_STEAD2',
 u'HOME_STEAD3',
 'HSG_UNITS',
 'HSG_UNITS_ACS',
 'HSLD_PERS',
 'IGR_ST',
 'LANDAREA',
 u'LEVY_L',
 u'LIMITS',
 'MANU_EMP_PNFARM',
 'MANU_RES',
 'MDHOMEVAL',
 'MED_INC',
 u'MFDISC_L',
 'MFG_EMP',
 u'MGEXP_L',
 u'MGEXP_L2',
 u'MLEVY_L',
 u'MLEVY_L2',
 u'MLEVY_L3',
 u'MLEVY_L4',
 u'MRATE_L',
 u'MRATE_L2',
 u'MREVU_L',
 'OTH_SERV_EMP_PNFARM',
 'PC_GEN_EXP',
 'PC_INC',
 'PC_SSI',
 'PERS_POVT',
 'POP65',
 'POPGROWTH',
 'POP_OV65',
 'POP_TH18',
 'POVERTY',
 'POV_EST_FAM_DENOM',
 'POV_EST_FAM_NUMER',
 'P

In [56]:
print third.head().T.to_string()

YEAR        1970            
FIPSST         1  2  4  5  6
FIPST_N        1  2  4  5  6
RATE_L         0  0  0  0  0
RATE_L2        0  0  0  0  0
MRATE_L        0  0  0  0  0
CRATE_L        0  0  0  0  0
SRATE_L        0  0  0  0  0
SRATE_L2       0  0  0  0  0
MRATE_L2       0  0  0  0  0
CRATE_L2       0  0  0  0  0
MLEVY_L        0  0  0  0  0
CLEVY_L        0  0  0  0  0
CLEVY_L2       0  0  0  0  0
MLEVY_L2       0  0  0  0  0
MLEVY_L3       0  0  0  0  0
SLEVY_L        0  0  0  0  0
CLEVY_L3       0  0  0  0  0
CLEVY_L4       0  0  0  0  0
MLEVY_L4       0  0  0  0  0
SLEVY_L2       0  0  0  0  0
SLEVY_L3       0  0  0  0  0
SLEVY_L4       0  0  0  0  0
ASMT_L         0  0  0  0  0
ASMT_L2        0  0  0  0  0
ASMT_L3        0  0  0  0  0
CREVU_L        0  0  0  0  0
MREVU_L        0  0  0  0  0
SREVU_L        0  0  0  0  0
CGEXP_L        0  0  0  0  0
MGEXP_L        0  0  0  0  0
SGEXP_L        0  0  0  0  0
SGEXP_L2       0  0  0  0  0
MGEXP_L2       0  0  0  0  0
CFDISC_L      

It would be useful to understand how we are impacting the universe of states with this join (or future joins).  To aid in this effort, let's generate a coverage set.  The index will be the union of the indices from `first` and `third`.  The variables are just boolean values indicating that the data set in question contains a record for the year and state identified in the index.

In [57]:
#Capture union of indices
u_idx=list(set(first.index.values).union(set(third.index.values)))

#Generate county coverage dict
st_cov=DataFrame({'covariates':[idx in first.index for idx in u_idx],
                  'tel':[idx in third.index for idx in u_idx]},
                  index=pd.MultiIndex.from_tuples(u_idx,names=['Year','FIPS'])).sortlevel(0)

print 'Number of year-states represented in the COSTAT/PUMS set:',st_cov['covariates'].sum()
print 'Number of year-states represented in the institutional set:',st_cov['tel'].sum()

#Write to disk
st_cov.to_csv('../data/state_coverage.csv')

st_cov

Number of year-states represented in the COSTAT/PUMS set: 468
Number of year-states represented in the institutional set: 2244


Unnamed: 0_level_0,Unnamed: 1_level_0,covariates,tel
Year,FIPS,Unnamed: 2_level_1,Unnamed: 3_level_1
1970,1,False,True
1970,2,False,True
1970,4,False,True
1970,5,False,True
1970,6,False,True
1970,8,False,True
1970,9,False,True
1970,10,False,True
1970,11,False,True
1970,12,False,True


In [58]:
set(first.index).issubset(set(third.index))

False

In [59]:
print sorted(df_list[1].columns)

[u'AFN110197D', u'AFN110197F', u'AFN110202D', u'AFN110202F', u'AFN110207D', u'AFN110207F', u'AFN120197D', u'AFN120197F', u'AFN120202D', u'AFN120202F', u'AFN120207D', u'AFN120207F', u'AFN130197D', u'AFN130197F', u'AFN130202D', u'AFN130202F', u'AFN130207D', u'AFN130207F', u'AFN140197D', u'AFN140197F', u'AFN140202D', u'AFN140202F', u'AFN140207D', u'AFN140207F', u'AFN210197D', u'AFN210197F', u'AFN210202D', u'AFN210202F', u'AFN210207D', u'AFN210207F', u'AFN220197D', u'AFN220197F', u'AFN220202D', u'AFN220202F', u'AFN220207D', u'AFN220207F', u'AFN230197D', u'AFN230197F', u'AFN230202D', u'AFN230202F', u'AFN230207D', u'AFN230207F', u'AFN240197D', u'AFN240197F', u'AFN240202D', u'AFN240202F', u'AFN240207D', u'AFN240207F', u'AFN310197D', u'AFN310197F', u'AFN310202D', u'AFN310202F', u'AFN310207D', u'AFN310207F', u'AFN320197D', u'AFN320197F', u'AFN320202D', u'AFN320202F', u'AFN320207D', u'AFN320207F', u'AFN330197D', u'AFN330197F', u'AFN330202D', u'AFN330202F', u'AFN330207D', u'AFN330207F', u'AFN3401