# Converting SAS to CSV

This is a quick notebook to conver the SAS data files that hold the TEL data to CSV.  Once this occurs, we can merge these data with the Reuter's debt data.

In [3]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sas7bdat import SAS7BDAT

Let's take a look at the files...

In [4]:
files=!ls ../debt_data/*.sas7bdat
files

['../debt_data/co_pums11.sas7bdat',
 '../debt_data/costat12.sas7bdat',
 '../debt_data/tel_cl_jm.sas7bdat']

Supposedly we can read them right into pandas DFs.

In [5]:
#Create container for each set
df_list=[]

#For each file...
for file_in in files:
    #...create a sas7bdat object...
    with SAS7BDAT(file_in) as f:
        #...and throw a DF version in df_list
        df_list.append(f.to_data_frame())

[costat12.sas7bdat] column count mismatch


Just what is contained in these guys?

In [6]:
for df in df_list:
    print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28778 entries, 0 to 28777
Data columns (total 41 columns):
STCOU         28778 non-null object
POP_TH18      28778 non-null float64
POP_OV65      28778 non-null float64
TOT_EMP       28778 non-null float64
MFG_EMP       28778 non-null float64
RETL_EMP      28778 non-null float64
T_PUB_SCH     28778 non-null float64
PUB_SCHL      28778 non-null float64
PVT_SCHL      28778 non-null float64
HSLD_PERS     28778 non-null float64
HSG_UNITS     28778 non-null float64
CH_HS_UNT     28778 non-null float64
PRE_1940      28778 non-null float64
VACANT        28778 non-null float64
MDHOMEVAL     28778 non-null float64
MED_INC       28778 non-null float64
PC_INC        28778 non-null float64
LANDAREA      28778 non-null float64
GEN_REV       28778 non-null float64
IGR_ST        28778 non-null float64
TAX_REV       28778 non-null float64
PT_REV        28778 non-null float64
D_GEN_EXP     28778 non-null float64
PC_GEN_EXP    28778 non-null float64
RES_

From Dan's text, it sounds like we want the first DF (which contains selected variables from COStats and PUMS sets) and the third (which contains TEL-related variables).  Do all of these DFs have year and county variables?

In [7]:
for i,df in enumerate(df_list):
    print i,'|','YEAR' in df.columns
    print i,'|','STCOU' in df.columns

0 | True
0 | True
1 | False
1 | True
2 | True
2 | False


It would appear only the first does.  Can the third be effectively merged on FIPS codes?

In [8]:
print df_list[2][[var for var in df_list[2].columns if 'fips' in var.lower()]+['YEAR']].head()
print df_list[2][[var for var in df_list[2].columns if 'fips' in var.lower()]+['YEAR']].tail()

   FIPSST  FIPSCO  YEAR
0       1       3  1990
1       1       3  2000
2       1       3  2001
3       1       3  2002
4       1       3  2003
     FIPSST  FIPSCO  YEAR
658      56      21  2007
659      56      21  2008
660      56      21  2009
661      56      21  2010
662      56      21  2011


FIPS codes are available in the third set.  How are they formatted in the first?

In [9]:
print df_list[0][['YEAR','STCOU']].head()

   YEAR  STCOU
0  1990  00000
1  1990  01000
2  1990  01001
3  1990  01003
4  1990  01005


We can define a new variable in the third set that captures the zero padded versions of `FIPSST` and `FIPSCO` in one variable called `STCOU`.

In [10]:
#Define consolidated FIPS variable
df_list[2]['STCOU']=df_list[2]['FIPSST'].apply(lambda x: str(int(x)).zfill(2))+\
                    df_list[2]['FIPSCO'].apply(lambda x: str(int(x)).zfill(3))

print df_list[2][[var for var in df_list[2].columns if 'fips' in var.lower()]+['YEAR','STCOU']].head()
print df_list[2][[var for var in df_list[2].columns if 'fips' in var.lower()]+['YEAR','STCOU']].tail()

   FIPSST  FIPSCO  YEAR  STCOU
0       1       3  1990  01003
1       1       3  2000  01003
2       1       3  2001  01003
3       1       3  2002  01003
4       1       3  2003  01003
     FIPSST  FIPSCO  YEAR  STCOU
658      56      21  2007  56021
659      56      21  2008  56021
660      56      21  2009  56021
661      56      21  2010  56021
662      56      21  2011  56021


We want to join the third set to the first.

In [11]:
#Capture sets
first=df_list[0].set_index(['YEAR','STCOU'])
third=df_list[2].set_index(['YEAR','STCOU'])

#Sort indices
first.sortlevel(0,inplace=True)
third.sortlevel(0,inplace=True)

#Join sets together
data=first.join(third,rsuffix='_TEL')

data.head().T

YEAR,1990,1990,1990,1990,1990
STCOU,00000,01000,01001,01003,01005
POP_TH18,6.360443e+07,1058788,10098,25533,7464
POP_OV65,3.124183e+07,522989,3372,14879,3726
TOT_EMP,1.393809e+08,2061101,11471,40809,12163
MFG_EMP,1.96942e+07,396248,2350,5586,3698
RETL_EMP,2.28855e+07,321969,2163,8007,1577
T_PUB_SCH,4.07376e+07,728252,6847,17054,5156
PUB_SCHL,3.837969e+07,680875,6567,15507,4848
PVT_SCHL,4187099,57284,604,1761,466
HSLD_PERS,2.63,2.62,2.88,2.62,2.7
HSG_UNITS,1.022637e+08,1670379,12732,50933,10705


In [13]:
print len(data.columns)
print sorted(data.columns)

194
[u'ACQ_ValAss', u'ASMT_L', u'ASMT_L2', u'ASMT_L3', u'Area', u'BOTH', u'BURDEN05', u'BURDEN06', u'BURDEN99', u'CB', u'CB_E', u'CB_E2', u'CB_E3', u'CB_E4', u'CB_G', u'CB_G2', u'CB_share', u'CFDISC_L', u'CGEXP_L', u'CH_HS_UNT', u'CLEVY_L', u'CLEVY_L2', u'CLEVY_L3', u'CLEVY_L4', u'CRATE_L', u'CRATE_L2', u'CREVU_L', u'CV_99BURDEN', u'County', u'D_GEN_EXP', u'Def', u'Density', u'Dillon_all', u'EAST', u'EDU_MAND', u'EmptoResPop', u'FFDISC_L', u'FIPSCO', u'FIPSCO_TEL', u'FIPSST', u'FIPSST_TEL', u'GEN_REV', u'GEXP_L', u'GL', u'GP_GEXP', u'GP_LEVY', u'GP_LMT', u'GP_RATE', u'GP_REVU', u'GST', u'HOME_STEAD', u'HOME_STEAD2', u'HOME_STEAD3', u'HSG_UNITS', u'HSLD_PERS', u'IGR', u'IGR_ST', u'IIT', u'LANDAREA', u'LEVY_L', u'LIMITS', u'LIT', u'LST', u'MDHOMEVAL', u'ME', u'MED_INC', u'MFDISC_L', u'MFG_EMP', u'MGEXP_L', u'MGEXP_L2', u'MIDDLE', u'MLEVY_L', u'MLEVY_L2', u'MLEVY_L3', u'MLEVY_L4', u'MRATE_L', u'MRATE_L2', u'MREVU_L', u'M_H_INC05', u'M_H_INC06', u'M_H_INC99', u'M_P_TAX05', u'M_P_TAX06', u'

Ok, let's write this to disk.

In [14]:
data.to_csv('tel_data.csv')