# Converting SAS to CSV

This is a quick notebook to conver the SAS data files that hold the TEL data to CSV.  Once this occurs, we can merge these data with the Reuter's debt data.

In [3]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sas7bdat import SAS7BDAT

Let's take a look at the files...

In [None]:
files=!ls ../debt_data/*.sas7bdat
files

['../debt_data/co_pums11.sas7bdat',
 '../debt_data/costat12.sas7bdat',
 '../debt_data/tel_cl_jm.sas7bdat']

Supposedly we can read them right into pandas DFs.

In [None]:
#Create container for each set
df_list=[]

#For each file...
for file_in in files:
    #...create a sas7bdat object...
    with SAS7BDAT(file_in) as f:
        #...and throw a DF version in df_list
        df_list.append(f.to_data_frame())

[costat12.sas7bdat] column count mismatch


Just what is contained in these guys?

In [None]:
for df in df_list:
    print df.info()

From Dan's text, it sounds like we want the first DF (which contains selected variables from COStats and PUMS sets) and the third (which contains TEL-related variables).  Do all of these DFs have year and county variables?

In [None]:
for i,df in enumerate(df_list):
    print i,'|','YEAR' in df.columns
    print i,'|','STCOU' in df.columns

It would appear only the first does.  Can the third be effectively merged on FIPS codes?

In [None]:
print df_list[2][[var for var in df_list[2].columns if 'fips' in var.lower()]+['YEAR']].head()
print df_list[2][[var for var in df_list[2].columns if 'fips' in var.lower()]+['YEAR']].tail()

FIPS codes are available in the third set.  How are they formatted in the first?

In [None]:
print df_list[0][['YEAR','STCOU']].head()

We can define a new variable in the third set that captures the zero padded versions of `FIPSST` and `FIPSCO` in one variable called `STCOU`.

In [None]:
#Define consolidated FIPS variable
df_list[2]['STCOU']=df_list[2]['FIPSST'].apply(lambda x: str(int(x)).zfill(2))+\
                    df_list[2]['FIPSCO'].apply(lambda x: str(int(x)).zfill(3))

print df_list[2][[var for var in df_list[2].columns if 'fips' in var.lower()]+['YEAR','STCOU']].head()
print df_list[2][[var for var in df_list[2].columns if 'fips' in var.lower()]+['YEAR','STCOU']].tail()

We want to join the third set to the first.

In [None]:
#Capture sets
first=df_list[0].set_index(['YEAR','STCOU'])
third=df_list[2].set_index(['YEAR','STCOU'])

#Sort indices
first.sortlevel(0,inplace=True)
third.sortlevel(0,inplace=True)

#Join sets together
data=first.join(third,rsuffix='_TEL')

data.head().T

In [None]:
len(data.columns)
print sorted(data.columns)