# Debt Data Time Series

This script seeks to grab the subset of relevant variables from each year, so that we have a set across all years that can be readily merged with the TEL/ACS data.

In [402]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import glob
import requests

## What are the common variables?

The first thing we will do is see if we can get what we need from just the common variables across all sets.  There are a few reasons why variables may not align every year:

1. Reuters doesn't offer the entire set of variables every year;
2. Variable names with long words may have been hyphenated in different ways across years;
3. If variables appeared more than once, the parsing routine appended the variable position to the name to create a unique variable.  If the position varies across years, so will the variable name.

To grab the columns in lightweight fashion, we will just read in the first couple lines for each set.

In [403]:
#Grab list of files
files=glob.glob('../../debt_data/*.csv')

#Create a dictionary to hold columns from each year
col_dict={}

#For each file...
for f in files[:-1]:
    #...read in the first couple rows...
    tmp_df=pd.read_csv(f,nrows=2)
    #...capture the columns...
    col_dict.update({f[13:-4]:list(tmp_df.columns)})
    #...and dump the partial data set
    del tmp_df
    
#Create a container for the variable sets within each file
var_sets=[]

#For each file...
for f in col_dict.keys():
    #...add the variable set to var_sets
    var_sets.append(set(col_dict[f]))
    
#Capture the intersection of variables across all years
common_vars=sorted(list(set.intersection(*var_sets)))

print 'There are '+str(len(common_vars))+' variables common to all sets.'
print common_vars

There are 254 variables common to all sets.
['# of Mgrs', '$ Amount of Highest Cpn Maturity', '144A FLAG', '501c3', '8-Digit CUSIP25', '8-Digit CUSIP26', 'Accumulator Amt ($ Mil)', 'All Use of Proceeds (Code)', 'All Use of Proceeds (Desc)', 'All Use of Proceeds (Number)', 'Amount at Maturity ($ mils)', 'Amount of Final Maturity ($mils)', 'Amount of Issue ($ mils)', 'Amount of Maturity ($ mils)', 'Ant- ici- pa- tion Type', 'Asset Backed Indicator Flag (Y/N)', 'Auction Rate', 'Aver- age Life', 'Average Take Down', 'Bank Qual', 'Beginning Price/ Yield', 'Beginning Serial Coupon', 'Beginning Serial Maturity', 'Bid', 'Bk Elig', 'Bk En- try', 'Bnk Mgd', 'Bond Buyer ALL UOP', 'Bond Buyer GO Index', 'Bond Buyer Region118', 'Bond Buyer Region119', 'Bond Buyer Rev. Index', 'Bond Buyer UOP142', 'Bond Buyer UOP143', 'Bond Buyer UOP30', 'Bond Counsel Deal(Y/N)', 'CD-ROM Number', 'CUSIP of Insti- tutional Backer', 'Call Date', 'Call Issue', 'Call Price', 'Callable at Par', 'Co-Managers', 'Comb. Gros

## Data Input

Ok, we are looking for aggregations of debt by county.  In particular, we want to capture activity by concepts:

1. Type of Debt (General Obligation or Revenue; latter can be split by )
2. Issuer Type (General purpose gov, school district, special district, or private entity)
3. Purpose of the Issue
4. Volume of Issue

For the latter two, we also want variables that split out GO versus revenue bonds.  For example, we would want to know the volume of GO debt issued by general purpose jurisdictions, or the revenue debt issued in service of transportation infrastructure.  The following table maps concepts to variables.

Concept|Variable|Possible Values
-------|--------|---------------
Debt Type|`Security Type`| GO<br>RV
Issuer Type|`Issuer Type Description`|District<br>City, Town Vlg<br>Local Authority<br>State Authority<br>County/Parish<br>College or Univ<br>State/Province<br>Direct<br>Indian Tribe<br>Co-op Utility
Purpose|`Bond Buyer UOP30`|Development<br>Education<br>Electric Power<br>Environmental Facilities<br>General Purpose<br>Healthcare<br>Housing<br>Public Facilities<br>Transportation<br>Utilities
Volume|`Amount of Maturity (M)`|Continuous
County|`County`|Any county in the US
State|`State`|Any state in the US
Issue Date|`Sale Date`|Continuous (we only need the year)

Fortunately, all of these variables appear in the common set.

In [404]:
#Define required variables
req_vars=['Security Type','Issuer Type Description','Bond Buyer UOP30',\
          'Amount of Maturity ($ mils)','County','State','Sale Date','Issuer','Net Interest Cost',\
          'Coupon of Fin Maty','Coupon Maturity','SDC Est. Gross Spread','True Interest Cost','Coupon Rate']
print 'All the requisite variables are in the common set:',np.array([var in common_vars for var in req_vars]).all()

All the requisite variables are in the common set: True


That makes things easier.  Let's just go ahead and read the data in from all years, keeping only the variables in `req_vars`.

In [405]:
#Create a container for DFs from all years
df_list=[]

#For each file...
for f in files[:-1]:
    print f
    #...throw the subset into df_list
    df_list.append(pd.read_csv(f,usecols=req_vars))
    
#Concatenate all the years together
debt=pd.concat(df_list)

#Convert sale date to datetime
debt['Sale Date']=debt['Sale Date'].apply(lambda x: pd.to_datetime(x))

#Generate a year variable
debt['Year']=debt['Sale Date'].apply(lambda x: x.year)

#Jettison Sale Date
debt.pop('Sale Date')

print debt.info()
debt.head()

../../debt_data/1988to1989.csv
../../debt_data/2004.csv
../../debt_data/2014to2015.csv
../../debt_data/1990to1991.csv
../../debt_data/2006to2007.csv
../../debt_data/1992to1993.csv
../../debt_data/2010to2011.csv
../../debt_data/2012to2013.csv
../../debt_data/2000to2001.csv
../../debt_data/2005.csv
../../debt_data/2008to2009.csv
../../debt_data/1998to1999.csv
../../debt_data/1986to1987.csv
../../debt_data/1994to1995.csv
../../debt_data/1996to1997.csv
../../debt_data/1984to1985.csv
../../debt_data/2002to2003.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 465391 entries, 0 to 36385
Data columns (total 14 columns):
Bond Buyer UOP30               465388 non-null object
Amount of Maturity ($ mils)    465391 non-null object
Coupon Maturity                170258 non-null float64
County                         461215 non-null object
Coupon of Fin Maty             397978 non-null float64
Coupon Rate                    0 non-null float64
Issuer                         465357 non-null object

Unnamed: 0,Bond Buyer UOP30,Amount of Maturity ($ mils),Coupon Maturity,County,Coupon of Fin Maty,Coupon Rate,Issuer,Issuer Type Description,Net Interest Cost,True Interest Cost,State,Security Type,SDC Est. Gross Spread,Year
0,Utilities,0.48,,Callaway,7.75,,Callaway Co Pub Wtr Supp Dt #2,District,,,MO,RV,,1988
1,Utilities,0.05,,Cass,,,Cleveland-Missouri,"City, Town Vlg",,,MO,RV,,1988
2,General Purpose,5.175,,Gunnison,,,Skyland Metropolitan Dt,District,,,CO,GO,,1988
3,Education,0.273,,Clermont/Warren,,,Clermont Co (Goshen) LSD,District,,,OH,GO,,1988
4,Transportation,0.22,,Bartholomew,,,Flat Rock-Hawcreek School Corp,District,,,IN,GO,,1988


## Identification of FIPS codes

We need to merge in FIPS codes, which are conveniently held by Census on a public site.

In [406]:
#Define names for fields
fips_names=['state','fips_st','fips_co','county','unknown']

#Capture dtypes of fips code variables (to keep the zeroes)
fips_dtypes={'fips_st':str,
             'fips_co':str}

#Read in fips
fips=pd.read_csv('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
                 names=fips_names,dtype=fips_dtypes)

#Remove 'County' and 'Parish' from the county names
fips['county']=fips['county'].str.replace(' County','')
fips['county']=fips['county'].str.replace(' Parish','')

#Create composite name and fips variables
fips['st_cou']=fips.apply(lambda row: (row['state']+'_'+row['county']).lower(),axis=1)
fips['fips']=fips.apply(lambda row: (row['fips_st']+row['fips_co']).lower(),axis=1)

#Capture dict to map composite names to composite fips codes
fips_dict=dict(zip(fips['st_cou'],fips['fips']))

#Fix the counties with 'St' (insert a period)
debt['County']=debt['County'].str.replace('St ','St. ')

#Generate composite name for the debt data
debt['st_cou']=debt.apply(lambda row: (str(row['State'])+'_'+str(row['County'])).lower(),axis=1)

#Map in fips codes
debt['FIPS']=debt['st_cou'].map(fips_dict)

#Write in temporary set (before FIPS improvement)
debt.to_csv('debt_ts_pre_fips.csv')

debt

Unnamed: 0,Bond Buyer UOP30,Amount of Maturity ($ mils),Coupon Maturity,County,Coupon of Fin Maty,Coupon Rate,Issuer,Issuer Type Description,Net Interest Cost,True Interest Cost,State,Security Type,SDC Est. Gross Spread,Year,st_cou,FIPS
0,Utilities,0.48,,Callaway,7.750,,Callaway Co Pub Wtr Supp Dt #2,District,,,MO,RV,,1988,mo_callaway,29027
1,Utilities,0.05,,Cass,,,Cleveland-Missouri,"City, Town Vlg",,,MO,RV,,1988,mo_cass,29037
2,General Purpose,5.175,,Gunnison,,,Skyland Metropolitan Dt,District,,,CO,GO,,1988,co_gunnison,08051
3,Education,0.273,,Clermont/Warren,,,Clermont Co (Goshen) LSD,District,,,OH,GO,,1988,oh_clermont/warren,
4,Transportation,0.22,,Bartholomew,,,Flat Rock-Hawcreek School Corp,District,,,IN,GO,,1988,in_bartholomew,18005
5,Education,1.798,,Lake,,,Crown Point Comm School Corp,District,,,IN,GO,,1988,in_lake,18089
6,Healthcare,0.09,,Carver,9.000,,Chaska City-Minnesota,"City, Town Vlg",,,MN,RV,,1988,mn_carver,27019
7,General Purpose,1.32,,Platte,7.550,,Columbus City-Nebraska,"City, Town Vlg",,,NE,GO,,1988,ne_platte,31141
8,General Purpose,3.68,,Grundy,7.600,,Grundy Co-Illinois,County/Parish,,,IL,GO,,1988,il_grundy,17063
9,General Purpose,0.955,,St. Croix,6.700,,Hudson City-Wisconsin,"City, Town Vlg",6.620,,WI,GO,,1988,wi_st. croix,55109


In [407]:
dict(zip(debt[debt['FIPS'].notnull()]['FIPS'].apply(lambda x: x[:2]),debt[debt['FIPS'].notnull()]['State']))

{'01': 'AL',
 '02': 'AK',
 '04': 'AZ',
 '05': 'AR',
 '06': 'CA',
 '08': 'CO',
 '09': 'CT',
 '10': 'DE',
 '11': 'DC',
 '12': 'FL',
 '13': 'GA',
 '15': 'HI',
 '16': 'ID',
 '17': 'IL',
 '18': 'IN',
 '19': 'IA',
 '20': 'KS',
 '21': 'KY',
 '22': 'LA',
 '23': 'ME',
 '24': 'MD',
 '25': 'MA',
 '26': 'MI',
 '27': 'MN',
 '28': 'MS',
 '29': 'MO',
 '30': 'MT',
 '31': 'NE',
 '32': 'NV',
 '33': 'NH',
 '34': 'NJ',
 '35': 'NM',
 '36': 'NY',
 '37': 'NC',
 '38': 'ND',
 '39': 'OH',
 '40': 'OK',
 '41': 'OR',
 '42': 'PA',
 '44': 'RI',
 '45': 'SC',
 '46': 'SD',
 '47': 'TN',
 '48': 'TX',
 '49': 'UT',
 '50': 'VT',
 '51': 'VA',
 '53': 'WA',
 '54': 'WV',
 '55': 'WI',
 '56': 'WY'}

In [408]:
print 'Total number of issues:',len(debt)
print 'Number of issues that still do not have a FIPS code:',len(debt[debt['FIPS'].isnull()])
print 'Proportion of issues that still do not have a FIPS code:',len(debt[debt['FIPS'].isnull()])/float(len(debt))

Total number of issues: 465391
Number of issues that still do not have a FIPS code: 101639
Proportion of issues that still do not have a FIPS code: 0.218394855079


### State Issuers

Ok, that got about three quarters of the records.  Let's try to get the rest.  Many issuers come from the `State Authority` or the `State` outright.  Pretty much, if `State` is anywhere in the description, no single county can be affiliated with the issue. So, let's allocate the state FIPS to all of them.

In [409]:
#Capture states
states=sorted(set(debt['State']))[1:]

#For each state...
for st in states:
    #...capture the keys associated with that state...
    st_keys=[item for item in fips_dict.items() if str(st).lower()+'_' in item[0]]
    try:
        #...extract the state portion of the value associated with the first member of the list...
        st_key_part=st_keys[0][1][:2]
        #...and assign the state fips code
        state_in_desc=((debt['Issuer Type Description'].apply(lambda x: 'State' in str(x))) |\
                       (debt['County'].apply(lambda x: 'State' in str(x))))
        st_mask=(debt['State']==st) & (state_in_desc)
        debt.ix[st_mask,'FIPS']=st_key_part+'000'
#         debt.ix[(debt['State']==st) & (debt['County']=='State'),'FIPS']=st_key_part+'000'
    except:
        print 'Problem state >>> ',st

debt[debt['County']=='State Authority']

Problem state >>>  MR
Problem state >>>  TT


Unnamed: 0,Bond Buyer UOP30,Amount of Maturity ($ mils),Coupon Maturity,County,Coupon of Fin Maty,Coupon Rate,Issuer,Issuer Type Description,Net Interest Cost,True Interest Cost,State,Security Type,SDC Est. Gross Spread,Year,st_cou,FIPS
10,Education,12,,State Authority,,,Massachusetts Hlth & Ed Facs Au,State Authority,,,MA,RV,,1988,ma_state authority,25000
46,Housing,43,,State Authority,10.550,,Alaska Housing Finance Corp,State Authority,,,AK,RV,,1988,ak_state authority,02000
54,Housing,1.865,,State Authority,8.500,,Maryland Dept of Hsg & Comm Dev,State Authority,8.343,8.267,MD,RV,,1988,md_state authority,24000
57,Transportation,150,,State Authority,,,Port Authority of NY & NJ,State Authority,,,NY,RV,,1988,ny_state authority,36000
62,Healthcare,11.16,,State Authority,5.100,,California Health Facs Fin Auth,State Authority,,,CA,RV,,1988,ca_state authority,06000
65,Healthcare,9.18,,State Authority,,,Illinois Health Facilities Auth,State Authority,,,IL,RV,,1988,il_state authority,17000
111,General Purpose,25,,State Authority,7.000,,Florida Dept of Nat Resources,State Authority,7.780,,FL,RV,,1988,fl_state authority,12000
124,Housing,6.58,,State Authority,8.800,,Massachusetts Housing Fin Auth,State Authority,8.841,8.948,MA,RV,,1988,ma_state authority,25000
127,Education,275,,State Authority,,,Nebraska Higher Ed Loan Prog Inc,State Authority,,,NE,RV,,1988,ne_state authority,31000
170,Housing,20.4,,State Authority,8.600,,California Housing Finance Agcy,State Authority,8.270,8.32,CA,RV,,1988,ca_state authority,06000


In [410]:
print 'Total number of issues:',len(debt)
print 'Number of issues that still do not have a FIPS code:',len(debt[debt['FIPS'].isnull()])
print 'Proportion of issues that still do not have a FIPS code:',len(debt[debt['FIPS'].isnull()])/float(len(debt))

Total number of issues: 465391
Number of issues that still do not have a FIPS code: 51254
Proportion of issues that still do not have a FIPS code: 0.110131051095


### Using Google to Capture Colleges and Some of the Remaining Misses

Ok, we still have a number of misses.  I considered just taking the first element from hybrid county descriptions, but sometimes the subsequent positions are meaningful.  For example, we wouldn't necessarily get Bronx County if we only captures New York from `New York/Bronx/Kings`.  There are too many to do by hand, so we are going to utilize the Google Maps API to cut down the current gap.  There are a couple lists we have to sort through:

1. Issues with some version of `College or University` as the county description.  We will key on issuer names in this scenario.
2. County descriptions that did not have a direct match in the FIPS data.

The API does not spit out FIPS codes, but it does provide regular county names.  We can construct new keys based upon the state_county combination that hopefully are members of the FIPS mapping dictionary.  First, let's identify our lists.

In [411]:
#Capture list of college issuers
college_pairs=zip(debt[debt['County'].apply(lambda x: 'College' in str(x))]['Issuer'],
                  debt[debt['County'].apply(lambda x: 'College' in str(x))]['State'])
college_issues=list(set(college_pairs))

#Create masks for other county descriptions (excluding authorities)
no_st_auth=((debt['Issuer Type Description'].apply(lambda x: 'State' not in str(x))) |\
            (debt['County'].apply(lambda x: 'State' not in str(x))))
no_loc_auth=(debt['County'].apply(lambda x: 'Local' not in str(x)))
no_college=(debt['County'].apply(lambda x: 'College' not in str(x)))

#Capture list of random issuers
rando_pairs=zip(debt[no_st_auth & no_loc_auth & no_college]['County'],
                debt[no_st_auth & no_loc_auth & no_college]['State'])
rando_issues=list(set(rando_pairs))

len(college_issues)

381

Now we need a function to implement the geocoding...

In [412]:
def county_id(point_of_interest,state=None):
    '''Function returns state_county concatenation of a given point of interest.'''
    #Set base URL
    url = 'https://maps.googleapis.com/maps/api/geocode/json'
    #Set parameters for call to API (which are appended to the base)
    params = {'sensor': 'false',
              'address': point_of_interest,
              'key':'Get your own key son!'}
    #Make call to API
    r = requests.get(url, params=params)
    #Capture results
    results = r.json()['results']
#     print results
#     print len(results)
    #If a state is provided...
    if state != None:
        #...for each hit...
        for r in results:
            #...capture the state and county...
            res_st=[comp['short_name'] for comp in r['address_components'] \
                    if comp['types'][0]=='administrative_area_level_1']
            res_co=[comp['short_name'] for comp in r['address_components'] \
                    if comp['types'][0]=='administrative_area_level_2']
            #...if the state matches...
            if res_st[0]==state:
                #...return the county...
                return (res_st[0]+'_'+res_co[0].replace(' County','')).lower()
    else:
        #Capture the state and county
        res_st=[comp['short_name'] for comp in results[0]['address_components'] \
                if comp['types'][0]=='administrative_area_level_1']
        res_co=[comp['short_name'] for comp in results[0]['address_components'] \
                if comp['types'][0]=='administrative_area_level_2']
        #Capture county from the first hit
        return (res_st[0]+'_'+res_co[0].replace(' County','')).lower()

Now, let's roll through these places and capture (hopefully) better keys.  Note that this takes awhile, so we will store our results and read it back the next time.

In [413]:
# print 'Number of College Issues:',len(college_issues)
# print 'Number of Random Issues:',len(rando_issues)

# #Create dicts for colleges and the randos
# college_map={}
# rando_map={}

# #Create containers for misses
# college_miss=[]
# rando_miss=[]

# print '***INITIALIZING COLLEGE LOOP***'
# #For each college...
# for i,college in enumerate(college_issues):
#     if i%50==0:
#         print i,'|',college
#     try:
#         #...capture the new key...
#         college_map.update({college[0]:county_id(college[0]+' '+college[1])})
#     except:
#         college_miss.append(college)

# print '\n\n***INITIALIZING COLLEGE LOOP***'
# #For each random issuer...
# for i,rando in enumerate(rando_issues):
#     if i%50==0:
#         print i,'|',rando
#     try:
#         #...capture the new key...
#         rando_map.update({rando[0]:county_id(rando[0]+' '+rando[1])})
#     except:
#         rando_miss.append(rando)

We can capture the successful API calls in two series.

In [414]:
# #Capture in series
# college_new_keys=Series(college_map)
# rando_new_keys=Series(rando_map)

# #Remove special characters
# ##Tildes
# college_new_keys=college_new_keys.apply(lambda x: x.encode('utf-8').replace('\xc3\xb1','n'))
# rando_new_keys=rando_new_keys.apply(lambda x: x.encode('utf-8').replace('\xc3\xb1','n'))

# #Write to disk
# college_new_keys.to_csv('../data/g_api_college.csv')
# rando_new_keys.to_csv('../data/g_api_rando.csv')

# len(college_new_keys),len(rando_new_keys)

In [415]:
#Read from disk
college_new_keys=pd.read_csv('../data/g_api_college.csv',names=['desc','key'])
rando_new_keys=pd.read_csv('../data/g_api_rando.csv',names=['desc','key'])

#Set indices
college_new_keys.set_index('desc',inplace=True)
rando_new_keys.set_index('desc',inplace=True)

college_new_keys.head()

Unnamed: 0_level_0,key
desc,Unnamed: 1_level_1
Akron University,oh_summit
Alabama State Board of Education,al_clarke
Alabama State University,al_madison
Arizona State University,az_maricopa
Arizona Western College,az_maricopa


We need to map in these new keys, but I would like to preserve the ability to compare the old and new keys.  Consequently, we will create a new `st_cou_g1` variable that will hold the new keys, and a composite variable `st_cou_final` that holds the keys from `st_cou_g1` where they exist, and `st_cou` where they don't.  Note that these should only be assigned where the FIPS code is currently missing.

In [416]:
#Define mask
nofips=(debt['FIPS'].isnull())

#Generate new var
debt['st_cou_g1']=''

#Fill in random new keys
debt.ix[nofips,'st_cou_g1']=debt.ix[nofips]['County'].map(rando_new_keys['key'])

#Fill in college new keys
no_st_cou_g1=(debt['st_cou_g1'].isnull())
debt.ix[nofips & no_st_cou_g1,'st_cou_g1']=debt.ix[nofips & no_st_cou_g1]['Issuer'].map(college_new_keys['key'])

#Create composite variable
debt['st_cou_final']=np.where(debt['st_cou_g1'].notnull(),debt['st_cou_g1'],debt['st_cou'])

#For records without FIPS, use st_cou_final to map in a code
debt.ix[nofips,'FIPS']=debt.ix[nofips]['st_cou_final'].map(fips_dict)

debt[debt['FIPS'].isnull()]

Unnamed: 0,Bond Buyer UOP30,Amount of Maturity ($ mils),Coupon Maturity,County,Coupon of Fin Maty,Coupon Rate,Issuer,Issuer Type Description,Net Interest Cost,True Interest Cost,State,Security Type,SDC Est. Gross Spread,Year,st_cou,FIPS,st_cou_g1,st_cou_final
66,Education,1.05,,Marion/Polk,7.100,,Marion Co (Salem-Keizer) SD #24-J,District,7.132,7.217,OR,RV,,1988,or_marion/polk,,,or_marion/polk
80,Utilities,0.5,,,,,Camano Vista Water Dt,District,,,CA,RV,,1988,ca_nan,,,ca_nan
116,Education,3.6,,College or University,,,Indiana Vo-Tech College,College or Univ,,,IN,GO,,1988,in_college or university,,,in_college or university
162,General Purpose,2,,,7.700,,Stonegate Metropolitan Dt,District,,,CA,GO,,1988,ca_nan,,,ca_nan
210,Education,24.49,,Douglas/Durango/Eagle,5.250,,Colorado SD,District,,,CO,GO,,1988,co_douglas/durango/eagle,,,co_douglas/durango/eagle
228,Education,4.9,,Anoka/Ramsey/Washington,6.600,,Anoka Co (NE Metro) ISD #916,District,6.573,,MN,GO,9.832,1988,mn_anoka/ramsey/washington,,,mn_anoka/ramsey/washington
231,Education,5,,DuPage/Will,6.500,,DuPage Co (Naperville) CUSD #203,District,6.374,,IL,GO,7.912,1988,il_dupage/will,,,il_dupage/will
265,Education,29.1,,College or University,5.250,,University of Missouri Curators,College or Univ,4.740,,MO,GO,,1988,mo_college or university,,,mo_college or university
287,Education,2.744,,Monroe/Livingston/Ontario,4.880,,Monroe Co (Honeoye Falls-Lima) CSD,District,,,NY,GO,,1988,ny_monroe/livingston/ontario,,,ny_monroe/livingston/ontario
338,General Purpose,2.935,,,,,Springfield-Indiana,"City, Town Vlg",,,IN,RV,,1988,in_nan,,,in_nan


In [417]:
print 'Total number of issues:',len(debt)
print 'Number of issues that still do not have a FIPS code:',len(debt[debt['FIPS'].isnull()])
print 'Proportion of issues that still do not have a FIPS code:',len(debt[debt['FIPS'].isnull()])/float(len(debt))

Total number of issues: 465391
Number of issues that still do not have a FIPS code: 12763
Proportion of issues that still do not have a FIPS code: 0.0274242518656


### Dealing with composite `County` descriptions

We are down to about 37,000 records with no FIPS codes out of 465,000.  How many of these are because of compound county descriptions (e.g. `New York/Bronx/Kings`)?

In [418]:
debt[debt['FIPS'].isnull()]['st_cou_final'].apply(lambda x: '/' in str(x)).sum()

4440

Looks like we can take out roughly 60% of our misses if one of the locations an actual county name in the FIPS data.  Our approach will be to roll through the compound county descriptions, individually pair them with the associated state, and see if they show up in our keys in `fips_dict`.  While it is not always the case that a county name will show up in the `Issuer` variable, when it does, it appears to correspond with the first jurisdiction mentioned in the `County` variable.  Consequently, our rule will be to take the first match we find.

In [419]:
#Build mask to capture compound counties that do not have FIPS yet
nofips_compound=(debt['FIPS'].isnull()) & (debt['st_cou_final'].apply(lambda x: '/' in str(x)))

#Define function that returns FIPS for county description components
def composite_match(s,delim='/'):
    #If the first jurisdiction is a county in the FIPS set...
    if s.split(delim)[0] in fips_dict.keys():
        #...return the appropriate FIPS code...
        return fips_dict[s.split(delim)[0]]
    #...otherwise...
    else:
        #...capture state...
        s_st=s[:3]
        #...capture other jurisdictions...
        s_jur=s[3:]
        #...and for each remaining jurisdiction...
        for j in s_jur.split(delim):
            #...if one of them shows up in FIPS...
            if s_st+j in fips_dict.keys():
                #...return the appropriate FIPS code
                return fips_dict[s_st+j]
            
#Assign FIPS codes to composite county records
debt.ix[nofips_compound,'FIPS']=debt.ix[nofips_compound]['st_cou_final'].apply(lambda x: composite_match(x))

#Redefine mask for backslash as the delimeter
nofips_compound=(debt['FIPS'].isnull()) & (debt['st_cou_final'].apply(lambda x: '\\' in str(x)))

#Assign FIPS codes to composite county records
debt.ix[nofips_compound,'FIPS']=debt.ix[nofips_compound]['st_cou_final'].apply(lambda x: \
                                                                               composite_match(x,delim='\\'))

debt[debt['FIPS'].isnull()]

Unnamed: 0,Bond Buyer UOP30,Amount of Maturity ($ mils),Coupon Maturity,County,Coupon of Fin Maty,Coupon Rate,Issuer,Issuer Type Description,Net Interest Cost,True Interest Cost,State,Security Type,SDC Est. Gross Spread,Year,st_cou,FIPS,st_cou_g1,st_cou_final
80,Utilities,0.5,,,,,Camano Vista Water Dt,District,,,CA,RV,,1988,ca_nan,,,ca_nan
116,Education,3.6,,College or University,,,Indiana Vo-Tech College,College or Univ,,,IN,GO,,1988,in_college or university,,,in_college or university
162,General Purpose,2,,,7.700,,Stonegate Metropolitan Dt,District,,,CA,GO,,1988,ca_nan,,,ca_nan
265,Education,29.1,,College or University,5.250,,University of Missouri Curators,College or Univ,4.740,,MO,GO,,1988,mo_college or university,,,mo_college or university
338,General Purpose,2.935,,,,,Springfield-Indiana,"City, Town Vlg",,,IN,RV,,1988,in_nan,,,in_nan
347,Education,2.36,,College or University,,,Indiana State University Bd of Trustees,College or Univ,,,IN,GO,,1988,in_college or university,,,in_college or university
415,Education,1,,St. Clair\Washington,7.250,,St Clair Co (Freeburg) CHSD #77,District,7.044,,IL,GO,,1988,il_st. clair\washington,,il_st clair,il_st clair
427,Utilities,7.815,,Fairbanks No Star,7.800,,Fairbanks-Alaska,"City, Town Vlg",7.653,,AK,RV,,1988,ak_fairbanks no star,,ak_fairbanks north star,ak_fairbanks north star
445,General Purpose,1.5,,,,,Rohstown-Texas,"City, Town Vlg",,,TX,GO,,1988,tx_nan,,,tx_nan
449,Education,90.5,,,,,Western Loan Marketing Assoc,Local Authority,,,AZ,RV,,1988,az_nan,,,az_nan


In [420]:
print 'Total number of issues:',len(debt)
print 'Number of issues that still do not have a FIPS code:',len(debt[debt['FIPS'].isnull()])
print 'Proportion of issues that still do not have a FIPS code:',len(debt[debt['FIPS'].isnull()])/float(len(debt))

Total number of issues: 465391
Number of issues that still do not have a FIPS code: 8359
Proportion of issues that still do not have a FIPS code: 0.0179612411929


So, we still have over 8000 misses, but that represents under 2% of debt issues.  That's a decent hit rate.  Here's where those misses occur.

In [421]:
print 'Number of counties without FIPS:',len(debt.ix[(debt['FIPS'].isnull())]['County'].value_counts())
debt.ix[(debt['FIPS'].isnull())]['County'].value_counts()

Number of counties without FIPS: 201


College or University       1672
Local Authority              677
College or Univ              325
Anchorage                    166
La Salle                     141
Direct Issuer                136
Baton Rouge                  108
Bossier/Caddo                 75
North Slope                   57
Matanuska-Susitna             36
Fairbanks No Star             32
St. Marys                     29
Hampton Indep City            29
Valdez/Cordova                27
Barron/Dunn/St. Croix         23
Orleans Parish                22
Plaquemine                    20
Perham/Dent                   20
East Baton Rouge Parish       19
District                      19
Northwest                     18
Fairbanks                     18
Goge                          16
Sanilac/Lapeer/St. Clair      15
Marshall/St. Joseph           13
County/Parish                 12
Ft Pierce                     12
Saipan                        12
James                         11
Quachita                      11
          

In [422]:
#Subset to debt issues with FIPS codes
debt_fips=debt[debt['FIPS'].notnull()]

In [423]:
debt_fips.head()

Unnamed: 0,Bond Buyer UOP30,Amount of Maturity ($ mils),Coupon Maturity,County,Coupon of Fin Maty,Coupon Rate,Issuer,Issuer Type Description,Net Interest Cost,True Interest Cost,State,Security Type,SDC Est. Gross Spread,Year,st_cou,FIPS,st_cou_g1,st_cou_final
0,Utilities,0.48,,Callaway,7.75,,Callaway Co Pub Wtr Supp Dt #2,District,,,MO,RV,,1988,mo_callaway,29027,,
1,Utilities,0.05,,Cass,,,Cleveland-Missouri,"City, Town Vlg",,,MO,RV,,1988,mo_cass,29037,,
2,General Purpose,5.175,,Gunnison,,,Skyland Metropolitan Dt,District,,,CO,GO,,1988,co_gunnison,8051,,
3,Education,0.273,,Clermont/Warren,,,Clermont Co (Goshen) LSD,District,,,OH,GO,,1988,oh_clermont/warren,39155,oh_trumbull,oh_trumbull
4,Transportation,0.22,,Bartholomew,,,Flat Rock-Hawcreek School Corp,District,,,IN,GO,,1988,in_bartholomew,18005,,


In [424]:
all_cty_year=set(zip(debt_fips['Year'],
                     debt_fips['FIPS']))
nic_cty_year=set(zip(debt_fips[debt_fips['Net Interest Cost'].notnull()]['Year'],
                     debt_fips[debt_fips['Net Interest Cost'].notnull()]['FIPS']))
tic_cty_year=set(zip(debt_fips[debt_fips['True Interest Cost'].notnull()]['Year'],
                     debt_fips[debt_fips['True Interest Cost'].notnull()]['FIPS']))
all_cty=set(debt_fips['FIPS'])
nic_cty=set(debt_fips[debt_fips['Net Interest Cost'].notnull()]['FIPS'])
tic_cty=set(debt_fips[debt_fips['True Interest Cost'].notnull()]['FIPS'])

len(all_cty_year),len(nic_cty_year),len(tic_cty_year),len(all_cty),len(nic_cty),len(tic_cty)

(61292, 29320, 22847, 3073, 2746, 2436)

In [425]:
print '*** TOTAL COUNTS OF YEAR-COUNTIES ***'
print 'Entire Set:',len(all_cty_year)
print 'Subset with NIC values:',len(nic_cty_year)
print 'Subset with TIC values:',len(tic_cty_year)
print '\n*** TOTAL COUNTS OF ALL COUNTIES ***'
print 'Entire Set:',len(all_cty)
print 'Subset with NIC values:',len(nic_cty)
print 'Subset with TIC values:',len(tic_cty)
print '\n*** OVERLAP ACROSS YEAR-COUNTIES ***'
print 'Overlapping year-counties between NIC and TIC:',len(nic_cty_year & tic_cty_year)
print 'Total year-counties with NIC or TIC:',len(nic_cty_year | tic_cty_year)
print 'Proportion of year-counties with NIC or TIC:',len(nic_cty_year | tic_cty_year)/float(len(all_cty_year))
print '\n*** OVERLAP ACROSS COUNTIES ***'
print 'Overlapping counties between NIC and TIC:',len(nic_cty & tic_cty)
print 'Total counties with NIC or TIC:',len(nic_cty | tic_cty)
print 'Proportion of counties with NIC or TIC:',len(nic_cty | tic_cty)/float(len(all_cty))

*** TOTAL COUNTS OF YEAR-COUNTIES ***
Entire Set: 61292
Subset with NIC values: 29320
Subset with TIC values: 22847

*** TOTAL COUNTS OF ALL COUNTIES ***
Entire Set: 3073
Subset with NIC values: 2746
Subset with TIC values: 2436

*** OVERLAP ACROSS YEAR-COUNTIES ***
Overlapping year-counties between NIC and TIC: 13037
Total year-counties with NIC or TIC: 39130
Proportion of year-counties with NIC or TIC: 0.638419369575

*** OVERLAP ACROSS COUNTIES ***
Overlapping counties between NIC and TIC: 2369
Total counties with NIC or TIC: 2813
Proportion of counties with NIC or TIC: 0.915392124959


## Dropping Issues by States and Colleges/Universities

We want to focus on governments, and state issues are probably so large they end up dominating the space.  We must do this while we still have access to the Purpose of each debt issue.

In [426]:
debt_fips.head().T

Unnamed: 0,0,1,2,3,4
Bond Buyer UOP30,Utilities,Utilities,General Purpose,Education,Transportation
Amount of Maturity ($ mils),0.48,0.05,5.175,0.273,0.22
Coupon Maturity,,,,,
County,Callaway,Cass,Gunnison,Clermont/Warren,Bartholomew
Coupon of Fin Maty,7.75,,,,
Coupon Rate,,,,,
Issuer,Callaway Co Pub Wtr Supp Dt #2,Cleveland-Missouri,Skyland Metropolitan Dt,Clermont Co (Goshen) LSD,Flat Rock-Hawcreek School Corp
Issuer Type Description,District,"City, Town Vlg",District,District,District
Net Interest Cost,,,,,
True Interest Cost,,,,,


In [427]:
#Define issuers to avoid
issuers_to_drop=['College or Univ','State Authority','State/Province']

print 'Observations before subset:',len(debt_fips)
debt_fips=debt_fips[~debt_fips['Issuer Type Description'].isin(issuers_to_drop)]
print 'Observations before subset:',len(debt_fips)

Observations before subset: 457032
Observations before subset: 396149


In [428]:
debt_fips[debt_fips['Issuer Type Description'].isin(issuers_to_drop)]

Unnamed: 0,Bond Buyer UOP30,Amount of Maturity ($ mils),Coupon Maturity,County,Coupon of Fin Maty,Coupon Rate,Issuer,Issuer Type Description,Net Interest Cost,True Interest Cost,State,Security Type,SDC Est. Gross Spread,Year,st_cou,FIPS,st_cou_g1,st_cou_final


In [429]:
#Write to disk
debt_fips.to_csv('../data/debt_w_fips.csv')

## Aggregating by County and Year

At this point, we will drop the debt issues we do not have FIPS codes for because they cannot be merged with the institutional data.  Our goal in this section is to generate an output set that captures total volumes of GO and revenue debt issued by county and year.  We will also want the GO and RV debt issued by type of issuer, and the same breakout by purpose.  It is useful to review the table from the beginning of the Notebook here.

Concept|Variable|Possible Values
-------|--------|---------------
Debt Type|`Security Type`| GO<br>RV
Issuer Type|`Issuer Type Description`|District<br>City, Town Vlg<br>Local Authority<br>State Authority<br>County/Parish<br>College or Univ<br>State/Province<br>Direct<br>Indian Tribe<br>Co-op Utility
Purpose|`Bond Buyer UOP30`|Development<br>Education<br>Electric Power<br>Environmental Facilities<br>General Purpose<br>Healthcare<br>Housing<br>Public Facilities<br>Transportation<br>Utilities
Volume|`Amount of Maturity (M)`|Continuous
County|`County`|Any county in the US
State|`State`|Any state in the US
Issue Date|`Sale Date`|Continuous (we only need the year)

We can read this back in here to avoid having to execute the entire Notebook.  We also no longer need FIPS components, so we can drop those.  

*Note:  For the time being we are dropping issues classified as `S` or `T` (as opposed to `GO` and `RV`).  We don't know what they mean currently, and there seven issues affected in the entire data set.  We can revisit this later.*

In [430]:
#Read in data
debt_fips=pd.read_csv('../data/debt_w_fips.csv',dtype={'FIPS':str})

# print debt_fips[debt_fips['Security Type'].isin(['S','T'])]

debt_fips.columns

Index([u'Unnamed: 0', u'Bond Buyer UOP30', u'Amount of Maturity ($ mils)',
       u'Coupon Maturity', u'County', u'Coupon of Fin Maty', u'Coupon Rate',
       u'Issuer', u'Issuer Type Description', u'Net Interest Cost',
       u'True Interest Cost', u'State', u'Security Type',
       u'SDC Est. Gross Spread', u'Year', u'st_cou', u'FIPS', u'st_cou_g1',
       u'st_cou_final'],
      dtype='object')

In [431]:
#Read in data
debt_fips=pd.read_csv('../data/debt_w_fips.csv',dtype={'FIPS':str})

#Drop unnecessary variables
for var in ['Unnamed: 0','st_cou','st_cou_g1','st_cou_final','Coupon Maturity','Coupon of Fin Maty',\
            'Coupon Rate','SDC Est. Gross Spread']:
    debt_fips.pop(var)
    
#Rename variables
debt_fips.columns=['Purpose','Amount','County','Issuer','Issuer_Type','NIC','TIC','State',\
                   'Security_Type','Year','FIPS']

#Subset to exclude S and T Security Types
debt_fips=debt_fips[debt_fips['Security_Type'].isin(['RV','GO'])]

#Convert Amount to float
def to_float(x):
    try:
        return float(x.replace(',',''))
    except:
        return np.NaN
debt_fips['Amount']=debt_fips['Amount'].apply(lambda x: to_float(x))

#Convert NIC and TIC to float
for var in ['NIC','TIC']:
    debt_fips[var]=debt_fips[var].astype(float)

#Retroactively fix missing Purpose values (see validation effort below)
# debt_fips.ix[426160,'Purpose']='Housing'
# debt_fips.ix[328463,'Purpose']='General Purpose'
# debt_fips.ix[437185,'Purpose']='General Purpose'
    
debt_fips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 396143 entries, 0 to 396148
Data columns (total 11 columns):
Purpose          396140 non-null object
Amount           363376 non-null float64
County           396143 non-null object
Issuer           396143 non-null object
Issuer_Type      396143 non-null object
NIC              92091 non-null float64
TIC              64698 non-null float64
State            396143 non-null object
Security_Type    396143 non-null object
Year             396143 non-null int64
FIPS             396143 non-null object
dtypes: float64(3), int64(1), object(7)
memory usage: 36.3+ MB


In [432]:
debt_fips.ix[[426160,328463,437185]]

Unnamed: 0,Purpose,Amount,County,Issuer,Issuer_Type,NIC,TIC,State,Security_Type,Year,FIPS
426160,,,,,,,,,,,
328463,General Purpose,9.567,Mercer,Princeton-New Jersey,"City, Town Vlg",4.743,,NJ,GO,1996.0,34021.0
437185,,,,,,,,,,,


Our approach will be to build this up incrementally.  We will use the appropriate subsets of `debt_fips` to construct three components of the data set (by year and county):

1. Total GO and Revenue debt issue volume;
2. GO and Revenue debt by issuer type; and,
3. GO and Revenue debt by purpose.

These components will then be joined together.

In [433]:
debt_fips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 396143 entries, 0 to 396148
Data columns (total 11 columns):
Purpose          396140 non-null object
Amount           363376 non-null float64
County           396143 non-null object
Issuer           396143 non-null object
Issuer_Type      396143 non-null object
NIC              92091 non-null float64
TIC              64698 non-null float64
State            396143 non-null object
Security_Type    396143 non-null object
Year             396143 non-null int64
FIPS             396143 non-null object
dtypes: float64(3), int64(1), object(7)
memory usage: 36.3+ MB


In [434]:
#Capture total debt
tot_debt=debt_fips.groupby(['Year','FIPS','Security_Type']).sum()['Amount'].unstack('Security_Type').fillna(0)

#Define function to capture aggregations by Issuer_Type and Purpose
def debt_by_concept(var):
    #Capture debt by issuer type
    tmp_debt=debt_fips.groupby(['Year','FIPS','Security_Type',var]).sum()['Amount'].sortlevel(2)
    #Unstack types
    tmp_debt=tmp_debt.unstack(['Security_Type',var])
    #Generate new column names
    new_cols=[item[0]+'_'+item[1] for item in tmp_debt.columns.values]
    #Assign new column names
    tmp_debt.columns=new_cols
    #Reorder columns, sort index, and fill in NaN values
    tmp_debt=tmp_debt[sorted(new_cols)].sortlevel(0).fillna(0)
    return tmp_debt

#Capture debt issues for Issuer Type and Purpose tabs
issuer_debt=debt_by_concept('Issuer_Type')
purpose_debt=debt_by_concept('Purpose')

#Capture average NIC and TIC by year and county
interest_cost=debt_fips.groupby(['Year','FIPS']).mean()[['NIC','TIC']]

#Join sets together
debt_agg=tot_debt.join([issuer_debt,purpose_debt,interest_cost])

debt_agg.head().T

Year,1984,1984,1984,1984,1984
FIPS,01001,01003,01007,01021,01025
Security_Type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
GO,0.0,6.35,0.0,1.0,1.378
RV,1.625,23.33,0.4,1.425,0.0
"GO_City, Town Vlg",0.0,0.0,0.0,0.0,1.378
GO_Co-op Utility,0.0,0.0,0.0,0.0,0.0
GO_County/Parish,0.0,6.35,0.0,1.0,0.0
GO_Direct Issuer,0.0,0.0,0.0,0.0,0.0
GO_District,0.0,0.0,0.0,0.0,0.0
GO_Indian Tribe,0.0,0.0,0.0,0.0,0.0
GO_Local Authority,0.0,0.0,0.0,0.0,0.0
"RV_City, Town Vlg",0.0,0.0,0.0,0.0,0.0


Perhaps we can validate these a bit by ensuring the components add up to the total debt levels for GO and RV respectively.

In [435]:
#Capture subsets
iss_vars={'GO':['GO_City, Town Vlg', 'GO_Co-op Utility', 'GO_County/Parish', 'GO_Direct Issuer',\
                'GO_District', 'GO_Indian Tribe', 'GO_Local Authority'],
          'RV':['RV_City, Town Vlg','RV_Co-op Utility', 'RV_County/Parish','RV_Direct Issuer',\
                'RV_District', 'RV_Indian Tribe','RV_Local Authority']}
pur_vars={'GO':['GO_Development', 'GO_Education', 'GO_Electric Power','GO_Environmental Facilities', 'GO_General Purpose',\
                'GO_Healthcare', 'GO_Housing', 'GO_Public Facilities','GO_Transportation', 'GO_Utilities'],
          'RV':['RV_Development','RV_Education', 'RV_Electric Power', 'RV_Environmental Facilities','RV_General Purpose',\
                'RV_Healthcare', 'RV_Housing','RV_Public Facilities', 'RV_Transportation', 'RV_Utilities']}

#Create dict to hold comparisons
component_diff={}

#For each variable group...
for i,vg in enumerate([iss_vars,pur_vars]):
    #...and for each Security Type...
    for st in ['GO','RV']:
        #...update the dict with the difference between the component and the reported sums
        component_diff.update({str(i)+'_'+st:debt_agg[st]-debt_agg[vg[st]].sum(axis=1)})

#For each comparison set...
for key in component_diff.keys():
    #...tell me the sum of the misses
    print key,'|',component_diff[key].sum()

1_RV | 205.87
1_GO | 2.0
0_GO | -1.03311248445e-11
0_RV | -9.46465128493e-14


What's going on with the Revenue Bonds by Purpose?  (They were originally off by \$205.87 M. GO bonds by purpose were off by \$2.0 B.)

In [436]:
#Capture index of problem records
idx_rv=component_diff['1_RV'][component_diff['1_RV']>1].index
idx_go=component_diff['1_GO'][component_diff['1_GO']>1].index

#Explore debt_agg at this location
print debt_agg.ix[idx_rv][['RV']+pur_vars['RV']].T
print debt_agg.ix[idx_go][['GO']+pur_vars['GO']].T

Year                            1999     2002
FIPS                           12057    17089
Security_Type                                
RV                           424.470  239.895
RV_Development                 0.000    4.000
RV_Education                  49.405    0.000
RV_Electric Power              4.175    6.930
RV_Environmental Facilities  138.725    0.000
RV_General Purpose             6.670    0.000
RV_Healthcare                112.060    0.000
RV_Housing                    10.915    0.000
RV_Public Facilities          29.035    0.000
RV_Transportation             56.560   23.470
RV_Utilities                  12.555    3.995
Year                           2002
FIPS                          31153
Security_Type                      
GO                           76.650
GO_Development                1.335
GO_Education                 29.630
GO_Electric Power             0.000
GO_Environmental Facilities   2.515
GO_General Purpose           24.245
GO_Healthcare                 0.

Disparity confirmed, and the case of FIPS code 17089 in 2002, the disparity is enormous.  Perhaps the original data can shed some light?

In [437]:
# debt_fips[(debt_fips['Year']==2002) & (debt_fips['FIPS']=='17089')]

Ah, the original data lacked a purpose for a single, very large issue.  I'd wager a similar issue is occurring with FIPS code 12057 in 1999...

In [438]:
# debt_fips[(debt_fips['Year']==1999) & (debt_fips['FIPS']=='12057')]

...and FIPS code 31153 in 2002.

In [439]:
# debt_fips[(debt_fips['Year']==2002) & (debt_fips['FIPS']=='31153')]

Inspection of the original raw data (upstream of `debt_fips`) reveals the following info about our three problematic records:

`debt_fips` Index|Year|FIPS|Issuer|Amount|General Use of Proceeds|Imputed Purpose
-----|----|----|------|------|-----------------------|----------------
328463|1999|12057|Covington Park Comm Dev Dt|4.37|Genl Purpose/ Public Imp|`General Purpose`
426160|2002|17089|Aurora Kane-DuPage Cos-Illinois|201.5|Single Family Housing|`Housing`
437185|2002|31153|Sarpy Co Sanit & Imp Dt #215|2.0|Genl Purpose/ Public Imp|`General Purpose`

*Note: The findings of this little investigation were incorporated, retroactively, into the `debt_fips` set above.  No such issues exist any longer, which is why the prints of the problematic sections of data are commented out.*

In [440]:
len(debt_agg)

59669

## Incorporating MSA Status

We have captured a county to MSA crosswalk from the [Missouri Census Data Center MABLE tool](http://mcdc.missouri.edu/websas/geocorr12.html).  We need to read these in to generate a binary indicator for MSA.

In [441]:
#Read in data
mable=pd.read_csv('../data/mable_msa.csv')[1:]

#Generate binary MSA variable
mable['MSA']=np.where(mable['cbsatype']=='Metro',1,0)

#Capture county to MSA mapping
msa_map=dict(zip(mable['county'],mable['MSA']))

#Reset debt_agg index
debt_agg=debt_agg.reset_index()

#Map in MSA variable
debt_agg['MSA']=(debt_agg['FIPS'].map(msa_map).fillna(0)).astype(int)

#Set index
debt_agg.set_index(['Year','FIPS'],inplace=True)

#Drop old index variables
for var in ['level_0','index']:
    try:
        debt_agg.pop(var)
    except:
        pass

debt_agg.head().T

Year,1984,1984,1984,1984,1984
FIPS,01001,01003,01007,01021,01025
Security_Type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
GO,0.0,6.35,0.0,1.0,1.378
RV,1.625,23.33,0.4,1.425,0.0
"GO_City, Town Vlg",0.0,0.0,0.0,0.0,1.378
GO_Co-op Utility,0.0,0.0,0.0,0.0,0.0
GO_County/Parish,0.0,6.35,0.0,1.0,0.0
GO_Direct Issuer,0.0,0.0,0.0,0.0,0.0
GO_District,0.0,0.0,0.0,0.0,0.0
GO_Indian Tribe,0.0,0.0,0.0,0.0,0.0
GO_Local Authority,0.0,0.0,0.0,0.0,0.0
"RV_City, Town Vlg",0.0,0.0,0.0,0.0,0.0


## Incorporating Institutional Data

At this point, we are well prepared to join TEL data compiled by Dan and COSTAT/PUMS control variables of interest to our debt data.  These data were provided as SAS files.  The TEL data and COSTAT/PUMS covariate data was joined and converted to CSV in [sas2csv.ipynb](https://github.com/choct155/TELs_debt/blob/master/code/sas2csv.ipynb).  These data share merge keys (year and FIPS codes) with our debt data, which should facilitate their integration.

In [442]:
!ls ../data/

13slsstab1a.xls			debt_ts_pre_fips.csv  geocorr12.csv
2013_GFS_debt.xcf		debt_w_fips.csv       mable_msa.csv
bonds.csv			debt_w_int.csv	      regions.csv
costat_mod_vars1940_2010.csv	Descriptives.csv      SA1_1929_2014.csv
cty_coverage.csv		fips_st_co_02_07.csv  state_coverage.csv
current_issue_geocode_list.csv	FRB_TREAS_30YR.csv    state_fips.xlsx
debt_mod_std.csv		g_api_college.csv     state_local_deflators.xls
debt_out.csv			g_api_rando.csv       tel_data.csv


In [443]:
#Read in TEL data
tel=pd.read_csv('../../debt_data/tel_data_alt.csv',dtype={'STCOU':str})

#Rename year and FIPS columns
tel=tel.rename(columns={'YEAR':'Year',
                        'STCOU':'FIPS'})
# tel.columns=['Year','FIPS']+list(tel.columns[2:])

#Set index
tel.set_index(['Year','FIPS'],inplace=True)

#Join TEL data to debt data
debt_out=debt_agg.join(tel)

#Write to disk
debt_out.to_csv('../data/debt_out.csv')

print debt_out.ix[1990,'04001'].T.to_string()

Security_Type
GO                             2.261000e+01
RV                             0.000000e+00
GO_City, Town Vlg              0.000000e+00
GO_Co-op Utility               0.000000e+00
GO_County/Parish               8.325000e+00
GO_Direct Issuer               0.000000e+00
GO_District                    1.428500e+01
GO_Indian Tribe                0.000000e+00
GO_Local Authority             0.000000e+00
RV_City, Town Vlg              0.000000e+00
RV_Co-op Utility               0.000000e+00
RV_County/Parish               0.000000e+00
RV_Direct Issuer               0.000000e+00
RV_District                    0.000000e+00
RV_Indian Tribe                0.000000e+00
RV_Local Authority             0.000000e+00
GO_Development                 0.000000e+00
GO_Education                   1.428500e+01
GO_Electric Power              0.000000e+00
GO_Environmental Facilities    0.000000e+00
GO_General Purpose             8.325000e+00
GO_Healthcare                  0.000000e+00
GO_Housing        

In [444]:
len(debt_agg)

59669

In [445]:
len(debt_out[debt_out['MSA']==1])

28035

In [446]:
debt_msa=debt_out[debt_out['MSA']==1].reset_index()
all_cty_year=set(zip(debt_msa['Year'],
                     debt_msa['FIPS']))
nic_cty_year=set(zip(debt_msa[debt_msa['NIC'].notnull()]['Year'],
                     debt_msa[debt_msa['NIC'].notnull()]['FIPS']))
tic_cty_year=set(zip(debt_msa[debt_msa['TIC'].notnull()]['Year'],
                     debt_msa[debt_msa['TIC'].notnull()]['FIPS']))
all_cty=set(debt_msa['FIPS'])
nic_cty=set(debt_msa[debt_msa['NIC'].notnull()]['FIPS'])
tic_cty=set(debt_msa[debt_msa['TIC'].notnull()]['FIPS'])

len(all_cty_year),len(nic_cty_year),len(tic_cty_year),len(all_cty),len(nic_cty),len(tic_cty)

(28035, 15277, 13647, 1129, 1054, 996)

In [447]:
print '*** TOTAL COUNTS OF YEAR-COUNTIES ***'
print 'Entire Set:',len(all_cty_year)
print 'Subset with NIC values:',len(nic_cty_year)
print 'Subset with TIC values:',len(tic_cty_year)
print '\n*** TOTAL COUNTS OF ALL COUNTIES ***'
print 'Entire Set:',len(all_cty)
print 'Subset with NIC values:',len(nic_cty)
print 'Subset with TIC values:',len(tic_cty)
print '\n*** OVERLAP ACROSS YEAR-COUNTIES ***'
print 'Overlapping year-counties between NIC and TIC:',len(nic_cty_year & tic_cty_year)
print 'Total year-counties with NIC or TIC:',len(nic_cty_year | tic_cty_year)
print 'Proportion of year-counties with NIC or TIC:',len(nic_cty_year | tic_cty_year)/float(len(all_cty_year))
print '\n*** OVERLAP ACROSS COUNTIES ***'
print 'Overlapping counties between NIC and TIC:',len(nic_cty & tic_cty)
print 'Total counties with NIC or TIC:',len(nic_cty | tic_cty)
print 'Proportion of counties with NIC or TIC:',len(nic_cty | tic_cty)/float(len(all_cty))

*** TOTAL COUNTS OF YEAR-COUNTIES ***
Entire Set: 28035
Subset with NIC values: 15277
Subset with TIC values: 13647

*** TOTAL COUNTS OF ALL COUNTIES ***
Entire Set: 1129
Subset with NIC values: 1054
Subset with TIC values: 996

*** OVERLAP ACROSS YEAR-COUNTIES ***
Overlapping year-counties between NIC and TIC: 8606
Total year-counties with NIC or TIC: 20318
Proportion of year-counties with NIC or TIC: 0.724736935973

*** OVERLAP ACROSS COUNTIES ***
Overlapping counties between NIC and TIC: 987
Total counties with NIC or TIC: 1063
Proportion of counties with NIC or TIC: 0.941541186891


In [448]:
len(debt_out[pd.isnull(debt_out).any(axis=1)])

49009

In [449]:
print debt_out[['NIC','TIC']].describe()
print debt_out.ix[2014:2015][['NIC','TIC']].describe()

                NIC           TIC
count  28246.000000  21374.000000
mean       5.276884      4.815942
std        2.094137      4.546225
min        0.014000      0.066000
25%        4.110000      3.467667
50%        5.139000      4.471875
75%        6.550000      5.510950
max       99.100000     99.890000
               NIC          TIC
count  1072.000000  1649.000000
mean      2.564348     2.653012
std       3.104809     2.710323
min       0.108000     0.091000
25%       1.867000     2.168000
50%       2.500000     2.581000
75%       3.084750     3.000000
max      98.761000    98.000000


In [450]:
debt_fips[debt_fips['Year'].isin([2014,2015])][['Year','NIC','TIC']].describe()

Unnamed: 0,Year,NIC,TIC
count,23670.0,2977.0,5353.0
mean,2014.488086,2.572695,2.556071
std,0.499869,2.671863,2.022181
min,2014.0,0.071,0.05
25%,2014.0,1.668,1.962
50%,2014.0,2.572,2.584
75%,2015.0,3.374,3.084
max,2015.0,98.761,98.057


In [451]:
debt_fips

Unnamed: 0,Purpose,Amount,County,Issuer,Issuer_Type,NIC,TIC,State,Security_Type,Year,FIPS
0,Utilities,,Callaway,Callaway Co Pub Wtr Supp Dt #2,District,,,MO,RV,1988,29027
1,Utilities,,Cass,Cleveland-Missouri,"City, Town Vlg",,,MO,RV,1988,29037
2,General Purpose,,Gunnison,Skyland Metropolitan Dt,District,,,CO,GO,1988,08051
3,Education,,Clermont/Warren,Clermont Co (Goshen) LSD,District,,,OH,GO,1988,39155
4,Transportation,,Bartholomew,Flat Rock-Hawcreek School Corp,District,,,IN,GO,1988,18005
5,Education,,Lake,Crown Point Comm School Corp,District,,,IN,GO,1988,18089
6,Healthcare,,Carver,Chaska City-Minnesota,"City, Town Vlg",,,MN,RV,1988,27019
7,General Purpose,,Platte,Columbus City-Nebraska,"City, Town Vlg",,,NE,GO,1988,31141
8,General Purpose,,Grundy,Grundy Co-Illinois,County/Parish,,,IL,GO,1988,17063
9,General Purpose,,St. Croix,Hudson City-Wisconsin,"City, Town Vlg",6.620,,WI,GO,1988,55109


## Imputation of NIC and TIC Values

The true interest cost (TIC) for an issuer is the real time value of money.  Specifically, it is the interest rate, compounded semi-annually, that must be charged to make equivalent the outlay for bond in year 0 and the sum of the future stream of payments between year 0 and the year of maturity.  The net interest cost (NIC) is the average annual interest rate that an issuer will pay.  

These are different concepts, but the latter can be a proxy for the former.  We just need to adjust the TIC so that it reflects and annual rate.  Once this is done, we can create a new interest cost variable that takes all values of the annualized TIC when available.  If the value is missing, it takes the NIC value (multiplied by 10).

This will still leave us with over a quarter of year-counties missing in the MSA subset, and over 35% in the full data set.  We can impute the missing data via interpolation and padding, just like we did with the COSTAT data.  

In [452]:
#Define function to convert semi-annual compounded rate to annual effective rate
def ann_rate(i,n):
    return (1+(i/n))**n-1

#Calculate annualized TIC
debt_out['TIC_ANN']=debt_out['TIC'].apply(lambda x: ann_rate((x/100.),2))

#Define consolidated interest cost
debt_out['CTY_INTEREST']=np.where(debt_out['TIC_ANN'].notnull(),
                                  debt_out['TIC_ANN']*100,
                                  debt_out['NIC'])

#Reorder index
debt_out=debt_out.reorder_levels(['FIPS','Year'])

#Sort index
debt_out.sortlevel(0,inplace=True)

#Create container to hold subsets with interpolated interest
debt_int_list=[]

#For each county...
for cty in sorted(set(debt_out.index.get_level_values(level='FIPS'))):
    #...capture the subset...
    debt_sub=debt_out.ix[cty].reset_index()
    #...redefine the county...
    debt_sub['FIPS']=cty
    #...use interpolation to capture missing values for CTY_INTEREST
    debt_sub['CTY_INTEREST']=debt_sub['CTY_INTEREST'].interpolate()
    #...and throw the subset in the list
    debt_int_list.append(debt_sub)
    
#Concatenate together
debt_w_int=pd.concat(debt_int_list)

#Set and sort index
debt_w_int.set_index(['FIPS','Year'],inplace=True)
debt_w_int.sortlevel(0,inplace=True)


debt_w_int[['TIC','NIC','TIC_ANN','CTY_INTEREST']]

Unnamed: 0_level_0,Security_Type,TIC,NIC,TIC_ANN,CTY_INTEREST
FIPS,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01000,1994,,,,
01000,2000,,,,
01000,2004,,,,
01000,2005,,,,
01000,2015,,,,
01001,1984,,,,
01001,1988,,,,
01001,1989,6.777,7.1125,0.068918,6.891819
01001,1990,,6.8385,,6.838500
01001,1992,,4.4830,,4.483000


In [453]:
#Write to disk
debt_w_int_out_vars=[var for var in debt_w_int.columns if var not in ['TIC','NIC','TIC_ANN']]
debt_w_int[debt_w_int_out_vars].to_csv('../data/debt_w_int.csv')

In [454]:
debt_w_int

Unnamed: 0_level_0,Security_Type,GO,RV,"GO_City, Town Vlg",GO_Co-op Utility,GO_County/Parish,GO_Direct Issuer,GO_District,GO_Indian Tribe,GO_Local Authority,"RV_City, Town Vlg",...,GEXP_L,GP_RATE,GP_LEVY,GP_REVU,GP_GEXP,GP_LMT,SC_LMT,TREND,TIC_ANN,CTY_INTEREST
FIPS,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
01000,1994,0.000,5.585,0.000,0,0.000,0,0.000,0,0,0.000,...,0,0,0,0,0,0,0,25,,
01000,2000,0.000,12.205,0.000,0,0.000,0,0.000,0,0,0.000,...,0,0,0,0,0,0,0,31,,
01000,2004,0.000,0.000,0.000,0,0.000,0,0.000,0,0,0.000,...,0,0,0,0,0,0,0,35,,
01000,2005,0.000,4.610,0.000,0,0.000,0,0.000,0,0,0.000,...,0,0,0,0,0,0,0,36,,
01000,2015,0.000,9.855,0.000,0,0.000,0,0.000,0,0,0.000,...,,,,,,,,,,
01001,1984,0.000,1.625,0.000,0,0.000,0,0.000,0,0,0.000,...,0,0,0,0,0,0,0,15,,
01001,1988,0.000,0.000,0.000,0,0.000,0,0.000,0,0,0.000,...,0,0,0,0,0,0,0,19,,
01001,1989,0.000,0.000,0.000,0,0.000,0,0.000,0,0,0.000,...,0,0,0,0,0,0,0,20,0.068918,6.891819
01001,1990,0.000,2.200,0.000,0,0.000,0,0.000,0,0,0.000,...,0,0,0,0,0,0,0,21,,6.838500
01001,1992,3.000,4.450,3.000,0,0.000,0,0.000,0,0,0.000,...,0,0,0,0,0,0,0,23,,4.483000


In [457]:
print sorted(debt_w_int.columns)

['ASMT_L', 'ASMT_L2', 'ASMT_L3', 'BOTH', 'CB_E', 'CB_E2', 'CB_E3', 'CB_E4', 'CB_G', 'CB_G2', 'CFDISC_L', 'CGEXP_L', 'CH_HS_UNT', 'CLEVY_L', 'CLEVY_L2', 'CLEVY_L3', 'CLEVY_L4', 'CRATE_L', 'CRATE_L2', 'CREVU_L', 'CTY_INTEREST', 'DENSITY', 'DIVERSITY', 'D_GEN_EXP', 'EDUC_SERV_EMP_PNFARM', 'EMP_RES', 'FFDISC_L', 'FIPSST', 'FIPST_N', 'FOOD_SERV_EMP_PNFARM', 'GEN_REV', 'GEXP_L', 'GO', 'GO_City, Town Vlg', 'GO_Co-op Utility', 'GO_County/Parish', 'GO_Development', 'GO_Direct Issuer', 'GO_District', 'GO_Education', 'GO_Electric Power', 'GO_Environmental Facilities', 'GO_General Purpose', 'GO_Healthcare', 'GO_Housing', 'GO_Indian Tribe', 'GO_Local Authority', 'GO_Public Facilities', 'GO_Transportation', 'GO_Utilities', 'GP_GEXP', 'GP_LEVY', 'GP_LMT', 'GP_RATE', 'GP_REVU', 'HOME_STEAD', 'HOME_STEAD2', 'HOME_STEAD3', 'HSG_UNITS', 'HSG_UNITS_ACS', 'HSLD_PERS', 'IGR_ST', 'LANDAREA', 'LEVY_L', 'LIMITS', 'MANU_EMP_PNFARM', 'MANU_RES', 'MDHOMEVAL', 'MED_INC', 'MFDISC_L', 'MFG_EMP', 'MGEXP_L', 'MGEXP_L2

In [455]:
debt_tic_nic=debt_out[(debt_out['NIC'].notnull()) & (debt_out['TIC'].notnull())]

all_recs=[]
nic_recs=[]
tic_recs=[]
nic_tic_recs=[]
yr_list=[]

for yr in range(1984,2016):
    all_recs.append(len(debt_out.ix[yr]))
    nic_recs.append(len(debt_out[(debt_out['NIC'].notnull())].ix[yr]))
    tic_recs.append(len(debt_out[(debt_out['TIC'].notnull())].ix[yr]))
    nic_tic_recs.append(len(debt_out[(debt_out['NIC'].notnull()) & (debt_out['TIC'].notnull())].ix[yr]))
    yr_list.append(yr)

nic_tic_cnts=DataFrame({'Total':all_recs,
                        'NIC':nic_recs,
                        'TIC':tic_recs,
                        'NIC&TIC':nic_tic_recs},
                        index=yr_list)
nic_tic_cnts['NIC_prop']=nic_tic_cnts['NIC']/nic_tic_cnts['Total']
nic_tic_cnts['TIC_prop']=nic_tic_cnts['TIC']/nic_tic_cnts['Total']

print nic_tic_cnts.sum()

print nic_tic_cnts
    

debt_tic_nic[['NIC','TIC','TIC_ANN','CTY_INTEREST']]

NIC         5664
NIC&TIC     5664
TIC         5664
Total       5664
NIC_prop      32
TIC_prop      32
dtype: float64
      NIC  NIC&TIC  TIC  Total  NIC_prop  TIC_prop
1984  177      177  177    177         1         1
1985  177      177  177    177         1         1
1986  177      177  177    177         1         1
1987  177      177  177    177         1         1
1988  177      177  177    177         1         1
1989  177      177  177    177         1         1
1990  177      177  177    177         1         1
1991  177      177  177    177         1         1
1992  177      177  177    177         1         1
1993  177      177  177    177         1         1
1994  177      177  177    177         1         1
1995  177      177  177    177         1         1
1996  177      177  177    177         1         1
1997  177      177  177    177         1         1
1998  177      177  177    177         1         1
1999  177      177  177    177         1         1
2000  177      1

Unnamed: 0_level_0,Security_Type,NIC,TIC,TIC_ANN,CTY_INTEREST
FIPS,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01001,1989,7.112500,6.777000,0.068918,6.891819
01001,1994,6.098000,6.174000,0.062693,6.269296
01001,2005,4.380000,4.400000,0.044484,4.448400
01001,2008,5.516000,5.555000,0.056321,5.632145
01001,2013,3.330000,3.330000,0.033577,3.357722
01003,1988,7.760000,7.857000,0.080113,8.011331
01003,1994,5.998000,6.019333,0.061099,6.109914
01003,1996,5.479750,5.649000,0.057288,5.728778
01003,2009,4.310000,4.370000,0.044177,4.417742
01003,2010,3.550000,3.510000,0.035408,3.540800


It should be noted that the `tel` data is really the integration of two input sets, so `debt_out` is a three-part join.  This has implications for our data universe.  Each set has a different number of counties, and these sets do not necessarily overlap completely.  In the **Data Checks** section of *sas2csv*, we captured the disparity in state coverage in a data set, and wrote it to disk as `cty_coverage.csv` (which we are ultimately concerned with).  Note that in the case of the join in *sas2csv*, we were joining a state level data set (TEL data) to county level data (COSTATS/PUMS).  Therefore, for each state covered in the TEL data, all counties are included by construction.

We can now take the opportunity to check the year-county overlap between the `tel` and `debt_agg` sets.

In [456]:
#Capture union of indices
u_idx=list(set(tel.index.values).union(set(debt_agg.index.values)))

#Generate county coverage dict
cty_cov=DataFrame({'tel':[idx in tel.index for idx in u_idx],
                   'debt_agg':[idx in debt_agg.index for idx in u_idx]},
                    index=pd.MultiIndex.from_tuples(u_idx,names=['Year','FIPS'])).sortlevel(0)

print 'Number of year-counties represented in the TEL set:',cty_cov['tel'].sum()
print 'Number of year-counties represented in the aggregate debt set:',cty_cov['debt_agg'].sum()
print 'Number of complete cases:',len(cty_cov[(cty_cov==True).all(axis=1)])

#Write to disk
cty_cov.to_csv('../data/cty_coverage.csv')

cty_cov.ix[1990:]

Number of year-counties represented in the TEL set: 127405
Number of year-counties represented in the aggregate debt set: 59669
Number of complete cases: 50212


Unnamed: 0_level_0,Unnamed: 1_level_0,debt_agg,tel
Year,FIPS,Unnamed: 2_level_1,Unnamed: 3_level_1
1990,00000,False,True
1990,01000,False,True
1990,01001,True,True
1990,01003,True,True
1990,01005,True,True
1990,01007,False,True
1990,01009,False,True
1990,01011,True,True
1990,01013,False,True
1990,01015,True,True
