# [Can You Dig It??](https://www.youtube.com/watch?v=V-OYKd8SVrI)

This is just a quick Notebook to demo opening of fixed width data.  The data we are using are 6 months worth of debt issues from Reuters.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

%pylab inline

Populating the interactive namespace from numpy and matplotlib


  from pkg_resources import resource_stream


So that these proprietary data do not end up being public, they are housed in my parent directory.  Consequently, if you are trying this at home, be sure to change the path.  Note that we will be using the [`read_fwf()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html) method from the [pandas](http://pandas.pydata.org/) library.

In [2]:
!ls ..

6months_text_as_columns.txt  fmatrix	      spatial_analysis_lit
algorithms		     google_api_keys  TELs_debt
CensusPovThresh.ipynb	     mortgage	      work_scratch
Conference Notes	     NTA


Let's inspect the relevant options...

In [3]:
help(pd.read_fwf)

Help on function read_fwf in module pandas.io.parsers:

read_fwf(filepath_or_buffer, colspecs='infer', widths=None, **kwds)
    Read a table of fixed-width formatted lines into DataFrame
    
    Also supports optionally iterating or breaking of the file
    into chunks.
    
    Parameters
    ----------
    filepath_or_buffer : string or file handle / StringIO
        The string could be a URL. Valid URL schemes include
        http, ftp, s3, and file. For file URLs, a
        host is expected. For instance, a local file could be
        file ://localhost/path/to/table.csv
    colspecs : list of pairs (int, int) or 'infer'. optional
        A list of pairs (tuples) giving the extents of the fixed-width
        fields of each line as half-open intervals (i.e.,  [from, to[ ).
        String value 'infer' can be used to instruct the parser to try
        detecting the column specifications from the first 100 rows of
        the data (default='infer').
    widths : list of ints. optional

Looks like infer is already on, but inspecting the front end (via text editor) reveals that a foolish method has been used for the headers.  For some reason they start on line 4 and they are wrapped.  That is, variable name can span multiple lines within column.  The consequence is the headers appear like data to the parser, which means we have to actually explicitly write all this crap out.

Or ... because I am super lazy and explicit writing is tedious, we can come up with a programmatic solution.  We know that we have a fixed width file, so what we really need is to understand where each field starts.  If we can get the starting location of these lines, we can insert commas and get them into lists.  Once this has occurred, we can construct the variable names by position, and capture them all in one list (which will then serve as header info in the read in statement).

For some ungodly reason, the good folks at Reuters have used multiple spaces to separate variables instead of tabs (so no keying on them will work).  Moreover, while most variables have one word per line, there are several with multiple words on a given line.  The only saving grace is that only one space appears between words that end up on the same line.  Consequently, we can define the starting position of a field to be two positions in front of the first character in that field.  Since the first line of the variable name always holds at least one word, we will use that line (which is the 4th in the file) to establish field position.

Observe the sequence of tests applied to the characters in the first line (which we capture as a string).  The elements are as follows:

1. Line number
2. Character
3. Test to see if the character is a space
4. Test to see if both the preceding character, and the one before that, are both spaces

In [4]:
#Create container for header lines
header=[]
#Capture the 4th-8th lines
with open('../6months_text_as_columns.txt','r') as f:
    for i in range(4):
        tmp_line=f.readline()
    #Capture line 4
    header.append(tmp_line)
    #Capture 5-8
    for i in range(4):
        header.append(f.readline())
    
#For each character in the first line...
for i,c in enumerate(header[0]):
    #...give me the line number, the character, the first test (see #3 above), and the second test (see #4)
    print i,'|',c,'|',c.isspace(),'|',(header[0][i-2].isspace()) & (header[0][i-1].isspace())

0 |   | True | True
1 |   | True | True
2 |   | True | True
3 |   | True | True
4 |   | True | True
5 |   | True | True
6 |   | True | True
7 |   | True | True
8 | S | False | True
9 | a | False | False
10 | l | False | False
11 | e | False | False
12 |   | True | False
13 |   | True | False
14 |   | True | True
15 |   | True | True
16 |   | True | True
17 |   | True | True
18 |   | True | True
19 |   | True | True
20 |   | True | True
21 |   | True | True
22 |   | True | True
23 |   | True | True
24 |   | True | True
25 | F | False | True
26 | i | False | False
27 | r | False | False
28 | s | False | False
29 | t | False | False
30 |   | True | False
31 |   | True | False
32 |   | True | True
33 |   | True | True
34 |   | True | True
35 |   | True | True
36 |   | True | True
37 |   | True | True
38 | S | False | True
39 | i | False | False
40 | n | False | False
41 | k | False | False
42 |   | True | False
43 |   | True | False
44 |   | True | True
45 |   | True | True
46 |   | True |

What we have done here is identify the first character of each field.  The first character is the last `TRUE` before a string of `FALSE` in our last test.  If we encounter such a transition, we will capture the line number that is (not one, but) two positions ahead of said transition.  We do this because, again for some strange reason, some of the variable names start with a space (even if the second line does not).

In [5]:
#Create container to hold field positions
field_pos=[]

#For each character in the first line...
for i,c in enumerate(header[0]):
    #...define the test...
    old_spaces=((header[0][i-3].isspace()) & (header[0][i-2].isspace()))
    new_letters=~((header[0][i-2].isspace()) & (header[0][i-1].isspace()))
    new_field=old_spaces & new_letters
    #...if a new field has begun...
    if new_field:
        #...capture the position at which it started
        field_pos.append(i-4)

def delim(line,field_pos=field_pos):
    '''Function takes fields positions and turns an ugly string into a nicely delimited list'''
    #Capture padding needed to equal length of first string as list
    pad=(len(list(header[0]))-len(list(line)))
    #Capture string as a list
    line_list=list(line)+[' ']*pad
    #For each new field...
    for pos in field_pos:
        #...convert the start position from space to comma
        try:
            line_list[pos]=','
        except:
            print ''.join(line_list)
            break
    #Convert list back to string
    line=''.join(line_list)
    #Strip space
    line=[s.strip() for s in line.split(',')]
    return line

#Generate container to hold all processed header lines
pheader=[]

#For each header line...
for i,hl in enumerate(header):
    #...process that line
    pheader.append(delim(hl))

Now that we have our nice comma delimted lists, let's throw them together for our final variables.

In [6]:
#Generate container to hold variables
varlist=[]

#For each variable...
for i in range(len(pheader[0])):
    #...create a temporary container to hold the variable components from each line...
    var_tmp=[]
    #...and for each line...
    for j in range(len(pheader)):
        #...put the variable components in var_tmp...
        var_tmp.append(pheader[j][i])
    #...convert to string and throw the variable in varlist
    varlist.append(' '.join(var_tmp).strip())

print len(varlist)
varlist

260


['',
 'Sale Date',
 'First Sinking Fund Date',
 'Sink Date',
 'Pre-Ref Date',
 'Maturity Date',
 'Maturity',
 'Maturity Date',
 'Letter of Credit Expiration Date (Maty)',
 'Issue Dated Date',
 'Initial Put Date',
 'Maty of Highest Cpn Maty',
 'First Interest Payment Date',
 'First Call Date',
 'Final Maturity',
 'Delivery Date',
 'Maturity',
 'Dated Date',
 'Dated Date',
 'Date Issue Added',
 'Conversion Date (Maty)',
 'Callable at Par',
 'Call Date',
 'Beginning Serial Maturity',
 '501c3',
 '8-Digit CUSIP',
 '8-Digit CUSIP',
 'Cusip',
 'Managers',
 'Bond Buyer ALL UOP',
 'Bond Buyer UOP',
 'All Use of Proceeds (Code)',
 'All Use of Proceeds (Desc)',
 'All Use of Proceeds (Number)',
 'Maturity Amount',
 'Amount of Final Maturity ($mils)',
 '$ Amount of Highest Cpn Maturity',
 'Principal Amount',
 'Amount of Maturity ($ mils)',
 'Ant- ici- pa- tion Type',
 'Use of Proceeds Amount ($ mils)',
 'Asset Backed Indicator Flag (Y/N)',
 'Auction Rate',
 'Aver- age Life',
 'Bank Qual',
 'Bk Elig

In [7]:
#Capture start and stop positions in DF
fp_df=DataFrame({'stop':field_pos,
                 'start':Series(field_pos).shift()+1})

#Make sure we start at position 0
fp_df.ix[0,'start']=0

#Add last field position par
last_pair=DataFrame({'start':5154,
                     'stop':5170},index=[259])
fp_df=pd.concat([fp_df,last_pair])

#Match up fields positions and labels
fp_df['var']=varlist

#Assign arbitrary label to first field 
fp_df.ix[0,'var']='Number'

#Convert field positions parameters to int
for var in ['start','stop']:
    fp_df[var]=fp_df[var].astype(int)

print len(field_pos),len(varlist)
fp_df

259 260


Unnamed: 0,start,stop,var
0,0,5,Number
1,6,22,Sale Date
2,23,35,First Sinking Fund Date
3,36,45,Sink Date
4,46,58,Pre-Ref Date
5,59,78,Maturity Date
6,79,91,Maturity
7,92,104,Maturity Date
8,105,120,Letter of Credit Expiration Date (Maty)
9,121,131,Issue Dated Date


So, it looks like we have repeats in our variable list...

In [8]:
dups=fp_df['var'].value_counts()[fp_df['var'].value_counts()>1]

dups

Fitch                       4
Trustee                     3
Financial Advisor           3
Coupon Type                 3
Tender Agent                3
Bond Counsel                3
Issuer's Counsel            3
Credit Enhancer             3
Paying Agent                3
Bond Buyer UOP              3
S&P Rating                  2
Nation                      2
Credit Enhance ment Type    2
Lead Manager                2
Maturity                    2
Dated Date                  2
Maturity Date               2
Moody Rating                2
8-Digit CUSIP               2
Managers                    2
Bond Buyer Region           2
Project                     2
Remarketing Agent           2
SDC Region                  2
dtype: int64

We can deal with this by appending the original position of the variable to the variable name, thereby making each instance unique.

In [9]:
def pos_append(varlist):
    '''Function appends position of variable to variable name to uniquely identify variables 
    that appear more than once'''
    #Create an output varlist
    varlist_out=['']*len(varlist)
    #For each variable...
    for idx,v in enumerate(varlist):
        #...identify the instances of the variable and their positions
        instances=[(i,var) for i,var in enumerate(fp_df['var'].values) if var==v]
        #...if the variable appears more than once...
        if len(instances)>1:
            #...for each item in instances...
            for item in instances:
                #...append the variable position to the duplicate instance...
                varlist_out[item[0]]=varlist[item[0]]+str(item[0])
        #...otherwise leave the variable alone
        else:
            varlist_out[idx]=varlist[idx]
    return varlist_out
    
#Make the variables unique    
fp_df['u_var']=Series(pos_append(fp_df['var'].values))

fp_df

Unnamed: 0,start,stop,var,u_var
0,0,5,Number,Number
1,6,22,Sale Date,Sale Date
2,23,35,First Sinking Fund Date,First Sinking Fund Date
3,36,45,Sink Date,Sink Date
4,46,58,Pre-Ref Date,Pre-Ref Date
5,59,78,Maturity Date,Maturity Date5
6,79,91,Maturity,Maturity6
7,92,104,Maturity Date,Maturity Date7
8,105,120,Letter of Credit Expiration Date (Maty),Letter of Credit Expiration Date (Maty)
9,121,131,Issue Dated Date,Issue Dated Date


In [10]:
fp_df['u_var'].value_counts()[fp_df['u_var'].value_counts()>1]

Series([], dtype: int64)

No dups!

Now we have the field positions we need to parse the whole file.  (The inferential tool created too many columns for some reason.)

In [11]:
print 'Capturing data'
#Create container for data lines
data=[]
#Capture the 9th line forward
with open('../6months_text_as_columns.txt','r') as f:
    for i in range(9):
        tmp_line=f.readline()
    #Capture line 9
    data.append(tmp_line)
    #Capture 10 through 89369
    for i in range(89359):
        data.append(f.readline())
        
print 'Processing data'
#Generate container to hold all processed header lines
data_lines=[]

#For each data line...
for i,dl in enumerate(data):
    #...process that line
    data_lines.append(delim(dl))
    if i%1000==0:
        print '>>Processing data line #',i
    
print 'Collecting data in dictionary'
#Create dictionary to hold data
data_dict={}

#For each variable...
for i,var in enumerate(fp_df['u_var']):
    #...once all lines are collected, update the dictionary
    data_dict.update({var:[data_lines[row][i] for row in range(len(data_lines))]})
    if i%50==0:
        print '>>Capturing variable #',i
    
    
#Convert data dictionary into DF
debt=DataFrame(data_dict)

Capturing data
Processing data
>>Processing data line # 0
>>Processing data line # 100
>>Processing data line # 200
>>Processing data line # 300
>>Processing data line # 400
>>Processing data line # 500
>>Processing data line # 600
>>Processing data line # 700
>>Processing data line # 800
>>Processing data line # 900
>>Processing data line # 1000
>>Processing data line # 1100
>>Processing data line # 1200
>>Processing data line # 1300
>>Processing data line # 1400
>>Processing data line # 1500
>>Processing data line # 1600
>>Processing data line # 1700
>>Processing data line # 1800
>>Processing data line # 1900
>>Processing data line # 2000
>>Processing data line # 2100
>>Processing data line # 2200
>>Processing data line # 2300
>>Processing data line # 2400
>>Processing data line # 2500
>>Processing data line # 2600
>>Processing data line # 2700
>>Processing data line # 2800
>>Processing data line # 2900
>>Processing data line # 3000
>>Processing data line # 3100
>>Processing data lin

In [12]:
debt[fp_df['u_var']]

Unnamed: 0,Number,Sale Date,First Sinking Fund Date,Sink Date,Pre-Ref Date,Maturity Date5,Maturity6,Maturity Date7,Letter of Credit Expiration Date (Maty),Issue Dated Date,...,S/ T,Security Type,Rank Eligible Flag (Y/N),SDC Est. Gross Spread,Master Deal Type,Deal Number,Underlying S&P Short Term Rating,Underlying S&P Long Term Rating,SPSHORT,S&P Short Rating
0,1,04/01/15,,,,12/01/24,,,,,...,Series 2015,S,GO,Yes,,TE,15040128039,NR,NR,NR
1,,,,,,12/01/25,,,,,...,,,,,,,,,,
2,,,,,,12/01/26,,,,,...,,,,,,,,,,
3,,,,,,12/01/27,,,,,...,,,,,,,,,,
4,,,,,,12/01/15,,,,,...,,,,,,,,,,
5,2,04/01/15,,,,04/08/16,,,,,...,Series 2015,T,GO,Yes,,TE,15040102039,NR,NR,NR
6,,,,,,,,,,,...,,,,,,,,,,
7,3,04/01/15,,,,06/01/15,,,,,...,Series 2015 B,S,RV,Yes,,TE,15040302039,NR,A+,NR
8,,,,,,06/01/16,,,,,...,,,,,,,,,,
9,,,,,,06/01/17,,,,,...,,,,,,,,,,


Success!  We can now write to disk in a place of our choosing.

In [13]:
# debt.to_csv('/some_location/debt.csv')