# [Can You Dig It??](https://www.youtube.com/watch?v=V-OYKd8SVrI)

This is just a quick Notebook to demo opening of fixed width data.  The data we are using are 6 months worth of debt issues from Reuters.

In [300]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import glob

%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


So that these proprietary data do not end up being public, they are housed in my parent directory.  Consequently, if you are trying this at home, be sure to change the path.  Note that we will be using the [`read_fwf()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html) method from the [pandas](http://pandas.pydata.org/) library.

In [301]:
!ls ../../debt_data/

6months_text_as_columns.txt  fmatrix	   notes       TEL_Defense
algorithms		     gdrive_data   qcew        TELs_debt
debt_data		     MiscData	   quant_econ  tmp
DefenseDeck		     MiscData.zip  TEL	       work_scratch


Let's inspect the relevant options...

In [302]:
help(pd.read_fwf)

Help on function read_fwf in module pandas.io.parsers:

read_fwf(filepath_or_buffer, colspecs='infer', widths=None, **kwds)
    Read a table of fixed-width formatted lines into DataFrame
    
    Also supports optionally iterating or breaking of the file
    into chunks.
    
    Parameters
    ----------
    filepath_or_buffer : string or file handle / StringIO
        The string could be a URL. Valid URL schemes include
        http, ftp, s3, and file. For file URLs, a
        host is expected. For instance, a local file could be
        file ://localhost/path/to/table.csv
    colspecs : list of pairs (int, int) or 'infer'. optional
        A list of pairs (tuples) giving the extents of the fixed-width
        fields of each line as half-open intervals (i.e.,  [from, to[ ).
        String value 'infer' can be used to instruct the parser to try
        detecting the column specifications from the first 100 rows of
        the data (default='infer').
    widths : list of ints. optional

Looks like infer is already on, but inspecting the front end (via text editor) reveals that a foolish method has been used for the headers.  For some reason they start on line 4 and they are wrapped.  That is, variable name can span multiple lines within column.  The consequence is the headers appear like data to the parser, which means we have to actually explicitly write all this crap out.

Or ... because I am super lazy and explicit writing is tedious, we can come up with a programmatic solution.  We know that we have a fixed width file, so what we really need is to understand where each field starts.  If we can get the starting location of these lines, we can insert commas and get them into lists.  Once this has occurred, we can construct the variable names by position, and capture them all in one list (which will then serve as header info in the read in statement).

For some ungodly reason, the good folks at Reuters have used multiple spaces to separate variables instead of tabs (so no keying on them will work).  Moreover, while most variables have one word per line, there are several with multiple words on a given line.  The only saving grace is that only one space appears between words that end up on the same line.  Consequently, we can define the starting position of a field to be two positions in front of the first character in that field.  Since the first line of the variable name always holds at least one word, we will use that line (which is the 4th in the file) to establish field position.

Observe the sequence of tests applied to the characters in the first line (which we capture as a string).  The elements are as follows:

1. Line number
2. Character
3. Test to see if the character is a space
4. Test to see if both the preceding character, and the one before that, are both spaces

In [303]:
#Define file for testing
file_test='../../debt_data/2014to2015.txt'

#Create container for header lines
header=[]
#Capture the 4th-8th lines
with open(file_test,'r') as f:
    for i in range(4):
        tmp_line=f.readline()
    #Capture line 4
    header.append(tmp_line)
    #Capture 5-8
    for i in range(4):
        header.append(f.readline())
    
#For each character in the first line...
for i,c in enumerate(header[0]):
    #...give me the line number, the character, the first test (see #3 above), and the second test (see #4)
    print i,'|',c,'|',c.isspace(),'|',(header[0][i-2].isspace()) & (header[0][i-1].isspace())

0 |   | True | True
1 |   | True | True
2 |   | True | True
3 |   | True | True
4 |   | True | True
5 |   | True | True
6 |   | True | True
7 |   | True | True
8 |   | True | True
9 | S | False | True
10 | a | False | False
11 | l | False | False
12 | e | False | False
13 |   | True | False
14 |   | True | False
15 |   | True | True
16 |   | True | True
17 |   | True | True
18 |   | True | True
19 |   | True | True
20 |   | True | True
21 |   | True | True
22 |   | True | True
23 |   | True | True
24 |   | True | True
25 |   | True | True
26 | F | False | True
27 | i | False | False
28 | r | False | False
29 | s | False | False
30 | t | False | False
31 |   | True | False
32 |   | True | False
33 |   | True | True
34 |   | True | True
35 |   | True | True
36 |   | True | True
37 |   | True | True
38 |   | True | True
39 | S | False | True
40 | i | False | False
41 | n | False | False
42 | k | False | False
43 |   | True | False
44 |   | True | False
45 |   | True | True
46 |   | True |

In [304]:
len(header[0])

5048

What we have done here is identify the first character of each field.  The first character is the last `TRUE` before a string of `FALSE` in our last test.  If we encounter such a transition, we will capture the line number that is (not one, but) two positions ahead of said transition.  We do this because, again for some strange reason, some of the variable names start with a space (even if the second line does not).

In [305]:
#Create container to hold field positions
field_pos=[]

#For each character in the first line...
for i,c in enumerate(header[0]):
    #...define the test...
    old_spaces=((header[0][i-3].isspace()) & (header[0][i-2].isspace()))
    new_letters=~((header[0][i-2].isspace()) & (header[0][i-1].isspace()))
    new_field=old_spaces & new_letters
    #...if a new field has begun...
    if new_field:
        #...capture the position at which it started
        field_pos.append(i-4)

def delim(line,line_len,field_pos=field_pos):
    '''Function takes fields positions and turns an ugly string into a nicely delimited list'''
    #Capture padding needed to equal length of first string as list
    pad=(line_len-len(list(line)))
    #Capture string as a list
    line_list=list(line)+[' ']*pad
    #For each new field...
    for pos in field_pos:
        #...convert the start position from space to comma
        line_list[pos]='|'
#         try:
#             line_list[pos]=','
#         except:
#             print ''.join(line_list)
#             break
    #Convert list back to string
    line=''.join(line_list)
    #Strip space
    line=[s.strip() for s in line.split('|')]
    return line

#Generate container to hold all processed header lines
pheader=[]

#For each header line...
for i,hl in enumerate(header):
    #...process that line
    tmp_line=delim(hl,len(header[0]))
    pheader.append(tmp_line)
    print i,len(tmp_line)
    print len(hl)
#     print hl

0 254
5048
1 254
5041
2 254
5042
3 254
5008
4 254
3250


In [306]:
pheader

[['',
  'Sale',
  'First',
  'Sink',
  'Pre-Ref',
  'Maturity Date',
  'Maturity',
  'Maturity',
  'Letter of',
  'Issue',
  'Initial',
  'Maty of',
  'First',
  'First',
  'Final',
  'Delivery',
  'Maturity',
  'Dated',
  'Dated Date',
  'Date',
  'Conversion',
  'Callable',
  'Call',
  'Beginning',
  '501c3',
  '8-Digit',
  '8-Digit',
  'Cusip',
  'Managers',
  'Bond',
  'Bond',
  'All Use',
  'All Use',
  'All Use',
  'Maturity Amount',
  'Amount',
  '$ Amount of',
  'Principal',
  'Amount',
  'Ant-',
  'Use of',
  'Asset',
  'Auction',
  'Aver-',
  'Bank',
  'Bk',
  'Beginning',
  'Corporate or',
  'Beginning',
  'Bond',
  'Bid',
  'Bond',
  'Bk',
  'Call',
  'Call',
  'Initial',
  'Co-Managers',
  'Spec',
  'Bnk',
  'Comm-',
  'Comp',
  'Corp',
  'Coupon',
  'Coupon Maturity',
  'County',
  'Coupon',
  'Coupon',
  'Coupon Type',
  'Cpn',
  'Coupon Type',
  'Coupon',
  'Credit',
  'Credit',
  'Credit Enhancer',
  'Credit Enhancer',
  'Credit Enhancer',
  'CUSIP of',
  'DALCOMP',
  

In [307]:
# for i in range(len(pheader[0])):
#     print ' '.join([line[i] for line in pheader]).strip()

Now that we have our nice comma delimted lists, let's throw them together for our final variables.

In [308]:
#Generate container to hold variables
varlist=[]

#For each variable...
for i in range(len(pheader[0])):
    #...create a temporary container to hold the variable components from each line...
    var_tmp=[]
    #...and for each line...
    for j in range(len(pheader)):
        #...put the variable components in var_tmp...
        try:
            var_tmp.append(pheader[j][i])
        except:
            print '***',j,i
            print len(pheader),len(pheader[0])
    #...convert to string and throw the variable in varlist
    varlist.append(' '.join(var_tmp).strip())

print len(varlist)
varlist

254


['',
 'Sale Date',
 'First Sinking Fund Date',
 'Sink Date',
 'Pre-Ref Date',
 'Maturity Date',
 'Maturity',
 'Maturity Date',
 'Letter of Credit Expiration Date (Maty)',
 'Issue Dated Date',
 'Initial Put Date',
 'Maty of Highest Cpn Maty',
 'First Interest Payment Date',
 'First Call Date',
 'Final Maturity',
 'Delivery Date',
 'Maturity',
 'Dated Date',
 'Dated Date',
 'Date Issue Added',
 'Conversion Date (Maty)',
 'Callable at Par',
 'Call Date',
 'Beginning Serial Maturity',
 '501c3',
 '8-Digit CUSIP',
 '8-Digit CUSIP',
 'Cusip',
 'Managers',
 'Bond Buyer ALL UOP',
 'Bond Buyer UOP',
 'All Use of Proceeds (Code)',
 'All Use of Proceeds (Desc)',
 'All Use of Proceeds (Number)',
 'Maturity Amount',
 'Amount of Final Maturity ($mils)',
 '$ Amount of Highest Cpn Maturity',
 'Principal Amount',
 'Amount of Maturity ($ mils)',
 'Ant- ici- pa- tion Type',
 'Use of Proceeds Amount ($ mils)',
 'Asset Backed Indicator Flag (Y/N)',
 'Auction Rate',
 'Aver- age Life',
 'Bank Qual',
 'Bk Elig

In [309]:
len(pheader[0])

254

In [310]:
#Capture start and stop positions in DF
fp_df=DataFrame({'stop':field_pos,
                 'start':Series(field_pos).shift()+1})

#Make sure we start at position 0
fp_df.ix[0,'start']=0

#Add last field position par
last_pair=DataFrame({'start':fp_df.iloc[-1]['stop']+1,
                     'stop':len(header[0])},index=[fp_df.index[-1]+1])
fp_df=pd.concat([fp_df,last_pair])

#Match up fields positions and labels
fp_df['var']=varlist

#Assign arbitrary label to first field 
fp_df.ix[0,'var']='Number'

#Convert field positions parameters to int
for var in ['start','stop']:
    fp_df[var]=fp_df[var].astype(int)

print len(field_pos),len(fp_df),len(varlist)
fp_df

253 254 254


Unnamed: 0,start,stop,var
0,0,6,Number
1,7,23,Sale Date
2,24,36,First Sinking Fund Date
3,37,46,Sink Date
4,47,59,Pre-Ref Date
5,60,79,Maturity Date
6,80,92,Maturity
7,93,105,Maturity Date
8,106,121,Letter of Credit Expiration Date (Maty)
9,122,132,Issue Dated Date


So, it looks like we have repeats in our variable list...

In [311]:
dups=fp_df['var'].value_counts()[fp_df['var'].value_counts()>1]

dups

Fitch                       4
Paying Agent                3
Trustee                     3
Financial Advisor           3
Coupon Type                 3
Issuer's Counsel            3
Credit Enhancer             3
Tender Agent                3
Bond Buyer UOP              3
Nation                      2
Dated Date                  2
Credit Enhance ment Type    2
Moody Rating                2
Maturity                    2
Managers                    2
Maturity Date               2
8-Digit CUSIP               2
Remarketing Agent           2
Bond Buyer Region           2
S&P Rating                  2
Project                     2
SDC Region                  2
dtype: int64

We can deal with this by appending the original position of the variable to the variable name, thereby making each instance unique.

In [312]:
def pos_append(varlist):
    '''Function appends position of variable to variable name to uniquely identify variables 
    that appear more than once'''
    #Create an output varlist
    varlist_out=['']*len(varlist)
    #For each variable...
    for idx,v in enumerate(varlist):
        #...identify the instances of the variable and their positions
        instances=[(i,var) for i,var in enumerate(fp_df['var'].values) if var==v]
        #...if the variable appears more than once...
        if len(instances)>1:
            #...for each item in instances...
            for item in instances:
                #...append the variable position to the duplicate instance...
                varlist_out[item[0]]=varlist[item[0]]+str(item[0])
        #...otherwise leave the variable alone
        else:
            varlist_out[idx]=varlist[idx]
    return varlist_out
    
#Make the variables unique    
fp_df['u_var']=Series(pos_append(fp_df['var'].values))

print fp_df.to_string()

     start  stop                                       var                                     u_var
0        0     6                                    Number                                    Number
1        7    23                                 Sale Date                                 Sale Date
2       24    36                   First Sinking Fund Date                   First Sinking Fund Date
3       37    46                                 Sink Date                                 Sink Date
4       47    59                              Pre-Ref Date                              Pre-Ref Date
5       60    79                             Maturity Date                            Maturity Date5
6       80    92                                  Maturity                                 Maturity6
7       93   105                             Maturity Date                            Maturity Date7
8      106   121   Letter of Credit Expiration Date (Maty)   Letter of Credit Expiration Da

In [313]:
fp_df[fp_df['u_var'] == 'Issuer']

Unnamed: 0,start,stop,var,u_var
130,2265,2330,Issuer,Issuer


No dups!

Now we have the field positions we need to parse the whole file.  (The inferential tool created too many columns for some reason.)

In [314]:
s=['a','b','c','d','e','f','g']
print s[2:]
print s[2:-3]

['c', 'd', 'e', 'f', 'g']
['c', 'd']


In [315]:
print 'Capturing data'
#Create container for data lines
data=[]
#Capture the 9th line forward
with open(file_test,'r') as f:
    data=f.readlines()[8:-15] #(there are session details at the end of the file)
    f.close()
        
print 'Processing data'
#Generate container to hold all processed data lines
data_lines=[]

#For each data line...
for i,dl in enumerate(data):
    #...process that line
    data_lines.append(delim(dl,len(header[0])))
    if i%10000==0:
        print '>>Processing data line #',i
        
print 'Consolidating lines (vertical concatenation)'

#Capture start position of each issue (vertical)
issue_start=[(line[0],i) for i,line in enumerate(data_lines) if line[0]!='']

#Capture in DF and include stop position
issue_pos=DataFrame({'issue':[iss[0] for iss in issue_start],
                     'start':[iss[1] for iss in issue_start],
                     'stop':Series([iss[1] for iss in issue_start]).shift(-1)-1})

#Fill in last stop position
issue_pos.ix[issue_pos.index[-1],'stop']=len(data_lines)

#Convert positions to integer
for var in ['issue','start','stop']:
    issue_pos[var]=issue_pos[var].astype(int)
    
#Set index
issue_pos.set_index('issue',inplace=True)

#Create a container for consolidated data lines
data_lines_con=[]

#For each issue...
for issue in issue_pos.index:
    #...create a container for a consolidated, issue-specific line...
    new_data_line=[]
    #...if there is more than one line allocated to that issue...
    if issue_pos.ix[issue]['start']<issue_pos.ix[issue]['stop']:
        #...capture the data lines in that issue...
        iss_lns=data_lines[issue_pos.ix[issue]['start']:issue_pos.ix[issue]['stop']]
        #...and for each variable in those data lines...
        for idx in range(len(data_lines[0])):
            #...vertically concatenate to form a new consolidated data line...
            new_data_line.append(' '.join([line[idx] for line in iss_lns]).strip())
    #...otherwise, just rename the single line...
    new_data_line=data_lines[issue_pos.ix[issue]['start']]
    #...and then throw the new line in data_lines_con
    data_lines_con.append(new_data_line)
    
print 'Collecting data in dictionary'
#Create dictionary to hold data
data_dict={}

#For each variable...
for i,var in enumerate(fp_df['u_var']):
    #...once all lines are collected, update the dictionary
    data_dict.update({var:[data_lines_con[row][i] for row in range(len(data_lines_con))]})
    if i%50==0:
        print '>>Capturing variable #',i
    
    
#Convert data dictionary into DF
debt=DataFrame(data_dict)

Capturing data
Processing data
>>Processing data line # 0
>>Processing data line # 10000
>>Processing data line # 20000
>>Processing data line # 30000
>>Processing data line # 40000
>>Processing data line # 50000
>>Processing data line # 60000
>>Processing data line # 70000
>>Processing data line # 80000
>>Processing data line # 90000
>>Processing data line # 100000
>>Processing data line # 110000
>>Processing data line # 120000
>>Processing data line # 130000
>>Processing data line # 140000
>>Processing data line # 150000
>>Processing data line # 160000
>>Processing data line # 170000
>>Processing data line # 180000
>>Processing data line # 190000
>>Processing data line # 200000
>>Processing data line # 210000
>>Processing data line # 220000
>>Processing data line # 230000
>>Processing data line # 240000
>>Processing data line # 250000
>>Processing data line # 260000
>>Processing data line # 270000
>>Processing data line # 280000
Consolidating lines (vertical concatenation)
Collecting

In [316]:
issue_pos.tail()

Unnamed: 0_level_0,start,stop
issue,Unnamed: 1_level_1,Unnamed: 2_level_1
26893,280852,280853
26894,280854,280855
26895,280856,280857
26896,280858,280859
26897,280860,280862


In [317]:
debt[fp_df['u_var']]

Unnamed: 0,Number,Sale Date,First Sinking Fund Date,Sink Date,Pre-Ref Date,Maturity Date5,Maturity6,Maturity Date7,Letter of Credit Expiration Date (Maty),Issue Dated Date,...,S/ T,Security Type,Rank Eligible Flag (Y/N),SDC Est. Gross Spread,Master Deal Type,Deal Number,Underlying S&P Short Term Rating,Underlying S&P Long Term Rating,SPSHORT,S&P Short Rating
0,1,01/02/14,,,,01/09/15,,,,,...,T,GO,Yes,,TE,14010201039,NR,NR,NR,NR
1,2,01/02/14,,,,01/09/15,,,,,...,T,GO,Yes,,TE,14010205039,NR,NR,NR,NR
2,3,01/02/14,,,,01/08/15,,,,,...,T,GO,Yes,,TE,14010204039,NR,NR,NR,NR
3,4,01/02/14,,,,04/30/14,,,,,...,T,GO,Yes,,TE,14010206039,NR,NR,NR,NR
4,5,01/03/14,07/01/14,,,01/01/21,,,,,...,,GO,Yes,,TE,14010302039,NR,NR,NR,NR
5,6,01/03/14,,,,12/15/14,,,,,...,T,GO,Yes,,TE,14010301039,NR,NR,NR,NR
6,7,01/06/14,,,,01/01/17,,,,,...,T,GO,Yes,,TE,14010605039,NR,AA+,NR,NR
7,8,01/06/14,,,,02/01/15,,,,,...,S,GO,Yes,,TE,14010602039,NR,NR,NR,NR
8,9,01/06/14,,,,09/01/14,,,,,...,S,GO,Yes,,TE,14020602039,NR,A-,NR,NR
9,10,01/06/14,,,,01/15/15,,,,,...,T,GO,Yes,,TE,14010604039,NR,NR,NR,NR


In [319]:
debt[fp_df['u_var']].to_csv(file_test[:-3]+'csv')

For some reason the CSV is still jacked up in LibreOffice, but pandas seems to read it in just fine.

In [325]:
reread=pd.read_csv(file_test[:-3]+'csv')
print (debt['S&P Short Rating']==reread['S&P Short Rating']).all()
debt['S&P Short Rating'].head(100)

True


0        NR
1        NR
2        NR
3        NR
4        NR
5        NR
6        NR
7        NR
8        NR
9        NR
10       NR
11       NR
12       NR
13       NR
14       NR
15       NR
16       NR
17       NR
18       NR
19       NR
20       NR
21       NR
22       NR
23       NR
24       NR
25       NR
26    SP-1+
27       NR
28       NR
29       NR
      ...  
70       NR
71       NR
72       NR
73       NR
74       NR
75       NR
76       NR
77       NR
78       NR
79       NR
80       NR
81       NR
82       NR
83       NR
84       NR
85       NR
86       NR
87       NR
88       NR
89       NR
90       NR
91       NR
92       NR
93       NR
94       NR
95       NR
96       NR
97       NR
98       NR
99       NR
Name: S&P Short Rating, dtype: object

In [340]:
print debt['Security Type'].value_counts()
# print len(debt['Issue Description'].value_counts()),debt['Issue Description'].value_counts()
print debt['Issuer Type Description'].value_counts()
print sorted(set(debt['All Use of Proceeds (Desc)']))
print sorted(set(debt['Bond Buyer UOP30']))

GO    18713
RV     8184
dtype: int64
District           10739
City, Town Vlg      8605
Local Authority     2788
State Authority     2247
County/Parish       1596
College or Univ      413
State/Province       405
Direct Issuer         97
Indian Tribe           5
Co-op Utility          2
dtype: int64
['Agriculture', 'Airports', 'Assisted Living', 'Bridges', "Children's Hospital", 'Civic & Convention Centers', 'Combined Utilities', 'Cont Care Retirement Community', 'Correctional Facilities', 'Economic Development', 'Fire Stations & Equipment', 'Flood Control', 'Gas', 'General Acute Care Hospital', 'General Medical', 'General Purpose/ Public Imp', 'Government Buildings', 'Higher Education', 'Hospital Equipment Loans', 'Industrial Development', 'Libraries & Museums', 'Mass Transportation', 'Multi Family Housing', 'Nursing Homes', 'Office Buildings', 'Other Education', 'Other Recreation', 'Parking Facilities', 'Parks, Zoos & Beaches', 'Police Stations & Equipment', 'Pollution Control', 'Prim

In [335]:
debt[debt['Issuer Type Description']=='Direct Issuer']['Issuer']

4                   Navajo Tribal Utility Authority
67                                 Energy Northwest
317                    American Municipal Power Inc
485                           Nebraska Utility Corp
619                       Nebraska Tech Coop Fin #4
629                    American Municipal Power Inc
677                              TexAmericas Center
678                              TexAmericas Center
1388                       Greene Co Medical Center
1491                    Agua Mansa Ind Growth Assoc
1887                    Florida PACE Funding Agency
2904                               Energy Northwest
2905                               Energy Northwest
3980                  Synergy Education Project Inc
4690                       Jenks Aquarium Authority
4921                   River Springs Charter School
5045                        Build NYC Resource Corp
5046                        Build NYC Resource Corp
5351                        Build NYC Resource Corp
5352        

There are way too many uses of the debt issues to be reasonably included in the specification, so let's sort these into categories.

Category|Use Descriptions
--------|----------------
Education|
Health|Children's Hospital<br>General Acute Care Hospital<br>General Medical
Infrastructure|Combined Utilities<br>Flood Control<br>Gas<br>
Natural Resources|Agriculture<br>
Public Safety|Fire Stations & Equipment<br>
Recreation|Civic & Convention Centers<br>
Social|Assisted Living<br>Cont Care Retirement Community<br>Correctional Facilities<br>Economic Development<br>
Transportation|Airports<br>Bridges<br>

Success!  We can now write to disk in a place of our choosing.  Let's create a function to execute this task given an input file.

In [275]:
def txt2df(file_in):
    '''File converts debt files from txt to csv'''
    print '\n\n*** Processing '+file_in+' ***'
    
    ### CAPTURE HEADER ###
    print '--Capturing header--'
    #Create container for header lines
    hdr=[]
    #Capture the 4th-8th lines
    with open(file_in,'r') as f_i:
        for i in range(4):
            tmp_line=f_i.readline()
        #Capture line 4
        hdr.append(tmp_line)
        #Capture 5-8
        for i in range(4):
            hdr.append(f_i.readline())
    f_i.close()
    
    ### CAPTURE FIELD POSITIONS ###
    print '--Capturing field positions--'
    #Create container to hold field positions
    fld_pos=[]
    #For each character in the first line...
    for i,ch in enumerate(hdr[0]):
        #...define the test...
        old=((hdr[0][i-3].isspace()) & (hdr[0][i-2].isspace()))
        new=~((hdr[0][i-2].isspace()) & (hdr[0][i-1].isspace()))
        new_fld=old & new
        #...if a new field has begun...
        if new_fld:
            #...capture the position at which it started
            fld_pos.append(i-4)
    
    ### PROCESS HEADER ###
    print '--Processing header--'
    #Generate container to hold all processed header lines
    phdr=[]
    #For each header line...
    for i,hline in enumerate(hdr):
        #...process that line
        phdr.append(delim(hline,len(hdr[0]),field_pos=fld_pos))
    
    ### CAPTURE VARIABLE LIST ###
    print '--Capturing clean variable list--'
    #Generate container to hold variables
    varl=[]
    #For each variable...
    for i in range(len(phdr[0])):
        #...create a temporary container to hold the variable components from each line...
        vtmp=[]
        #...and for each line...
        for j in range(len(phdr)):
            #...put the variable components in var_tmp...
            try:
                vtmp.append(phdr[j][i])
            except:
                print '***',j,i
                print len(phdr),len(phdr[0])
        #...convert to string and throw the variable in varlist
        varl.append(' '.join(vtmp).strip())
        
    ### CAPTURE DATAFRAME WITH ALL START/STOP INFO ###
    print '--Housing field position info in DataFrame--'
    #Capture start and stop positions in DF
    fpos_df=DataFrame({'stop':fld_pos,
                       'start':Series(fld_pos).shift()+1})
    #Make sure we start at position 0
    fpos_df.ix[0,'start']=0
    #Add last field position par
    lastp=DataFrame({'start':fpos_df.iloc[-1]['stop']+1,
                     'stop':len(hdr[0])},index=[fpos_df.index[-1]+1])
    fpos_df=pd.concat([fpos_df,lastp])
    #Match up fields positions and labels
    print len(fpos_df),len(varl)
    print fpos_df.tail()
    fpos_df['var']=varl
    #Assign arbitrary label to first field 
    fpos_df.ix[0,'var']='Number'
    #Convert field positions parameters to int
    for var in ['start','stop']:
        fpos_df[var]=fpos_df[var].astype(int)
    #Make the variables unique    
    fpos_df['u_var']=Series(pos_append(fpos_df['var'].values))
        
    ### PROCESS DATA AND CONVERT TO CSV ###
    print '--Capturing data--'
    #Create container for data lines
    data_lines=[]
    #Capture the 9th line forward
    with open(file_in,'r') as f_i:
        data_lines=f_i.readlines()[8:-15]
    f_i.close()

    print '--Processing data--'
    #Generate container to hold all processed header lines
    pdata_lines=[]
    #For each data line...
    for i,dline in enumerate(data_lines):
        #...process that line
        pdata_lines.append(delim(dline,len(hdr[0])))
        if i%10000==0:
            print '>>>>Processing data line #',i
            
    print 'Consolidating lines (vertical concatenation)'
    #Capture start position of each issue (vertical)
    issue_start=[(line[0],i) for i,line in enumerate(pdata_lines) if line[0]!='']
    #Capture in DF and include stop position
    issue_pos=DataFrame({'issue':[iss[0] for iss in issue_start],
                         'start':[iss[1] for iss in issue_start],
                         'stop':Series([iss[1] for iss in issue_start]).shift(-1)-1})
    #Fill in last stop position
    issue_pos.ix[issue_pos.index[-1],'stop']=len(pdata_lines)
    #Convert positions to integer
    for var in ['issue','start','stop']:
        issue_pos[var]=issue_pos[var].astype(int)
    #Set index
    issue_pos.set_index('issue',inplace=True)
    #Create a container for consolidated data lines
    pdata_lines_con=[]
    #For each issue...
    for issue in issue_pos.index:
        #...create a container for a consolidated, issue-specific line...
        new_data_line=[]
        #...if there is more than one line allocated to that issue...
        if issue_pos.ix[issue]['start']<issue_pos.ix[issue]['stop']:
            #...capture the data lines in that issue...
            iss_lns=pdata_lines[issue_pos.ix[issue]['start']:issue_pos.ix[issue]['stop']]
            #...and for each variable in those data lines...
            for idx in range(len(pdata_lines[0])):
                #...vertically concatenate to form a new consolidated data line...
                new_data_line.append(' '.join([line[idx] for line in iss_lns]).strip())
        #...otherwise, just rename the single line...
        new_data_line=data_lines[issue_pos.ix[issue]['start']]
        #...and then throw the new line in data_lines_con
        pdata_lines_con.append(new_data_line)

    print '--Collecting data in dictionary--'
    #Create dictionary to hold data
    data_dict_out={}
    #For each variable...
    for i,var in enumerate(fpos_df['u_var']):
        #...once all lines are collected, update the dictionary
        data_dict_out.update({var:[pdata_lines[row][i] for row in range(len(pdata_lines_con))]})
        if i%50==0:
            print '>>>>Capturing variable #',i
            
    return (DataFrame(data_dict_out)[fpos_df['u_var']],fpos_df)

Ok, let's test this guy.

In [276]:
txt2df('../debt_data/2002to2003.txt')



*** Processing ../debt_data/2002to2003.txt ***
--Capturing header--
--Capturing field positions--
--Processing header--
--Capturing clean variable list--
--Housing field position info in DataFrame--
254 254
     start  stop
249   4993  5010
250   5011  5027
251   5028  5044
252   5045  5061
253   5062  5078
--Capturing data--
--Processing data--
>>>>Processing data line # 0
>>>>Processing data line # 10000
>>>>Processing data line # 20000
>>>>Processing data line # 30000
>>>>Processing data line # 40000
>>>>Processing data line # 50000
>>>>Processing data line # 60000
>>>>Processing data line # 70000
>>>>Processing data line # 80000
>>>>Processing data line # 90000
>>>>Processing data line # 100000
>>>>Processing data line # 110000
>>>>Processing data line # 120000
>>>>Processing data line # 130000
>>>>Processing data line # 140000
>>>>Processing data line # 150000
>>>>Processing data line # 160000
>>>>Processing data line # 170000
>>>>Processing data line # 180000
>>>>Processing dat

(      Number Sale Date First Sinking Fund Date Sink Date Pre-Ref Date  \
 0          1  01/02/02                                                  
 1                                                                       
 2                                                                       
 3                                                                       
 4                                                                       
 5                                                                       
 6                                                                       
 7                                                                       
 8                                                                       
 9          2  01/02/02                                                  
 10                                                                      
 11                                                                      
 12                                   

It appears to work, so let's go ahead and generate CSV files for all the text files in the `debt_data/` folder.

In [278]:
#Create container for bad runs
bad_runs=[]

#Create dict to DFs
df_out_dict={}

#Create dict to hold field positions
fp_dict={}

#Capture list of files
f_list=glob.glob('../../debt_data/*.txt')
print 'Entering loop'
#For each file...
for f_in in f_list:
    print f_in
    #...capture the CSV form of the data...
    print 'Starting processing'
    tmp_csv_out=txt2df(f_in)
    print 'Capturing new CSV'
    tmp_csv=tmp_csv_out[0]
    print 'Capturing field positions'
    fp_dict.update({f_in:tmp_csv_out[1]})
    #...write it to disk...
    print 'Writing processed data to disk'
    tmp_csv.to_csv(f_in[:-3]+'csv')
    #...and delete the DF held in memory
    del tmp_csv
#     try:
#         #...capture the CSV form of the data...
#         tmp_csv_out=txt2df(f_in)
#         tmp_csv=tmp_csv_out[0]
#         df_out_dict.update({f_in:tmp_csv_out[0]})
#         fp_dict.update({f_in:tmp_csv_out[1]})
#         #...write it to disk...
#         tmp_csv.to_csv(f_in[:-3]+'csv')
#         #...and delete the DF held in memory
#         del tmp_csv
#     except:
#         print '*** BAD RUN - '+f_in+' ***'
#         bad_runs.append(f_in)

Entering loop
../debt_data/2002to2003.txt
Starting processing


*** Processing ../debt_data/2002to2003.txt ***
--Capturing header--
--Capturing field positions--
--Processing header--
--Capturing clean variable list--
--Housing field position info in DataFrame--
254 254
     start  stop
249   4993  5010
250   5011  5027
251   5028  5044
252   5045  5061
253   5062  5078
--Capturing data--
--Processing data--
>>>>Processing data line # 0
>>>>Processing data line # 10000
>>>>Processing data line # 20000
>>>>Processing data line # 30000
>>>>Processing data line # 40000
>>>>Processing data line # 50000
>>>>Processing data line # 60000
>>>>Processing data line # 70000
>>>>Processing data line # 80000
>>>>Processing data line # 90000
>>>>Processing data line # 100000
>>>>Processing data line # 110000
>>>>Processing data line # 120000
>>>>Processing data line # 130000
>>>>Processing data line # 140000
>>>>Processing data line # 150000
>>>>Processing data line # 160000
>>>>Processing data line

IndexError: list assignment index out of range

In [20]:
bad_runs

['../debt_data/1990to1991.txt', '../debt_data/1988to1989.txt']

In [21]:
fp_dict['../debt_data/1984to1985.txt'][fp_dict['../debt_data/1984to1985.txt']['u_var']=='Issuer']

Unnamed: 0,start,stop,var,u_var
130,2249,2310,Issuer,Issuer
