# [Can You Dig It??](https://www.youtube.com/watch?v=V-OYKd8SVrI)

This is just a quick Notebook to demo opening of fixed width data.  The data we are using are 6 months worth of debt issues from Reuters.

In [19]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import glob

%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


So that these proprietary data do not end up being public, they are housed in my parent directory.  Consequently, if you are trying this at home, be sure to change the path.  Note that we will be using the [`read_fwf()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html) method from the [pandas](http://pandas.pydata.org/) library.

In [2]:
!ls ..

algorithms	       debt_data	mortgage	      TELs_debt
CensusPovThresh.ipynb  fmatrix		NTA		      work_scratch
Conference Notes       google_api_keys	spatial_analysis_lit


Let's inspect the relevant options...

In [3]:
help(pd.read_fwf)

Help on function read_fwf in module pandas.io.parsers:

read_fwf(filepath_or_buffer, colspecs='infer', widths=None, **kwds)
    Read a table of fixed-width formatted lines into DataFrame
    
    Also supports optionally iterating or breaking of the file
    into chunks.
    
    Parameters
    ----------
    filepath_or_buffer : string or file handle / StringIO
        The string could be a URL. Valid URL schemes include
        http, ftp, s3, and file. For file URLs, a
        host is expected. For instance, a local file could be
        file ://localhost/path/to/table.csv
    colspecs : list of pairs (int, int) or 'infer'. optional
        A list of pairs (tuples) giving the extents of the fixed-width
        fields of each line as half-open intervals (i.e.,  [from, to[ ).
        String value 'infer' can be used to instruct the parser to try
        detecting the column specifications from the first 100 rows of
        the data (default='infer').
    widths : list of ints. optional

Looks like infer is already on, but inspecting the front end (via text editor) reveals that a foolish method has been used for the headers.  For some reason they start on line 4 and they are wrapped.  That is, variable name can span multiple lines within column.  The consequence is the headers appear like data to the parser, which means we have to actually explicitly write all this crap out.

Or ... because I am super lazy and explicit writing is tedious, we can come up with a programmatic solution.  We know that we have a fixed width file, so what we really need is to understand where each field starts.  If we can get the starting location of these lines, we can insert commas and get them into lists.  Once this has occurred, we can construct the variable names by position, and capture them all in one list (which will then serve as header info in the read in statement).

For some ungodly reason, the good folks at Reuters have used multiple spaces to separate variables instead of tabs (so no keying on them will work).  Moreover, while most variables have one word per line, there are several with multiple words on a given line.  The only saving grace is that only one space appears between words that end up on the same line.  Consequently, we can define the starting position of a field to be two positions in front of the first character in that field.  Since the first line of the variable name always holds at least one word, we will use that line (which is the 4th in the file) to establish field position.

Observe the sequence of tests applied to the characters in the first line (which we capture as a string).  The elements are as follows:

1. Line number
2. Character
3. Test to see if the character is a space
4. Test to see if both the preceding character, and the one before that, are both spaces

In [86]:
#Define file for testing
file_test='../debt_data/2006to2007.csv'

#Create container for header lines
header=[]
#Capture the 4th-8th lines
with open(file_test,'r') as f:
    for i in range(4):
        tmp_line=f.readline()
    #Capture line 4
    header.append(tmp_line)
    #Capture 5-8
    for i in range(4):
        header.append(f.readline())
    
#For each character in the first line...
for i,c in enumerate(header[0]):
    #...give me the line number, the character, the first test (see #3 above), and the second test (see #4)
    print i,'|',c,'|',c.isspace(),'|',(header[0][i-2].isspace()) & (header[0][i-1].isspace())

0 | 2 | False | False
1 | , | False | False
2 | N | False | False
3 | R | False | False
4 | , | False | False
5 | , | False | False
6 | N | False | False
7 | o | False | False
8 | , | False | False
9 | , | False | False
10 | , | False | False
11 | , | False | False
12 | - | False | False
13 | T | False | False
14 | R | False | False
15 | U | False | False
16 | S | False | False
17 | T | False | False
18 | - | False | False
19 | S | False | False
20 | V | False | False
21 | S | False | False
22 |   | True | False
23 |   | True | False
24 |   | True | True
25 |   | True | True
26 |   | True | True
27 |   | True | True
28 |   | True | True
29 |   | True | True
30 |   | True | True
31 |   | True | True
32 |   | True | True
33 |   | True | True
34 |   | True | True
35 |   | True | True
36 |   | True | True
37 |   | True | True
38 |   | True | True
39 |   | True | True
40 | B | False | True
41 | O | False | False
42 | N | False | False
43 | D | False | False
44 | - | False | False
45 | T | F

In [87]:
len(header[0])

1099

What we have done here is identify the first character of each field.  The first character is the last `TRUE` before a string of `FALSE` in our last test.  If we encounter such a transition, we will capture the line number that is (not one, but) two positions ahead of said transition.  We do this because, again for some strange reason, some of the variable names start with a space (even if the second line does not).

In [100]:
#Create container to hold field positions
field_pos=[]

#For each character in the first line...
for i,c in enumerate(header[0]):
    #...define the test...
    old_spaces=((header[0][i-3].isspace()) & (header[0][i-2].isspace()))
    new_letters=~((header[0][i-2].isspace()) & (header[0][i-1].isspace()))
    new_field=old_spaces & new_letters
    #...if a new field has begun...
    if new_field:
        #...capture the position at which it started
        field_pos.append(i-4)

def delim(line,field_pos=field_pos):
    '''Function takes fields positions and turns an ugly string into a nicely delimited list'''
    #Capture padding needed to equal length of first string as list
    pad=(len(list(header[0]))-len(list(line)))
    #Capture string as a list
    line_list=list(line)+[' ']*pad
    #For each new field...
    for pos in field_pos:
        #...convert the start position from space to comma
        try:
            line_list[pos]=','
        except:
            print ''.join(line_list)
            break
    #Convert list back to string
    line=''.join(line_list)
    #Strip space
    line=[s.strip() for s in line.split(',')]
    return line

#Generate container to hold all processed header lines
pheader=[]

#For each header line...
for i,hl in enumerate(header):
    #...process that line
    tmp_line=delim(hl)
    pheader.append(tmp_line)
    print i,len(tmp_line)
    print len(hl)
    print hl

0 249
1099
2,NR,,No,,,,-TRUST-SVS                  BOND-T,BON,Z43,General Purpose/ Public Imp,43,,,0.095,,N,N,,Yes,,3.5,02/01/2013,C,Yes,Yes,Yes,GP,4.380,Imp,MW,5.110,General Purpose,Yes,,,,Yes,,02/01/2014,,Mideast,N,1.450,,N,,Iowa,3.5,,F,Fixed Rate,,4.50,,F,,,,,,06809FAC0,,01/03/06,,02/01/2006,29,0,No,02/01/06,5,000,,No,,No,Yes,,,2/01/2026,,,,,No,02/01/26,Yes,02/01/2014,02/01/2007,02/01/07,,,NR,NR,,,,,100.000#,,,,No,,Barneveld-Wisconsin,,12,Town Vlg,068084/06809F,BKRS,,,Z43                        General Purpose/,GP         Gene,BKRS-BK-WI,0.095,02/01/09,,02/01/09,02/01/09,,02/01/26,US,,NR,,NR,US,,No,States,2,04        1,,6010,CPT,,Yes,UST-SVS                  Bond T,,,No,,,,,,,,N,,,,,,N,,SOLE,NR,GO            Yes,Midwest,N,R,01/03/06,,228b        Y,NR,,N,N,,E,N,,,,No,,BKRS-BK-WI,NR,20.014,39,No,1.45,,2026,,,,,NR          NR,1.450,,al Purpose,Z          Genl Purpose/ Publi,.225        1.450           No,No,,No,,,,N,Public Imp,No,,6010304,4.5,N         2,1.4,5,City,,,United,,1.45      

In [89]:
pheader

[['2',
  'NR',
  '',
  'No',
  '',
  '',
  '',
  '-TRUST-SVS',
  'BOND-T',
  'BON',
  'Z43',
  'General Purpose/ Public Imp',
  '43',
  '',
  '',
  '0.095',
  '',
  'N',
  'N',
  '',
  'Yes',
  '',
  '3.5',
  '02/01/2013',
  'C',
  'Yes',
  'Yes',
  'Yes',
  'GP',
  '4.380',
  'Imp',
  'MW',
  '5.110',
  'General Purpose',
  'Yes',
  '',
  '',
  '',
  'Yes',
  '',
  '02/01/2014',
  '',
  'Mideast',
  'N',
  '1.450',
  '',
  'N',
  '',
  'Iowa',
  '3.5',
  '',
  'F',
  'Fixed Rate',
  '',
  '4.50',
  '',
  'F',
  '',
  '',
  '',
  '',
  '',
  '06809FAC0',
  '',
  '01/03/06',
  '',
  '02/01/2006',
  '29',
  '0',
  'No',
  '02/01/06',
  '5',
  '000',
  '',
  'No',
  '',
  'No',
  'Yes',
  '',
  '',
  '2/01/2026',
  '',
  '',
  '',
  '',
  'No',
  '02/01/26',
  'Yes',
  '02/01/2014',
  '02/01/2007',
  '02/01/07',
  '',
  '',
  'NR',
  'NR',
  '',
  '',
  '',
  '',
  '100.000#',
  '',
  '',
  '',
  'No',
  '',
  'Barneveld-Wisconsin',
  '',
  '12',
  'Town Vlg',
  '068084/06809F',
  'BKRS',

Now that we have our nice comma delimted lists, let's throw them together for our final variables.

In [93]:
#Generate container to hold variables
varlist=[]

#For each variable...
for i in range(len(pheader[0])):
    #...create a temporary container to hold the variable components from each line...
    var_tmp=[]
    #...and for each line...
    for j in range(len(pheader)):
        #...put the variable components in var_tmp...
        try:
            var_tmp.append(pheader[j][i])
        except:
            print '***',j,i
            print len(pheader),len(pheader[0])
    #...convert to string and throw the variable in varlist
    varlist.append(' '.join(var_tmp).strip())

print len(varlist)
varlist

*** 1 248
5 249
*** 2 248
5 249
*** 3 248
5 249
*** 4 248
5 249
249


['2 3 4 5 6',
 'NR',
 '',
 'No',
 '',
 '',
 '',
 '-TRUST-SVS',
 'BOND-T',
 'BON',
 'Z43',
 'General Purpose/ Public Imp',
 '43',
 '',
 '0.105 1.250',
 '0.095',
 '',
 'N N N',
 'N',
 '',
 'Yes',
 '',
 '3.5',
 '02/01/2013',
 'C',
 'Yes',
 'Yes',
 'Yes',
 'GP',
 '4.380',
 'Imp',
 'MW',
 '5.110',
 'General Purpose',
 'Yes',
 '',
 '',
 '',
 'Yes',
 '',
 '02/01/2014',
 '',
 'Mideast',
 'N',
 '1.450',
 '',
 'N',
 '',
 'Iowa 3.7 3.8 3.9 3.95',
 '3.5',
 '',
 'F Fixed Rate Fixed Rate',
 'Fixed Rate',
 '',
 '4.50',
 'F F',
 'F',
 '',
 '',
 '',
 '',
 '06809FAE6 06809FAG1 06809FAH9 06809FAJ5',
 '06809FAC0',
 '',
 '01/03/06',
 '',
 '02/01/2006',
 '29',
 '0',
 'No',
 '02/01/06',
 '5',
 '000',
 '',
 'No',
 '',
 'No',
 'Yes',
 '',
 '',
 '2/01/2026',
 '',
 '',
 '',
 '',
 'No',
 '02/01/26',
 'Yes',
 '02/01/2014',
 '02/01/2007',
 '02/01/07',
 'NR NR',
 '',
 'NR',
 'NR',
 'N N',
 '',
 '',
 '',
 '100.000#',
 '',
 '',
 '',
 'No',
 '',
 'Barneveld-Wisconsin',
 '',
 '12',
 'Town Vlg',
 '068084/06809F',
 'BKRS'

In [97]:
len(pheader[0])

249

In [95]:
pheader[1][248]

IndexError: list index out of range

In [91]:
#Capture start and stop positions in DF
fp_df=DataFrame({'stop':field_pos,
                 'start':Series(field_pos).shift()+1})

#Make sure we start at position 0
fp_df.ix[0,'start']=0

#Add last field position par
last_pair=DataFrame({'start':fp_df.iloc[-1]['stop']+1,
                         'stop':len(header[0])},index=[fp_df.index[-1]+1])
fp_df=pd.concat([fp_df,last_pair])

#Match up fields positions and labels
fp_df['var']=varlist

#Assign arbitrary label to first field 
fp_df.ix[0,'var']='Number'

#Convert field positions parameters to int
for var in ['start','stop']:
    fp_df[var]=fp_df[var].astype(int)

print len(field_pos),len(varlist)
fp_df

ValueError: Length of values does not match length of index

In [61]:
print fp_df.iloc[-1]['stop']+1
print fp_df.index[-1]

5136
254


So, it looks like we have repeats in our variable list...

In [62]:
dups=fp_df['var'].value_counts()[fp_df['var'].value_counts()>1]

dups

Fitch                       4
Financial Advisor           3
Paying Agent                3
Coupon Type                 3
Trustee                     3
Tender Agent                3
Issuer's Counsel            3
Bond Buyer UOP              3
Credit Enhancer             3
S&P Rating                  2
Deal Number                 2
Credit Enhance ment Type    2
Nation                      2
Maturity                    2
Dated Date                  2
Maturity Date               2
Moody Rating                2
8-Digit CUSIP               2
Managers                    2
SDC Region                  2
Project                     2
Bond Buyer Region           2
Remarketing Agent           2
dtype: int64

We can deal with this by appending the original position of the variable to the variable name, thereby making each instance unique.

In [63]:
def pos_append(varlist):
    '''Function appends position of variable to variable name to uniquely identify variables 
    that appear more than once'''
    #Create an output varlist
    varlist_out=['']*len(varlist)
    #For each variable...
    for idx,v in enumerate(varlist):
        #...identify the instances of the variable and their positions
        instances=[(i,var) for i,var in enumerate(fp_df['var'].values) if var==v]
        #...if the variable appears more than once...
        if len(instances)>1:
            #...for each item in instances...
            for item in instances:
                #...append the variable position to the duplicate instance...
                varlist_out[item[0]]=varlist[item[0]]+str(item[0])
        #...otherwise leave the variable alone
        else:
            varlist_out[idx]=varlist[idx]
    return varlist_out
    
#Make the variables unique    
fp_df['u_var']=Series(pos_append(fp_df['var'].values))

fp_df

Unnamed: 0,start,stop,var,u_var
0,0,6,Number,Number
1,7,23,Sale Date,Sale Date
2,24,36,First Sinking Fund Date,First Sinking Fund Date
3,37,46,Sink Date,Sink Date
4,47,59,Pre-Ref Date,Pre-Ref Date
5,60,79,Maturity Date,Maturity Date5
6,80,92,Maturity,Maturity6
7,93,105,Maturity Date,Maturity Date7
8,106,121,Letter of Credit Expiration Date (Maty),Letter of Credit Expiration Date (Maty)
9,122,132,Issue Dated Date,Issue Dated Date


No dups!

Now we have the field positions we need to parse the whole file.  (The inferential tool created too many columns for some reason.)

In [64]:
print 'Capturing data'
#Create container for data lines
data=[]
#Capture the 9th line forward
with open(file_test,'r') as f:
    for i in range(9):
        tmp_line=f.readline()
    #Capture line 9
    data.append(tmp_line)
    #Capture 10 through 89369
    for i in range(89359):
        data.append(f.readline())
        
print 'Processing data'
#Generate container to hold all processed header lines
data_lines=[]

#For each data line...
for i,dl in enumerate(data):
    #...process that line
    data_lines.append(delim(dl))
    if i%10000==0:
        print '>>Processing data line #',i
    
print 'Collecting data in dictionary'
#Create dictionary to hold data
data_dict={}

#For each variable...
for i,var in enumerate(fp_df['u_var']):
    #...once all lines are collected, update the dictionary
    data_dict.update({var:[data_lines[row][i] for row in range(len(data_lines))]})
    if i%50==0:
        print '>>Capturing variable #',i
    
    
#Convert data dictionary into DF
debt=DataFrame(data_dict)

Capturing data
Processing data
>>Processing data line # 0
>>Processing data line # 1000
>>Processing data line # 2000
>>Processing data line # 3000
>>Processing data line # 4000
>>Processing data line # 5000
>>Processing data line # 6000
>>Processing data line # 7000
>>Processing data line # 8000
>>Processing data line # 9000
>>Processing data line # 10000
>>Processing data line # 11000
>>Processing data line # 12000
>>Processing data line # 13000
>>Processing data line # 14000
>>Processing data line # 15000
>>Processing data line # 16000
>>Processing data line # 17000
>>Processing data line # 18000
>>Processing data line # 19000
>>Processing data line # 20000
>>Processing data line # 21000
>>Processing data line # 22000
>>Processing data line # 23000
>>Processing data line # 24000
>>Processing data line # 25000
>>Processing data line # 26000
>>Processing data line # 27000
>>Processing data line # 28000
>>Processing data line # 29000
>>Processing data line # 30000
>>Processing data lin

In [65]:
debt[fp_df['u_var']]

Unnamed: 0,Number,Sale Date,First Sinking Fund Date,Sink Date,Pre-Ref Date,Maturity Date5,Maturity6,Maturity Date7,Letter of Credit Expiration Date (Maty),Issue Dated Date,...,Security Type,Rank Eligible Flag (Y/N),SDC Est. Gross Spread,Master Deal Type,Deal Number249,Underlying S&P Short Term Rating,Underlying S&P Long Term Rating,SPSHORT,S&P Short Rating,Deal Number254
0,1,01/01/88,,,,,,,,,...,S,RV,Yes,,TE,88010188039,NR,NR,NR,NR
1,2,01/01/88,,,,,,,,,...,,T,RV,Yes,,TE,88010199039,NR,,NR
2,3,01/01/88,,,,,,,,,...,GO,Yes,,TE,88010177039,NR,NR,NR,NR,88010177039
3,,,,,,,,,,,...,,,,,,,,,,
4,4,01/02/88,,,,,,,,,...,GO,Yes,,TE,88010267039,NR,NR,NR,NR,88010267039
5,5,01/02/88,,,,,,,,,...,,T,GO,Yes,,TE,88010260039,NR,NR,NR
6,6,01/03/88,,,,,,,,,...,GO,Yes,,TE,88010361039,NR,NR,NR,NR,88010361039
7,7,01/04/88,,,,,,,,,...,S,RV,Yes,,TE,88010499039,NR,,NR,NR
8,,,,,,,,,,,...,,,,,,NR,,NR,NR,
9,8,01/04/88,,,,,,,,,...,S,GO,Yes,,TE,88010450039,NR,NR,NR,NR


In [15]:
# debt.to_csv('/some_location/debt.csv')

Success!  We can now write to disk in a place of our choosing.  Let's create a function to execute this task given an input file.

In [83]:
def txt2df(file_in):
    '''File converts debt files from txt to csv'''
    print '\n\n*** Processing '+file_in+' ***'
    
    ### CAPTURE HEADER ###
    print '--Capturing header--'
    #Create container for header lines
    header=[]
    #Capture the 4th-8th lines
    with open(file_in,'r') as f:
        for i in range(4):
            tmp_line=f.readline()
        #Capture line 4
        header.append(tmp_line)
        #Capture 5-8
        for i in range(4):
            header.append(f.readline())
    
    ### CAPTURE FIELD POSITIONS ###
    print '--Capturing field positions--'
    #Create container to hold field positions
    field_pos=[]
    #For each character in the first line...
    for i,c in enumerate(header[0]):
        #...define the test...
        old_spaces=((header[0][i-3].isspace()) & (header[0][i-2].isspace()))
        new_letters=~((header[0][i-2].isspace()) & (header[0][i-1].isspace()))
        new_field=old_spaces & new_letters
        #...if a new field has begun...
        if new_field:
            #...capture the position at which it started
            field_pos.append(i-4)
    
    ### PROCESS HEADER ###
    print '--Processing header--'
    #Generate container to hold all processed header lines
    pheader=[]
    #For each header line...
    for i,hl in enumerate(header):
        #...process that line
        pheader.append(delim(hl))
    
    ### CAPTURE VARIABLE LIST ###
    print '--Capturing clean variable list--'
    #Generate container to hold variables
    varlist=[]
    #For each variable...
    for i in range(len(pheader[0])):
        #...create a temporary container to hold the variable components from each line...
        var_tmp=[]
        #...and for each line...
        for j in range(len(pheader)):
            #...put the variable components in var_tmp...
            try:
                var_tmp.append(pheader[j][i])
            except:
                print '***',j,i
                print len(pheader),len(pheader[0])
        #...convert to string and throw the variable in varlist
        varlist.append(' '.join(var_tmp).strip())
        
    ### CAPTURE DATAFRAME WITH ALL START/STOP INFO ###
    print '--Housing field position info in DataFrame--'
    #Capture start and stop positions in DF
    fp_df=DataFrame({'stop':field_pos,
                     'start':Series(field_pos).shift()+1})
    #Make sure we start at position 0
    fp_df.ix[0,'start']=0
    #Add last field position par
    last_pair=DataFrame({'start':fp_df.iloc[-1]['stop']+1,
                         'stop':len(header[0])},index=[fp_df.index[-1]+1])
    fp_df=pd.concat([fp_df,last_pair])
    #Match up fields positions and labels
    print len(fp_df),len(varlist)
    fp_df['var']=varlist
    #Assign arbitrary label to first field 
    fp_df.ix[0,'var']='Number'
    #Convert field positions parameters to int
    for var in ['start','stop']:
        fp_df[var]=fp_df[var].astype(int)
    #Make the variables unique    
    fp_df['u_var']=Series(pos_append(fp_df['var'].values))
        
    ### PROCESS DATA AND CONVERT TO CSV ###
    print '--Capturing data--'
    #Create container for data lines
    data=[]
    #Capture the 9th line forward
    with open(file_in,'r') as f:
        for i in range(9):
            tmp_line=f.readline()
        #Capture line 9
        data.append(tmp_line)
        #Capture 10 through 89369
        for i in range(89359):
            data.append(f.readline())

    print '--Processing data--'
    #Generate container to hold all processed header lines
    data_lines=[]
    #For each data line...
    for i,dl in enumerate(data):
        #...process that line
        data_lines.append(delim(dl))
        if i%10000==0:
            print '>>>>Processing data line #',i

    print '--Collecting data in dictionary--'
    #Create dictionary to hold data
    data_dict={}
    #For each variable...
    for i,var in enumerate(fp_df['u_var']):
        #...once all lines are collected, update the dictionary
        data_dict.update({var:[data_lines[row][i] for row in range(len(data_lines))]})
        if i%50==0:
            print '>>>>Capturing variable #',i
            
    return DataFrame(data_dict)

Ok, let's test this guy.

In [84]:
# txt2df('../debt_data/6months_text_as_columns.txt')

It appears to work, so let's go ahead and generate CSV files for all the text files in the `debt_data/` folder.

In [85]:
#Capture list of files
f_list=glob.glob('../debt_data/*')

#For each file...
for f_in in f_list:
    print f_in
    #...capture the CSV form of the data...
    tmp_csv=txt2df(f_in)
    #...write it to disk...
    tmp_csv.to_csv(f_in[:-3]+'csv')
    #...and delete the DF held in memory
    del tmp_csv

../debt_data/2006to2007.csv


*** Processing ../debt_data/2006to2007.csv ***
--Capturing header--
--Capturing field positions--
--Processing header--
--Capturing clean variable list--
*** 2 470
5 474
*** 1 471
5 474
*** 2 471
5 474
*** 1 472
5 474
*** 2 472
5 474
*** 4 472
5 474
*** 1 473
5 474
*** 2 473
5 474
*** 3 473
5 474
*** 4 473
5 474
--Housing field position info in DataFrame--
15 474


ValueError: Length of values does not match length of index

In [101]:
f_list

['../debt_data/2006to2007.csv',
 '../debt_data/1990to1991.txt',
 '../debt_data/2006to2007.txt',
 '../debt_data/1996to1997.txt',
 '../debt_data/2014to2015.txt',
 '../debt_data/1986to1987.txt',
 '../debt_data/1988to1989.txt',
 '../debt_data/1986to1987.csv',
 '../debt_data/1992to1993.txt',
 '../debt_data/2010to2011.txt',
 '../debt_data/1990to1991.csv',
 '../debt_data/1998to1999.txt',
 '../debt_data/2000to2001.txt',
 '../debt_data/1994to1995.txt',
 '../debt_data/6months_text_as_columns.txt',
 '../debt_data/1996to1997.csv',
 '../debt_data/2012to2013.txt',
 '../debt_data/2014to2015.csv',
 '../debt_data/2004.txt',
 '../debt_data/2005.txt',
 '../debt_data/2002to2003.txt',
 '../debt_data/1984to1985.txt',
 '../debt_data/2008to2009.txt']

In [102]:
txt2df('../debt_data/6months_text_as_columns.txt')



*** Processing ../debt_data/6months_text_as_columns.txt ***
--Capturing header--
--Capturing field positions--
--Processing header--
--Capturing clean variable list--
--Housing field position info in DataFrame--
260 15


ValueError: Length of values does not match length of index