In [18]:
import numpy as np
import pandas as pd
import tabula

## working with tabula-py

### <font color=green>summary</font>
1. the [documentation](https://pypi.org/project/tabula-py/)
2. Python wrapper for tabula-java, the command line interface for the table extraction engine the [Tabula](https://tabula.technology/) project. Tabula is in use for investigative reporting at news organizations :[ProPublica](https://www.propublica.org/), [Foriegn Policy](https://foreignpolicy.com/), [New York Times](https://www.nytimes.com/) and others.
3. It reads data and attempts to return a Pandas dataframe object by default or json optionally
4. docs for python wrapper don't seem as dense but helpful explanation of option parameters

### <font color=red>limitations</font>

1. tight text 
2. incorrectly appending columns
3. chopping header data
4. handles rotatated sheets poorly

In [19]:
# nothing happening if the document is scanned
tabula_scanned_pdf = tabula.read_pdf_table('data/pdf_data/Final Exhibit A to Equipment Lease #29.pdf', pages=4)
type(tabula_scanned_pdf)


NoneType

In [20]:
tabula.read_pdf_table('data/pdf_data/camelot-twisted.pdf')


Unnamed: 0,Nov 01,2018 8:54:09 PM org.apache.pdfbox.util.operator.pagedrawer.FillNonZeroRule process
0,WARNING: java.lang.ArrayIndexOutOfBoundsExcept...,
1,java.lang.ArrayIndexOutOfBoundsException: 160,
2,\tat java.awt.geom.Path2D$Float$TxIterator.cur...,
3,\tat technology.tabula.ObjectExtractor.strokeO...,
4,\tat technology.tabula.ObjectExtractor.fillPat...,
5,\tat org.apache.pdfbox.util.operator.pagedrawe...,
6,\tat org.apache.pdfbox.util.PDFStreamEngine.pr...,
7,\tat org.apache.pdfbox.util.PDFStreamEngine.pr...,
8,\tat org.apache.pdfbox.util.PDFStreamEngine.pr...,
9,\tat org.apache.pdfbox.util.PDFStreamEngine.pr...,


In [21]:
# converts automatically to dataframe by default but tight text doesn't work well
# splits colums appending incorrectly to the next column (first row)
# chops column header data
tabula.read_pdf_table('data/pdf_data/active_licenses.pdf')

ParserError: Error tokenizing data. C error: Expected 2 fields in line 13, saw 9


In [16]:
tabula.read_pdf_table('data/pdf_data/active_licenses.pdf', guess=False)

ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 7


### <font color=orange>useful</font>
1. auto detect structure break in table to retun only table elements: no title no paragraph
2. conversion to dataframe in <font color=green><b>8</b></font> lines
3. converstion to excel or csv in <font color=green><b>9</b></font> lines
4. has useful ```pages``` parameter for setting target page number to read into the function that attempts to parse and return df or json. takes list or 'all'

## performs well if the document is searchable & you know, in advance, the target page with table 

In [29]:
df = tabula.read_pdf_table('data/pdf_data/exhibita-10-20-2018.pdf', pages=4)
cols = []
[cols.append('{} {}'.format(k.split('.')[0],v))for k,v in zip(df.iloc[0].index, df.iloc[0])]
df.drop([0,1,2,23,24],inplace=True)
df.columns = cols
for col in df.columns[2:]:
    df[col] = df[col].apply(lambda x: float(x.replace(',','')))
df.set_index('Payment No.', inplace=True)

ParserError: Error tokenizing data. C error: Expected 2 fields in line 29, saw 6


In [147]:
df

Unnamed: 0_level_0,Payment Date,Payment Amount,Interest Component,Principal Component,Purchase Price
Payment No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5/1/2018,1100264.54,273262.18,827002.36,18606566.47
2,11/1/2018,1100264.54,271872.65,828391.89,17753322.82
3,5/1/2019,1100264.54,259405.35,840859.19,16887237.86
4,11/1/2019,1100264.54,246750.42,853514.12,16008118.31
5,5/1/2020,1100264.54,233905.03,866359.51,15115768.02
6,11/1/2020,1100264.54,220866.32,879398.22,14209987.85
7,5/1/2021,1100264.54,207631.38,892633.16,13290575.7
8,11/1/2021,1100264.54,194197.25,906067.29,12357326.39
9,5/1/2022,1100264.54,180560.93,919703.61,11410031.67
10,11/1/2022,1100264.54,166719.39,933545.15,10448480.17


In [24]:
df.to_excel('tabula-to-excel.xlsx')

NameError: name 'df' is not defined

In [25]:
# handles multiple pages, merging into one object to render dataframe
# unlike camelot-py which parses a document to return tables and pages individually
import camelot
camelot_page_limitation = camelot.read_pdf('data/pdf_data/active_licenses.pdf', flavor='stream')
camelot_page_limitation



<TableList n=1>

In [26]:
print('pages returned with camelot: {}'.format(camelot_page_limitation.n))
camelot_page_limitation[0].df.tail(5)

pages returned with camelot: 1


Unnamed: 0,0,1,2,3,4,5,6,7
42,,,,1522 WEST LINDSEY,,,,
43,,632575 BAW BASHU LEGENDS HYH H...,,STREET NORMAN,OK,73069.0,-,2014/07/21
44,,,DEEP FORK HOLDINGS,,,,,
45,,543149 BAW BEDLAM BAR-B-Q,LLC,610 NORTHEAST 50TH OKLAHOMA CITY,OK,73105.0,(405) 528-7427,2015/02/23
46,,,,Page 1 of 151,,,,


In [28]:
tabula_license_page1 = tabula.read_pdf_table('data/pdf_data/active_licenses.pdf',pages=1)

ParserError: Error tokenizing data. C error: Expected 2 fields in line 13, saw 9


In [None]:
tabula_license_page2 = tabula.read_pdf('data/pdf_data/active_licenses.pdf',pages=2)

In [None]:
tabula.read_pdf('data/pdf_data/active_licenses.pdf',pages=[1,2])