In [5]:
%%html
<link href="https://aegis4048.github.io/theme/libs/prism.css" rel="stylesheet" />
<link href="https://aegis4048.github.io/jupyter_custom.css" rel="stylesheet" />
<script src="https://aegis4048.github.io/theme/libs/prism.js"></script>
<script src="https://aegis4048.github.io/jupyter_custom.js"></script>

If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful it is — it's hard to copy-and-paste rows of data out of PDF files. It's especially hard if you want to retain the formats of the data in PDF file while extracting text. Most of the open source PDF parsers available are good at extracting text. But when it comes to retaining the the file's structure, eh, not really. Try [tabula-py](https://github.com/chezou/tabula-py) to extract data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. One look is worth a thousand words. Take a look at the demo screenshot.
 
<div class="row">
    <div class="col"><img src="jupyter_images/pdf_parse_demo.png"></div>
</div>

## Installations

This installation tutorial assumes that you are using Windows operating system. However, according to the offical [tabula-py documentation](https://github.com/chezou/tabula-py#os), it was confirmed that tabula-py works on macOS and Ubuntu. 


**1. Download Java**

Tabula-py is a wrapper for tabula-java, which translates Python commands to Java commands, so that us Python programmers don't have to bother learning Java. As the name "tabula-java" suggests, it requires Java. You can download Java [here](https://www.java.com/en/).

**2. Set environment PATH variable (Windows)**

One thing that I don't like about Windows is that it's difficult to use a new program I downloaded in a console environment like Python or CMD window. But oh well, if you are a Windows user, you have to go through this extra step to allow Python to use Java. If you are a macOS or Ubuntu user, you probably don't need this step. 

Find where Java is installed, and go to <code>Control Panel > System and Security > System > Advanced system settings > Advanced > Environment Variables...</code> to set environment PATH variable for Java.
<div class="row give-margin">
    <div class="col-6"><img src="jupyter_images/tabula_instruction_1.png" style="border: 1px solid;"></div>
    <div class="col-6"><img src="jupyter_images/tabula_instruction_2.png" style="border: 1px solid;"></div>
</div>

Make sure you have <code>Java\jdk1.8.0_201\bin</code> and <code>Java\jre1.8.0_201\bin</code> in the environment path variable. Then, type <code>java -version</code> on CMD window. If you successfully installed Java and configured the environment variable, you should see something like this:

<pre class="command-line language-powershell" data-prompt="PS C:\Users\Eric>" data-output="2-5">
<code class="language-powershell">java -version
    
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
</code>
</pre>

If you don't see something like this, it means that you didn't properly configure environment PATH variable for Java.

**3. Re-start Your Command Prompt**

Any program invoked from the command prompt will be given the environment variables that was at the time the command prompt was invoked. If you launched your Python console or Jupyter Notebook before you updated your environment PATH variable, you need to re-start again. Otherwise the change in the environment variable will not be reflected.

If you are experiencing <code>FileNotFoundError</code> or <code>'java' is not recognized as an internal or external command, operable program or batch file</code> inside Jupyter or Python console, it's the issue of environment variable. Either you set it wrong, or your command prompt is not reflecting the change you made in the environment variable.

To check if the change in the environment variable was reflected, run the following code in Jupyter or Python console:
<pre>
    <code class="language-python">
        import os
        
        s = os.environ["PATH"].split(';')
        for item in s: 
            print(item)
    </code>
</pre>

Something like these must be in the output if everything is working fine:

<pre>
    <code class="language-markup">
        C:\Program Files\Java\jdk1.8.0_201\bin
        C:\Program Files\Java\jre1.8.0_201\bin
    </code>
</pre>

**4. Install Tabula-py**

This is the last step:
<pre class="command-line language-powershell" data-prompt="C:\Users\Eric>">
<code class="language-powershell">pip install tabula-py
</code>
</pre>

More detailed instructions are provided in the [github repo](https://github.com/chezou/tabula-py) of tabula-py

## Tabula Web Application

Tabula supports web application to parse PDF files. You do not need this to use tabula-py, but from my personal experience I strongly recommend you to use this tool because it really helps you debugging issues when using tabula-py. For example, I was tring to parse 100s of PDF files at once, and for some reason tabula-py would return an <code>NoneType</code> object instead of <code>pd.DataFrame</code> object (by default, tabula-py extracts tables in dataframe) for one PDF file. There was nothing wrong with my codes, and yet it would just not parse the file. So I tried opening it on the tabula web-app, and realized that it was actually a scanned PDF file and that tabula is unable to parse scanned PDFs. 

Long stort shot, if it can be parsed with tabula web-app, you can replicate it with tabula-py. If tabula web-app can't, you should look for a different tool to meet your need.

**Installations**

If you already configured the environment PATH variable for Java, all you need to do is downloading the .zip file [here](https://tabula.technology/) and running <code>tabula.exe</code>. That's it. Tabula has really nice web-interface in which you can parse tables from PDFs by just clicking buttons.

<div class="alert alert-info">
    <h4>Note</h4>
    <p>The web-app will automatically open in your browser with <strong>127.0.0.1:8080</strong> local host. If port 8080 is already being used by another process, you will need to shut it down. But normally you don't have to worry about this.</p>
</div>

**Screenshots**

This is what you will see when you launch <code>tabula.exe</code>. <code>Browse</code> the PDF file you want to parse, and <code>import</code>.

<div class="row give-margin">
    <div class="col"><img src="jupyter_images/tabula-webapp.png"></div>
</div>

You can either use <code>Autodetect Tables</code> or drag your mouse to choose the area of your interest. If the PDF file has a  complicated structure, it is usually better to manually choose the area of your interest. Also, note the option <code>Repeat to All Pages</code>. Selecting this option will apply the area you chose for all pages.

<div class="row give-margin">
    <div class="col"><img src="jupyter_images/tabula-webapp_2.png"></div>
</div>

Here's the output. More explanation about <code>Lattice</code> and <code>Stream</code> options will be discussed in detail later.

<div class="row give-margin">
    <div class="col"><img src="jupyter_images/tabula-webapp_3.png"></div>
</div>

## Running Tabula-py

Tabula-py enables you to extract table from PDF into DataFrame and JSON. It also can extract tables from PDF and save file as CSV, TSV or JSON. Some basic code examples are as follows:

<pre>
    <code class="language-python">
        import tabula

        # Read pdf into DataFrame
        df = tabula.read_pdf("test.pdf", options)

        # Read remote pdf into DataFrame
        df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

        # convert PDF into CSV
        tabula.convert_into("test.pdf", "output.csv", output_format="csv")

        # convert all PDFs in a directory
        tabula.convert_into_by_batch("input_directory", output_format='csv')
    </code>
</pre>

**Area Selection**

You decide a portion of PDF you want to analyze by setting <code>area</code> (top,left,bottom,right) option in <code>tabula.read_pdf()</code>. This is equivalent to dragging your mouse and setting the area of your interest in tabula web-app as it was mentioned above. Also note that you can choose the page, or pages you want to parse with <code>pages</code> option.

According to the offical documentation:

<pre>
    <code class="language-markup">
        pages (str, int, list of int, optional)

            An optional values specifying pages to extract from. It allows str, int, list of int.
            Example: 1, '1-2,3', 'all' or [1,2]. Default is 1
    </code>
</pre>    
    
The sample PDF file can be downloaded from [here](https://github.com/aegis4048/aegis4048.github.io-source/blob/master/content/downloads/notebooks/pdf_parsing/lattice-timelog-multiple-pages.pdf).

In [30]:
file = 'pdf_parsing/lattice-timelog-multiple-pages.pdf'
df = tabula.read_pdf(file, lattice=True, pages=2, area=(406, 24, 695, 589))

In [31]:
df

Unnamed: 0,Start Date,End Date,(hr),Activity,Activity Detail,Operation,Com
0,12/13/2014\r06:00,12/13/2014\r09:00,3.0,SURF-DRILL,DRILL SURFACE,DRL,Rotate from 1600' to 1859' (259' @ 8 fph). WOB...
1,12/13/2014\r09:00,12/13/2014\r11:00,2.0,SURF-CIRC,CIRCULATE,CIRC,Pump 2- 50 bbl hi vis sweep; Circulate to surface
2,12/13/2014\r11:00,12/13/2014\r14:00,3.0,SURF-TRIP,TOOH,TRIP,TOOH (Slick off bottom) f/1859' to 108' (SLM)...
3,12/13/2014\r14:00,12/13/2014\r16:00,2.0,PLAN,EQUIP,BHA,"PJSA - Break bit & L/D directional BHA, clean ..."
4,12/13/2014\r16:00,12/13/2014\r17:30,1.5,PLAN,DRLG,CSG,PTJSA / R/U Pipe Pros.Csg. tools / PTJSA on ru...
5,12/13/2014\r17:30,12/13/2014\r18:00,0.5,PLAN,DRLG,CSG,Make up 13 3/8 Gemco PDC drillable float shoe;...
6,12/13/2014\r18:00,12/13/2014\r18:30,0.5,PLAN,PERS,SFTY,"HJSM with Morning tour crew, Pipe Pro casing c..."
7,12/13/2014\r18:30,12/13/2014\r23:30,5.0,PLAN,DRLG,CSG,"Make up 13 /8"" PDC drillable float collar onto..."
8,12/13/2014\r23:30,12/14/2014\r01:30,2.0,SURF-CIRC,CIRCULATE,CIRC,HJSM on Hoisting personal; Make up Swedge in ...
9,12/14/2014\r01:30,12/14/2014\r03:30,2.0,PLAN,DRLG,CSG,"Run 13 3/8""J-55 54.5 BTC f/ 1,639' to 1,819';..."


Alternatively, you can set area with percentage scale by setting <code>relative_area=True</code>. For this specific PDF file, the below <code>area=(50, 5, 92, 100), relative_area=True</code> option is equivalent to <code>area=(406, 24, 695, 589)</code> above.

In [35]:
file = 'pdf_parsing/lattice-timelog-multiple-pages.pdf'
df = tabula.read_pdf(file, lattice=True, pages=2, area=(50, 5, 92, 100), relative_area=True)

In [36]:
df

Unnamed: 0,Start Date,End Date,Dur (hr),Activity,Activity Detail,Operation,Com
0,2/13/2014\r6:00,12/13/2014\r09:00,3.0,SURF-DRILL,DRILL SURFACE,DRL,Rotate from 1600' to 1859' (259' @ 8 fph). WOB...
1,2/13/2014\r9:00,12/13/2014\r11:00,2.0,SURF-CIRC,CIRCULATE,CIRC,Pump 2- 50 bbl hi vis sweep; Circulate to surface
2,2/13/2014\r1:00,12/13/2014\r14:00,3.0,SURF-TRIP,TOOH,TRIP,TOOH (Slick off bottom) f/1859' to 108' (SLM)...
3,2/13/2014\r4:00,12/13/2014\r16:00,2.0,PLAN,EQUIP,BHA,"PJSA - Break bit & L/D directional BHA, clean ..."
4,2/13/2014\r6:00,12/13/2014\r17:30,1.5,PLAN,DRLG,CSG,PTJSA / R/U Pipe Pros.Csg. tools / PTJSA on ru...
5,2/13/2014\r7:30,12/13/2014\r18:00,0.5,PLAN,DRLG,CSG,Make up 13 3/8 Gemco PDC drillable float shoe;...
6,2/13/2014\r8:00,12/13/2014\r18:30,0.5,PLAN,PERS,SFTY,"HJSM with Morning tour crew, Pipe Pro casing c..."
7,2/13/2014\r8:30,12/13/2014\r23:30,5.0,PLAN,DRLG,CSG,"Make up 13 /8"" PDC drillable float collar onto..."
8,2/13/2014\r3:30,12/14/2014\r01:30,2.0,SURF-CIRC,CIRCULATE,CIRC,HJSM on Hoisting personal; Make up Swedge in ...
9,2/14/2014\r1:30,12/14/2014\r03:30,2.0,PLAN,DRLG,CSG,"Run 13 3/8""J-55 54.5 BTC f/ 1,639' to 1,819';..."



Default is the entire page. 


**Lattice Mode vs Stream Mode**

Tabula supports two primary modes of table extraction — Lattice mode and stream mode. 

<p><u>Lattice Mode</u></p>

Lattice mode forces PDF to be extracted using lattice-mode extraction. It recognizes each cells based on ruling lines, or borders of each cell. 

<p><u>Stream Mode</u></p>

Stream mode forces PDF to be extracted using stream-mode extraction. This mode is used when there are no ruling lines to differentiate one cell from the other. Instead, it uses spacings among each cells to recognize each cell.

<div class="row give-margin">
    <div class="col-md-6 col-sm-12">
        <div class="col-12"><img src="jupyter_images/lattice_mode.png" style="border: 1px solid; height: 380px"></div>
        <div class="col-12"><p class="image-description">Lattice mode recommended</p></div>
    </div>
    <div class="col-md-6 col-sm-12">
        <div class="col-12"><img src="jupyter_images/stream_mode_2.png" style="border: 1px solid; height: 380px"></div>
        <div class="col-12"><p class="image-description">Stream mode recommended</p></div>
    </div>
</div>

**Lattice 

In [6]:
import tabula
import pandas as pd

In [17]:
file = 'pdf_parsing/lattice-survey-single-page.pdf'
df = tabula.read_pdf(file, lattice=True,area=(25, 0, 90, 100), relative_area=True)
df

Unnamed: 0,Int #,#,Date,Type,Top (ftKB),Top (TVD) (ftKB),Btm (ftKB),Btm (TVD) (ftKB),Job
0,68,5,2/9/2016 16:19,CONVENTIONAL PERF,9071.0,8913.9,9073.0,8914.6,"OCM, 1/16/2015 00:00"
1,68,4,2/9/2016 16:18,CONVENTIONAL PERF,9101.0,8923.8,9103.0,8924.4,"OCM, 1/16/2015 00:00"
2,68,3,2/9/2016 16:17,CONVENTIONAL PERF,9131.0,8931.4,9133.0,8931.8,"OCM, 1/16/2015 00:00"
3,68,2,2/9/2016 16:16,CONVENTIONAL PERF,9161.0,8936.9,9163.0,8937.2,"OCM, 1/16/2015 00:00"
4,68,1,2/9/2016 16:15,CONVENTIONAL PERF,9191.0,8940.9,9193.0,8941.2,"OCM, 1/16/2015 00:00"
5,67,5,2/9/2016 13:31,CONVENTIONAL PERF,9221.0,8944.2,9223.0,8944.4,"OCM, 1/16/2015 00:00"
6,67,4,2/9/2016 13:30,CONVENTIONAL PERF,9251.0,8947.0,9253.0,8947.2,"OCM, 1/16/2015 00:00"
7,67,3,2/9/2016 13:29,CONVENTIONAL PERF,9281.0,8949.6,9283.0,8949.7,"OCM, 1/16/2015 00:00"
8,67,2,2/9/2016 13:28,CONVENTIONAL PERF,9311.0,8952.0,9313.0,8952.1,"OCM, 1/16/2015 00:00"
9,67,1,2/9/2016 13:27,CONVENTIONAL PERF,9341.0,8954.2,9343.0,8954.3,"OCM, 1/16/2015 00:00"


In [3]:
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")


In [4]:
df2

Unnamed: 0,مرحباً,اسمي سلطان
0,انا من ولاية كارولينا الشمال,من اين انت؟
1,1234,عندي 47 قطط
2,هل انت شباك؟,اسمي Jeremy في الانجليزية
3,Jeremy is جرمي in Arabic,


In [None]:

First, you need sample PDF files. You can choose to use your own PDF files, or download the sample PDF files from my [Github repo](https://github.com/aegis4048/aegis4048.github.io-source/tree/master/content/downloads/notebooks/pdf_parsing).

<div class="alert alert-info">
    <h4>Notes on Sample Data</h4>
    <p>The sample PDF files used in this tutorial are publicly available data from <a href="http://www.utlands.utsystem.edu/API/4200346352">University Land</a>.</p>
</div>