Parsing PDF files are hard. It's especially hard if you want to retain the formats of the original PDF file while extracting text. Most of the open source PDF parsers available online are good at extracting text. But when it comes to retaining the original file's structure, eh, not really. 

Try tabula-py, Python wrapper for tabula-java to extract tables from PDF files. One look is worth a thousand words. Take a look at this below demo.

<div class="row">
    <div class="col"><img src="jupyter_images/pdf_parse_demo.png"></div>
</div>

## Installations

This installation tutorial assumes Windows operating system. However, according to the offical [tabula-py repo](https://github.com/chezou/tabula-py) it was confirmed that tabula-py works on macOS and Ubuntu. 


**1. Download JAVA**

You can download [here](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html). Get "Java SE Development Kit 8u201"

**2. Set environment PATH variable (Windows)**

<code>Control Panel > System and Security > System > Advanced system settings > Advanced > Environment Variables...</code>

<br>
<div class="row">
    <div class="col-6"><img src="jupyter_images/tabula_instruction_1.png" style="border: 1px solid;"></div>
    <div class="col-6"><img src="jupyter_images/tabula_instruction_1.png" style="border: 1px solid;"></div>
</div>
<br>

Make sure you have <code>Java\jdk1.8.0_201\bin</code> and <code>Java\jre1.8.0_201\bin</code> in the environment path variable. 

Type <code>java -version</code> on your console. If you sucessfully installed and configured the environment variable, this is what you should see:

<pre class="command-line language-powershell" data-prompt="PS C:\Users\Eric>" data-output="2-5">
<code class="language-powershell">java -version
    
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
</code>
</pre>

**3. Re-start Jupyter**

Any program invoked from the command prompt will be given the environment variables that was at the time the command prompt was invoked. If you haven't already added Java to your environment variable by the time you launched this Jupyter Notebook, you need to restart your CMD console and launch Jupyter again.

If you are experiencing <code>FileNotFoundError</code> or <code>'java' is not recognized as an internal or external command, operable program or batch file</code> inside Jupyter or on console, it's the issue of PATH environment variable. 

To check if the environment variable was actually added to your configuration, run the following code in Jupyter:
<pre>
    <code class="language-python">
        s = os.environ["PATH"].split(';')
        for item in s: 
            print(item)
    </code>
</pre>

Something like these must be in the output if everything is working fine:

<pre>
    <code class="language-markup">
        C:\Program Files\Java\jdk1.8.0_201\bin
        C:\Program Files\Java\jre1.8.0_201\bin
    </code>
</pre>

**4. Install Tabula-py**

This is the last step:

<pre class="command-line language-powershell" data-prompt="PS C:\Users\Eric>">
<code class="language-powershell">pip install tabula-py
</code>
</pre>

More detailed instructions are provided in [github repo](https://github.com/chezou/tabula-py) of tabula-py

In [1]:
import tabula
import pandas as pd

In [17]:
file = 'pdf_parsing/lattice-survey-single-page.pdf'
df = tabula.read_pdf(file, lattice=True,area=(25, 0, 90, 100), relative_area=True)
df

Unnamed: 0,Int #,#,Date,Type,Top (ftKB),Top (TVD) (ftKB),Btm (ftKB),Btm (TVD) (ftKB),Job
0,68,5,2/9/2016 16:19,CONVENTIONAL PERF,9071.0,8913.9,9073.0,8914.6,"OCM, 1/16/2015 00:00"
1,68,4,2/9/2016 16:18,CONVENTIONAL PERF,9101.0,8923.8,9103.0,8924.4,"OCM, 1/16/2015 00:00"
2,68,3,2/9/2016 16:17,CONVENTIONAL PERF,9131.0,8931.4,9133.0,8931.8,"OCM, 1/16/2015 00:00"
3,68,2,2/9/2016 16:16,CONVENTIONAL PERF,9161.0,8936.9,9163.0,8937.2,"OCM, 1/16/2015 00:00"
4,68,1,2/9/2016 16:15,CONVENTIONAL PERF,9191.0,8940.9,9193.0,8941.2,"OCM, 1/16/2015 00:00"
5,67,5,2/9/2016 13:31,CONVENTIONAL PERF,9221.0,8944.2,9223.0,8944.4,"OCM, 1/16/2015 00:00"
6,67,4,2/9/2016 13:30,CONVENTIONAL PERF,9251.0,8947.0,9253.0,8947.2,"OCM, 1/16/2015 00:00"
7,67,3,2/9/2016 13:29,CONVENTIONAL PERF,9281.0,8949.6,9283.0,8949.7,"OCM, 1/16/2015 00:00"
8,67,2,2/9/2016 13:28,CONVENTIONAL PERF,9311.0,8952.0,9313.0,8952.1,"OCM, 1/16/2015 00:00"
9,67,1,2/9/2016 13:27,CONVENTIONAL PERF,9341.0,8954.2,9343.0,8954.3,"OCM, 1/16/2015 00:00"
