![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/ocr/PDF_TO_TEXT.ipynb)

# Extract Tables from PDF

To run this yourself, you will need to upload your **Spark OCR** license keys to the notebook. Otherwise, you can look at the example outputs at the bottom of the notebook. To upload license keys, open the file explorer on the left side of the screen and upload `workshop_license_keys.json` to the folder that opens.

For more in-depth tutorials: https://github.com/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter

## 1. Colab Setup

Install correct version of Pillow and Restart runtime

In [None]:
# Install correct Pillow version
import PIL
if PIL.__version__  != '6.2.1':
  print ('Installing correct version of Pillow. Kernel will restart automatically')
  !pip install --upgrade pillow==6.2.1
  # hard restart runtime
  import os
  os.kill(os.getpid(), 9)
else:
  print ('Correct Pillow detected')

Read licence key

In [None]:
import os
import json

with open('./spark_ocr.json', 'r') as f:
    license_keys = json.load(f)

secret = license_keys['JSL_OCR_SECRET']
os.environ['JSL_OCR_LICENSE'] = license_keys['SPARK_OCR_LICENSE']
version = ocr_secret.split('-')[0]
print ('Spark OCR Version:', version)

Install Dependencies

In [None]:
# Install Java
!apt-get update
!apt-get install -y openjdk-8-jdk
!java -version

# Install pyspark, SparkOCR, and SparkNLP
!pip install --ignore-installed -q pyspark==2.4.4
# Insall Spark Ocr from pypi using secret
!python -m pip install --upgrade spark-ocr==$version  --extra-index-url https://pypi.johnsnowlabs.com/$secret
# or install from local path
# %pip install --user ../../python/dist/spark-ocr-[version].tar.gz

Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import os

#Pyspark Imports
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

# Necessary imports from Spark OCR library
from sparkocr import start
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_image, to_pil_image
from sparkocr.metrics import score
import pkg_resources

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]


Start Spark Session

In [None]:
spark = start(secret=secret)
spark

## 2. Read a sample pdf file

In [None]:

pdf_example = pkg_resources.resource_filename('sparkocr', 'resources/ocr/pdfs/tabular-pdf/data.pdf')
pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache()

In [None]:
image_df = PdfToImage() \
    .setInputCol("content") \
    .setOutputCol("image") \
    .transform(pdf_example_df.select("content", "path"))
for r in image_df.limit(1).collect():
    display_image(r.image)

## 3. Extract tables from PDF using a sinlge transformer

In [None]:
pdf_to_text_table = PdfToTextTable()
pdf_to_text_table.setInputCol("content")
pdf_to_text_table.setOutputCol("table")
pdf_to_text_table.setMethod("basic")
pdf_to_text_table.setGuess(True)


table = pdf_to_text_table.transform(pdf_example_df)


## 4. Post-Processing

### Raw result

In [None]:

table.select(table["table.chunks"].getItem(0)["chunkText"]).show(1, False)

### Convert to table and dataframe

In [None]:
res = table.toPandas()

In [None]:
# extract ALL tables and create dataframes
dfs = []
for docu in res['table'].values:
    rows = []
    for page in docu[1]:
        cols = []
        for row in page:
            #print (row[0])
            cols.append(str(row[0]))
        rows.append(cols)

    rows = np.asarray(rows)

    df = pd.DataFrame(rows[1:], columns=rows[0])
    dfs.append(df)

In [None]:
#first dataframe in list of dataframes
dfs[0]

In [None]:
#print all
for df in dfs:
  print (df)