![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/ocr/PDF_TEXT_NER.ipynb)

# Recognize entities in scanned PDFs

To run this yourself, you will need to upload your **Spark OCR** license keys to the notebook. Otherwise, you can look at the example outputs at the bottom of the notebook. To upload license keys, open the file explorer on the left side of the screen and upload `workshop_license_keys.json` to the folder that opens.

For more in-depth tutorials: https://github.com/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter

## 1. Colab Setup

Install correct version of Pillow and Restart runtime

In [None]:
# Install correct Pillow version
import PIL
if PIL.__version__  != '6.2.1':
  print ('Installing correct version of Pillow. Kernel will restart automatically')
  !pip install --upgrade pillow==6.2.1
  # hard restart runtime
  import os
  os.kill(os.getpid(), 9)
else:
  print ('Correct Pillow detected')

Read licence key

In [None]:
import os
import json

with open('workshop_license_keys.json') as f:
    license_keys = json.load(f)

print (license_keys.keys())

secret = license_keys['JSL_OCR_SECRET']
os.environ['SPARK_OCR_LICENSE'] = license_keys['SPARK_OCR_LICENSE']
os.environ['JSL_OCR_LICENSE'] = license_keys['SPARK_OCR_LICENSE']
version = secret.split("-")[0]
nlp_secret = license_keys['JSL_SECRET']
jsl_version = nlp_secret.split('-')[0]
print ('Spark OCR Version:', version)

Install Dependencies

In [None]:
# Install Java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
! java -version

# Install pyspark, SparkOCR, and SparkNLP
!pip install --ignore-installed -q pyspark==2.4.4
# Insall Spark Ocr from pypi using secret
!python -m pip install --upgrade spark-ocr==$version  --extra-index-url https://pypi.johnsnowlabs.com/$secret
# or install from local path
# %pip install --user ../../python/dist/spark-ocr-[version].tar.gz

! python -m pip install --upgrade spark-nlp-jsl==$jsl_version --extra-index-url https://pypi.johnsnowlabs.com/$nlp_secret


Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import os

#Pyspark Imports
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

# Necessary imports from Spark OCR library
from sparkocr import start
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_image, to_pil_image
from sparkocr.metrics import score
import pkg_resources

# import sparknlp packages
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp_jsl
from sparknlp_jsl.annotator import *

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]


Start Spark Session

In [None]:
spark = start(secret=secret,
              nlp_secret=nlp_secret,
              nlp_version=jsl_version,
              nlp_internal=True)

spark

## 2. Download and read scanned pdf image. 
**To process PDF, download it and just use pdf_to_image annotator instead of binary_to_image in the pipeline**

In [None]:
!wget https://www.reneelab.com/wp-content/uploads/sites/2/2015/11/target-500x600.png -O 1.jpg

In [None]:
image_df = spark.read.format("binaryFile").load('./1.jpg').cache()
image_df.show()

## 3. Construct OCR & NLP pipelines

OCR Pipleline

In [None]:
# To load PDF instead of Image,
#pdf_to_image = PdfToImage() \
#            .setInputCol("content") \
#            .setOutputCol("image_raw") \
#            .setKeepInput(True)

# Read binary as image
binary_to_image = BinaryToImage()
binary_to_image.setInputCol('content')
binary_to_image.setOutputCol('image')

# Scale image
scaler = ImageScaler()
scaler.setInputCol('image')
scaler.setOutputCol('scaled_image')
scaler.setScaleFactor(2.0)

# Binarize using adaptive tresholding
binarizer = ImageAdaptiveThresholding()
binarizer.setInputCol('scaled_image')
binarizer.setOutputCol('binarized_image')
binarizer.setBlockSize(91)
binarizer.setOffset(70)

# Remove extraneous objects from image
remove_objects = ImageRemoveObjects()
remove_objects.setInputCol('binarized_image')
remove_objects.setOutputCol('cleared_image')
remove_objects.setMinSizeObject(30)
remove_objects.setMaxSizeObject(4000)

# Apply morphology opening
morpholy_operation = ImageMorphologyOperation()
morpholy_operation.setKernelShape(KernelShape.DISK)
morpholy_operation.setKernelSize(1)
morpholy_operation.setOperation('closing')
morpholy_operation.setInputCol('cleared_image')
morpholy_operation.setOutputCol('corrected_image')

# Extract text from corrected image with OCR
ocr = ImageToText()
ocr.setInputCol('binarized_image')
ocr.setOutputCol('text')
ocr.setConfidenceThreshold(50)
ocr.setIgnoreResolution(False)

# Create pipeline
pipeline = PipelineModel(stages=[
    binary_to_image,
    scaler,
    binarizer,
    remove_objects,
    morpholy_operation,
    ocr])



NLP Pipeline containing **Spell Correction** and **NER**

In [None]:
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_dl')\
    .setInputCols("token")\
    .setOutputCol("checked")
    
embeddings = WordEmbeddingsModel.pretrained('glove_100d').\
                    setInputCols(["document", 'checked']).\
                    setOutputCol("embeddings")

public_ner = NerDLModel.pretrained('onto_100', 'en') \
          .setInputCols(["document", "token", "embeddings"]) \
          .setOutputCol("ner")

ner_converter = NerConverter() \
                .setInputCols(["document", "token", "ner"]) \
                  .setOutputCol("ner_chunk")

nlp_pipeline =  Pipeline(stages=[documentAssembler, 
    tokenizer,
    spellModel,
    embeddings,
    public_ner,
    ner_converter])

## 4. Run OCR pipeline

In [None]:
result = pipeline.transform(image_df).cache()

## 5. Visualize Results

Display result dataframe

In [None]:
result.select("text", "confidence").show()

Display text and images

In [None]:
result_arr = []
for r in result.distinct().collect():
  print (r.text)
  result_arr.append(r.text)

# 6. Run NLP pipeline

In [None]:
empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = nlp_pipeline.fit(empty_df)
df = spark.createDataFrame(pd.DataFrame({"text":result_arr}))
nlp_result = pipelineModel.transform(df)

#7. Visualize NLP results

Contextual Spell Correction

In [None]:
nlp_result.select(F.explode(F.arrays_zip('token.result', 'checked.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("original"),
        F.expr("cols['1']").alias("corrected")).show(truncate=False)

NER 

In [None]:

nlp_result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
        F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
