# Spark OCR in Healthcare

Spark OCR is a commercial extension of Spark NLP for optical character recognition from images, scanned PDF documents, Microsoft DOCX and DICOM files. 

In this notebook we will:
  - Import clinical notes in pdf format and store in delta
  - Convert pdfs to image and improve image quality
  - Extract text from pdfs and store resulting text data in delta

In [0]:
%pip install transformers==4.22.1 johnsnowlabs spark-nlp spark-ocr

In [0]:
# To prevent undesired infor from the outputs
import logging
logger = spark._jvm.org.apache.log4j
logging.getLogger("py4j.java_gateway").setLevel(logging.ERROR)

In [0]:
import os
import json
import string
#import sys
#import base64
import numpy as np
import pandas as pd
 
import sparknlp
# import sparknlp_jsl
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
# from sparknlp_jsl.base import *
# from sparknlp_jsl.annotator import *
from sparknlp.pretrained import ResourceDownloader
 
# import sparkocr
# from sparkocr.transformers import *
# from sparkocr.utils import *
# from sparkocr.enums import *
 
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel
from sparknlp.training import CoNLL
 
import matplotlib.pyplot as plt
 
pd.set_option('max_colwidth', 100)
pd.set_option('display.max_columns', 100)  
pd.set_option('display.expand_frame_repr', False)
 
spark.sql("set spark.sql.legacy.allowUntypedScalaUDF=true")
 
print('sparknlp.version : ',sparknlp.version())
# print('sparknlp_jsl.version : ',sparknlp_jsl.version())
# print('sparkocr : ',sparkocr.version())

In [0]:
%run ./00_config

In [0]:
solacc_settings=SolAccUtil('phi_ocr')
solacc_settings.print_paths()

In [0]:
remote_url='https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/data/ocr'
for i in range(0,3):
  solacc_settings.load_remote_data(f'{remote_url}/MT_OCR_0{i}.pdf')
dbutils.fs.ls(solacc_settings.data_path)

In [0]:
pdfs_df = spark.read.format('binaryFile').load(f'{solacc_settings.data_path}/*.pdf').sort('path')
print("Number of files in the folder: ", pdfs_df.count())

## Write pdf files to delta bronze layer

In [0]:
pdfs_bronze_df = pdfs_df.selectExpr('sha1(path) as id','*')
display(pdfs_bronze_df)

In [0]:
pdfs_bronze_df.write.mode('overwrite').save(f'{solacc_settings.delta_path}/bronze/notes_pdfs')

## Parsing the Files through OCR (create bronze layer)

* The pdf files can have more than one page. We will transform the document in to images per page. Than we can run OCR to get text.
* We are using PdfToImage() to render PDF to images and ImageToText() to runs OCR for each images.

In [0]:
pdf_df = spark.read.load(f'{solacc_settings.delta_path}/bronze/notes_pdfs')

In [0]:
# Transofrm PDF document to images per page
pdf_to_image = PdfToImage()\
      .setInputCol("content")\
      .setOutputCol("image")
 