# Document Preparation for NLP

## Applications (Open Source) Recommended

1. Artifex's Ghostscript - extracting text from text PDF
2. ImageMagic's Magick - extracting image from scanned PDF
3. Google's Tesseract - OCR of scanned/image captured text

### Github Account

https://github.com/andrewferlitsch/epipog-nlp

### Ghostscript

1. Download link : https://www.ghostscript.com/download/gsdnld.html
        
    Use the Free Version<br/>

    I have a 64bit Windows laptop, so I am using this version: Ghostscript 9.23 for Windows (64 bit).<br/><br/>

2. Check if path to the program is in your PATH variable. 

    A. Open a command shell.<br/>
    B. Type gswin64c in the command line.<br/>
    C. If not found, add it to your path variable. For me, it is: C:\Program Files\gs\gs9.23\bin<br/>




### Magick

1. Download Link: https://www.imagemagick.org/script/download.php

    Use the static version (dynamic is for DLL inclusion).<br/>
    
    Use the 8bits per pixel version.<br/>
    
    I have a 64bit Windows laptop, so I am using this version: ImageMagick-7.0.8-1-Q8-x64-static.exe<br/><br/>
    
2. Check if path to the program is in your PATH variable.
 
    A. Open a command shell.<br/>
    B. Type magick in the command line.<br/>
    C. If not found, add it to your path variable. For me, it is: C:\Program Files\ImageMagic-7.0.8-Q8

### Tesseract

1. Download Link: https://github.com/tesseract-ocr/tesseract/wiki/Downloads

    A. Make sure to add the English Language training data to the tessdata subdirectory where tesseract is installed.<br/><br/>

2. Check if path to program is in your PATH variable:

    A. Open a command shell.<br/>
    B. Type tesseract in the command line.<br/>
    C. If not found, add it your path variable. For me, it is C:\Program Files\tesseract-Win64\<br/>
    
3. Install the English Training Data files as: C:\Program Files\tesseract-Win64\tessdata

    You can get a copy from my github account.

### Ghostscript Example: Extracting Text from Text PDF

Let's try some examples with using Ghostscript and PDF documents.

First, let's get the number of pages in the PDF. (yes, ghostscript's options are somewhat cryptic). We will do it on the 97 page il.pdf file.

In [2]:
!cd

C:\Users\'\Desktop\epipog-nlp\train


In [3]:

!gswin64c  -dBATCH -q -dNODISPLAY -c "("../plan/afspa.pdf") (r) file runpdfbegin pdfpagecount = quit"

140


Let's now try to split a PDF into individual pages. In the command below, we tell Ghostscript to split out page 1. We could
do this in a for loop and extract each page one at a time.

Note that we set the output DEVICE to pdfwrite. This is the command to Ghostscript to output a PDF file.

In [4]:
!gswin64c -dBATCH -dNOPAUSE -sOutputFile="../plan/afspa1.pdf" -sPageList=1 -sDEVICE=pdfwrite "../plan/afspa.pdf"

GPL Ghostscript 9.23 (2018-03-21)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1.
Page 1


GPL Ghostscript 9.23: ERROR: A pdfmark destination page 140 points beyond the last page 1.


Let's look at what is inside of the first PDF page. Is it text, a scanned image, a mix of text and images? We can guessimate this by looking at the PDF Resource directive inside of the PDF file.

Text -> Text<br/>
ImageB -> B&W Image<br/>
ImageC -> Color Image</br>
ImageI -> Indexed Image<br/>

We will do this using my PDFResource object. 

In [33]:
# Import Epipog PDFResource class from the pdf_res module
from pdf_res import PDFResource

In [34]:
res = PDFResource("plan/afspa1.pdf", debug=True)

PDF Version 1.5
resources  /ImageC /ImageI /Text]



In [35]:
# Let's now check whether the page is a text PDF, scanned PDF, or mix text and images.
print(res.text)
print(res.image)

True
True


Let's now extract the text from this single page PDF file using Ghostscript.

Note that we set the output DEVICE to txtwrite. This is the command to Ghostscript to output a PDF file.

In [36]:
!gswin64c -dBATCH -dNOPAUSE -sOutputFile="plan/afspa1.txt" -sPageList=1 -sDEVICE=txtwrite "plan/afspa1.pdf"

GPL Ghostscript 9.23 (2018-03-21)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1.
Page 1


Let's try another PDF document, which will have an unexpected result - which I will explain.

In [37]:
# Extract the 1st PDF page
!gswin64c -dBATCH -dNOPAUSE -sOutputFile="plan/il1.pdf" -sPageList=1 -sDEVICE=pdfwrite "plan/il.pdf"

GPL Ghostscript 9.23 (2018-03-21)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1.
Page 1


In [38]:
# Extract the text from the PDF page
!gswin64c -dBATCH -dNOPAUSE -sOutputFile="plan/il1.txt" -sPageList=1 -sDEVICE=txtwrite "plan/il1.pdf"

GPL Ghostscript 9.23 (2018-03-21)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1.
Page 1


Let's view the contents of the extracted text file.

OMG. It's just a lot of unprintable ASCII control characters. What happened?

The PDF Resource directive is just a hint. It doesn't mean that it is correct. So, in this case, this is really a scanned PDF.

### Ghostscript/Tesseract Example:  Extract Text from Scanned PDF example

So the il.pdf file appears to be a scanned PDF. So, let's extract the scanned page as a PNG image using Ghostscript.

This time, we will set the output device to a grayscale PNG image. Ghostscript actually renders an image (vs. merely extract). This gives us an opportunity to tell Ghostscript the resolution of the generated image, which will affect the OCR quality. I good rule of thumb is 300dpi. I've found 72 and 150dpi give poor OCR, 200 is okay on many things, but 300 generally is good for all cases.


In [40]:
# Extract the scanned image from the PDF page
!gswin64c -dBATCH -dNOPAUSE -sOutputFile="plan/il1.png" -sPageList=1 -sDEVICE=pnggray  -r300 "plan/il1.pdf"

GPL Ghostscript 9.23 (2018-03-21)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1.
Page 1


Now we will use Tesseract to extract the text from the PNG image.

In [43]:
!tesseract "plan/il1.png" "plan/il1"

Tesseract Open Source OCR Engine v3.05.01 with Leptonica


# Automated PDF, Fax, Image Capture Text Extraction with Epipog

Let's now automate all of the above and MORE. We will be using the Splitter component in my Epipog module.

Steps:
1. Import the document module
2. Create a Document object
3. Pass a PDF (text or scanned), Facsimile (TIFF) or image captured document to the Document object.
4. Wait for the results :)

In [82]:
# import Document and Page from the document module
from document import Document, Page

## Document Object

The initializer (constructor) takes the following arguments:<br/>

        document - - path to the document<br/>
        dir - directory where to store extracted pages and text<br/>
        ehandler - function to invoke when processing is completed in asynchronous mode<br/>


In [83]:
doc = Document("plan/nc.pdf", "plan/nc")

resources  /ImageC /ImageI /Text]



Ok, we are done! Let's look at a page, like page 105.

Wow, that's the foreign language translation page - see how it handles other (non-latin) character sets.

In [84]:
# Use the len() operator to find out how many pages are in the document
len(doc)

105

## Page Object

Let's now dive deeper. When the document was processed, each page was put into a page object. Here are some things we can do:

1. Walk thru each page sequentially as an array index.<br/>
2. See the original text from the page.<br/>
3. See the "default" NLP preprocessing of the text on the page.<br/>


In [85]:
# Let's take a look at one of the pages
pages = doc.pages
print(len(pages))
pages[104]

105


<document.Page at 0x5563f98>

In [86]:
# Let's look at the text for that page (page 105)
page = pages[104]
page.text

'Legal Notices \n         Laotian    ໂປດຊາບ:  ຖ້າວ່າ  ທ່ານເວົ້າພາສາ  ລາວ,  ການບໍລິການຊ່ວຍ\n                    ເຫຼືອດ້ານພາສາ, ໂດຍບໍ່ເສັຽຄ່າ, ແມ່ນມີພ້ອມໃຫ້ທ່ານ. \n                    ໂທຣ 919-814-4400. \n         Japanese   注意事項：日本語を話される場合、無料の言語支援をご利用いただけます。919-814-\n                    4400. \n                           Notice of Grandfather Status \n     The State Health Plan believes the 70/30 Plan is a “grandfathered health plan” under the Patient Protection \n     and Affordable Care Act (the Affordable Care Act). As permitted by the Affordable Care Act, a grandfathered \n     health plan can preserve certain basic health coverage that was already in effect when that law was enacted. \n     Being a grandfathered health plan means that your plan may not include certain consumer protections of the \n     Affordable Care Act that apply to other plans, for example, the requirement for the provision of preventive \n     health services without any cost sharing. However, grandfathered hea

In [79]:
# Let's look at the default NLP preprocessing of the text (stemming, stopword removal, punct removal)
page.words


[{'tag': 0, 'word': 'legal'},
 {'tag': 0, 'word': 'ໂປດຊາບ'},
 {'tag': 0, 'word': 'ຖ'},
 {'tag': 0, 'word': 'າວ'},
 {'tag': 0, 'word': 'າ'},
 {'tag': 0, 'word': 'ທ'},
 {'tag': 0, 'word': 'ານເວ'},
 {'tag': 0, 'word': 'າພາສາ'},
 {'tag': 0, 'word': 'ລາວ'},
 {'tag': 0, 'word': 'ການບ'},
 {'tag': 0, 'word': 'ລ'},
 {'tag': 0, 'word': 'ການຊ'},
 {'tag': 0, 'word': 'ວຍ'},
 {'tag': 0, 'word': 'ເຫ'},
 {'tag': 0, 'word': 'ອດ'},
 {'tag': 0, 'word': 'ານພາສາ'},
 {'tag': 0, 'word': 'ໂດຍບ'},
 {'tag': 0, 'word': 'ເສ'},
 {'tag': 0, 'word': 'ຽຄ'},
 {'tag': 0, 'word': 'າ'},
 {'tag': 0, 'word': 'ແມ'},
 {'tag': 0, 'word': 'ນມ'},
 {'tag': 0, 'word': 'ພ'},
 {'tag': 0, 'word': 'ອມໃຫ'},
 {'tag': 0, 'word': 'ທ'},
 {'tag': 0, 'word': 'ານ'},
 {'tag': 0, 'word': 'ໂທຣ'},
 {'tag': 0, 'word': 'japanese'},
 {'tag': 0, 'word': '注意事項'},
 {'tag': 0, 'word': '日本語を話される場合'},
 {'tag': 0, 'word': '無料の言語支援をご利用いただけます'},
 {'tag': 0, 'word': 'notice'},
 {'tag': 0, 'word': 'believe'},
 {'tag': 0, 'word': 'grandfather'},
 {'tag': 0, 'w

## Word Object

Let's directly use the Word object to control how the text is NLP preprocessed.

In [80]:
# import the Words class
from document import Words

In [81]:

w = Words("grandfather, consumer", stopwords=True)
w.words

[{'tag': 0, 'word': 'grandfather'}, {'tag': 0, 'word': 'consum'}]

### NER (Name Entity Recognition)

In [91]:
# Let's look at a string with a name and social security number.
w = Words(" word1 word2 Jim Jones, SSN: 123-12-1234 word3", stopwords=True)

In [92]:
# Let's print the word list. Note that jim and jones are tagged 11 (Proper Name) and 123121234 is tagged 9 (SSN)
w.words

[{'tag': 0, 'word': 'word1'},
 {'tag': 0, 'word': 'word2'},
 {'tag': 11, 'word': 'jim'},
 {'tag': 11, 'word': 'jones'},
 {'tag': 9, 'word': '123121234'},
 {'tag': 0, 'word': 'word3'}]

### De-Identification

In [94]:
# Let's remove any names and SSN from our text
w = Words("  word1 word2 Jim Jones, SSN: 123-12-1234 word3", name=False, ssn=False)
w.words

[{'tag': 0, 'word': 'word1'},
 {'tag': 0, 'word': 'word2'},
 {'tag': 0, 'word': 'word3'}]

## THAT'S ALL FOR SESSION 1

Look forward to seeing everyone again on session 2 where we will do some serious deep diving.