# Gap Framework - Natural Language Processing
## Splitter Module

<b>[Github] (https://github.com/andrewferlitsch/gap)</b>

# Automated PDF, Fax, Image Capture Text Extraction with Gap (Session 1)

Let's start with the basics. We will be using the <b style='color: saddlebrown'>SPLITTER</b> module in the **Gap** Framework.

Steps:
1. Import the <b style='color: saddlebrown'>Document</b> and <b style='color: saddlebrown'>Page</b> class from the <b style='color: saddlebrown'>splitter</b> module.
2. Create a <b style='color: saddlebrown'>Document</b> object.
3. Pass a PDF (text or scanned), Facsimile (TIFF) or image captured document to the <b style='color: saddlebrown'>Document</b> object.
4. Wait for the results :)

In [None]:
# let's go to the directory where Gap Framework is installed
import os
os.chdir("../")
%ls

In [None]:
# import Document and Page from the document module
from gapml.splitter import Document, Page

## <span style='color: saddlebrown'>Document</span> Object

The initializer (constructor) takes the following arguments:<br/>

        document - path to the document
        dir      - directory where to store extracted pages and text
        ehandler - function to invoke when processing is completed in asynchronous mode
        config   - configuration settings for SYNTAX module
        
Let's start by preprocessing a 105 page PDF, which is a medical benefits plan. We should see:

- Split into individual PDF pages
- Text extracted from each page
- Individual page PDF and text stored in specified directory.

*Note, for brevity, we reduced the size of the PDF document to 10 pages for this code along.* 

In [None]:
doc = Document("train/10nc.pdf", "train/nc")

Ok, we are done! Let's look at the last page (page 105 ~ page 10 in the shorten version).

Wow, that's the foreign language translation page - see how it handles other (non-latin) character sets.

In [None]:
# Let's use the name property to see the name of the document
print( doc.name )

# Use the len() operator to find out how many pages are in the document
print( len(doc) )


## <span style='color: saddlebrown'>Page</span> Object

Let's now dive deeper. When the document was processed, each page was put into a <b style='color: saddlebrown'>Page</b> object. Here are some things we can do:

1. Walk thru each page sequentially as an array index (list).<br/>
2. See the original text from the page.<br/>
3. See the "default" NLP preprocessing of the text on the page (which can be modified with *config* settings).<br/>


In [None]:
# Let's take a look at one of the pages
pages = doc.pages

# total number of pages
print(len(pages))

# Last page in the document
pages[9]

In [None]:
# Let's look at the text for that page (page 10)
page = pages[9]
page.text

In [None]:
# Let's look at the default NLP preprocessing of the text (stemming, stopword removal, punct removal)
page.words

We can see that some words appear a lot, like preventive, health and protection. Let's get information on the distribution of words in the page. There are two properties we can use for this purpose:

    freqDist - The count of the number of occurrences of each word.
    termFreq - The percentage the word appears on the page (TF -> Term Frequency).

In [None]:
# Let's see the frequency distribution (word counts) for the page
page.freqDist

In [None]:
# Let's see the term frequency (TF)
page.termFreq

## <span style='color: saddlebrown'>Document</span> Object (Advanced)

Let's look at more advanced features of the <b style='color:saddlebrown'>Document</b> object.

1. Word Count and Term Frequency
2. Save and Restore
3. Asychronous Processing of Documents
4. Scanned PDF / OCR

### Frequency Distribution

Let's look at a frequency distribution (word count) for the whole document. Note that if we look at just the top 10 word counts (after removing stopwords), it is very clear what the document is about: service, benefit, cover, health, medical, care, coverage, ...

If we look at the top 25 word counts, we can see secondary classification indicators, like: plan, medication, treatment, deductible, eligible, dependent, hospital, claim, authorization, prescription and limit.

HINT: It's a Healthcare Benefit Plan.

In [None]:
doc.freqDist

### (Re) Load

When a <b style='color:saddlebrown'>Document</b> object is created, the individual PDF pages, text extraction and NLP analysis are stored. 

The document can then be subsequently reloaded from storage without reprocessing.

In [None]:
# Let's first delete the Document object from memory
doc = None

In [None]:
# Let's reload the document from storage.
doc = Document()
doc.load("train/10nc.pdf", "train/nc")

Let's show some examples of how the document was reconstructed from memory.

In [None]:
# Document Name, Number of Pages
print(doc.document)
print(len(doc))

In [None]:
# Let's print text from the last page
page = doc[9]
page.text

In [None]:
# Let's print the word (count) frequency distribution
doc.freqDist

### Async Execution

Let's say you have PDF files arriving for processing in real-time from various sources. The *ehandler* option provides asynchronous processing of documents. When this option is specified, the document is processed on an independent process thread, and when complete the specified event handler is called.

In [None]:
def done(document):
    print("EVENT HANDLER: done")
    
doc = Document("train/crash_2015.pdf", "train/crash", ehandler=done)

Let's get a frequency distribution for this document. BTW, it is a 2015 State of Oregon table of crash statistics (single page) from a multi-page report. Note how the top ten words (after stopword removal) indicate what the document is about: serious, injury, fatal, crash, highway, death.

In [None]:
doc.freqDist

### Scanned PDF / OCR

Let's now process a scanned PDF. That's a PDF which is effective a scanned image of a text document, which is then wrapped inside a PDF.

- Split into pages
- Extract page image
- OCR the image image into text
- Extract the text

In [None]:
Document.SCANCHECK = 0

# OCR the scanned PDF and extract text
doc = Document("train/4scan.pdf", "train/4scan")

Let's now look at a few properties of the preprocessed document.

In [None]:
# The scanned property indicates the document was a scanned PDF (true)
print( doc.scanned )

# Let's print the number of pages
print( len(doc) )

Let's now look at a page.

In [None]:
# Get the first page
page = doc[0]

page.text

In [None]:
#clean dir
import shutil
shutil.rmtree('train/nc')
shutil.rmtree('train/4scan')
shutil.rmtree('train/crash')

## END OF SESSION 1