# Gap (prelaunch) 0.9 - July 2018
## NLP and CV Data Engineering Framework

<b>[Github] (https://github.com/andrewferlitsch/gap)</b>

# Automated PDF, Fax, Image Capture Text Extraction with Gap (Session 1)

Let's start with the basics. We will be using the <b style='color: saddlebrown'>SPLITTER</b> component in my Gap module.

Steps:
1. Import the <b style='color: saddlebrown'>Document</b> and <b style='color: saddlebrown'>Page</b> class from the <b style='color: saddlebrown'>splitter</b> module.
2. Create a Document object.
3. Pass a PDF (text or scanned), Facsimile (TIFF) or image captured document to the Document object.
4. Wait for the results :)

In [1]:
# Let's change directory to where the Gap Framework is located
import os
os.chdir('../')

In [2]:
# import Document and Page from the document module
from scripts.splitter import Document, Page

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\'\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\'\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## <span style='color: saddlebrown'>Document</span> Object

The initializer (constructor) takes the following arguments:<br/>

        document - path to the document
        dir      - directory where to store extracted pages and text
        ehandler - function to invoke when processing is completed in asynchronous mode
        config   - configuration settings for SYNTAX module
        
Let's start by preprocessing a 105 page PDF, which is a medical benefits plan. We should see:

- Split into individual PDF pages
- Text extracted from each page
- Individual page PDF and text stored in specified directory.


In [4]:
doc = Document("plan/nc.pdf", "plan/nc")

Ok, we are done! Let's look at a page, like page 105.

Wow, that's the foreign language translation page - see how it handles other (non-latin) character sets.

In [8]:
# Let's use the name property to see the name of the document
print( doc.name )

# Use the len() operator to find out how many pages are in the document
print( len(doc) )


nc
105


## <span style='color: saddlebrown'>Page</span> Object

Let's now dive deeper. When the document was processed, each page was put into a <b style='color: saddlebrown'>Page</b> object. Here are some things we can do:

1. Walk thru each page sequentially as an array index (list).<br/>
2. See the original text from the page.<br/>
3. See the "default" NLP preprocessing of the text on the page (which can be modified with config settings).<br/>


In [10]:
# Let's take a look at one of the pages
pages = doc.pages

# total number of pages
print(len(pages))

# Last page in the document
pages[104]

105


<document.Page at 0xf0eb9e8>

In [11]:
# Let's look at the text for that page (page 105)
page = pages[104]
page.text

'Legal Notices \n         Laotian    ໂປດຊາບ:  ຖ້າວ່າ  ທ່ານເວົ້າພາສາ  ລາວ,  ການບໍລິການຊ່ວຍ\n                    ເຫຼືອດ້ານພາສາ, ໂດຍບໍ່ເສັຽຄ່າ, ແມ່ນມີພ້ອມໃຫ້ທ່ານ. \n                    ໂທຣ 919-814-4400. \n         Japanese   注意事項：日本語を話される場合、無料の言語支援をご利用いただけます。919-814-\n                    4400. \n                           Notice of Grandfather Status \n     The State Health Plan believes the 70/30 Plan is a “grandfathered health plan” under the Patient Protection \n     and Affordable Care Act (the Affordable Care Act). As permitted by the Affordable Care Act, a grandfathered \n     health plan can preserve certain basic health coverage that was already in effect when that law was enacted. \n     Being a grandfathered health plan means that your plan may not include certain consumer protections of the \n     Affordable Care Act that apply to other plans, for example, the requirement for the provision of preventive \n     health services without any cost sharing. However, grandfathered hea

In [12]:
# Let's look at the default NLP preprocessing of the text (stemming, stopword removal, punct removal)
page.words


[{'tag': 0, 'word': 'legal'},
 {'tag': 0, 'word': 'ໂປດຊາບ'},
 {'tag': 0, 'word': 'ຖ'},
 {'tag': 0, 'word': 'າວ'},
 {'tag': 0, 'word': 'າ'},
 {'tag': 0, 'word': 'ທ'},
 {'tag': 0, 'word': 'ານເວ'},
 {'tag': 0, 'word': 'າພາສາ'},
 {'tag': 0, 'word': 'ລາວ'},
 {'tag': 0, 'word': 'ການບ'},
 {'tag': 0, 'word': 'ລ'},
 {'tag': 0, 'word': 'ການຊ'},
 {'tag': 0, 'word': 'ວຍ'},
 {'tag': 0, 'word': 'ເຫ'},
 {'tag': 0, 'word': 'ອດ'},
 {'tag': 0, 'word': 'ານພາສາ'},
 {'tag': 0, 'word': 'ໂດຍບ'},
 {'tag': 0, 'word': 'ເສ'},
 {'tag': 0, 'word': 'ຽຄ'},
 {'tag': 0, 'word': 'າ'},
 {'tag': 0, 'word': 'ແມ'},
 {'tag': 0, 'word': 'ນມ'},
 {'tag': 0, 'word': 'ພ'},
 {'tag': 0, 'word': 'ອມໃຫ'},
 {'tag': 0, 'word': 'ທ'},
 {'tag': 0, 'word': 'ານ'},
 {'tag': 0, 'word': 'ໂທຣ'},
 {'tag': 0, 'word': 'japanese'},
 {'tag': 0, 'word': '注意事項'},
 {'tag': 0, 'word': '日本語を話される場合'},
 {'tag': 0, 'word': '無料の言語支援をご利用いただけます'},
 {'tag': 0, 'word': 'notice'},
 {'tag': 0, 'word': 'believe'},
 {'tag': 0, 'word': 'health'},
 {'tag': 0, 'word':

We can see that some words appear a lot, like preventive, health and protection. Let's get information on the distribution of words in the page. There are two properties we can use for this purpose:

    freqDist - count of the number of occurrences of each word
    termFreq - percentage the word appears on the page (TF -> Term Frequency)

In [13]:
# Let's see the frequency distribution (word counts) for the page
page.freqDist

[('plan', 10),
 ('health', 8),
 ('protection', 4),
 ('apply', 3),
 ('example', 2),
 ('cost', 2),
 ('preventive', 2),
 ('service', 2),
 ('consumer', 2),
 ('benefit', 2),
 ('າ', 2),
 ('ທ', 2),
 ('share', 2),
 ('japanese', 1),
 ('ານເວ', 1),
 ('ການຊ', 1),
 ('ນມ', 1),
 ('notice', 1),
 ('cause', 1),
 ('າວ', 1),
 ('legal', 1),
 ('ແມ', 1),
 ('gov', 1),
 ('ພ', 1),
 ('elimination', 1),
 ('enact', 1),
 ('law', 1),
 ('question', 1),
 ('ານ', 1),
 ('ອມໃຫ', 1),
 ('direct', 1),
 ('provision', 1),
 ('effect', 1),
 ('status', 1),
 ('basic', 1),
 ('ລ', 1),
 ('preserve', 1),
 ('ອດ', 1),
 ('contact', 1),
 ('healthcare', 1),
 ('u', 1),
 ('current', 1),
 ('ລາວ', 1),
 ('coverage', 1),
 ('ຽຄ', 1),
 ('ເສ', 1),
 ('ານພາສາ', 1),
 ('mean', 1),
 ('ການບ', 1),
 ('無料の言語支援をご利用いただけます', 1),
 ('າພາສາ', 1),
 ('continue', 1),
 ('ໂທຣ', 1),
 ('location', 1),
 ('requirement', 1),
 ('ໂປດຊາບ', 1),
 ('ເຫ', 1),
 ('lifetime', 1),
 ('limit', 1),
 ('comp', 1),
 ('www', 1),
 ('must', 1),
 ('s', 1),
 ('注意事項', 1),
 ('ຖ', 1),
 ('permit', 

In [14]:
# Let's see the term frequency (TF)
page.termFreq

[('plan', 0.0970873786407767),
 ('health', 0.07766990291262135),
 ('protection', 0.038834951456310676),
 ('apply', 0.02912621359223301),
 ('example', 0.019417475728155338),
 ('cost', 0.019417475728155338),
 ('preventive', 0.019417475728155338),
 ('service', 0.019417475728155338),
 ('consumer', 0.019417475728155338),
 ('benefit', 0.019417475728155338),
 ('າ', 0.019417475728155338),
 ('ທ', 0.019417475728155338),
 ('share', 0.019417475728155338),
 ('japanese', 0.009708737864077669),
 ('ານເວ', 0.009708737864077669),
 ('ການຊ', 0.009708737864077669),
 ('ນມ', 0.009708737864077669),
 ('notice', 0.009708737864077669),
 ('cause', 0.009708737864077669),
 ('າວ', 0.009708737864077669),
 ('legal', 0.009708737864077669),
 ('ແມ', 0.009708737864077669),
 ('gov', 0.009708737864077669),
 ('ພ', 0.009708737864077669),
 ('elimination', 0.009708737864077669),
 ('enact', 0.009708737864077669),
 ('law', 0.009708737864077669),
 ('question', 0.009708737864077669),
 ('ານ', 0.009708737864077669),
 ('ອມໃຫ', 0.00970

## <span style='color: saddlebrown'>Document</span> Object (Advanced)

Let's look at more advanced features of the Document object.

1. Word Count and Term Frequency
2. Save and Restore
3. Asychronous Processing of Documents
4. Scanned PDF / OCR

### Frequency Distribution

Let's look at a frequency distribution (word count) for the whole document. Note that if we look at just the top 10 word counts (after removing stopwords), it is very clear what the document is about: service, benefit, cover, health, medical, care, coverage, ...

If we look at the top 25 word counts, we can see secondary classification indicators, like: plan, medication, treatment, deductible, eligible, dependent, hospital, claim, authorization, prescription and limit.

HINT: It's a Healthcare Benefit Plan.

In [15]:
doc.freqDist

[('service', 627),
 ('provide', 579),
 ('benefit', 460),
 ('cover', 331),
 ('health', 321),
 ('care', 302),
 ('medical', 214),
 ('network', 212),
 ('coverage', 207),
 ('receive', 201),
 ('information', 192),
 ('medication', 177),
 ('s', 158),
 ('plan', 157),
 ('treat', 156),
 ('requ', 145),
 ('certification', 135),
 ('review', 130),
 ('deductible', 129),
 ('amount', 112),
 ('representative', 112),
 ('must', 110),
 ('eligible', 108),
 ('dependent', 107),
 ('prior', 106),
 ('condition', 102),
 ('time', 101),
 ('hospital', 99),
 ('claim', 96),
 ('authorization', 93),
 ('supp', 92),
 ('prescription', 92),
 ('limit', 88),
 ('available', 87),
 ('facility', 86),
 ('therapy', 83),
 ('decision', 81),
 ('program', 80),
 ('charge', 78),
 ('employee', 77),
 ('list', 76),
 ('emergency', 75),
 ('call', 74),
 ('bcbsnc', 72),
 ('period', 72),
 ('requir', 71),
 ('visit', 70),
 ('level', 70),
 ('pay', 70),
 ('license', 67),
 ('apply', 66),
 ('require', 65),
 ('date', 62),
 ('www', 62),
 ('responsible', 

### (Re) Load

When a Document object is created, the individual PDF pages, text extraction and NLP analysis are stored. 

The document can then be subsequently reloaded from storage without reprocessing.

In [17]:
# Let's first delete the Document object from memory
doc = None

In [27]:
# Let's reload the document from storage.
doc = Document()
doc.load("plan/nc.pdf", "plan/nc")

Let's show some examples of how the document was reconstructed from memory.

In [28]:
# Document Name, Number of Pages
print(doc.document)
print(len(doc))

plan/nc.pdf
105


In [31]:
# Let's print text from the last page
page = doc[104]
page.text

'                                     Legal Notices \n     LEGAL NOTICES \n     According to the applicable provisions and limitations of North Carolina General Statutes Chapter 135, the State of North \n     Carolina provides health care benefits to North Carolina teachers, state employees, retirees, members of boards and \n     commissions, and their eligible dependents, as well as others eligible such as employees of certain counties and \n     municipalities, firemen, rescue squad or emergency medical workers, members of the North Carolina Army and Air \n     National Guard, and their eligible dependents. These provisions authorize the offering of an optional health plan, which is \n     being offered in the form of a Preferred Provider Organization (PPO) plan and which is outlined in this booklet.  \n     The information contained in this booklet is supported by medical policies which are used as guides to make coverage \n     determinations. \n     For specific detailed informati

In [21]:
# Let's print the word (count) frequency distribution
doc.freqDist

[('service', 627),
 ('provide', 579),
 ('benefit', 460),
 ('cover', 331),
 ('health', 321),
 ('care', 302),
 ('medical', 214),
 ('network', 212),
 ('coverage', 207),
 ('receive', 201),
 ('information', 192),
 ('medication', 177),
 ('s', 158),
 ('plan', 157),
 ('treat', 156),
 ('requ', 145),
 ('certification', 135),
 ('review', 130),
 ('deductible', 129),
 ('representative', 112),
 ('amount', 112),
 ('must', 110),
 ('eligible', 108),
 ('dependent', 107),
 ('prior', 106),
 ('condition', 102),
 ('time', 101),
 ('hospital', 99),
 ('claim', 96),
 ('authorization', 93),
 ('supp', 92),
 ('prescription', 92),
 ('limit', 88),
 ('available', 87),
 ('facility', 86),
 ('therapy', 83),
 ('decision', 81),
 ('program', 80),
 ('charge', 78),
 ('employee', 77),
 ('list', 76),
 ('emergency', 75),
 ('call', 74),
 ('bcbsnc', 72),
 ('period', 72),
 ('requir', 71),
 ('visit', 70),
 ('level', 70),
 ('pay', 70),
 ('license', 67),
 ('apply', 66),
 ('require', 65),
 ('date', 62),
 ('www', 62),
 ('responsible', 

### Async Execution

Let's say you have PDF files arriving for processing in real-time from various sources. The ehandler option provides asynchronous processing of documents. When this option is specified, the document is processed on an independent process thread, and when complete the specified event handler is called.

In [22]:
def done(document):
    print("EVENT HANDLER: done")
    
doc = Document("crash_2015.pdf", ehandler=done)

EVENT HANDLER: done


Let's get a frequency distribution for this document. BTW, it is a 2015 State of Oregon table of crash statistics (single page) from a multi-page report. Note how the top ten words (after stopword removal) indicate what the document is about: serious, injury, fatal, crash, highway, death.

In [37]:
doc.freqDist

[('combine', 7),
 ('serious', 5),
 ('injury', 4),
 ('fatal', 4),
 ('inj', 4),
 ('system', 3),
 ('crash', 3),
 ('miles', 2),
 ('rate', 2),
 ('death', 2),
 ('highway', 2),
 ('vmt', 2),
 ('es', 1),
 ('ys', 1),
 ('wjury', 1),
 ('deate', 1),
 ('drivinevent', 1),
 ('continue', 1),
 ('sustain', 1),
 ('jurisdiction', 1),
 ('injuriula', 1),
 ('nog', 1),
 ('pralk', 1),
 ('functional', 1),
 ('result', 1),
 ('inon', 1),
 ('vehicle', 1),
 ('crashes', 1),
 ('data', 1),
 ('total', 1),
 ('v', 1),
 ('s', 1),
 ('casualty', 1),
 ('table', 1),
 ('classification', 1),
 ('activity', 1),
 ('and', 1),
 ('prior', 1),
 ('list', 1),
 ('injuries', 1),
 ('rmal', 1),
 ('capable', 1),
 ('annual', 1),
 ('person', 1)]

### Scanned PDF / OCR

Let's now process a scanned PDF. That's a PDF which is effective a scanned image of a text document, which is then wrapped inside a PDF.

- Split into pages
- Extract page image
- OCR the image image into text
- Extract the text

In [23]:
# OCR the scanned PDF and extract text
doc = Document("tests/4scan.pdf", "tests")

Let's now look at a few properties of the preprocessed document

In [24]:
# The scanned property indicates the document was a scanned PDF (true)
print( doc.scanned )

# Let's print the number of pages
print( len(doc) )

True
4


Let's now look at a page.

In [25]:
# Get the first page
page = doc[0]

page.text

'E T8 2 G R E Listening.Learningieading?\n\nF \\\n\n \n\nA Comprehensive Review\nof Published GRE®\nValidity Data\n\n \n\nA Summary from ETS'

## END OF SESSION 1