# Gap Framework - Natural Language Processing
## Splitter Module

<b>[Github] (https://github.com/andrewferlitsch/gap)</b>

# Automated PDF, Fax, Image Capture Text Extraction with Gap (Session 1)

Let's start with the basics. We will be using the <b style='color: saddlebrown'>SPLITTER</b> module in the **Gap** Framework.

Steps:
1. Import the <b style='color: saddlebrown'>Document</b> and <b style='color: saddlebrown'>Page</b> class from the <b style='color: saddlebrown'>splitter</b> module.
2. Create a <b style='color: saddlebrown'>Document</b> object.
3. Pass a PDF (text or scanned), Facsimile (TIFF) or image captured document to the <b style='color: saddlebrown'>Document</b> object.
4. Wait for the results :)

In [1]:
# let's go to the directory where Gap Framework is installed
import os
os.chdir("../")
!cd

C:\Users\'\Desktop\Gap-ml


In [2]:
# import Document and Page from the splitter module
from splitter import Document, Page

[nltk_data] Error loading wordnet: <urlopen error [Errno 11004]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [Errno 11004] getaddrinfo failed>


## <span style='color: saddlebrown'>Document</span> Object

The initializer (constructor) takes the following arguments:<br/>

        document - path to the document
        dir      - directory where to store extracted pages and text
        ehandler - function to invoke when processing is completed in asynchronous mode
        config   - configuration settings for SYNTAX module
        
Let's start by preprocessing a 105 page PDF, which is a medical benefits plan. We should see:

- Split into individual PDF pages
- Text extracted from each page
- Individual page PDF and text stored in specified directory.

*Note, for brevity, we reduced the size of the PDF document to 10 pages for this code along.* 

In [3]:
doc = Document("train/10nc.pdf", "train/nc")

Ok, we are done! Let's look at the last page (page 105 ~ page 10 in the shorten version).

Wow, that's the foreign language translation page - see how it handles other (non-latin) character sets.

In [4]:
# Let's use the name property to see the name of the document
print( doc.name )

# Use the len() operator to find out how many pages are in the document
print( len(doc) )


10nc
10


## <span style='color: saddlebrown'>Page</span> Object

Let's now dive deeper. When the document was processed, each page was put into a <b style='color: saddlebrown'>Page</b> object. Here are some things we can do:

1. Walk thru each page sequentially as an array index (list).<br/>
2. See the original text from the page.<br/>
3. See the "default" NLP preprocessing of the text on the page (which can be modified with *config* settings).<br/>


In [5]:
# Let's take a look at one of the pages
pages = doc.pages

# total number of pages
print(len(pages))

# Last page in the document
pages[9]

10


<splitter.Page at 0xeebc9b0>

In [6]:
# Let's look at the text for that page (page 10)
page = pages[9]
page.text

'Legal Notices \n         Laotian    ໂປດຊາບ:  ຖ້າວ່າ  ທ່ານເວົ້າພາສາ  ລາວ,  ການບໍລິການຊ່ວຍ\n                    ເຫຼືອດ້ານພາສາ, ໂດຍບໍ່ເສັຽຄ່າ, ແມ່ນມີພ້ອມໃຫ້ທ່ານ. \n                    ໂທຣ 919-814-4400. \n         Japanese   注意事項：日本語を話される場合、無料の言語支援をご利用いただけます。919-814-\n                    4400. \n                           Notice of Grandfather Status \n     The State Health Plan believes the 70/30 Plan is a “grandfathered health plan” under the Patient Protection \n     and Affordable Care Act (the Affordable Care Act). As permitted by the Affordable Care Act, a grandfathered \n     health plan can preserve certain basic health coverage that was already in effect when that law was enacted. \n     Being a grandfathered health plan means that your plan may not include certain consumer protections of the \n     Affordable Care Act that apply to other plans, for example, the requirement for the provision of preventive \n     health services without any cost sharing. However, grandfathered hea

In [7]:
# Let's look at the default NLP preprocessing of the text (stemming, stopword removal, punct removal)
page.words


[{'tag': 0, 'word': 'legal'},
 {'tag': 0, 'word': 'ໂປດຊາບ'},
 {'tag': 0, 'word': 'ຖ'},
 {'tag': 0, 'word': 'າວ'},
 {'tag': 0, 'word': 'າ'},
 {'tag': 0, 'word': 'ທ'},
 {'tag': 0, 'word': 'ານເວ'},
 {'tag': 0, 'word': 'າພາສາ'},
 {'tag': 0, 'word': 'ລາວ'},
 {'tag': 0, 'word': 'ການບ'},
 {'tag': 0, 'word': 'ລ'},
 {'tag': 0, 'word': 'ການຊ'},
 {'tag': 0, 'word': 'ວຍ'},
 {'tag': 0, 'word': 'ເຫ'},
 {'tag': 0, 'word': 'ອດ'},
 {'tag': 0, 'word': 'ານພາສາ'},
 {'tag': 0, 'word': 'ໂດຍບ'},
 {'tag': 0, 'word': 'ເສ'},
 {'tag': 0, 'word': 'ຽຄ'},
 {'tag': 0, 'word': 'າ'},
 {'tag': 0, 'word': 'ແມ'},
 {'tag': 0, 'word': 'ນມ'},
 {'tag': 0, 'word': 'ພ'},
 {'tag': 0, 'word': 'ອມໃຫ'},
 {'tag': 0, 'word': 'ທ'},
 {'tag': 0, 'word': 'ານ'},
 {'tag': 0, 'word': 'ໂທຣ'},
 {'tag': 0, 'word': 'japanese'},
 {'tag': 0, 'word': '注意事項'},
 {'tag': 0, 'word': '日本語を話される場合'},
 {'tag': 0, 'word': '無料の言語支援をご利用いただけます'},
 {'tag': 0, 'word': 'notice'},
 {'tag': 0, 'word': 'believe'},
 {'tag': 0, 'word': 'health'},
 {'tag': 0, 'word':

We can see that some words appear a lot, like preventive, health and protection. Let's get information on the distribution of words in the page. There are two properties we can use for this purpose:

    freqDist - The count of the number of occurrences of each word.
    termFreq - The percentage the word appears on the page (TF -> Term Frequency).

In [8]:
# Let's see the frequency distribution (word counts) for the page
page.freqDist

[('plan', 10),
 ('health', 8),
 ('protection', 4),
 ('apply', 3),
 ('ທ', 2),
 ('example', 2),
 ('consumer', 2),
 ('service', 2),
 ('າ', 2),
 ('share', 2),
 ('benefit', 2),
 ('preventive', 2),
 ('cost', 2),
 ('ການບ', 1),
 ('無料の言語支援をご利用いただけます', 1),
 ('ລາວ', 1),
 ('permit', 1),
 ('change', 1),
 ('ເຫ', 1),
 ('ເສ', 1),
 ('elimination', 1),
 ('status', 1),
 ('gov', 1),
 ('japanese', 1),
 ('າພາສາ', 1),
 ('ໂດຍບ', 1),
 ('coverage', 1),
 ('www', 1),
 ('ນມ', 1),
 ('ແມ', 1),
 ('ານພາສາ', 1),
 ('າວ', 1),
 ('current', 1),
 ('ຖ', 1),
 ('continue', 1),
 ('provision', 1),
 ('ໂປດຊາບ', 1),
 ('question', 1),
 ('ຽຄ', 1),
 ('basic', 1),
 ('believe', 1),
 ('must', 1),
 ('notice', 1),
 ('requirement', 1),
 ('limit', 1),
 ('location', 1),
 ('u', 1),
 ('ການຊ', 1),
 ('ານເວ', 1),
 ('lifetime', 1),
 ('ລ', 1),
 ('direct', 1),
 ('effect', 1),
 ('ອດ', 1),
 ('enact', 1),
 ('ໂທຣ', 1),
 ('legal', 1),
 ('cause', 1),
 ('ວຍ', 1),
 ('law', 1),
 ('base', 1),
 ('注意事項', 1),
 ('s', 1),
 ('healthcare', 1),
 ('provide', 1),
 ('ພ',

In [10]:
# Let's see the term frequency (TF)
page.termFreq

[('plan', 0.0970873786407767),
 ('health', 0.07766990291262135),
 ('protection', 0.038834951456310676),
 ('apply', 0.02912621359223301),
 ('ທ', 0.019417475728155338),
 ('example', 0.019417475728155338),
 ('consumer', 0.019417475728155338),
 ('service', 0.019417475728155338),
 ('າ', 0.019417475728155338),
 ('share', 0.019417475728155338),
 ('benefit', 0.019417475728155338),
 ('preventive', 0.019417475728155338),
 ('cost', 0.019417475728155338),
 ('ການບ', 0.009708737864077669),
 ('無料の言語支援をご利用いただけます', 0.009708737864077669),
 ('ລາວ', 0.009708737864077669),
 ('permit', 0.009708737864077669),
 ('change', 0.009708737864077669),
 ('ເຫ', 0.009708737864077669),
 ('ເສ', 0.009708737864077669),
 ('elimination', 0.009708737864077669),
 ('status', 0.009708737864077669),
 ('gov', 0.009708737864077669),
 ('japanese', 0.009708737864077669),
 ('າພາສາ', 0.009708737864077669),
 ('ໂດຍບ', 0.009708737864077669),
 ('coverage', 0.009708737864077669),
 ('www', 0.009708737864077669),
 ('ນມ', 0.009708737864077669)

## <span style='color: saddlebrown'>Document</span> Object (Advanced)

Let's look at more advanced features of the <b style='color:saddlebrown'>Document</b> object.

1. Word Count and Term Frequency
2. Save and Restore
3. Asychronous Processing of Documents
4. Scanned PDF / OCR

### Frequency Distribution

Let's look at a frequency distribution (word count) for the whole document. Note that if we look at just the top 10 word counts (after removing stopwords), it is very clear what the document is about: service, benefit, cover, health, medical, care, coverage, ...

If we look at the top 25 word counts, we can see secondary classification indicators, like: plan, medication, treatment, deductible, eligible, dependent, hospital, claim, authorization, prescription and limit.

HINT: It's a Healthcare Benefit Plan.

In [11]:
doc.freqDist

[('benefit', 29),
 ('health', 20),
 ('plan', 19),
 ('booklet', 12),
 ('medical', 11),
 ('information', 10),
 ('service', 10),
 ('policy', 8),
 ('obtain', 8),
 ('m', 7),
 ('provide', 7),
 ('www', 7),
 ('claim', 6),
 ('de', 6),
 ('question', 6),
 ('apply', 5),
 ('medication', 5),
 ('network', 5),
 ('care', 5),
 ('requ', 5),
 ('a', 4),
 ('p', 4),
 ('shpnc', 4),
 ('org', 4),
 ('prior', 4),
 ('authorization', 4),
 ('prescription', 4),
 ('protection', 4),
 ('summary', 4),
 ('holiday', 4),
 ('search', 3),
 ('avail', 3),
 ('coverage', 3),
 ('id', 3),
 ('bcbsnc', 3),
 ('website', 3),
 ('law', 3),
 ('cost', 3),
 ('nc', 3),
 ('conflict', 3),
 ('beneficio', 3),
 ('form', 3),
 ('card', 3),
 ('call', 3),
 ('contact', 3),
 ('week', 3),
 ('access', 2),
 ('please', 2),
 ('change', 2),
 ('າ', 2),
 ('consumer', 2),
 ('read', 2),
 ('cvs', 2),
 ('participate', 2),
 ('example', 2),
 ('former', 2),
 ('receive', 2),
 ('status', 2),
 ('visit', 2),
 ('ທ', 2),
 ('eligibility', 2),
 ('offer', 2),
 ('describ', 2),

### (Re) Load

When a <b style='color:saddlebrown'>Document</b> object is created, the individual PDF pages, text extraction and NLP analysis are stored. 

The document can then be subsequently reloaded from storage without reprocessing.

In [12]:
# Let's first delete the Document object from memory
doc = None

In [13]:
# Let's reload the document from storage.
doc = Document()
doc.load("train/10nc.pdf", "train/nc")

Let's show some examples of how the document was reconstructed from memory.

In [14]:
# Document Name, Number of Pages
print(doc.document)
print(len(doc))

train/10nc.pdf
10


In [15]:
# Let's print text from the last page
page = doc[9]
page.text

'                              Legal Notices \n         Laotian    ໂປດຊາບ:  ຖ້າວ່າ  ທ່ານເວົ້າພາສາ  ລາວ,  ການບໍລິການຊ່ວຍ\n                    ເຫຼືອດ້ານພາສາ, ໂດຍບໍ່ເສັຽຄ່າ, ແມ່ນມີພ້ອມໃຫ້ທ່ານ. \n                    ໂທຣ 919-814-4400. \n         Japanese   注意事項：日本語を話される場合、無料の言語支援をご利用いただけます。919-814-\n                    4400. \n                           Notice of Grandfather Status \n     The State Health Plan believes the 70/30 Plan is a “grandfathered health plan” under the Patient Protection \n     and Affordable Care Act (the Affordable Care Act). As permitted by the Affordable Care Act, a grandfathered \n     health plan can preserve certain basic health coverage that was already in effect when that law was enacted. \n     Being a grandfathered health plan means that your plan may not include certain consumer protections of the \n     Affordable Care Act that apply to other plans, for example, the requirement for the provision of preventive \n     health services without any cost shari

In [16]:
# Let's print the word (count) frequency distribution
doc.freqDist

[('benefit', 29),
 ('health', 20),
 ('plan', 19),
 ('booklet', 12),
 ('medical', 11),
 ('information', 10),
 ('service', 10),
 ('policy', 8),
 ('obtain', 8),
 ('m', 7),
 ('provide', 7),
 ('www', 7),
 ('claim', 6),
 ('de', 6),
 ('question', 6),
 ('apply', 5),
 ('medication', 5),
 ('network', 5),
 ('care', 5),
 ('requ', 5),
 ('a', 4),
 ('p', 4),
 ('shpnc', 4),
 ('org', 4),
 ('prior', 4),
 ('authorization', 4),
 ('prescription', 4),
 ('protection', 4),
 ('summary', 4),
 ('holiday', 4),
 ('search', 3),
 ('avail', 3),
 ('coverage', 3),
 ('id', 3),
 ('bcbsnc', 3),
 ('website', 3),
 ('law', 3),
 ('cost', 3),
 ('nc', 3),
 ('conflict', 3),
 ('beneficio', 3),
 ('form', 3),
 ('card', 3),
 ('call', 3),
 ('contact', 3),
 ('week', 3),
 ('access', 2),
 ('please', 2),
 ('change', 2),
 ('າ', 2),
 ('consumer', 2),
 ('read', 2),
 ('cvs', 2),
 ('participate', 2),
 ('example', 2),
 ('former', 2),
 ('receive', 2),
 ('status', 2),
 ('visit', 2),
 ('ທ', 2),
 ('eligibility', 2),
 ('offer', 2),
 ('describ', 2),

### Async Execution

Let's say you have PDF files arriving for processing in real-time from various sources. The *ehandler* option provides asynchronous processing of documents. When this option is specified, the document is processed on an independent process thread, and when complete the specified event handler is called.

In [20]:
def done(document):
    print("EVENT HANDLER: done")
    
doc = Document("train/crash_2015.pdf", "train/crash", ehandler=done)

EVENT HANDLER: done


Let's get a frequency distribution for this document. BTW, it is a 2015 State of Oregon table of crash statistics (single page) from a multi-page report. Note how the top ten words (after stopword removal) indicate what the document is about: serious, injury, fatal, crash, highway, death.

In [21]:
doc.freqDist

[('combine', 7),
 ('serious', 5),
 ('injury', 4),
 ('inj', 4),
 ('fatal', 4),
 ('system', 3),
 ('crash', 3),
 ('death', 2),
 ('miles', 2),
 ('highway', 2),
 ('vmt', 2),
 ('rate', 2),
 ('injuriula', 1),
 ('continue', 1),
 ('functional', 1),
 ('classification', 1),
 ('pralk', 1),
 ('sustain', 1),
 ('rmal', 1),
 ('injuries', 1),
 ('cap', 1),
 ('result', 1),
 ('prior', 1),
 ('inon', 1),
 ('s', 1),
 ('crashes', 1),
 ('nog', 1),
 ('vehicle', 1),
 ('ys', 1),
 ('total', 1),
 ('es', 1),
 ('drivinevent', 1),
 ('v', 1),
 ('deate', 1),
 ('activity', 1),
 ('casualty', 1),
 ('wjury', 1),
 ('jurisdiction', 1),
 ('table', 1),
 ('data', 1),
 ('list', 1),
 ('annual', 1),
 ('person', 1),
 ('and', 1)]

### Scanned PDF / OCR

Let's now process a scanned PDF. That's a PDF which is effective a scanned image of a text document, which is then wrapped inside a PDF.

- Split into pages
- Extract page image
- OCR the image image into text
- Extract the text

In [22]:
Document.SCANCHECK = 0

# OCR the scanned PDF and extract text
doc = Document("train/4scan.pdf", "train/4scan")

Let's now look at a few properties of the preprocessed document.

In [23]:
# The scanned property indicates the document was a scanned PDF (true)
print( doc.scanned )

# Let's print the number of pages
print( len(doc) )

(True, 1)
4


Let's now look at a page.

In [24]:
# Get the first page
page = doc[0]

page.text

'E T8 2 G R E Listening.Learningieading?\n\nF \\\n\n \n\nA Comprehensive Review\nof Published GRE®\nValidity Data\n\n \n\nA Summary from ETS'

## END OF SESSION 1