
# Gap (prelaunch) 0.9 - July 2018
## NLP and CV Data Engineering Framework

<b>[Github] (https://github.com/andrewferlitsch/gap)</b>

# Document Preparation for NLP with Gap (Session 2)

Let's dig deeper into the basics. We will be using the <b style='color: saddlebrown'>SYNTAX</b> component in my Gap module. 

## <span style='color: saddlebrown'>Words</span> Object

Let's directly use the <b style='color: saddlebrown'>Words</b> object to control how the text is NLP preprocessed.I will cover the following:

    - Syntax Preprocessing
    - Text Reduction (Stopwords)
    - Parts of Speech Tagging
    - De-Identification
    - Measurement Extraction


In [2]:
import os
os.chdir("../")
!cd

C:\Users\'\Desktop\epipog-nlp


In [3]:
# import the Words class
from document import Words

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\'\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\'\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Syntax Preprocessing

The <b style='color: saddlebrown'>SYNTAX</b> module supports various keyword parameters to configure how the text is NLP preprocessed. We will cover just a few in this code-along. Let's start with processing text for gender recognition. When the text is preprocessed, an ordered sequential list of Word objects are generated; each consisting a set of key/value pairs.

In bare mode, all the text and punctuation is preserved, and no tagging, parts of speech (POS), stemming, lemmatization, name entity recognition (NER) or stopword removal is perform.

#### Bare

Let's look at the preprocessing of a simple sentence in bare mode.

In [4]:
# Process this well-known typing phrase which contains all 26 letters of the alphabet
w = Words('The quick brown fox jumped over the lazy dog.', bare=True)
print(w.words)

[{'word': 'The', 'tag': 0}, {'word': 'quick', 'tag': 0}, {'word': 'brown', 'tag': 0}, {'word': 'fox', 'tag': 0}, {'word': 'jumped', 'tag': 0}, {'word': 'over', 'tag': 0}, {'word': 'the', 'tag': 0}, {'word': 'lazy', 'tag': 0}, {'word': 'dog', 'tag': 0}, {'word': '.', 'tag': 23}]


As you can see, the words property displays a list, where each entry is an object consisting of a word and tag key value pair. I know you don't know what the integer values of the tags mean (see Vocabulary.py). In bare mode, all words are tagged as UNTAGGED (0) and punctuation as PUNCT (23).

Note how in bare mode, all words are kept, their capitalization, order and punctuation.

#### Stopwords and Stemming

Let's do some text reduction. In NLP, a lot of things add very little to the understanding of the text, such as common words like 'the', 'and', 'a', and punctuation. Removing these common words is called stopword removal. There are several lists for doing this, the most common being the Porter list.

Additionallly, we can make it easier to match words if we lowercase all the words and remove word endings, such as plural and 'ing'; which is called stemming. Let's give it a try with the same sentence.

Note how words like 'the', and 'over' have been removed, the punctuation has been removed, words have been lowercased and 'jumped' has been stemmed to its root word 'jump'.


In [7]:
w = Words('The quick brown fox jumped over the lazy dog.', stem='porter')
print(w.words)

[{'word': 'quick', 'tag': 0}, {'word': 'brown', 'tag': 0}, {'word': 'fox', 'tag': 0}, {'word': 'jump', 'tag': 0}, {'word': 'lazi', 'tag': 0}, {'word': 'dog', 'tag': 0}]


In [5]:

w = Words("grandfather, son, dad, sister", gender=True)
w.words

[{'tag': 15, 'word': 'grandfather'},
 {'tag': 0, 'word': 'son'},
 {'tag': 15, 'word': 'dad'},
 {'tag': 16, 'word': 'sister'}]

### NER (Name Entity Recognition)

In [5]:
# Let's look at a string with a name and social security number.
w = Words(" word1 word2 Jim Jones, SSN: 123-12-1234 word3", stopwords=True)

In [6]:
# Let's print the word list. Note that jim and jones are tagged 11 (Proper Name) and 123121234 is tagged 9 (SSN)
w.words

[{'tag': 0, 'word': 'word1'},
 {'tag': 0, 'word': 'word2'},
 {'tag': 11, 'word': 'jim'},
 {'tag': 11, 'word': 'jones'},
 {'tag': 9, 'word': '123121234'},
 {'tag': 0, 'word': 'word3'}]

### De-Identification

In [7]:
# Let's remove any names and SSN from our text
w = Words("  word1 word2 Jim Jones, SSN: 123-12-1234 word3", name=False, ssn=False)
w.words

[{'tag': 0, 'word': 'word1'},
 {'tag': 0, 'word': 'word2'},
 {'tag': 0, 'word': 'word3'}]

## THAT'S ALL FOR SESSION 1

Look forward to seeing everyone again on session 2 where we will do some serious deep diving.