
# Gap Framework - Natural Language Processing
## Syntax Module

<b>[Github] (https://github.com/andrewferlitsch/gap)</b>

# Document Preparation for NLP with Gap (Session 2)

Let's dig deeper into the basics. We will be using the <b style='color: saddlebrown'>SYNTAX</b> module in the **Gap** framework. 

## <span style='color: saddlebrown'>Words</span> Object

Let's directly use the <b style='color: saddlebrown'>Words</b> object to control how the text is NLP preprocessed. We will cover the following:

    - Syntax Preprocessing
    - Text Reduction (Stopwords)
    - Name Entity Recognition
    - Parts of Speech Tagging
    - De-Identification
    - Measurement Extraction


In [None]:
import os
os.chdir("../")
%ls

In [None]:
# import the Words class
from gapml.syntax import Words

### Syntax Preprocessing

The <b style='color: saddlebrown'>SYNTAX</b> module supports various keyword parameters to configure how the text is NLP preprocessed. We will cover just a few in this code-along. When the text is preprocessed, an ordered sequential list of <b style='color: saddlebrown'>Word</b> objects are generated; each consisting a set of key/value pairs.

In *bare* mode, all the text and punctuation is preserved, and no tagging, parts of speech (POS), stemming, lemmatization, name entity recognition (NER) or stopword removal is perform.

#### Bare

Let's look at the preprocessing of a simple sentence in *bare* mode.

In [None]:
# Process this well-known typing phrase which contains all 26 letters of the alphabet
w = Words('The quick brown fox jumped over the lazy dog.', bare=True)
print(w.words)

As you can see, the *words* property displays a list, where each entry is an object consisting of a word and tag key value pair. I know you don't know what the integer values of the tags mean (see Vocabulary.py). In bare mode, all words are tagged as UNTAGGED (0) and punctuation as PUNCT (23).

Note how in bare mode, all words are kept, their capitalization, order and punctuation.

#### Stopwords and Stemming

Let's do some text reduction. In NLP, a lot of things add very little to the understanding of the text, such as common words like 'the', 'and', 'a', and punctuation. Removing these common words is called stopword removal. There are several lists for doing this, the most common being the *Porter* list.

Additionallly, we can make it easier to match words if we lowercase all the words and remove word endings, such as plural and 'ing'; which is called stemming. Let's give it a try with the same sentence.

Note how words like 'the', and 'over' have been removed, the punctuation has been removed, words have been lowercased and 'jumped' has been stemmed to its root word 'jump'.


In [None]:
# Stem words using the NLTK Porter stemmer
w = Words('The quick brown fox jumped over the lazy dog.', stem='porter')
print(w.words)

Stemmers sometimes reduce words into something that isn't the root. Like 'riding' could end up being 'rid', after cutting off 'ing'. Note above how the *NLTK Porter* stemmer changed 'lazy' into 'lazi'.

Different stemmers have different errors. This can be corrected using a lemmatization. Let's repeat the above but use the **Gap** stemmer which has a lemmatizer correction.

In [None]:
# Stem words using the Gap stemmer
w = Words('The quick brown fox jumped over the lazy dog.', stem='gap')
print(w.words)

#### Gender Recognition

The <b style='color: saddlebrown'>Words</b> object will also recognize gender specific words. We will preprocess four different ways of saying 'father'. In each case, the tag will be set to MALE (15) and each word will be replaced (reduced) with its common equivalent 'father'.

In [None]:
# Let's recognize various forms of father
w = Words("dad daddy father papa", gender=True)
w.words

Let's now try a variety of words indicating the gender FEMALE (16). Note now 'mom' and 'mother' got reduced to the common equivalent 'mother', and the slang 'auntie' and 'sis' got reduced to 'aunt' and sister', respectively.

In [None]:
w = Words("girl lady mother mom auntie sis", gender=True)
w.words

#### NER (Name Entity Recognition)

The <b style='color: saddlebrown'>SYNTAX</b> module will recognize a wide variety of proper names, places and identification, such as a person's name (11), a social security number (9) a title (33), geographic location.

In [None]:
# Let's look at a string with a name, social security number, and title.
w = Words("Patient: Jim Jones, SSN: 123-12-1234. Dr. Nancy Lou", stopwords=True)

In [None]:
# Let's print the word list. Note that jim and jones are tagged 11 (Proper Name), 123121234 is tagged 9 (SSN), and 
# Dr is tagged 33 (Title)
w.words

Let's now try an address. Nice, in our example we recognized (tagged) a street number (27), street direction (28), street name (29), street type (30), a secondary address unit (36), a city (31), a state (32) and postal code (34).

Both US and Canadian street and postal addresses are recognized. Note how the state name "Oregon" got replaced with its ISO international standard code.

In [None]:
w = Words("124 NE Main Ave, Apt #6, Portland, OR 97221", address=True)
w.words

#### De-Identification

The <b style='color: saddlebrown'>SYNTAX</b> module supports de-identification of the text. One can remove names, dates of birth, gender, social security number, telephone numbers and addresses.

In [None]:
# Let's remove any names and SSN from our text
w = Words("Patient: Jim Jones, SSN: 123-12-1234", name=False, ssn=False)
w.words

#### Measurements

The <b style='color: saddlebrown'>SYNTAX</b> module supports extracting measurement units, such as height, weight, speed, volume and quantity (38). You can also configure to convert measurements (25) to Standard or Metric system. A wide variety of acronyms and formats are recognized. Note that numbers are tagged as 1.

In [None]:
# Let's do height using ' for foot and " for inches
w = Words("Height: 5'7\"", stopwords=True)
w.words

In [None]:
# Let's do height using the acronym ft and in.
w = Words("Height: 5 ft 7 in", stopwords=True)
w.words

In [None]:
# Let's do height using the acronym ft and in, with no space between the value and unit
w = Words("Height: 5ft 7in", stopwords=True)
w.words

In [None]:
# Let's do an example in Standard and convert to Metric system.
w = Words("Weight is 120lbs", stopwords=True, metric=True)
w.words

## THAT'S ALL FOR SESSION 2

Look forward to seeing everyone again on session 3 where we will do some data preparation for computer vision.