# Text Preprocessing

Text preprocessing is an important step in the use of unstructured text document for any type of data mining, information retrieval, or text analytics.
This lab walks through the use of the Python Natural Language Toolkit (NLTK) to discuss the tools available for text preprocessing.
Specifically, we are looking at the concepts of
  1. Stop Words
  1. Stemming
  1. Lemmatization
  
In the labs after this, these things will be automatically handled for us as we build information retrieval.
However, these are still key concepts to see in action.
You will see them again as we continue to move forward with our text analytics in future modules.

#### <span style="background:yellow">As you continue to work with text processing, contemplate the discussion board question!</span>

## Stop Words

Text documents often contain many occurrences of the same word. 
For example, in a document written in _English_, words such as _a_, _the_, _of_, and _it_ likely occur very frequently. 
When classifying a document based on the number of times specific words occur in the text document, 
these words can lead to biases, especially since they are generally common in **all** text documents you might want to classify. 
As a result, the concept of [_stop words_](https://en.wikipedia.org/wiki/Stop_words) was invented. 
Basically, these words are the most commonly occurring words that should be removed during the tokenization process in order to improve subsequent text analytics efforts. 

We can easily specify that the __English__ stop words should be excluded during tokenization by using the `stop_words`. 
Note, _stop word_ dictionaries for other languages, or even specific domains, exist and can be used instead. 
We demonstrate the removal of stop words by using a `CountVectorizer` in the following simple example.

-----

In [1]:
# Define our vectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word', lowercase=True)

# Sample sentance to tokenize
my_text = 'This module introduced many concepts in text analysis.'

cv1 = CountVectorizer(lowercase=True)
cv2 = CountVectorizer(stop_words = 'english', lowercase=True)

tk_func1 = cv1.build_analyzer()
tk_func2 = cv2.build_analyzer()

import pprint
pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

print('Tokenization:')
pp.pprint(tk_func1(my_text))

print()

print('Tokenization (with Stop words):')
pp.pprint(tk_func2(my_text))

Tokenization:
['this', 'module', 'introduced', 'many', 'concepts', 'in', 'text', 'analysis']

Tokenization (with Stop words):
['module', 'introduced', 'concepts', 'text', 'analysis']


 

--- 
## Stemming


We have looked at the removal of redundant or unimportant words, i.e., _stop words_. 
However, an issue still exist because of different word forms of the same base term; for example compute, computer, computed, and computing. 
The process of changing words back to their root term, or basic form (by removing prefixes and suffixes) so that token frequencies match the use of the root token rather than being spread across multiple similar tokens is known as [stemming](https://en.wikipedia.org/wiki/Stemming). 

The most widely used stemmer, or program/method that performs stemming, is the _Porter Stemmer_, which was originally published in 1980 by Martin Porter. 
An improved version was released in 2000, which fixed a number of errors. 
NLTK includes the Porter Stemmer.
This is used by creating a special function that tokenizes text documents and then passes this function as an argument to the `CountVectorizer` via the `tokenizer` attribute. 
By performing stemming inside this tokenize method, we can return a set of tokens for a document that have been stemmed. 
In the following code cell, we use a custom `tokenize` method that first builds a list of tokens by using nltk, and then maps the Porter Stemmer to the list of tokens to generate a stemmed list.

-----


In [2]:
import string
import nltk
from nltk.stem.porter import PorterStemmer

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
stemmer = PorterStemmer()

for w in example_words:
    print(stemmer.stem(w))

python
python
python
python
pythonli


# Pull down some data to aid in text analytics

Uncomment this cell. 
Run this cell once to download the NLTK datasets. 
Once the data is downloaded you can comment out. 


### If using in a local install, you will need to install the NLTK data, ~12GB
```
nltk.download('all')
```

In [3]:
new_text = "It is important to be very pythonly while you are pythoning with python. \
All pythoners have pythoned poorly at least once."

tokens = nltk.word_tokenize(new_text)
tokens = [token for token in tokens if token not in string.punctuation]

for w in tokens:
    print(stemmer.stem(w))

It
is
import
to
be
veri
pythonli
while
you
are
python
with
python
all
python
have
python
poorli
at
least
onc


-----

## Lemmatization


Lemmatization in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. 
By inflected it means to change the form of a word to express a particular grammatical function or attribute, typically tense, mood, person, number, case, and gender.

In computational linguistics, lemmatization is the algorithmic process of determining the lemma for a given word. 
The process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence requiring, for example, knowledge of the grammar of a language.

In many languages, words appear in several inflected forms. 
For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. 
The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. 

Lemmatization is closely related to stemming. 
The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. 
However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

-----

In [4]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('dogs')

'dog'

Lets try to understand the difference between **Stemming** and **Lemmatization**.

In [5]:
from nltk import pos_tag

print("Stem %s: %s" % ("going", stemmer.stem("going")))
print("Stem %s: %s" % ("gone", stemmer.stem("gone")))
print("Stem %s: %s" % ("goes", stemmer.stem("goes")))
print("Stem %s: %s" % ("went", stemmer.stem("went")))
print("\n")

print("Without context")
print("Lemmatise %s: %s" % ("going", lemmatizer.lemmatize("going")))
print("Lemmatise %s: %s" % ("gone", lemmatizer.lemmatize("gone")))
print("Lemmatise %s: %s" % ("goes", lemmatizer.lemmatize("goes")))
print("Lemmatise %s: %s" % ("went", lemmatizer.lemmatize("went")))
print("\n")

print("With context")
print("Lemmatise %s: %s" % ("going", lemmatizer.lemmatize("going", pos="v")))
print("Lemmatise %s: %s" % ("gone", lemmatizer.lemmatize("gone", pos="v")))
print("Lemmatise %s: %s" % ("goes", lemmatizer.lemmatize("goes", pos="v")))
print("Lemmatise %s: %s" % ("went", lemmatizer.lemmatize("went", pos="v")))

Stem going: go
Stem gone: gone
Stem goes: goe
Stem went: went


Without context
Lemmatise going: going
Lemmatise gone: gone
Lemmatise goes: go
Lemmatise went: went


With context
Lemmatise going: go
Lemmatise gone: go
Lemmatise goes: go
Lemmatise went: go


We can observe that the stemming process does not generate a real word, but a root form. 
On the other side, the lemmatizer generates real words, 
but without contextual information it is not able to distinguish between nouns and verbs. 
Hence the lemmatization process doesn’t change the word. 

The context is provided by the POS tag ("v" for verb in this example). 
We cannot specify POS tag everytime in order to lemmatize words in a text. 
NLTK generates POS tags automatically, using a simple function `pos_tag()`.

In [6]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
 
s = "This is a simple sentence"
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens) 
 
print(tokens_pos)

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('sentence', 'NN')]


So `pos_tag()` function generates keywords for each word in the text. 
The outputs 'DT', 'VBZ' etc represents parts of speech. 
The output is turn used by NLTK to refer the tokens against the set of tags from the [Penn Treebank project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

## Check Point
Stop Words, Stemming, Lemmatization are important pre-processing steps in text analytics applications. 
You can leverage the off-the-shelf solutions offered by NLTK into yout text analysis applications.
Additionally, many a code libraries and applications that perform more advanced text analytical processes incorporate these techniques in them by default.

Below are some practice coding for you to experiment with the NLTK functionality above.


In [7]:
StringAction = "We are meeting"
StringNoun =  "We had a meeting"

In [8]:
# 1) Compare the result of parsing the two 
# variables, StringAction and StringNoun 
# using the tokenizer with the english stop words
# ----------------------------------------

# Import libraries
import pprint
from sklearn.feature_extraction.text import CountVectorizer

# Building the vectorizing tokenizer
cv = CountVectorizer(stop_words = 'english', lowercase=True)
tk_function = cv.build_analyzer()

# Print out the comparison of stop-word enabled tokenizing the two variables
pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

print('Tokenization:')
print("'{}':".format(StringAction))
# -------- EDIT NEXT LINE --------
pp.pprint("'{}':".format(tk_function(StringAction)))

print("  ... vs ...  ")

print('Tokenization:')
print("'{}':".format(StringNoun))
# -------- EDIT NEXT LINE --------
pp.pprint("'{}':".format(tk_function(StringNoun)))

Tokenization:
'We are meeting':
"'['meeting']':"
  ... vs ...  
Tokenization:
'We had a meeting':
"'['meeting']':"


In [9]:
# 2) Compare the result of parsing the two 
# variables, StringAction and StringNoun 
# using the stemmer with the english stop words
# ----------------------------------------

# Import libraries
import string
import nltk
from nltk.stem.porter import PorterStemmer

### Add your code below to parse and stem the two variables

Action_tokens = nltk.word_tokenize(StringAction)
Action_tokens = [token for token in Action_tokens if token not in string.punctuation]

Noun_tokens = nltk.word_tokenize(StringNoun)
Noun_tokens = [token for token in Noun_tokens if token not in string.punctuation]

print("'{}':".format(StringAction))
for w in Action_tokens:
    print(stemmer.stem(w))

print("  ... vs ...  ")


print("'{}':".format(StringNoun))
for w in Noun_tokens:
    print(stemmer.stem(w))

'We are meeting':
We
are
meet
  ... vs ...  
'We had a meeting':
We
had
a
meet


In [10]:
# 3) Compare the result of parsing the two 
# variables, StringAction and StringNoun 
# using the lemmatization with the english stop words
# ----------------------------------------

# Import libraries
from nltk.stem import WordNetLemmatizer


### Add your code below to parse and lemmatize the two variables

from nltk.corpus import stopwords
from nltk import word_tokenize

stop = stopwords.words('english')

Action_tokens = [token for token in word_tokenize(StringAction.lower()) if token not in stop]
Noun_tokens = [token for token in word_tokenize(StringNoun.lower()) if token not in stop]

print("Lemmatization")
print(40*'-')
print("'{}':".format(StringAction))
for w in Action_tokens:
    print(lemmatizer.lemmatize(w))

print("  ... vs ...  ")


print("'{}':".format(StringNoun))
for w in Noun_tokens:
    print(lemmatizer.lemmatize(w))

Lemmatization
----------------------------------------
'We are meeting':
meeting
  ... vs ...  
'We had a meeting':
meeting


## Text Preprocessing in Full Text Search

Now that we have seen these concepts in isolation, lets revisit the PostgreSQL Full Text Search.

Specifically, I have loaded a new table that breaks each document from the book in the lab into individual lines.  
You can review the load process for this data in [this notebook](./Load_BookLines.ipynb).
The notebook include a few queries, showing how the loaded lines may be a little more useful that just document matching.


```SQL
dsa_ro=# select count(*),sum(length(line)) from ir.booklines;
 count |   sum   
-------+---------
 31259 | 4315223
(1 row)
```

#### 31K lines

#### Looking at a randome line that was added:

```SQL
dsa_ro=# \x 
Expanded display is on.
dsa_ro=# select * from ir.booklines where id = 34;
-[ RECORD 1 ]-+-------------------------------------------
id            | 34
name          | ./book/zeph.txt
line_no       | 34
line          | 2:14: And flocks shall lie down in the midst of her, 
                all the beasts of the nations: both the cormorant and 
                the bittern shall lodge in the upper lintels of it; 
                their voice shall sing in the windows; desolation shall 
                be in the thresholds: for he shall uncover the cedar work.
line_tsv_gin  | '14':2 '2':1 'beast':15 'bittern':24 'cedar':51 'cormor':21 
                'desol':40 'flock':4 'lie':6 'lintel':30 'lodg':26 'midst':10 
                'nation':18 'shall':5,25,35,41,48 'sing':36 'threshold':45 
                'uncov':49 'upper':29 'voic':34 'window':39 'work':52
line_tsv_gist | '14':2 '2':1 'beast':15 'bittern':24 'cedar':51 'cormor':21 
                'desol':40 'flock':4 'lie':6 'lintel':30 'lodg':26 'midst':10 
                'nation':18 'shall':5,25,35,41,48 'sing':36 'threshold':45 
                'uncov':49 'upper':29 'voic':34 'window':39 'work':52

```


We see that the line 

    2:14: And flocks shall lie down in the midst of her, 
    all the beasts of the nations: both the cormorant and 
    the bittern shall lodge in the upper lintels of it; 
    their voice shall sing in the windows; desolation shall 
    be in the thresholds: for he shall uncover the cedar work.

Is tokenized into **_text search vector_ (tsv)**: 
```
'14':2 '2':1 'beast':15 'bittern':24 'cedar':51 'cormor':21 
'desol':40 'flock':4 'lie':6 'lintel':30 'lodg':26 'midst':10 
'nation':18 'shall':5,25,35,41,48 'sing':36 'threshold':45 
'uncov':49 'upper':29 'voic':34 'window':39 'work':52
```

Lets compare this line to the Python tokenizing:


In [11]:
line = "2:14: And flocks shall lie down in the midst of her, " + \
    "all the beasts of the nations: both the cormorant and  " + \
    "the bittern shall lodge in the upper lintels of it;  " + \
    "their voice shall sing in the windows; desolation shall  " + \
    "be in the thresholds: for he shall uncover the cedar work."
print(line)

2:14: And flocks shall lie down in the midst of her, all the beasts of the nations: both the cormorant and  the bittern shall lodge in the upper lintels of it;  their voice shall sing in the windows; desolation shall  be in the thresholds: for he shall uncover the cedar work.


### Compare processing of the line:

In the cell below, use each of the Python techniques above to process the `line` variable.
Then, answer the questions in the cells below the code to compare and contrast the Python methods versus the apparenet techniques applied by PostgreSQL.

In [12]:
line.lower()

'2:14: and flocks shall lie down in the midst of her, all the beasts of the nations: both the cormorant and  the bittern shall lodge in the upper lintels of it;  their voice shall sing in the windows; desolation shall  be in the thresholds: for he shall uncover the cedar work.'

In [13]:
# Import libraries
import pprint
import string
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# 1) Add code to process the line variable with 
# Stop Word tokenization, Stemming, and Lemmatization
# ---------------------------------------------


stop = stopwords.words('english')

tokens = [token for token in word_tokenize(line.lower()) if token not in stop]
print("After removing stop words")
print(40*'-')
print(tokens)

stems=[]
for token in tokens:
    stems.append(stemmer.stem(token))
print("\n After Stemming")
print(40*'-')
print(stems)

lemmas=[]
print("\n After Lemmatization")
print(40*'-')
for w in tokens:
    lemmas.append(lemmatizer.lemmatize(w))
print(lemmas)

After removing stop words
----------------------------------------
['2:14', ':', 'flocks', 'shall', 'lie', 'midst', ',', 'beasts', 'nations', ':', 'cormorant', 'bittern', 'shall', 'lodge', 'upper', 'lintels', ';', 'voice', 'shall', 'sing', 'windows', ';', 'desolation', 'shall', 'thresholds', ':', 'shall', 'uncover', 'cedar', 'work', '.']

 After Stemming
----------------------------------------
['2:14', ':', 'flock', 'shall', 'lie', 'midst', ',', 'beast', 'nation', ':', 'cormor', 'bittern', 'shall', 'lodg', 'upper', 'lintel', ';', 'voic', 'shall', 'sing', 'window', ';', 'desol', 'shall', 'threshold', ':', 'shall', 'uncov', 'cedar', 'work', '.']

 After Lemmatization
----------------------------------------
['2:14', ':', 'flock', 'shall', 'lie', 'midst', ',', 'beast', 'nation', ':', 'cormorant', 'bittern', 'shall', 'lodge', 'upper', 'lintel', ';', 'voice', 'shall', 'sing', 'window', ';', 'desolation', 'shall', 'threshold', ':', 'shall', 'uncover', 'cedar', 'work', '.']
