<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Natural Language Processing: Intro and Preprocessing
              
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 4: Topic 35</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [1]:
# Use this to install nltk if needed
# !pip install nltk
# !conda install -c anaconda nltk

In [2]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
import matplotlib.pyplot as plt
import string
import re
import numpy as np

In [3]:
# Use this to download the stopwords if you haven't already - only ever needs to be run once

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\prave\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

#### Natural Language Processing (NLP)

- Machine learning tasks with unstructured free language text.

#### Supervised learning: training on labeled free text documents
- Build document classifiers

<center><img src = "Images/spamvsham.png" />
Spam filtration </center>

<center>
<img src = "Images/doc_classification.jpg" />
Document management systems for your business
</center>

- Using free text as input in regression.
    - e.g., free text reviews to predict restaurant quality 0-10
    - sentiment analysis (extremely displeased to ecstatic)
    
<img src = "Images/anton_ego.jpg" width = 450/>


<img src = "Images/ego_quote.jpg" width = 450/>
<center> Our algorithm predicts a 9.8 for Gusteau's. </center>

Based on text:
- Algorithm predicts Anton was extremely pleased.

<center><img src = "Images/sentiment_analysis.jpg" > Sentiment Analysis</center>

#### Unsupervised Learning 

- Topic modeling
    - learn topics from a collection of documents



<center><img src = "Images/topicmodels.png" width = 900 ></center>

Many, many more types of NLP tasks.
- Just named a few.


Need to represent information in free text in a form useable by an ML model:
- i.e. vectorize/structure information inside body of documents
- create numeric representations of words, sentences, documents

Simple example: count vectorizer
<img src = "Images/vectorchart.png" >

Processing texting is multistep:
- Text pre-processing
- Feature extraction (vectorization)

A simple NLP workflow:

<img src = "Images/text_feature_pipe.png" >

Many types of vectorization schemes exist that can be trained:

- But first: text data must be preprocessed.
- This is the first phase in the NLP pipeline
- **Essential**: helps learning effective vector representation.

#### Text Preprocessing
1. **Tokenization**
2. Normalization

Tokenizing: cutting text into small semantic subunits (tokens).
<img src = "Images/tokenization.webp" >


Tokenization: language-specific splitting/contraction rules

Many NLP packages with excellent tokenizers (among other things):
- nltk
- spaCy
- gensim

Will use nltk: The Natural Language Toolkit
    
<center><img src = "Images/nltk_logo.png" width = 250></center>    

In [4]:
import nltk # the natural language toolkit

In [5]:
# need to downlod punkt to access better tokenization rules
# word_tokenize won't work without it
nltk.download('punkt') 


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prave\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [6]:
from nltk.tokenize import word_tokenize # nltk's gold standard word tokenizer
from nltk.tokenize import sent_tokenize # nltk's sentence tokenizer

In [7]:
import pandas as pd
satire_df = pd.read_csv('data/satire_nosatire.csv')

Predict whether an article is satire or real.

In [8]:
satire_df.head()

Unnamed: 0,body,target
0,Noting that the resignation of James Mattis as...,1
1,Desperate to unwind after months of nonstop wo...,1
2,"Nearly halfway through his presidential term, ...",1
3,Attempting to make amends for gross abuses of ...,1
4,Decrying the Senate’s resolution blaming the c...,1


In [9]:
satire_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   body    1000 non-null   object
 1   target  1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [61]:
first_doc = satire_df['body'].iloc[0]
first_doc

'Noting that the resignation of James Mattis as Secretary of Defense marked the ouster of the third top administration official in less than three weeks, a worried populace told reporters Friday that it was unsure how many former Trump staffers it could safely reabsorb. “Jesus, we can’t just take back these assholes all at once—we need time to process one before we get the next,” said 53-year-old Gregory Birch of Naperville, IL echoing the concerns of 323 million Americans in also noting that the country was only now truly beginning to reintegrate former national security advisor Michael Flynn. “This is just not sustainable. I’d say we can handle maybe one or two more former members of Trump’s inner circle over the remainder of the year, but that’s it. This country has its limits.” The U.S. populace confirmed that they could not handle all of these pieces of shit trying to rejoin society at once.'

Let's see what word tokenizer does:

In [62]:
print(word_tokenize(first_doc, language='english'))

['Noting', 'that', 'the', 'resignation', 'of', 'James', 'Mattis', 'as', 'Secretary', 'of', 'Defense', 'marked', 'the', 'ouster', 'of', 'the', 'third', 'top', 'administration', 'official', 'in', 'less', 'than', 'three', 'weeks', ',', 'a', 'worried', 'populace', 'told', 'reporters', 'Friday', 'that', 'it', 'was', 'unsure', 'how', 'many', 'former', 'Trump', 'staffers', 'it', 'could', 'safely', 'reabsorb', '.', '“', 'Jesus', ',', 'we', 'can', '’', 't', 'just', 'take', 'back', 'these', 'assholes', 'all', 'at', 'once—we', 'need', 'time', 'to', 'process', 'one', 'before', 'we', 'get', 'the', 'next', ',', '”', 'said', '53-year-old', 'Gregory', 'Birch', 'of', 'Naperville', ',', 'IL', 'echoing', 'the', 'concerns', 'of', '323', 'million', 'Americans', 'in', 'also', 'noting', 'that', 'the', 'country', 'was', 'only', 'now', 'truly', 'beginning', 'to', 'reintegrate', 'former', 'national', 'security', 'advisor', 'Michael', 'Flynn', '.', '“', 'This', 'is', 'just', 'not', 'sustainable', '.', 'I', '’', 

Deals with splitting on whitespace, punctuation, and contractions.

In [12]:
first_doc

'Noting that the resignation of James Mattis as Secretary of Defense marked the ouster of the third top administration official in less than three weeks, a worried populace told reporters Friday that it was unsure how many former Trump staffers it could safely reabsorb. “Jesus, we can’t just take back these assholes all at once—we need time to process one before we get the next,” said 53-year-old Gregory Birch of Naperville, IL echoing the concerns of 323 million Americans in also noting that the country was only now truly beginning to reintegrate former national security advisor Michael Flynn. “This is just not sustainable. I’d say we can handle maybe one or two more former members of Trump’s inner circle over the remainder of the year, but that’s it. This country has its limits.” The U.S. populace confirmed that they could not handle all of these pieces of shit trying to rejoin society at once.'

There are other more powerful tokenizers that can be dialect specific.

Can explore this later.

The sentence tokenizer
- sometimes want to chunk sentences before doing word tokenization.

In [13]:
sent_tokenize(first_doc)

['Noting that the resignation of James Mattis as Secretary of Defense marked the ouster of the third top administration official in less than three weeks, a worried populace told reporters Friday that it was unsure how many former Trump staffers it could safely reabsorb.',
 '“Jesus, we can’t just take back these assholes all at once—we need time to process one before we get the next,” said 53-year-old Gregory Birch of Naperville, IL echoing the concerns of 323 million Americans in also noting that the country was only now truly beginning to reintegrate former national security advisor Michael Flynn.',
 '“This is just not sustainable.',
 'I’d say we can handle maybe one or two more former members of Trump’s inner circle over the remainder of the year, but that’s it.',
 'This country has its limits.” The U.S. populace confirmed that they could not handle all of these pieces of shit trying to rejoin society at once.']

Word tokenize each chunked sentence:

In [14]:
print([word_tokenize(sent) for sent in sent_tokenize(first_doc)])

[['Noting', 'that', 'the', 'resignation', 'of', 'James', 'Mattis', 'as', 'Secretary', 'of', 'Defense', 'marked', 'the', 'ouster', 'of', 'the', 'third', 'top', 'administration', 'official', 'in', 'less', 'than', 'three', 'weeks', ',', 'a', 'worried', 'populace', 'told', 'reporters', 'Friday', 'that', 'it', 'was', 'unsure', 'how', 'many', 'former', 'Trump', 'staffers', 'it', 'could', 'safely', 'reabsorb', '.'], ['“', 'Jesus', ',', 'we', 'can', '’', 't', 'just', 'take', 'back', 'these', 'assholes', 'all', 'at', 'once—we', 'need', 'time', 'to', 'process', 'one', 'before', 'we', 'get', 'the', 'next', ',', '”', 'said', '53-year-old', 'Gregory', 'Birch', 'of', 'Naperville', ',', 'IL', 'echoing', 'the', 'concerns', 'of', '323', 'million', 'Americans', 'in', 'also', 'noting', 'that', 'the', 'country', 'was', 'only', 'now', 'truly', 'beginning', 'to', 'reintegrate', 'former', 'national', 'security', 'advisor', 'Michael', 'Flynn', '.'], ['“', 'This', 'is', 'just', 'not', 'sustainable', '.'], ['I'

List of lists: each sentence, word tokenized.

For our use case: 
- vectorizing documents in word-count vector
- word tokenization suffices

- Word tokenize each document in collection of documents
- List of token lists for each document in collection: **corpus**
- Unique tokens in entire corpus: **dictionary**

In [15]:
corpus = [word_tokenize(doc) for doc in satire_df['body']]
print(corpus[0:4])

[['Noting', 'that', 'the', 'resignation', 'of', 'James', 'Mattis', 'as', 'Secretary', 'of', 'Defense', 'marked', 'the', 'ouster', 'of', 'the', 'third', 'top', 'administration', 'official', 'in', 'less', 'than', 'three', 'weeks', ',', 'a', 'worried', 'populace', 'told', 'reporters', 'Friday', 'that', 'it', 'was', 'unsure', 'how', 'many', 'former', 'Trump', 'staffers', 'it', 'could', 'safely', 'reabsorb', '.', '“', 'Jesus', ',', 'we', 'can', '’', 't', 'just', 'take', 'back', 'these', 'assholes', 'all', 'at', 'once—we', 'need', 'time', 'to', 'process', 'one', 'before', 'we', 'get', 'the', 'next', ',', '”', 'said', '53-year-old', 'Gregory', 'Birch', 'of', 'Naperville', ',', 'IL', 'echoing', 'the', 'concerns', 'of', '323', 'million', 'Americans', 'in', 'also', 'noting', 'that', 'the', 'country', 'was', 'only', 'now', 'truly', 'beginning', 'to', 'reintegrate', 'former', 'national', 'security', 'advisor', 'Michael', 'Flynn', '.', '“', 'This', 'is', 'just', 'not', 'sustainable', '.', 'I', '’',

For purposes of understanding the dictionary/vocabulary:
- flattening corpus

In [16]:
import itertools
flattenedcorpus_tokens = pd.Series(list(itertools.chain(*corpus)))
print(flattenedcorpus_tokens.shape)

(464861,)


Dictionary, then, is unique values of tokens in corpus:

In [17]:
dictionary = pd.Series(
    flattenedcorpus_tokens.unique())
print(len(dictionary))

30182


In [18]:
flattenedcorpus_tokens.value_counts()

,               21510
the             21378
.               16432
to              11244
of              10582
                ...  
insert              1
kidney              1
ovaries             1
inhabit             1
inter-island        1
Length: 30182, dtype: int64

Tokens in the dictionary become features for a token-frequency matrix.

<center><img src = "Images/vectorchart.png" ></center>

In this light, think about the dictionary:

- any problems?
- look at various types of tokens. anything that you notice?

#### Problem 1

- 30,000 features: way too much. Curse of dimensionality.

#### Problem 2
- Want features to help us in classification task
- But many useless features: tokens too common in english language.
    - punctuation
    - prepositions, articles, etc.: **stop words**

In [19]:
flattenedcorpus_tokens.value_counts()[0:20]

,       21510
the     21378
.       16432
to      11244
of      10582
and      9997
a        9361
in       8066
’        4828
is       4762
that     4153
on       3904
for      3884
s        3354
“        3034
”        2889
The      2818
said     2661
with     2559
as       2431
dtype: int64

#### Problem 3

In [20]:
flattenedcorpus_tokens.isin(["warning"]).sum()

35

In [21]:
flattenedcorpus_tokens.isin(["Warning"]).sum()

4

Same exact word: just capitalized
- Shouldn't be independent feature.
- lowercase all of these.

#### Problem 4

In [22]:
flattenedcorpus_tokens.isin(["warns"]).sum()

2

In [23]:
flattenedcorpus_tokens.isin(["warned"]).sum()

58

In [24]:
flattenedcorpus_tokens.isin(["warn"]).sum()

6

All of these are treated as unique features:
- but are just variant of same word
- need to normalize these in some way

#### Problem 5

Let's get the number of tokens with only one occurence in entire corpus:

In [64]:
num_one_occurence = (flattenedcorpus_tokens.
                     value_counts() < 5).sum()
num_one_occurence

22301

~ 1/3 of tokens only appear **once**!

- Rare token are not useful to keep around.
- Not useful in building relationship between features and target.

#### Problem 6

Many of these tokens are numbers: 
- don't have semantic meaning that will aid in classification

In [26]:
dictionary[dictionary.str.isnumeric()]

71        323
129      2016
260      2018
369      2019
505      2020
         ... 
29810    1209
29820     177
29893      77
29957     152
29991     167
Length: 398, dtype: object

#### Addressing these problems step-by-step

- Lower casing, removing punctuation, and stop words.
- Keep only alphabetic tokens (drop numbers)

In [27]:
# imports package with many stopword lists
from nltk.corpus import stopwords

# get common stop words in english that we'll remove during tokenization/text normalization
stop_words = stopwords.words('english')
print(stop_words[0:5])

['i', 'me', 'my', 'myself', 'we']


Create a simple helper function:

In [65]:
def first_step_normalizer(doc):
    # filters for alphabetic (no punctuation or numbers) and filters out stop words. 
    # lower cases all tokens
    norm_text = [x.lower() for x in word_tokenize(doc) if ((x.isalpha()) & (x not in stop_words)) ]
    return norm_text

In [29]:
satire_df['tok_norm'] = satire_df['body'].apply(first_step_normalizer)
satire_df.head()

Unnamed: 0,body,target,tok_norm
0,Noting that the resignation of James Mattis as...,1,"[noting, resignation, james, mattis, secretary..."
1,Desperate to unwind after months of nonstop wo...,1,"[desperate, unwind, months, nonstop, work, inv..."
2,"Nearly halfway through his presidential term, ...",1,"[nearly, halfway, presidential, term, donald, ..."
3,Attempting to make amends for gross abuses of ...,1,"[attempting, make, amends, gross, abuses, powe..."
4,Decrying the Senate’s resolution blaming the c...,1,"[decrying, senate, resolution, blaming, crown,..."


In [30]:
print(satire_df['tok_norm'].iloc[0])

['noting', 'resignation', 'james', 'mattis', 'secretary', 'defense', 'marked', 'ouster', 'third', 'top', 'administration', 'official', 'less', 'three', 'weeks', 'worried', 'populace', 'told', 'reporters', 'friday', 'unsure', 'many', 'former', 'trump', 'staffers', 'could', 'safely', 'reabsorb', 'jesus', 'take', 'back', 'assholes', 'need', 'time', 'process', 'one', 'get', 'next', 'said', 'gregory', 'birch', 'naperville', 'il', 'echoing', 'concerns', 'million', 'americans', 'also', 'noting', 'country', 'truly', 'beginning', 'reintegrate', 'former', 'national', 'security', 'advisor', 'michael', 'flynn', 'this', 'sustainable', 'i', 'say', 'handle', 'maybe', 'one', 'two', 'former', 'members', 'trump', 'inner', 'circle', 'remainder', 'year', 'this', 'country', 'the', 'populace', 'confirmed', 'could', 'handle', 'pieces', 'shit', 'trying', 'rejoin', 'society']


In [31]:
norm_toks_flattened = pd.Series(list(
    itertools.chain(*satire_df['tok_norm'])))
new_dictionary = norm_toks_flattened.unique()
print(len(new_dictionary))

23067


Process removed 7000 features from the dictionary.

In [32]:
print(len(dictionary))

30182


#### Next step: stemming/lemmatizing
- Converting variants of the same word to a base form or root

Stemmers consolidate similar words by chopping off the ends of the words.
<center><img src = "Images/stemmer.png" width = 200> Stem isn't always a word.</center>


Different stemming algorithms (in order of increasing aggression):
- Porter stemmer
- Snowball stemmer (faster, more aggressive, smarter)
- Lancaster stemmer (real aggresso, **ultrafast**)



<img src = "Images/stemmers.jpg" >

In [66]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [67]:
p_stemmer = PorterStemmer()
s_stemmer = SnowballStemmer(language="english")
l_stemmer = LancasterStemmer()

Running a Porter stemmer on a document

In [35]:
sample_doc = satire_df['tok_norm'].iloc[0]
print(sample_doc)

['noting', 'resignation', 'james', 'mattis', 'secretary', 'defense', 'marked', 'ouster', 'third', 'top', 'administration', 'official', 'less', 'three', 'weeks', 'worried', 'populace', 'told', 'reporters', 'friday', 'unsure', 'many', 'former', 'trump', 'staffers', 'could', 'safely', 'reabsorb', 'jesus', 'take', 'back', 'assholes', 'need', 'time', 'process', 'one', 'get', 'next', 'said', 'gregory', 'birch', 'naperville', 'il', 'echoing', 'concerns', 'million', 'americans', 'also', 'noting', 'country', 'truly', 'beginning', 'reintegrate', 'former', 'national', 'security', 'advisor', 'michael', 'flynn', 'this', 'sustainable', 'i', 'say', 'handle', 'maybe', 'one', 'two', 'former', 'members', 'trump', 'inner', 'circle', 'remainder', 'year', 'this', 'country', 'the', 'populace', 'confirmed', 'could', 'handle', 'pieces', 'shit', 'trying', 'rejoin', 'society']


.stem(token) method

In [36]:
port_stemmed_doc  = [p_stemmer.stem(token) 
                     for token in sample_doc]
print(port_stemmed_doc)

['note', 'resign', 'jame', 'matti', 'secretari', 'defens', 'mark', 'ouster', 'third', 'top', 'administr', 'offici', 'less', 'three', 'week', 'worri', 'populac', 'told', 'report', 'friday', 'unsur', 'mani', 'former', 'trump', 'staffer', 'could', 'safe', 'reabsorb', 'jesu', 'take', 'back', 'asshol', 'need', 'time', 'process', 'one', 'get', 'next', 'said', 'gregori', 'birch', 'napervil', 'il', 'echo', 'concern', 'million', 'american', 'also', 'note', 'countri', 'truli', 'begin', 'reintegr', 'former', 'nation', 'secur', 'advisor', 'michael', 'flynn', 'thi', 'sustain', 'i', 'say', 'handl', 'mayb', 'one', 'two', 'former', 'member', 'trump', 'inner', 'circl', 'remaind', 'year', 'thi', 'countri', 'the', 'populac', 'confirm', 'could', 'handl', 'piec', 'shit', 'tri', 'rejoin', 'societi']


Compare Porter and Snowball stemmer on a document

In [37]:
print(port_stemmed_doc)

['note', 'resign', 'jame', 'matti', 'secretari', 'defens', 'mark', 'ouster', 'third', 'top', 'administr', 'offici', 'less', 'three', 'week', 'worri', 'populac', 'told', 'report', 'friday', 'unsur', 'mani', 'former', 'trump', 'staffer', 'could', 'safe', 'reabsorb', 'jesu', 'take', 'back', 'asshol', 'need', 'time', 'process', 'one', 'get', 'next', 'said', 'gregori', 'birch', 'napervil', 'il', 'echo', 'concern', 'million', 'american', 'also', 'note', 'countri', 'truli', 'begin', 'reintegr', 'former', 'nation', 'secur', 'advisor', 'michael', 'flynn', 'thi', 'sustain', 'i', 'say', 'handl', 'mayb', 'one', 'two', 'former', 'member', 'trump', 'inner', 'circl', 'remaind', 'year', 'thi', 'countri', 'the', 'populac', 'confirm', 'could', 'handl', 'piec', 'shit', 'tri', 'rejoin', 'societi']


In [38]:
snowball_stemmed_doc  = [s_stemmer.stem(token) 
                     for token in sample_doc]
print(snowball_stemmed_doc)

['note', 'resign', 'jame', 'matti', 'secretari', 'defens', 'mark', 'ouster', 'third', 'top', 'administr', 'offici', 'less', 'three', 'week', 'worri', 'populac', 'told', 'report', 'friday', 'unsur', 'mani', 'former', 'trump', 'staffer', 'could', 'safe', 'reabsorb', 'jesus', 'take', 'back', 'asshol', 'need', 'time', 'process', 'one', 'get', 'next', 'said', 'gregori', 'birch', 'napervill', 'il', 'echo', 'concern', 'million', 'american', 'also', 'note', 'countri', 'truli', 'begin', 'reintegr', 'former', 'nation', 'secur', 'advisor', 'michael', 'flynn', 'this', 'sustain', 'i', 'say', 'handl', 'mayb', 'one', 'two', 'former', 'member', 'trump', 'inner', 'circl', 'remaind', 'year', 'this', 'countri', 'the', 'populac', 'confirm', 'could', 'handl', 'piec', 'shit', 'tri', 'rejoin', 'societi']


Nearly identical results. Snowball is generally faster. Often also better.

Marked difference in results between Porter/Snowball vs. Lancaster

In [39]:
print(snowball_stemmed_doc)

['note', 'resign', 'jame', 'matti', 'secretari', 'defens', 'mark', 'ouster', 'third', 'top', 'administr', 'offici', 'less', 'three', 'week', 'worri', 'populac', 'told', 'report', 'friday', 'unsur', 'mani', 'former', 'trump', 'staffer', 'could', 'safe', 'reabsorb', 'jesus', 'take', 'back', 'asshol', 'need', 'time', 'process', 'one', 'get', 'next', 'said', 'gregori', 'birch', 'napervill', 'il', 'echo', 'concern', 'million', 'american', 'also', 'note', 'countri', 'truli', 'begin', 'reintegr', 'former', 'nation', 'secur', 'advisor', 'michael', 'flynn', 'this', 'sustain', 'i', 'say', 'handl', 'mayb', 'one', 'two', 'former', 'member', 'trump', 'inner', 'circl', 'remaind', 'year', 'this', 'countri', 'the', 'populac', 'confirm', 'could', 'handl', 'piec', 'shit', 'tri', 'rejoin', 'societi']


In [40]:
lancaster_stemmed_doc  = [l_stemmer.stem(token) 
                     for token in sample_doc]

print(lancaster_stemmed_doc)

['not', 'resign', 'jam', 'mat', 'secret', 'defens', 'mark', 'oust', 'third', 'top', 'admin', 'off', 'less', 'three', 'week', 'worry', 'populac', 'told', 'report', 'friday', 'uns', 'many', 'form', 'trump', 'staff', 'could', 'saf', 'reabsorb', 'jes', 'tak', 'back', 'asshol', 'nee', 'tim', 'process', 'on', 'get', 'next', 'said', 'greg', 'birch', 'napervil', 'il', 'echo', 'concern', 'mil', 'am', 'also', 'not', 'country', 'tru', 'begin', 'reintegr', 'form', 'nat', 'sec', 'adv', 'michael', 'flyn', 'thi', 'sustain', 'i', 'say', 'handl', 'mayb', 'on', 'two', 'form', 'memb', 'trump', 'in', 'circ', 'remaind', 'year', 'thi', 'country', 'the', 'populac', 'confirm', 'could', 'handl', 'piec', 'shit', 'try', 'rejoin', 'socy']


#### Advantages/Disadvantages of stemming:
- Uses simple, **fast** tree-based algorithms to normalize word variants
- Stems not always words
- Can produce base forms that are pretty weird/merge different words

#### Lemmatization

- Another way to convert inflections of word to a base form 
- Not simply cutting to word root

Changes to word *lemma*:
- is, was, will $\rightarrow$ be
- haves, having, had $\rightarrow$ have
- leafs, leaves $\rightarrow$ leaf

This enhanced ability comes at a small cost:

- Requires part of speech (POS) information
- due to possible ambiguities in form

Example:
- *leaves* (verb or noun)
- *leaves* (noun) $\rightarrow$ leaf
- *leaves* (verb) $\rightarrow$ leave

nltk has implementation of the WordNet Lemmatizer:
- links into Wordnet
- the mother of all semantic/lexical databases
- stores library of contextual word relationships, POS tagging, etc.
- *excellent* for rule-based document parsing

<img src = "Images/wordnet.webp" >
<center><a href = "https://wordnet.princeton.edu/" >Princeton's WordNet</a> </center>

In [41]:
from nltk import WordNetLemmatizer # lemmatizer using WordNet
from nltk.corpus import wordnet # imports WordNet
from nltk import pos_tag # nltk's native part of speech tagging

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\prave\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\prave\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Part of Speech (POS) Tagging

- identify parts of speech of each token from ordered list of tokens.

In [42]:
sent_string = "The dog licked the babies in the face."
sent_tok_list = word_tokenize(sent_string)

In [43]:
sent_tok_list

['The', 'dog', 'licked', 'the', 'babies', 'in', 'the', 'face', '.']

In [44]:
pos_tag(sent_tok_list)

[('The', 'DT'),
 ('dog', 'NN'),
 ('licked', 'VBD'),
 ('the', 'DT'),
 ('babies', 'NNS'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('face', 'NN'),
 ('.', '.')]

<a href = "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">List of NLTK POS tags</a>

Use POS tagging in lemmatizer, but:
- WordNet has different POS tagging system.
- Helper function to convert (reuse this code)

In [45]:
# helper function to change nltk's part of speech tagging to a wordnet format.
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None

Let's see this tagging in action

In [69]:
# document to list of tuples with tokens and POS tags in nltk format
# converts to wordnet format

wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tag(sample_doc))) 
print(wordnet_tagged)
# [wnl.lemmatize(x[0], x[1]) for x in wordnet_tagged if x[1] is not None]

[('noting', 'v'), ('resignation', 'n'), ('james', 'n'), ('mattis', 'v'), ('secretary', 'n'), ('defense', 'n'), ('marked', 'v'), ('ouster', 'a'), ('third', 'a'), ('top', 'a'), ('administration', 'n'), ('official', 'n'), ('less', 'a'), ('three', None), ('weeks', 'n'), ('worried', 'v'), ('populace', 'n'), ('told', 'v'), ('reporters', 'n'), ('friday', 'a'), ('unsure', 'a'), ('many', 'a'), ('former', 'a'), ('trump', 'n'), ('staffers', 'n'), ('could', None), ('safely', 'r'), ('reabsorb', 'v'), ('jesus', 'n'), ('take', 'v'), ('back', 'r'), ('assholes', 'n'), ('need', 'v'), ('time', 'n'), ('process', 'n'), ('one', None), ('get', 'n'), ('next', 'a'), ('said', 'v'), ('gregory', 'a'), ('birch', 'n'), ('naperville', 'n'), ('il', 'n'), ('echoing', 'v'), ('concerns', 'n'), ('million', None), ('americans', 'n'), ('also', 'r'), ('noting', 'v'), ('country', 'n'), ('truly', 'r'), ('beginning', 'v'), ('reintegrate', 'v'), ('former', 'a'), ('national', 'a'), ('security', 'n'), ('advisor', 'n'), ('michael'

This format can be inputted directly into WordNet lemmatizer.

- Instantiate wordnet object:
- WordNetLemmatizer()
- has method .lemmatize()

In [47]:
wnl = WordNetLemmatizer()
doc_lemmatized = [wnl.lemmatize(token, pos) for token, pos in wordnet_tagged if pos is not None]
print(doc_lemmatized)

['note', 'resignation', 'james', 'mattis', 'secretary', 'defense', 'mark', 'ouster', 'third', 'top', 'administration', 'official', 'less', 'week', 'worry', 'populace', 'tell', 'reporter', 'friday', 'unsure', 'many', 'former', 'trump', 'staffer', 'safely', 'reabsorb', 'jesus', 'take', 'back', 'asshole', 'need', 'time', 'process', 'get', 'next', 'say', 'gregory', 'birch', 'naperville', 'il', 'echo', 'concern', 'american', 'also', 'note', 'country', 'truly', 'begin', 'reintegrate', 'former', 'national', 'security', 'advisor', 'michael', 'flynn', 'sustainable', 'i', 'say', 'handle', 'maybe', 'former', 'member', 'trump', 'inner', 'circle', 'remainder', 'year', 'country', 'populace', 'confirm', 'handle', 'piece', 'shit', 'try', 'rejoin', 'society']


Compare original tokens and lemmatized tokens

In [48]:
print(sample_doc)

['noting', 'resignation', 'james', 'mattis', 'secretary', 'defense', 'marked', 'ouster', 'third', 'top', 'administration', 'official', 'less', 'three', 'weeks', 'worried', 'populace', 'told', 'reporters', 'friday', 'unsure', 'many', 'former', 'trump', 'staffers', 'could', 'safely', 'reabsorb', 'jesus', 'take', 'back', 'assholes', 'need', 'time', 'process', 'one', 'get', 'next', 'said', 'gregory', 'birch', 'naperville', 'il', 'echoing', 'concerns', 'million', 'americans', 'also', 'noting', 'country', 'truly', 'beginning', 'reintegrate', 'former', 'national', 'security', 'advisor', 'michael', 'flynn', 'this', 'sustainable', 'i', 'say', 'handle', 'maybe', 'one', 'two', 'former', 'members', 'trump', 'inner', 'circle', 'remainder', 'year', 'this', 'country', 'the', 'populace', 'confirmed', 'could', 'handle', 'pieces', 'shit', 'trying', 'rejoin', 'society']


In [49]:
print(doc_lemmatized)

['note', 'resignation', 'james', 'mattis', 'secretary', 'defense', 'mark', 'ouster', 'third', 'top', 'administration', 'official', 'less', 'week', 'worry', 'populace', 'tell', 'reporter', 'friday', 'unsure', 'many', 'former', 'trump', 'staffer', 'safely', 'reabsorb', 'jesus', 'take', 'back', 'asshole', 'need', 'time', 'process', 'get', 'next', 'say', 'gregory', 'birch', 'naperville', 'il', 'echo', 'concern', 'american', 'also', 'note', 'country', 'truly', 'begin', 'reintegrate', 'former', 'national', 'security', 'advisor', 'michael', 'flynn', 'sustainable', 'i', 'say', 'handle', 'maybe', 'former', 'member', 'trump', 'inner', 'circle', 'remainder', 'year', 'country', 'populace', 'confirm', 'handle', 'piece', 'shit', 'try', 'rejoin', 'society']


Compare snowball stemmer and lemmatization

In [50]:
print(snowball_stemmed_doc)

['note', 'resign', 'jame', 'matti', 'secretari', 'defens', 'mark', 'ouster', 'third', 'top', 'administr', 'offici', 'less', 'three', 'week', 'worri', 'populac', 'told', 'report', 'friday', 'unsur', 'mani', 'former', 'trump', 'staffer', 'could', 'safe', 'reabsorb', 'jesus', 'take', 'back', 'asshol', 'need', 'time', 'process', 'one', 'get', 'next', 'said', 'gregori', 'birch', 'napervill', 'il', 'echo', 'concern', 'million', 'american', 'also', 'note', 'countri', 'truli', 'begin', 'reintegr', 'former', 'nation', 'secur', 'advisor', 'michael', 'flynn', 'this', 'sustain', 'i', 'say', 'handl', 'mayb', 'one', 'two', 'former', 'member', 'trump', 'inner', 'circl', 'remaind', 'year', 'this', 'countri', 'the', 'populac', 'confirm', 'could', 'handl', 'piec', 'shit', 'tri', 'rejoin', 'societi']


In [51]:
print(doc_lemmatized)

['note', 'resignation', 'james', 'mattis', 'secretary', 'defense', 'mark', 'ouster', 'third', 'top', 'administration', 'official', 'less', 'week', 'worry', 'populace', 'tell', 'reporter', 'friday', 'unsure', 'many', 'former', 'trump', 'staffer', 'safely', 'reabsorb', 'jesus', 'take', 'back', 'asshole', 'need', 'time', 'process', 'get', 'next', 'say', 'gregory', 'birch', 'naperville', 'il', 'echo', 'concern', 'american', 'also', 'note', 'country', 'truly', 'begin', 'reintegrate', 'former', 'national', 'security', 'advisor', 'michael', 'flynn', 'sustainable', 'i', 'say', 'handle', 'maybe', 'former', 'member', 'trump', 'inner', 'circle', 'remainder', 'year', 'country', 'populace', 'confirm', 'handle', 'piece', 'shit', 'try', 'rejoin', 'society']


Lemmatization: 
- far superior to stemming in terms of semantic text normalization
- but need good POS tagging.
- slower than stemming: issue for processing large amounts of text

Applying lemmatizer to corpus
- useful to all preprocessing steps/necessary subroutines into one function


In [52]:
# takes in untokenized document and returns fully normalized token list
def process_doc(doc):

    #initialize lemmatizer
    wnl = WordNetLemmatizer()

    # helper function to change nltk's part of speech tagging to a wordnet format.
    def pos_tagger(nltk_tag):
        if nltk_tag.startswith('J'):
            return wordnet.ADJ
        elif nltk_tag.startswith('V'):
            return wordnet.VERB
        elif nltk_tag.startswith('N'):
            return wordnet.NOUN
        elif nltk_tag.startswith('R'):
            return wordnet.ADV
        else:         
            return None
        
    # remove stop words and punctuations, then lower case
    doc_norm = [tok.lower() for tok in word_tokenize(doc) if ((tok.isalpha()) & (tok not in stop_words)) ]

    #  POS detection on the result will be important in telling Wordnet's lemmatizer how to lemmatize
    
    # creates list of tuples with tokens and POS tags in wordnet format
    wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tag(doc_norm))) 
    doc_norm = [wnl.lemmatize(token, pos) for token, pos in wordnet_tagged if pos is not None]
    
    return doc_norm

In [53]:
print(process_doc(satire_df['body'].iloc[0]))

['note', 'resignation', 'james', 'mattis', 'secretary', 'defense', 'mark', 'ouster', 'third', 'top', 'administration', 'official', 'less', 'week', 'worry', 'populace', 'tell', 'reporter', 'friday', 'unsure', 'many', 'former', 'trump', 'staffer', 'safely', 'reabsorb', 'jesus', 'take', 'back', 'asshole', 'need', 'time', 'process', 'get', 'next', 'say', 'gregory', 'birch', 'naperville', 'il', 'echo', 'concern', 'american', 'also', 'note', 'country', 'truly', 'begin', 'reintegrate', 'former', 'national', 'security', 'advisor', 'michael', 'flynn', 'sustainable', 'i', 'say', 'handle', 'maybe', 'former', 'member', 'trump', 'inner', 'circle', 'remainder', 'year', 'country', 'populace', 'confirm', 'handle', 'piece', 'shit', 'try', 'rejoin', 'society']


Apply text tokenization/normalization to whole body of documents

In [54]:
fully_normalized_corpus = satire_df['body'].apply(process_doc)

In [55]:
fully_normalized_corpus.head()

0    [note, resignation, james, mattis, secretary, ...
1    [desperate, unwind, month, nonstop, work, inve...
2    [nearly, halfway, presidential, term, donald, ...
3    [attempt, make, amends, gross, abuse, power, t...
4    [decry, senate, resolution, blame, crown, prin...
Name: body, dtype: object

In [56]:
flattened_fully_norm = pd.Series(list(itertools.chain(*fully_normalized_corpus)))
len(flattened_fully_norm.unique())

18484

Original dictionary length

In [57]:
print(len(dictionary))

30182


Removed/cleaned dictionary to around half its size:
- Normalized text appropriately
- Still not dealt with infrequent tokens
- Tokens too common but not in stop words list.
- Will do when vectorizing.

Let's flatten the lists and save to csv:

In [58]:
fnc_output = fully_normalized_corpus.apply(
    " ".join)

fnc_output.to_csv("data/satire_norm.csv")

In [59]:
fnc_output

0      note resignation james mattis secretary defens...
1      desperate unwind month nonstop work investigat...
2      nearly halfway presidential term donald trump ...
3      attempt make amends gross abuse power time int...
4      decry senate resolution blame crown prince bru...
                             ...                        
995    britain opposition leader jeremy corbyn push a...
996    turkey take fight islamic state militant syria...
997    malaysia seek reparation goldman sachs group i...
998    israeli court sentence palestinian year impris...
999    least people die due landslide flood trigger t...
Name: body, Length: 1000, dtype: object