# Text Pre-processing

In [None]:
import nltk
print(nltk.__version__)

3.2.5


In [None]:
nltk.download()   # Go to Model tab and download Punkt model.

#nltk.download('punkt')   # Direct Method

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.....

True

# Tokenization

Tokens are the words (esp. in English) and tokenization is the process of converting the textual data into tokens or words. 
This is generally the first step in the NLP pipeline. After this step, we perform other tasks such as parsing, taggings, embeddings etc. 

If we have multiple sentences, we would do sentence tokenization to get the individual sentences and then do word tokenization to get the words per sentence.

Let us see with an example:

In [None]:
quote="The richest man is not he who has the most, but he who needs the least. A nice quote."

### NLTK - Sentence Tokenization

The most useful cues for segmenting a text into sentences are punctuation, like 
* periods, 
* question marks, and 
* exclamation points. 

Question marks and exclamation points are relatively unambiguous markers of sentence boundaries. Periods, on the other hand, are more ambiguous (it could be sentence boundary marker, or part of words like Mr., Inc., M.Tech etc) 

So, to address this problem, in general, sentence tokenization methods work by first deciding (based on rules or machine learning; abbreviation dictionary), whether a period is part of the word or is a sentence-boundary marker.

Let us see the sentence segmentation using nltk package.

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
sentences=sent_tokenize(quote)
print(sentences)

['The richest man is not he who has the most, but he who needs the least.', 'A nice quote.']


### NTLK - Word Tokenization

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
words=[word_tokenize(sentence) for sentence in sentences]
print(words)

[['The', 'richest', 'man', 'is', 'not', 'he', 'who', 'has', 'the', 'most', ',', 'but', 'he', 'who', 'needs', 'the', 'least', '.'], ['A', 'nice', 'quote', '.']]


### Spacy -- Tokenization

The nlp object:

- contains the processing pipeline
- includes language-specific rules for tokenization

In [None]:
from spacy.lang.en import English
nlp = English()

**Doc Object**

In [None]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


**Token Object**

In [None]:
# Index into the Doc to get a single Token
token = doc[1]
# Get the token text via the .text attribute
print(token.text)

world


### Exercise

Create list of list of tokens using NLTK, for the following sentences.

In [None]:
brilliant_quote="It always seems impossible until it’s done. A quote by Nelson Mandela."

In [None]:
print([word_tokenize(sentence) for sentence in sent_tokenize(brilliant_quote)])

[['It', 'always', 'seems', 'impossible', 'until', 'it', '’', 's', 'done', '.'], ['A', 'quote', 'by', 'Nelson', 'Mandela', '.']]


#### Create a single list of tokens

In [None]:
print([word.lower() for sent in sent_tokenize(brilliant_quote) for word in word_tokenize(sent)])

['it', 'always', 'seems', 'impossible', 'until', 'it', '’', 's', 'done', '.', 'a', 'quote', 'by', 'nelson', 'mandela', '.']


#### Create a unique single set of tokens

In [None]:
print(set([word.lower() for sent in sent_tokenize(brilliant_quote) for word in word_tokenize(sent)]))

{'.', 'impossible', 'seems', 'it', 'until', 'mandela', 's', '’', 'done', 'always', 'by', 'a', 'quote', 'nelson'}


# Stopwords

### NLTK

From [Intro to Information Retrieval](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html):

*Some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.*

*The general trend in IR systems over time has been from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists.*

In [None]:
from nltk.corpus import stopwords
from string import punctuation

In [None]:
print(list(punctuation))

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
set_of_stop_words = set(stopwords.words('english') + list(punctuation))
set_of_stop_words

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'need

In [None]:
non_stop_words = [word for word in word_tokenize(quote) if word not in set_of_stop_words]
print(non_stop_words)

['The', 'richest', 'man', 'needs', 'least', 'A', 'nice', 'quote']


### Spacy

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
print(sorted(list(nlp.Defaults.stop_words))[:20])

["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also']


### Exercise

Print all those stop words which are there in Spacy's stopwords but not there in NLTK's version

In [None]:
print(nlp.Defaults.stop_words - set_of_stop_words)

{'becoming', 'elsewhere', 'might', 'nevertheless', 'indeed', 'still', 'n‘t', 'six', 'noone', 'last', 'latterly', 'except', 'nobody', 'whether', 'top', 'thus', 'others', 'sometime', 'cannot', 'therefore', 'whatever', 'rather', 'serious', 'whither', "'ll", 'otherwise', 'always', 'upon', 'per', 'five', 'seems', 'either', 'using', 'thru', 'however', "n't", 'thereby', 'unless', 'nine', 'hence', 'would', 'front', 'several', '‘ve', 'throughout', 'sixty', 'could', 'beside', 'within', 'move', 'around', "'re", 'also', 'whoever', 'since', 'thereupon', 'back', 'ten', 'amount', '’s', 'amongst', 'enough', 'forty', 'every', 'namely', 'see', 'keep', 'made', 'seem', "'s", 'quite', 'former', 'somewhere', 'put', 'none', 'various', "'ve", 'hundred', 'towards', 'used', 'never', 'hereafter', 'make', 'whose', 'perhaps', 'herein', 'already', "'d", 'almost', 'thereafter', 'anything', 'everyone', 'must', '’m', 'neither', 'without', 'sometimes', 'among', 'via', 'hereupon', 'mostly', 'side', 'eleven', 'mine', 'le

### Observation

Each one has its own stop words. There is no single universal list of stop words.

### Guidelines

**When do we use stopwords?**

Recent trends show less usage of stopwords. It is generally used, when stop words do not add lot of value. For instance, when we are doing topic modeling, we are interested in the words which are not stop words, to represent the topic (top used words). However, for the sentiment analysis, each word might have a meaning and would be better not to avoid it. Adding an exclamation sign, could actually change the sentiment.


# Stemming and Lemmatization

from [Information Retrieval](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) textbook:

For grammatical reasons, documents are going to use different forms of a word, such as *organize, organizes, and organizing*.
Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization.

**The goal of both stemming and lemmatization** is to reduce grammatical forms and sometimes derivationally related forms of a word to a **common base form**. 

For instance:

- am, are, is $\Rightarrow$ be 

- car, cars, car's, cars' $\Rightarrow$ car



**Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

**Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove grammatical endings only and to return the base or dictionary form of a word, which is known as the **lemma** . 

If confronted with the token **saw, stemming might return just s**, whereas **lemmatization would attempt to return either see or saw** depending on whether the use of the token was as a verb or a noun.

### NLTK -  Stemming

In [None]:
another_sentence = "organize, organizes, and organizing are variations of the word organize."

#### Lancaster Stemmer

In [None]:
from nltk.stem.lancaster import LancasterStemmer
ls=LancasterStemmer()

In [None]:
stem_words=[ls.stem(word) for word in word_tokenize(another_sentence)]
print(stem_words)

['org', ',', 'org', ',', 'and', 'org', 'ar', 'vary', 'of', 'the', 'word', 'org', '.']


Looks like this model truncates quite a lot. The stemmed words lose lot of information.

#### Porter Stemmer

In [None]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [None]:
stem_porter_words=[ps.stem(word) for word in word_tokenize(another_sentence)]
print(stem_porter_words)

['organ', ',', 'organ', ',', 'and', 'organ', 'are', 'variat', 'of', 'the', 'word', 'organ', '.']


Looks like the model just truncates 'ize', 'ise', 'ion' part of the word.

Not that sophisticated.

#### Snowball Stemmer

In [None]:
from nltk.stem.snowball import SnowballStemmer    
ss = SnowballStemmer('english')

In [None]:
stem_snowball_words=[ss.stem(word) for word in word_tokenize(another_sentence)]
print(stem_snowball_words)

['organ', ',', 'organ', ',', 'and', 'organ', 'are', 'variat', 'of', 'the', 'word', 'organ', '.']


Similar to Porter Stemmer.




### NLTK - Lemmetization

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from nltk.stem import WordNetLemmatizer
wl = WordNetLemmatizer()

In [None]:
lemma_wordnet_words=[wl.lemmatize(word) for word in word_tokenize(another_sentence)]
print(lemma_wordnet_words)

['organize', ',', 'organizes', ',', 'and', 'organizing', 'are', 'variation', 'of', 'the', 'word', 'organize', '.']


### Spacy - Lemmetization

In [None]:
from spacy.lemmatizer import Lemmatizer
lemmatizer = nlp.Defaults.create_lemmatizer()

In [None]:
print([lemmatizer.lookup(token) for token in nlp(another_sentence)])

[organize, ,, organizes, ,, and, organizing, are, variations, of, the, word, organize, .]


Spacy doesn't offer a stemmer as lemmatization is considered better.

### Exercise

Use NLTK - porter stemmer and wordnet lemmatizer

In [None]:
sentence_cry ='cry, cries, cried and crying are variations of the word cry'

Print the stemmed words

In [None]:
print([ps.stem(word) for word in word_tokenize(sentence_cry)])

['cri', ',', 'cri', ',', 'cri', 'and', 'cri', 'are', 'variat', 'of', 'the', 'word', 'cri']


Print the lemmatized word

In [None]:
print([wl.lemmatize(word) for word in word_tokenize(sentence_cry)])

['cry', ',', 'cry', ',', 'cried', 'and', 'cry', 'are', 'variation', 'of', 'the', 'word', 'cry']


### Observation

Stemming and Lemmatization both generate the root form of the words. 

Lemmatization uses the rules about a language.  The resulting tokens are all actual words.


### Guidelines

As mentioned above Stemming is a crude heuristic that chops the ends off of words but the resulting tokens may not be actual words. However, Stemming is faster and is generally used when computation power is less.

For better results, lemmetization is better, but it is slower and computationally costlier.

**When do we do stemming/lemmatization?**

Again, it depends on the context. For topic modeling exercise, might be useful and could be done. However, for sentiment analysis, it is better avoided.
