# Bigrams, Stemming and Lemmatizing

* **Text Mining** is the process of deriving high-quality information from text.
* Text mining incorporates ideas from **Natural Language Processing** (NLP) 
* One common task in Text Mining is **tokenization**. 
  * "**Tokens**" are usually **individual words**
  * "**tokenization**" is breaking  a text up into its individual words.
* Token is the **atomic unit** of text comparison. 
  * If we want to `compare two documents`, we count how many tokens they share in common.

## 1. Exploring the `reuters` corpus

* A **text corpus** is a large  text collection
* The Reuters Corpus contains 10,788 **news** documents totaling 1.3 million words
  * The documents have been classified into **90 topics**, 
  * and grouped into **two sets**, called "training" and "test"

In [1]:
import nltk
from nltk.corpus import reuters

nltk.download('reuters')
reuters.readme()#.replace('\n', ' ')

ModuleNotFoundError: No module named 'nltk'

In [None]:
reuters.fileids() # the ids of the train set and test set

In [None]:
reuters.fileids()[-1]

In [None]:
len(reuters.fileids())

In [None]:
reuters.categories()

In [None]:
reuters.categories('training/9865')

In [None]:
reuters.sents('test/14826')

In [None]:
# we can specify the words we want in terms of files 
reuters.words('training/9865')[:6]

In [None]:
# we can specify the words we want in terms of categories 
reuters.words(categories='barley')

## 2. n-grams

*  **n-gram** is a contiguous sequence of **n items** from a given sample of **text** or speech

* an n-gram of size 1 is referred to as a "**unigram**"
* size 2 is a "**bigram**"
* size 3 is a "**trigram**"
* etc.

**bigram example**

<img src="figures/Ngram-language-model.png" width="50%">

### Tokenization

In [None]:
trade_words = reuters.words(categories='trade')
len(trade_words)

In [None]:
trade_words_condensed = trade_words[:100]
trade_words_condensed

### Stopwords

* Text may contain stop words like  
  'the', 'is', 'are'...
* Stop words can be **filtered** from the text to be processed.

In [None]:
from nltk.corpus import stopwords

# Remove stopwords from trade_words_condensed and lower case it
trade_words_condensed = [w.lower() for w in trade_words_condensed if w.lower() not in stopwords.words('english')]
trade_words_condensed[:10]

### Punctuation

In [None]:
import string # Contains string constants eg. ascii_lowercase which is 'a...z', string formatting functions, other string functions like .capwords() and .translate().

# Remove punctuation
# trade_words_condensed = [w for w in trade_words_condensed if w not in string.punctuation]
punct_combo = [c + "\"" for c in string.punctuation ] + ["\"" + c for c in string.punctuation] + [".-", ":-", "..", "..."]
trade_words_condensed = [w for w in trade_words_condensed if w not in string.punctuation and w not in punct_combo]
trade_words_condensed

### Bigrams

In [None]:
from nltk import bigrams

bi_trade_words_condensed = list(bigrams(trade_words_condensed))
bi_trade_words_condensed[:5]

### Frequency distributions

Count the number of times that each outcome of an experiment occurs.

In [None]:
from nltk import FreqDist

bi_fdist = FreqDist(bi_trade_words_condensed)

for word, frequency in bi_fdist.most_common(3):
    print(word, frequency)

In [None]:
bi_fdist.plot(3, cumulative=False);

## 3. Stemming

* is the process of **reducing derived words** to their word stem, base or root form
* **Example**: strings such as `cats`, `catlike`, and `catty` have the same stem **cat**.
* Several types of stemming algorithms which differ in respect to **performance** and **accuracy**.
  * simple stemmer: **lookup table**
  * **Suffix-stripping** is a list of "rules": ex. if the word ends in '`ed`', remove the 'ed'
  * etc.

In [None]:
from nltk.stem import (PorterStemmer, LancasterStemmer)
from nltk.stem.snowball import SnowballStemmer # This is "Porter 2" and is considered the optimal stemmer.

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")

print(porter.stem('Re-testing'), lancaster.stem('Re-testing'), snowball.stem('Re-testing'))

In [None]:
# Fun fact: SnowballStemmer can stem several other languages beside English.
# To make, for instance, a French stemmer, we can do the following: french_stemmer = SnowballStemmer('french')
SnowballStemmer.languages

In [None]:
from nltk import word_tokenize

sentence = "So, we'll go no more a-roving. So late into the night, Though the heart be still as loving, And the moon be still as bright."

# This uses the 3-argument version of str.maketrans with arguments (x, y, z) where 'x' and 'y' must be equal-length strings and characters in 'x' are replaced by characters in 'y'. 'z' is a string (string.punctuation here) where each character in the string is mapped to None
translator = str.maketrans('', '', string.punctuation)
translator

# This is an alternative that creates a dictionary mapping of every character from string.punctuation to None (this will also work but creates a whole dictionary so is slower)
#translator = str.maketrans(dict.fromkeys(string.punctuation))

In [None]:
tokens = word_tokenize(sentence.translate(translator))
tokens[:3]

In [None]:
for stemmer in [porter, lancaster, snowball]:
    print([stemmer.stem(t) for t in tokens])

## 4. Lemmatizing

* More complex approach than stemming. 
  * first determining the **part of speech** (noun, verb, adjective, etc.) of a word, 
  * then applying different **normalization rules** for each part of speech. 

* identifying the **wrong category** or being unable to produce the right category limits the added benefit of this approach 

In [None]:
# The default lemmatization method with the Python NLTK 
# is the WordNet lemmatizer.
nltk.download('wordnet')
from nltk import WordNetLemmatizer

wnl = WordNetLemmatizer()

print(wnl.lemmatize('brightening'), wnl.lemmatize('boxes'))

In [None]:
# As we saw above, sometimes, if we try to lemmatize a word, 
# it will end up with the same word. 
# This is because the default part of speech is nouns.
wnl.lemmatize('brightening', pos='v')