![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Practical Work : Preprocessing Options

by Fabian Märki (FHNW) and Andrei Popescu-Belis (HES-SO), modifications for Google Colab made by Daniel Perruchoud (FHNW)

## Summary
The aim of this notebook is to demonstrate three text preprocessing options, extending those seen in Lab 1 using NLTK.  There is no code to complete and no question to answer in this demo.

In [1]:
# Here are some of the packages that will be demonstrated.
import os
import string
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer, PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
import gensim
from gensim.parsing.preprocessing import preprocess_documents
from timeit import default_timer as timer

!pip install contractions
import contractions



## 1. Pre-processing with NLTK

### a. Lemmatization with NLTK
The purpose of lemmatization is to reduce different inflected forms of a word to a normalized one called _lemma_.  For example, a lemmatizer should be able to determine that _gone_, _going_ and _went_ all have the same lemma _go_.  The output of lemmatization is a proper word, so lemmatisation by simple suffix stripping (as with some stemming algorithms) is not sufficient.

The goal of lemmatization is somehow similar to stemming (demonstrated below), as it maps several words into one common root, but the stem is not necessarily and actual word, while the lemma is.

You will need to download the Punkt tokenizer and WordNet as well as Stopwords, by executing `nltk.download('punkt')`, `nltk.download('wordnet')` and `nltk.download('stopwords')`.

In [2]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
lemmatizer = WordNetLemmatizer() # from NLTK

In [4]:
[lemmatizer.lemmatize(w) for w in nltk.word_tokenize("The striped bats are hanging on their feet for rest.")]

['The',
 'striped',
 'bat',
 'are',
 'hanging',
 'on',
 'their',
 'foot',
 'for',
 'rest',
 '.']

The success of lemmatization depends on indicating the correct part-of-speech of the word to the lemmatizer (as the second argument, named `pos`).  Part-of-speech tagging of words will be discussed in a later course, but for now you can download (once) a POS tagger for NLTK by running `nltk.download('averaged_perceptron_tagger')`. This tagger will label every word with its POS tag.  

Then, the tags have to be converted into those known by WordNet (i.e. NOUN, ADJ, ADV, or VERB) so that the WordNetLemmatizer can operate. To convert them, we define the converter function `get_wordnet_pos`.  We then get the result of lemmatization, and can compare it with the previous one.  (See the [NLTK doc here](http://www.nltk.org/api/nltk.corpus.reader.html?highlight=nltk%20corpus%20wordnet#module-nltk.corpus.reader.wordnet), and a [SO question here](https://stackoverflow.com/questions/51634328/wordnetlemmatizer-different-handling-of-wn-adj-and-wn-adj-sat)).

In [5]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [6]:
def get_wordnet_pos(tag):    
    if tag[0] == "J":
        return wordnet.ADJ
    elif tag[0] == "N":
        return wordnet.NOUN
    elif tag[0] == "V":
        return wordnet.VERB
    elif tag[0] == "R":
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [7]:
[lemmatizer.lemmatize(w[0], get_wordnet_pos(w[1])) 
 for w in nltk.pos_tag(nltk.word_tokenize("The striped bats were hanging on their feet for rest."))]

['The',
 'striped',
 'bat',
 'be',
 'hang',
 'on',
 'their',
 'foot',
 'for',
 'rest',
 '.']

### b. Stemming with NLTK
Stemming is the process of reducing a word into its stem, i.e. its _root form_, which is not necessarily a word by itself. For example, the words _fish_, _fishes_ and _fishing_ all stem into _fish_, which is a correct word. On the other side, the words _study_, _studies_ and _studying_ stem into _studi_, which is not an English word.

Commonly, stemming algorithms (a.k.a. stemmers) are based on rules for suffix stripping.  The most famous example is the **Porter stemmer**, introduced in the 1980's and currently implemented in a variety of programming languages.  The **Snowball stemmer** is an improved version of the Porter stemmer.

Traditionally, search engines and other IR applications have applied stemming to improve the chance of matching different forms of a word, treating them as interchangeable, which may or may not be appropriate when searching.

Stemming can be seen as a quick and dirty method of chopping off words to their root forms, working especially on English.  Lemmatization is operation that requires more linguistic knowledge, such as dictionaries.

In [8]:
ls = SnowballStemmer("english") # from NLTK
print(ls.stem("trouble"), ls.stem("troubling"), ls.stem("troubled"))
print(ls.stem("happy"), ls.stem("happier"), ls.stem("happiest"))
print(ls.stem("cat"), ls.stem("cats"))
print(ls.stem("is"), ls.stem("are"), ls.stem("be"))

troubl troubl troubl
happi happier happiest
cat cat
is are be


## 2. Pre-processing with our own TextPreprocessor.py

We defined our own TextPreprocessor class, compatible with the processing pipeline of the `scikit-learn` library.  This class is available in the file `TextPreprocessor.py` provided with the lab, which should be imported.  It is inspired by two documents available [here](https://towardsdatascience.com/text-preprocessing-steps-and-universal-pipeline-94233cb6725a) and [here](https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html).  **You can use it in later AnTeDe labs.**

For usage in Google Colab, place `TextPreprocessor.py` in the same folder as this notebook and mount your Google Drive to be able to access `TextPreprocessor.py` as follows:

In [9]:
from google.colab import drive
drive.mount('/content/gdrive')

# Modify path according to your configuration
# !ls "/content/gdrive/MyDrive/ColabNotebooks/MSE_AnTeDe_Spring2022"
import sys
sys.path.insert(0,'/content/gdrive/MyDrive/ColabNotebooks/MSE_AnTeDe_Spring2022')

from TextPreprocessor import *

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


We will use here a shortened version of the Lee Background Corpus [described here](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF). The shortened version consists of 300 documents selected from the Australian Broadcasting Corporation's news mail service. It consists of texts of headline stories from around the years 2000-2001.  It is available as test data in the `gensim` package, so you do not need to download it separately. 

In [10]:
# Load the documents into a Pandas data frame.  Code inspired from:
# https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
data_df = pd.DataFrame({'text': text})

In [11]:
data_df.head()


Unnamed: 0,text
0,Hundreds of people have been forced to vacate ...
1,Indian security forces have shot dead eight su...
2,The national road toll for the Christmas-New Y...
3,Argentina's political and economic crisis has ...
4,Six midwives have been suspended at Wollongong...


In [12]:
# Run our TextPreprocessor and chronometer it.

language = 'english'
stop_words = set(stopwords.words(language)) # from NLTK: do nltk.download('stopwords') once.
for sw in ['\"', '\'', '\'\'', '`', '``', '\'s']:
    stop_words.add(sw)

processor = TextPreprocessor( # these are only a few of the options of TextPreprocessor (see code for more)
    language = language,
    pos_tags = {wordnet.ADJ, wordnet.NOUN},
    stopwords = stop_words,
)
start = timer()
data_df['processed'] = processor.transform(data_df['text'])
end = timer()
print("Took: " + str(end - start))

Took: 12.500095175000013


In [13]:
print(data_df['text'].iloc[0], '\n')
print(data_df['processed'].iloc[0])

Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year's Eve in New South Wales, fire crews have been called to new fire at Gunning, south of Goulburn. While few details are available at this

## 3. Pre-processing with the Gensim library
[Gensim](https://radimrehurek.com/gensim/) is a Python library widely used for topic modeling.  It provides handy utilities to preprocess text, documented [here](https://radimrehurek.com/gensim/parsing/preprocessing.html) and [here](https://github.com/thunlp/topical_word_embeddings/blob/master/TWE-2/gensim/parsing/preprocessing.py).  A simple example is as follows (don't forget to `import gensim`).

In [14]:
preprocess_documents(["""
<i>Hello</i> <b>World</b> 9!", "Th3     weather_is really g00d today, isn't it?
I'm tall, you're tall, but he isn't tall. But he's an apple in his hand isn't correct.
o ö ô e é è o ö a ä à n ñ
<h1>Title Goes Here</h1> 

<b>Bolded Text</b>
<i>Italicized Text</i>
The striped bats are hanging.
"""])

[['hello',
  'world',
  'weather',
  'todai',
  'isn',
  'tall',
  'tall',
  'isn',
  'tall',
  'appl',
  'hand',
  'isn',
  'correct',
  'titl',
  'goe',
  'bold',
  'text',
  'italic',
  'text',
  'stripe',
  'bat',
  'hang']]

It is also possible to use gensim's preprocessing utility on the text introduced above.  This does not perform lemmatization, but stemming (on English), and generates a list of words.  We can compare the timing and then the outputs on the first text.

In [15]:
start = timer()
data_df['processed_gensim'] = preprocess_documents(data_df['text'])
end = timer()
print("Took: "+str(end - start))

Took: 0.3695762809999792


In [16]:
print(data_df['processed_gensim'].iloc[0])

['hundr', 'peopl', 'forc', 'vacat', 'home', 'southern', 'highland', 'new', 'south', 'wale', 'strong', 'wind', 'todai', 'push', 'huge', 'bushfir', 'town', 'hill', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'sydnei', 'forc', 'closur', 'hume', 'highwai', 'aedt', 'mark', 'deterior', 'weather', 'storm', 'cell', 'move', 'east', 'blue', 'mountain', 'forc', 'author', 'decis', 'evacu', 'peopl', 'home', 'outli', 'street', 'hill', 'new', 'south', 'wale', 'southern', 'highland', 'estim', 'resid', 'left', 'home', 'nearbi', 'mittagong', 'new', 'south', 'wale', 'rural', 'servic', 'sai', 'weather', 'condit', 'caus', 'burn', 'finger', 'format', 'eas', 'unit', 'hill', 'optimist', 'defend', 'properti', 'blaze', 'burn', 'new', 'year', 'ev', 'new', 'south', 'wale', 'crew', 'call', 'new', 'gun', 'south', 'goulburn', 'detail', 'avail', 'stage', 'author', 'sai', 'close', 'hume', 'highwai', 'direct', 'new', 'sydnei', 'west', 'longer', 'threaten', 'properti', 'cranebrook', 'area', 'rain', 'fallen', 'p

## Conclusion

You are now aware of three ways of pre-processing texts, which will be useful in later labs:
1. a set of NLTK functions;
2. the in-house class `TextPreprocessing`;
3. gensim's `preprocess_documents` function.

## Appendix
Several modules may be helpful when preprocessing, but will not be demonstrated here.  If needed, they can be installed and used according to their documentation:
* [Contraction](https://github.com/kootenpv/contractions) is a tool expanding English contractions (e.g., contractions.fix("I'm tall") returns "I am tall") but installing it on Conda may be difficult due to some dependencies.
   * _Tips for installing it:_ `conda install contractions` may not work, so better try `pip install contractions`.  The package has a small number of dependencies, but one of them ([`pyahocorasick`](https://github.com/WojciechMula/pyahocorasick)) may trigger C compilation errors upon installation with `pip` and will not work with `conda install` but should work with `conda install -c conda-forge pyahocorasick` (see Anaconda [documentation](https://anaconda.org/conda-forge/pyahocorasick)). 
* [Inflect](https://pypi.org/project/inflect/) is a library for manipulating English word inflections, which can generate plural or singular nouns, ordinals, and indefinite articles, and can convert numbers written in digits to words.
* [Unicodedata](https://docs.python.org/3/library/unicodedata.html) provides access to the Unicode Character Database and may help to cleanup text with character conversion flaws, or convert characters with diacritics to ASCII equivalents (which may or may not be a good idea for languages like French or German).