# Preprocessing

Depending on the task at hand, we may need to preprocess the text data in different ways. There is no universal solution as to how to preprocess text data, as the choice of action is application dependent. Here we will discuss some common preprocessing steps that are often used in text classification tasks.

A key consideration here is how to balance the trade-off between the amount of information we want to retain and the dimensionality of the data. The more information we retain, the higher the dimensionality of the data.

With high dimensionality
    - More data needed
    - More computational resources

## Tokenization

There are different ways to tokenize text data.

- Words
    - Large vocabulary (up to a million words)
    - Words convey meaning
- Characters
    - Smaller vocabulary (up to a few hundred characters) (26 letters in the English alphabet)
    - Out-of-vocabulary characters
- Subwords
    - Medium vocabulary (up to a few thousand subwords)
    - Sleep and sleeping are similar

## Lowercasing

A simple string split will treat `Hello` and `hello` as different words. This may not be desirable in some cases (high dimensionality) and we may want to lowercase all the words. However, in sentiment analysis, the capitalization of words may be important as these are often used to express emotions.

## Punctuation

Punctuation can be removed or kept. In some cases, punctuation can be important for the meaning of the text. For example, `I am happy` and `I am happy?` may have different meanings.

## Accent Removal

Accents can be removed from text data. This is often done in English text data to reduce the dimensionality of the data.

## Stopwords

Stopwords are common words that are often removed from text data as these words do not carry much meaning and removing them reduces the dimensionality of the feature space. Examples of stopwords are `the`, `is`, `and`, `but`, `not`, etc. As usual, there is no universal rule as in some cases, these words can be important for the meaning of the text. For example, `not` is a stopword but it can change the meaning of a sentence.

- "I am happy" vs "I am not happy"
- "Problem" vs "No problem"

A problem with retaining the stopwords is that as they are very common, the vectors representing documents will be mapped close to each other in the feature space. As machine learning models rely on the distance between vectors, this can lead to poor performance.


In [2]:
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
stop_words

[nltk_data] Downloading package stopwords to /home/amarov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

## Stemming

Let's return to a simple string split example. If we have the words `running`, `runs`, and `run`, these words are similar in meaning. However, a simple string split will treat them as different words and `run` will be as different from `running` as is `tree`. Stemming is a rule-based process that removes suffixes from words to reduce them to their root form. 

Think about a search engine that needs to return results for the query `sleeping`. Simply querying for `sleeping` will not return results that contain `sleep`.

There are different stemming algorithms available, such as the Porter Stemmer, the Snowball Stemmer, and the Lancaster Stemmer, among others. Each one of these implements a set of rules to reduce words to their root form.

For example, the Porter Stemmer has the following rules:

- `s` -> `''`
- `ed` -> `''`
- `ing` -> `''`
- `sses` -> `ss`
- `ment` -> `''` (Shipment -> Ship)
- `ement` -> `''` (Agreement -> Agree)


- Fast and simple
- The result of stemming is not always a valid word.

In [3]:
porter = nltk.PorterStemmer()

porter.stem('running')

'run'

In [4]:
porter.stem('replacement')

'replac'

In [5]:
porter.stem("children")

'children'

In [6]:
print(porter.stem("happiness"))
print(porter.stem("happy"))
print(porter.stem("happily"))

happi
happi
happili


## Lemmatization

Lemmatization is a more sophisticated process that reduces words to their base or root form. The result of lemmatization is always a valid word. Lemmatization is slower than stemming as it requires a dictionary lookup as well as morphological analysis.

Some words have multiple lemmas. For example, the word `better` has two lemmas: `good` and `well`. The choice of lemma depends on the context in which the word is used and the part of speech of the word.


- Slower than stemming
- Depends on part of speech
- The result of lemmatization is always a valid word.

In [7]:
# Lemmatization
nltk.download('wordnet')
wnl = nltk.WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /home/amarov/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
print(wnl.lemmatize('running'))
print(wnl.lemmatize('running', 'v'))

running
run


In [9]:
print(wnl.lemmatize('happiness'))
print(wnl.lemmatize('happy'))

happiness
happy


In [10]:
print(wnl.lemmatize('casting'))
print(wnl.lemmatize('casting', 'v'))

casting
cast


In [11]:
print(porter.stem("mice"))
print(wnl.lemmatize("mice", 'n'))

mice
mouse


Providing the part of speech to the lemmatizer can improve the performance of the lemmatizer. For example, the word `better` can be a noun or a verb. 
If we provide the part of speech to the lemmatizer, it can choose the correct lemma.

- "He is our better" (noun)
- "He better run" (verb)
- "He bettered the situation" (verb)
- "He is better" (adjective)
- "He better" (adverb)


## Part of Speech Tagging

In order to provide the part of speech to the lemmatizer, we need to perform part of speech tagging. Part of speech tagging is the process of assigning a part of speech to each word in a sentence. The part of speech can be a noun, verb, adjective, adverb, etc. Part of speech tagging is a supervised machine learning task and there are many models available that can be used for this task.

You can find a list of all the `nltk` part of speech tags [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), and examples [here](https://medium.com/@faisal-fida/the-complete-list-of-pos-tags-in-nltk-with-examples-eb0485f04321).




In [11]:
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/amarov/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [13]:
sent = "Suddenly she came upon a little three-legged table, all made of solid glass."

tokens = nltk.word_tokenize(sent)

print(tokens)
tagged_tokens = nltk.pos_tag(tokens)

for token in tagged_tokens:
    print(f"{token[0]:12} --> {token[1]}")


['Suddenly', 'she', 'came', 'upon', 'a', 'little', 'three-legged', 'table', ',', 'all', 'made', 'of', 'solid', 'glass', '.']
Suddenly     --> RB
she          --> PRP
came         --> VBD
upon         --> IN
a            --> DT
little       --> JJ
three-legged --> JJ
table        --> NN
,            --> ,
all          --> DT
made         --> VBN
of           --> IN
solid        --> JJ
glass        --> NN
.            --> .


In [14]:
# The default pipeline in spaCy includes a tokenizer, a tagger, and a lemmatizer
! python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(sent)


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [15]:
for token in doc:
    print(f"{token.text:8} --> {token.pos_:5} --> {token.lemma_}")

Suddenly --> ADV   --> suddenly
she      --> PRON  --> she
came     --> VERB  --> come
upon     --> SCONJ --> upon
a        --> DET   --> a
little   --> ADJ   --> little
three    --> NUM   --> three
-        --> PUNCT --> -
legged   --> ADJ   --> legged
table    --> NOUN  --> table
,        --> PUNCT --> ,
all      --> PRON  --> all
made     --> VERB  --> make
of       --> ADP   --> of
solid    --> ADJ   --> solid
glass    --> NOUN  --> glass
.        --> PUNCT --> .


## What is WordNet?

WordNet is a manually constructed lexical database that groups words into set of synonyms (synsets). Furthermore, it describes hierarchical relationships (is part of) between words.

In [15]:
from nltk.corpus import wordnet as wn

synset = wn.synsets('pike')[1]

print("Name of the synset", synset.name())
print("Meaning of the synset : ", synset.definition())
print("Hypernyms ", synset.hypernyms())

Name of the synset pike.n.02
Meaning of the synset :  highly valued northern freshwater fish with lean flesh
Hypernyms  [Synset('freshwater_fish.n.01')]


In [16]:
wn.synsets('pike')

[Synset('expressway.n.01'),
 Synset('pike.n.02'),
 Synset('pike.n.03'),
 Synset('pike.n.04'),
 Synset('pike.n.05')]