# Lab 1: NLP Basics
## Text Preprocessing
Text preprocessing is, probably, one of the least pleasant yet one of the most important steps of a natural language processing (NLP) pipelines. This step determines how your NLP algorithms are going to see the data. If your preprocessing breaks, the whole model can break or, what is even worse, keep silent and give incorrect results.

Text preprocessing can be devided into three main parts:
* Tokenization
* Normalization
* Noise reduction

The parts are not necessarily applied in that particular order. Sometimes, before tokenization the noise reduction should be performed. In other cases, the some steps can be repeated several times.


In this lab we will be using [Python's Natural Language ToolKit (NLTK)](https://www.nltk.org/) and [spaCy](https://spacy.io/usage/spacy-101). Click the previous links to read more about them. 

In [None]:
from string import punctuation

import nltk
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords, wordnet

import spacy
nlp = spacy.load("en_core_web_sm")
# If you don't have the model installed run "python -m spacy download en_core_web_sm"
# in the console and restart the python kernel

In [None]:
# Run this cell to install all the necessary files for NLTK
nltk.download('stopwords') # Download stopwords 
nltk.download('wordnet') # Download WordNet 
nltk.download('punkt') # Download punkt tokenizer models
nltk.download('averaged_perceptron_tagger') # Download POS tagger

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Tokenization
Tokenization may be defined as the process of splitting the text into smaller parts called tokens, and is considered a crucial step in NLP. We can highlight word segmentation and sentence segmentation. Depending on the task, you might need to use only word segmentation, for other tasks, you might want to have both sentences and words.

As the names suggest, word segmentation is dividing the raw text sequence into words and sentence segmentation is dividing the text into sentences.

Imagine that we need to parse the some sentences  from the Wikipedia article about Coffee. We have the following raw text:

In [None]:
raw_text = "Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried. " + \
           "By the 16th century, the drink had reached the rest of the Middle East and North Africa, later spreading to Europe."
print(raw_text)

Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried. By the 16th century, the drink had reached the rest of the Middle East and North Africa, later spreading to Europe.


A simple approach is to define a subset of characters as whitespace, and then split the text on these tokens.

In [None]:

tokens = raw_text.split()
print(tokens)

['Coffee', 'is', 'a', 'brewed', 'drink', 'prepared', 'from', 'roasted', 'coffee', 'beans,', 'the', 'seeds', 'of', 'berries', 'from', 'certain', 'Coffea', 'species.', 'When', 'coffee', 'berries', 'turn', 'from', 'green', 'to', 'bright', 'red', 'in', 'color', '–', 'indicating', 'ripeness', '–', 'they', 'are', 'picked,', 'processed,', 'and', 'dried.', 'By', 'the', '16th', 'century,', 'the', 'drink', 'had', 'reached', 'the', 'rest', 'of', 'the', 'Middle', 'East', 'and', 'North', 'Africa,', 'later', 'spreading', 'to', 'Europe.']


But already here, we can see the problem with the tokens like '*dried*.' and '*Africa*,.'. In our case, the dot is the part of the token that we definetely don't want. One solution is to strip each token from the punctuation.

In [None]:
def whitespace_tokenize(text):
    return [token.strip(punctuation) for token in text.split()]
    

print(whitespace_tokenize(raw_text))

['Coffee', 'is', 'a', 'brewed', 'drink', 'prepared', 'from', 'roasted', 'coffee', 'beans', 'the', 'seeds', 'of', 'berries', 'from', 'certain', 'Coffea', 'species', 'When', 'coffee', 'berries', 'turn', 'from', 'green', 'to', 'bright', 'red', 'in', 'color', '–', 'indicating', 'ripeness', '–', 'they', 'are', 'picked', 'processed', 'and', 'dried', 'By', 'the', '16th', 'century', 'the', 'drink', 'had', 'reached', 'the', 'rest', 'of', 'the', 'Middle', 'East', 'and', 'North', 'Africa', 'later', 'spreading', 'to', 'Europe']


Let's say now, that we want to split the text into sentences and then get tokens for each sentence. The simplest way is to split the text by dot first and then get tokens for each sentence.

In [None]:
def segment_sents(text):
  sents = []
  for sent in text.split('.'):
    if sent: 
      sents.append(whitespace_tokenize(sent))
  return sents

print(segment_sents(raw_text))

[['Coffee', 'is', 'a', 'brewed', 'drink', 'prepared', 'from', 'roasted', 'coffee', 'beans', 'the', 'seeds', 'of', 'berries', 'from', 'certain', 'Coffea', 'species'], ['When', 'coffee', 'berries', 'turn', 'from', 'green', 'to', 'bright', 'red', 'in', 'color', '–', 'indicating', 'ripeness', '–', 'they', 'are', 'picked', 'processed', 'and', 'dried'], ['By', 'the', '16th', 'century', 'the', 'drink', 'had', 'reached', 'the', 'rest', 'of', 'the', 'Middle', 'East', 'and', 'North', 'Africa', 'later', 'spreading', 'to', 'Europe']]


For this example, it worked fine so far. But this task hold many surprises for an unprepared person. Let's see another examples that can cause troubles if using our function.

In [None]:
difficult_sents = [
    "Dr. Ford did not ask Col. Mustard the name of Mr. Smith's dog.",
    '"What is all the fuss about?" asked Mr. Peters.',
     "This full-time student isn't living in on-campus housing, and she's not wanting to visit Hawai'i."
]

for sent in difficult_sents: 
  print(segment_sents(sent))

[['Dr'], ['Ford', 'did', 'not', 'ask', 'Col'], ['Mustard', 'the', 'name', 'of', 'Mr'], ["Smith's", 'dog']]
[['What', 'is', 'all', 'the', 'fuss', 'about', 'asked', 'Mr'], ['Peters']]
[['This', 'full-time', 'student', "isn't", 'living', 'in', 'on-campus', 'housing', 'and', "she's", 'not', 'wanting', 'to', 'visit', "Hawai'i"]]


Here, we can see that different abbreviations like Dr., Col., Mr. were treated as a sentence end. Also, contractions like let's and she's are in fact two words: is not and she is. However, Smith's can be either Smith is or rather, like in our case, one word showing possession. Finally, we have to decide if full-time and on-campus have one word or two. 

Luckily, for English, we can use different libraries like NLTK or spaCy which tackle most of these problems. Let's see, how they manage with our sentences.

In [None]:
print("NLTK tokenization:\n")
for sent in difficult_sents: 
  print([word_tokenize(s) for s in sent_tokenize(sent)])

NLTK tokenization:

[['Dr.', 'Ford', 'did', 'not', 'ask', 'Col.', 'Mustard', 'the', 'name', 'of', 'Mr.', 'Smith', "'s", 'dog', '.']]
[['``', 'What', 'is', 'all', 'the', 'fuss', 'about', '?', "''"], ['asked', 'Mr.', 'Peters', '.']]
[['This', 'full-time', 'student', 'is', "n't", 'living', 'in', 'on-campus', 'housing', ',', 'and', 'she', "'s", 'not', 'wanting', 'to', 'visit', "Hawai'i", '.']]


In [None]:
print("Spacy tokenization:\n")
for sent in difficult_sents:
  doc = nlp(sent)
  print([token.text for s in doc.sents for token in s])


Spacy tokenization:

['Dr.', 'Ford', 'did', 'not', 'ask', 'Col', '.', 'Mustard', 'the', 'name', 'of', 'Mr.', 'Smith', "'s", 'dog', '.']
['"', 'What', 'is', 'all', 'the', 'fuss', 'about', '?', '"', 'asked', 'Mr.', 'Peters', '.']
['This', 'full', '-', 'time', 'student', 'is', "n't", 'living', 'in', 'on', '-', 'campus', 'housing', ',', 'and', 'she', "'s", 'not', 'wanting', 'to', 'visit', "Hawai'i", '.']


As we can see, Spacy is somewhat better for this task. However, this is only that good for English and, probably, most of the European languages. If we take a language where the words are not graphically separated in writing, like Chinese, Thai, or German compound words, we need to choose another approach.

## Normalization
In order to carry out processing on natural language text, we need to perform normalization that mainly involves eliminating punctuation, converting the entire text into lowercase or uppercase, converting numbers into words, expanding abbreviations, canonicalization of text, and so on. 

We are going to look at the main steps: **stemming** and **lemmatization**. 

Stemming usually refers to removing endings and prefixes from a word. For example, playing and played are going to be reduced to play after going through the stemmer. It works rather well for English but it can be troublesome for other languages with not complicated morphology. Also, the past tense for run, ran is not going to be changed with stemming and finally is going to be considered a different word. 

NLTK library includes a stemming package as well.

In [None]:
words_to_stem = ['playing', 'played', 'plays', 'play', 'running', 'ran', 'runs', 'run']
stemmer = PorterStemmer()
print('Stemming with NLTK:\n')
print([stemmer.stem(word) for word in words_to_stem])

Stemming with NLTK:

['play', 'play', 'play', 'play', 'run', 'ran', 'run', 'run']


To solve the problem with the words that change their roots in different grammarical forms, we should use more complicated method, called lemmatization. Lemmatization is a process wherein the context is used to convert a word to its meaningful base form. It helps in grouping together words that have a common base form and so can
be identified as a single item. Now, however, most of the lemmatizers are trained using neural networks.


Both NLTK and Spacy have a lemmatization module for English.


In [None]:
print('Lemmatization with NLTK:\n')
lemmatizer = WordNetLemmatizer()
for word in words_to_stem:
  print(f'{word}: {lemmatizer.lemmatize(word)}')

Lemmatization with NLTK:

playing: playing
played: played
plays: play
play: play
running: running
ran: ran
runs: run
run: run


We can see immediately that NLTK doesn't give correct lemmas for our words. This is because the NLTK lemmarizer expects to have a part-of-speech (POS) tag for each word, i.e. the information if the word is a noun, a verb, an adjective etc. We can, of course, specify the POS tag for each word but if our corpus is big, it will be tiresome to determine the POS tags by hand. In order to do that, we can use already pretrained POS tagger. We're going to look at POS tagging later.

In [None]:
print('Lemmatization with NLTK with correct pos tags:\n')
for word in words_to_stem:
  print(f'{word}: {lemmatizer.lemmatize(word, pos=wordnet.VERB)}')

Lemmatization with NLTK with correct pos tags:

playing: play
played: play
plays: play
play: play
running: run
ran: run
runs: run
run: run


Conveniently for us, Spacy does POS tagging and other necessary preprocessing for lemmatization, and we can get all the lemmas with only one command.

In [None]:
print('Lemmatization with Spacy:\n')
for word in words_to_stem: 
  doc = nlp(word)
  print(f'{word}: {doc[0].lemma_}')

Lemmatization with Spacy:

playing: play
played: play
plays: play
play: play
running: run
ran: run
runs: run
run: run


We can also see how our sentences from the previous exercise look after stemming and lemmatization:

In [None]:
print("NLTK stemming:\n")
for sent in difficult_sents:
  nltk_sents = [word_tokenize(s) for s in sent_tokenize(sent)]
  print(f'Original sentence:\n{nltk_sents}')
  nltk_stems = []
  for sent in nltk_sents:
    stemmed_sent = []
    for token in sent:
      stemmed_sent.append(stemmer.stem(token))
    nltk_stems.append(stemmed_sent)
  print(f'Stemmed sentence:\n{nltk_stems}')
  print('\n------\n')

NLTK stemming:

Original sentence:
[['Dr.', 'Ford', 'did', 'not', 'ask', 'Col.', 'Mustard', 'the', 'name', 'of', 'Mr.', 'Smith', "'s", 'dog', '.']]
Stemmed sentence:
[['dr.', 'ford', 'did', 'not', 'ask', 'col.', 'mustard', 'the', 'name', 'of', 'mr.', 'smith', "'s", 'dog', '.']]

------

Original sentence:
[['``', 'What', 'is', 'all', 'the', 'fuss', 'about', '?', "''"], ['asked', 'Mr.', 'Peters', '.']]
Stemmed sentence:
[['``', 'what', 'is', 'all', 'the', 'fuss', 'about', '?', "''"], ['ask', 'mr.', 'peter', '.']]

------

Original sentence:
[['This', 'full-time', 'student', 'is', "n't", 'living', 'in', 'on-campus', 'housing', ',', 'and', 'she', "'s", 'not', 'wanting', 'to', 'visit', "Hawai'i", '.']]
Stemmed sentence:
[['thi', 'full-tim', 'student', 'is', "n't", 'live', 'in', 'on-campu', 'hous', ',', 'and', 'she', "'s", 'not', 'want', 'to', 'visit', "hawai'i", '.']]

------




We can see the NLTK stemmer also puts all the words to lowercase which is another part of normalization. Also, we can also see some artifacts with the stemming like thi, full-tim, on-campu.

Let's now see the lemmatized sentence from Spacy:

In [None]:
print("Spacy lemmatization:\n")
for sent in difficult_sents:
    doc = nlp(sent) 
    print(f'Original sentence:\n{[token.text  for s in doc.sents for token in s]}')
    print(f'Lemmatized sentence:\n{[token.lemma_  for s in doc.sents for token in s]}')
    print('\n------\n')

Spacy lemmatization:

Original sentence:
['Dr.', 'Ford', 'did', 'not', 'ask', 'Col', '.', 'Mustard', 'the', 'name', 'of', 'Mr.', 'Smith', "'s", 'dog', '.']
Lemmatized sentence:
['Dr.', 'Ford', 'do', 'not', 'ask', 'Col', '.', 'Mustard', 'the', 'name', 'of', 'Mr.', 'Smith', "'s", 'dog', '.']

------

Original sentence:
['"', 'What', 'is', 'all', 'the', 'fuss', 'about', '?', '"', 'asked', 'Mr.', 'Peters', '.']
Lemmatized sentence:
['"', 'what', 'be', 'all', 'the', 'fuss', 'about', '?', '"', 'ask', 'Mr.', 'Peters', '.']

------

Original sentence:
['This', 'full', '-', 'time', 'student', 'is', "n't", 'living', 'in', 'on', '-', 'campus', 'housing', ',', 'and', 'she', "'s", 'not', 'wanting', 'to', 'visit', "Hawai'i", '.']
Lemmatized sentence:
['this', 'full', '-', 'time', 'student', 'be', 'not', 'live', 'in', 'on', '-', 'campus', 'housing', ',', 'and', '-PRON-', 'be', 'not', 'want', 'to', 'visit', "Hawai'i", '.']

------




With lemmatization, the results look better: *did* trasformed to *do*, as well as *is* and *'s* to *be*. Another good thing is that in the first sentence *'s* in *Smith's dog* stayed as *'s* which is important because in this case it is not a contraction from the verb *is*.

Another parts for the normalization include:

* Removing the punctuation
* Removing whitespace
* Removing numbers or converting them into text
* Removing stop words
* etc

Finally, we can look a bit more into the stop words. Stopwords are words such as a, an, the, in, at, and so on that occur frequently in text corpora
and do not carry a lot of information in most contexts. These words, in general, are required
for the completion of sentences and making them grammatically sound. They are often the
most common words in a language and can be filtered out in most NLP tasks, and
consequently help in reducing the vocabulary or search space.  However, the stop list can be modified to fit a specific task.

Both NLTK and Spacy have built-in lists for stop words, however, you are free to find it anywhere else on the internet or even compose your own list.

In [None]:
print('Stop words for English from NLTK:\n')

nltk_stopwords = set(stopwords.words('english')) 
print(nltk_stopwords)

Stop words for English from NLTK:

{"hasn't", 'very', 'off', 'her', 'of', 'wasn', 'against', 'so', 'who', 'up', 'i', 'these', 'hadn', "wasn't", "mightn't", 'there', 'are', 'being', 'hers', 'further', 'both', 'while', 'those', "don't", 'before', "didn't", 'aren', 'or', 'shouldn', 'whom', 'this', 'yourselves', 'into', 'such', 'any', 'out', 'what', 'to', 'theirs', 'be', 're', 'as', 'just', 'then', 'yours', 'their', 'and', 'between', 'again', "you've", 'because', 'in', "should've", 'by', "that'll", "weren't", 'too', 'during', 'ourselves', 'mightn', 'your', 'won', 'how', 'needn', 'them', 'only', 'myself', 'each', 'wouldn', 'yourself', 'which', 'him', 'where', 'from', 'been', 'most', "you're", 'its', 'we', "won't", 'more', 'doing', 'on', 's', 'am', 'himself', 'some', 'have', 'all', 'a', 'y', 'mustn', 'for', 'the', 'had', 'doesn', 'ours', 'hasn', "couldn't", 'o', 'our', 'other', 'than', 'was', 'is', 'own', 'after', 't', 'until', 'should', 'few', 'were', 'ma', "you'd", 'itself', 'can', 'don', 

In [None]:
print('Stop words for English from Spacy:\n')
nlp.Defaults.stop_words

Stop words for English from Spacy:



{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

Finally, we can see how our sentences look with the stop words removed:

In [None]:
print("NLTK stemming and stop words:\n")
for sent in difficult_sents:
    nltk_sents = [word_tokenize(s) for s in sent_tokenize(sent)]
    print(f'Original sentence:\n{nltk_sents}')
    nltk_stems = []
    nltk_no_stop = []
    for sent in nltk_sents:
        stemmed_sent = []
        for token in sent:
            stemmed_token = stemmer.stem(token)
            if stemmed_token not in nltk_stopwords:
                nltk_no_stop.append(stemmed_token)
            stemmed_sent.append(stemmed_token)
        nltk_stems.append(stemmed_sent)
    print(f'Stemmed sentence:\n{nltk_stems}')
    print(f'Stemmed sentence without stop words:\n{nltk_no_stop}')
    print('\n------\n')

NLTK stemming and stop words:

Original sentence:
[['Dr.', 'Ford', 'did', 'not', 'ask', 'Col.', 'Mustard', 'the', 'name', 'of', 'Mr.', 'Smith', "'s", 'dog', '.']]
Stemmed sentence:
[['dr.', 'ford', 'did', 'not', 'ask', 'col.', 'mustard', 'the', 'name', 'of', 'mr.', 'smith', "'s", 'dog', '.']]
Stemmed sentence without stop words:
['dr.', 'ford', 'ask', 'col.', 'mustard', 'name', 'mr.', 'smith', "'s", 'dog', '.']

------

Original sentence:
[['``', 'What', 'is', 'all', 'the', 'fuss', 'about', '?', "''"], ['asked', 'Mr.', 'Peters', '.']]
Stemmed sentence:
[['``', 'what', 'is', 'all', 'the', 'fuss', 'about', '?', "''"], ['ask', 'mr.', 'peter', '.']]
Stemmed sentence without stop words:
['``', 'fuss', '?', "''", 'ask', 'mr.', 'peter', '.']

------

Original sentence:
[['This', 'full-time', 'student', 'is', "n't", 'living', 'in', 'on-campus', 'housing', ',', 'and', 'she', "'s", 'not', 'wanting', 'to', 'visit', "Hawai'i", '.']]
Stemmed sentence:
[['thi', 'full-tim', 'student', 'is', "n't", 'l

In [None]:
print("Spacy lemmatization and stop words:\n")
for sent in difficult_sents:
    doc = nlp(sent) 
    print(f'Original sentence:\n{[token.text for s in doc.sents for token in s]}') 
    print(f'Lemmatized sentence:\n{[token.lemma_ for s in doc.sents for token in s]}')
    print(f'Lemmatized sentence without stop words:\n{[token.lemma_ for s in doc.sents for token in s if token.lemma_ not in nlp.Defaults.stop_words]}') 
    print('\n------\n')

Spacy lemmatization and stop words:

Original sentence:
['Dr.', 'Ford', 'did', 'not', 'ask', 'Col', '.', 'Mustard', 'the', 'name', 'of', 'Mr.', 'Smith', "'s", 'dog', '.']
Lemmatized sentence:
['Dr.', 'Ford', 'do', 'not', 'ask', 'Col', '.', 'Mustard', 'the', 'name', 'of', 'Mr.', 'Smith', "'s", 'dog', '.']
Lemmatized sentence without stop words:
['Dr.', 'Ford', 'ask', 'Col', '.', 'Mustard', 'Mr.', 'Smith', 'dog', '.']

------

Original sentence:
['"', 'What', 'is', 'all', 'the', 'fuss', 'about', '?', '"', 'asked', 'Mr.', 'Peters', '.']
Lemmatized sentence:
['"', 'what', 'be', 'all', 'the', 'fuss', 'about', '?', '"', 'ask', 'Mr.', 'Peters', '.']
Lemmatized sentence without stop words:
['"', 'fuss', '?', '"', 'ask', 'Mr.', 'Peters', '.']

------

Original sentence:
['This', 'full', '-', 'time', 'student', 'is', "n't", 'living', 'in', 'on', '-', 'campus', 'housing', ',', 'and', 'she', "'s", 'not', 'wanting', 'to', 'visit', "Hawai'i", '.']
Lemmatized sentence:
['this', 'full', '-', 'time', '