# Text Pre-processing and Wrangling

Text is different than usual datasets we use to build our typical Machine Learning models. Text data needs to be pre-processed to ensure we have it in a form that is usable for various NLP tasks. Text processing and wrangling is a necessary and important step in any NLP project.

In this notebook, we will cover:
- Sentence tokenization
- Word tokenization
- Handling non-text characters - accents, special symbols, HTML
- Preprocessing \ normalization steps such as stemming, lemmatization, expanding contractions and stopword removal

### Import Libraries

In [97]:
import re
import nltk
import requests
import numpy as np
from pprint import pprint
from bs4 import BeautifulSoup

In [6]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('europarl_raw')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/laurent/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/laurent/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/laurent/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package europarl_raw to
[nltk_data]     /Users/laurent/nltk_data...
[nltk_data]   Unzipping corpora/europarl_raw.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/laurent/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Get Text Dataset

Project Gutenberg is a large opensource and free collection of literary works from across the world. In this case, we will leverage content of the book **The Bible - Book 1: Genesis** for understanding different text processing and wrangling steps in the following sections


We will also use a smaller sample text with a few short sentences to demostrate examples as well.

In [7]:
bible_html_url = "http://www.gutenberg.org/cache/epub/8001/pg8001.html"
bible_txt_url = "http://www.gutenberg.org/cache/epub/8001/pg8001.txt"
data = requests.get(bible_txt_url)
content = data.text

In [8]:
print(content[970:1800])

 in November 2002.





Book 01        Genesis

01:001:001 In the beginning God created the heaven and the earth.

01:001:002 And the earth was without form, and void; and darkness was
           upon the face of the deep. And the Spirit of God moved upon
           the face of the waters.

01:001:003 And God said, Let there be light: and there was light.

01:001:004 And God saw the light, that it was good: and God divided the
           light from the darkness.

01:001:005 And God called the light Day, and the darkness he called
           Night. And the evening and the morning were the first day.

01:001:006 And God said, Let there be a firmament in the midst of the
           waters, and let it divide the waters from the waters.

01:001:007 And God made the firmament, and divided the waters 


In [9]:
# Total characters in Bible
len(content)

266996

In [10]:
sample_text = ("US unveils world's most powerful supercomputer, beats China. "
               "The US has unveiled the world's most powerful supercomputer called 'Summit', "
               "beating the previous record-holder China's Sunway TaihuLight. With a peak performance "
               "of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, "
               "which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, "
               "which reportedly take up the size of two tennis courts.")
sample_text

"US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts."

In [11]:
# First 100 characters in the corpus
content[0:100]

'\ufeffThe Project Gutenberg eBook of The Bible, King James version, Book 1: Genesis\r\n    \r\nThis ebook is '

## Sentence Tokenization

Sentence is a syntactical as well as a logical division of content in a given corpus. In order to understand text or prepare it for various tasks, understanding sentence boundaries is an important step.

Sentence tokenization is the process of determining sentence boundaries in a given corpus.

One might think that it is merely trivial to determine a sentence boundary, we just need to split based on ".". While this generally holds true yet there are exceptions. Think of scenarios where we use "." in abbreviations and shorthand notations(such as Mr. or Mrs.).


Let us now go through a few standard ways of performing sentence tokenization

### NLTK's Default Tokenizer

NLTK provides a number of sentence tokenizers. Let's first have a look at the default one.

In [12]:
bib_sentences = nltk.sent_tokenize(text=content)
sample_sentences = nltk.sent_tokenize(text=sample_text)

print('Total sentences in sample_text:{}'.format(len(sample_sentences)))
print('Sample text sentences :-')
print(np.array(sample_sentences))

print('\nTotal sentences in bible:', len(bib_sentences))
print('First 5 sentences in bible:-')
print(np.array(bib_sentences[0:5]))

Total sentences in sample_text:4
Sample text sentences :-
["US unveils world's most powerful supercomputer, beats China."
 "The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight."
 'With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.'
 'Summit has 4,608 servers, which reportedly take up the size of two tennis courts.']

Total sentences in bible: 1589
First 5 sentences in bible:-
['\ufeffThe Project Gutenberg eBook of The Bible, King James version, Book 1: Genesis\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever.'
 'You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenber

In [13]:
sample_text

"US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts."

In [14]:
nltk.sent_tokenize(sample_text)

["US unveils world's most powerful supercomputer, beats China.",
 "The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight.",
 'With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.',
 'Summit has 4,608 servers, which reportedly take up the size of two tennis courts.']

### Tokenize German Sentences

NLTK also provides utilities to handle sentence tokenization for various languages (apart from English). The following is a sample to showcase sentence tokenization for German text

In [15]:
from nltk.corpus import europarl_raw

german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
# Total characters in the corpus
print(len(german_text))
# First 100 characters in the corpus
print(german_text[0:100])

157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


In [16]:
# default sentence tokenizer
default_st = nltk.sent_tokenize
german_sentences_def = default_st(text=german_text, language='german')

# loading german text tokenizer into a PunktSentenceTokenizer instance
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

In [17]:
german_sentences[:5]

[' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .',
 'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .',
 'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .',
 'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .',
 'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .']

In [18]:
# check if results of both tokenizers match
# should be True
print(german_sentences_def == german_sentences)

True


In [19]:
# print first 5 sentences of the corpus
print(np.array(german_sentences[:5]))

[' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .'
 'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .'
 'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .'
 'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .'
 'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .']


## Word Tokenization

Word can be considered as a basic building block for NLP tasks. A combination of words make up a sentence. As in the previous section, we worked towards understanding sentence tokenization, in this section, we will focus on understanding word boundaries.

We will make use of various word tokenizers available from ``nltk`` as well as ``spacy`` in this section

In [20]:
sample_text

"US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts."

In [21]:
# default tokenizer
words = nltk.word_tokenize(sample_text)
np.array(words)

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the',
       'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight',
       '.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

In [23]:
# utility for tokenization
def tokenize_text(text):
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
    return word_tokens

sents = tokenize_text(sample_text)
#np.array(sents)

In [24]:
words = [word for sentence in sents
                for word in sentence]
np.array(words)

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the',
       'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight',
       '.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

### Spacy Tokenizer

``spacy`` provides easy to use interfaces to perform sentence and word tokenization.

In [27]:
import spacy

In [28]:
sample_text

"US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts."

In [30]:
nlp = spacy.load('en_core_web_sm')

In [31]:
text_spacy = nlp(sample_text)
type(text_spacy)

spacy.tokens.doc.Doc

In [32]:
text_spacy.sents

<generator at 0x31c268ea0>

In [33]:
list(text_spacy.sents)

[US unveils world's most powerful supercomputer, beats China.,
 The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight.,
 With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.,
 Summit has 4,608 servers, which reportedly take up the size of two tennis courts.]

In [53]:
sents = list(text_spacy.sents)

# Determine the maximum length of sentences
max_length = max(len(sentence) for sentence in sents)

# Extract sentences as lists of token texts
sents = [[token.text for token in sent] for sent in text_spacy.sents]

# Pad sentences with empty strings to make them the same length
padded_sents = [sentence + [""] * (max_length - len(sentence)) for sentence in sents]

# Convert to NumPy array
sents = np.array(padded_sents)


#sents = np.array(list(text_spacy.sents))
#sents = list(text_spacy.sents)
sents

array([['US', 'unveils', 'world', "'s", 'most', 'powerful',
        'supercomputer', ',', 'beats', 'China', '.', '', '', '', '', '',
        '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''],
       ['The', 'US', 'has', 'unveiled', 'the', 'world', "'s", 'most',
        'powerful', 'supercomputer', 'called', "'", 'Summit', "'", ',',
        'beating', 'the', 'previous', 'record', '-', 'holder', 'China',
        "'s", 'Sunway', 'TaihuLight', '.', '', '', '', '', ''],
       ['With', 'a', 'peak', 'performance', 'of', '200,000', 'trillion',
        'calculations', 'per', 'second', ',', 'it', 'is', 'over',
        'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
        'which', 'is', 'capable', 'of', '93,000', 'trillion',
        'calculations', 'per', 'second', '.'],
       ['Summit', 'has', '4,608', 'servers', ',', 'which', 'reportedly',
        'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts',
        '.', '', '', '', '', '', '', '', '', '', '', '', '', ''

In [54]:
sents[0]

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''],
      dtype='<U13')

In [55]:
type(sents[0])

numpy.ndarray

In [56]:
sents[0].text

AttributeError: 'numpy.ndarray' object has no attribute 'text'

In [57]:
type(sents[0].text)

AttributeError: 'numpy.ndarray' object has no attribute 'text'

In [58]:
[sentence.text for sentence in sents]

AttributeError: 'numpy.ndarray' object has no attribute 'text'

In [59]:
sent_words = []

for sent in sents:
  word_list = [word.text for word in sent]
  sent_words.append(word_list)

sent_words

AttributeError: 'numpy.str_' object has no attribute 'text'

In [60]:
sent_words = [[word.text for word in sent]
                 for sent
                    in sents]
np.array(sent_words)

AttributeError: 'numpy.str_' object has no attribute 'text'

In [None]:
# convert from text_spacy.sents to list of string-sentences
[sent.text for sent in list(text_spacy.sents)]

["US unveils world's most powerful supercomputer, beats China.",
 "The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight.",
 'With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.',
 'Summit has 4,608 servers, which reportedly take up the size of two tennis courts.']

In [None]:
words = [word.text for word in text_spacy]  #word tokenization
np.array(words)

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'", 'Summit', "'", ',', 'beating',
       'the', 'previous', 'record', '-', 'holder', 'China', "'s",
       'Sunway', 'TaihuLight', '.', 'With', 'a', 'peak', 'performance',
       'of', '200,000', 'trillion', 'calculations', 'per', 'second', ',',
       'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway',
       'TaihuLight', ',', 'which', 'is', 'capable', 'of', '93,000',
       'trillion', 'calculations', 'per', 'second', '.', 'Summit', 'has',
       '4,608', 'servers', ',', 'which', 'reportedly', 'take', 'up',
       'the', 'size', 'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

## Handling Non-Text Characters

Natural text consists of various types of characters such as alphabets, numbers, symbols, emoticons, non-printable characters and so on. For most practical use-cases we limit ourselves to alphabets (at most numbers) and ignore other types of characters.

In this section, we will focus on identification of non-text characters and how to remove them from our corpus safely.

We will focus on handling following types of characters:
- Accented Characters
- Special Characters
- HTML Tags & Noise


### Accented Characters

The most common accents are the acute (é), grave (è), circumflex (â, î or ô), tilde (ñ), umlaut and dieresis (ü or ï – the same symbol is used for two different purposes), and cedilla (ç). Accent marks (also referred to as diacritics or diacriticals) usually appear above a character.

These characters are part of extended alphabet in languages such as French, Spanish, etc.

In [61]:
import unicodedata

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [62]:
s = 'Sómě Áccěntěd těxt'
s

'Sómě Áccěntěd těxt'

In [63]:
remove_accented_chars(s)

'Some Accented text'

### Special Characters

Symbols, emoticons and characters such as ``#``, ``@`` etc. are considered special characters

In [64]:
# [^a-zA-Z0-9\s] => this will remove anything which is not a letter (eng alphabet), number or space

In [65]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    print('Pattern used is:', pattern)
    text = re.sub(pattern, '', text)
    return text

In [66]:
s = "Well this was fun! See you at 7:30, What do you think!!? #$@@9318@ 🙂🙂🙂"
s

'Well this was fun! See you at 7:30, What do you think!!? #$@@9318@ 🙂🙂🙂'

In [67]:
remove_special_characters(s, remove_digits=True)

Pattern used is: [^a-zA-Z\s]


'Well this was fun See you at  What do you think  '

In [68]:
remove_special_characters(s)

Pattern used is: [^a-zA-Z0-9\s]


'Well this was fun See you at 730 What do you think 9318 '

### HTML Tags & Noise

Many times, NLP datasets are collected as part of web-scraping activities. Web-scraping involves scanning various websites to extract text from them. This process leads to content which is a mix of actual text as well as HTML tags.

In this section we will extract HTML version of ** The Bible** book. We will then use ``BeautifulSoup`` to clean out HTML tags to get actual text.

In [69]:
bible_html_url

'http://www.gutenberg.org/cache/epub/8001/pg8001.html'

In [70]:
data = requests.get(bible_html_url)
content = data.text
print(content[6030:7000])

rg.org/ebooks/8001/pg8001.html">
<meta property="og:image" content="https://www.gutenberg.org/ebooks/8001/pg8001.cover.medium.jpg">
</head><body><section class="pg-boilerplate pgheader" id="pg-header" lang="en"><h2 id="pg-header-heading" title="">The Project Gutenberg eBook of <span lang="en" id="pg-title-no-subtitle">The Bible, King James version, Book 1: Genesis</span></h2>
    
<div>This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at <a class="reference external" href="https://www.gutenberg.org">www.gutenberg.org</a>. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.</div>

<div class="container" id="pg-machine-header"><p><st


In [71]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])] # basically reject all script tags in HTML
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text) # remove extra blank newlines and replace with single newlines
    return stripped_text

In [72]:
clean_content = strip_html_tags(content)
print(clean_content[4235:4900])

g and the morning were the fifth day.
01:001:024 And God said, Let the earth bring forth the living creature
           after his kind, cattle, and creeping thing, and beast of the
           earth after his kind: and it was so.
01:001:025 And God made the beast of the earth after his kind, and cattle
           after their kind, and every thing that creepeth upon the earth
           after his kind: and God saw that it was good.
01:001:026 And God said, Let us make man in our image, after our
           likeness: and let them have dominion over the fish of the sea,
           and over the fowl of the air, and over the cattle, and over
           all the ea



That seemed to have worked like a charm!
---

## Text Normalization

In this section, we will prepare utilities to fix different issues with textual data.

- Expand Contractions
- Stemming
- Lemmatization

## Stemming

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form

#### Porter Stemmer

The Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English.

In [73]:
from nltk.stem import PorterStemmer

In [74]:
ps = PorterStemmer()

ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped')

('jump', 'jump', 'jump')

In [75]:
ps.stem('lying')

'lie'

In [76]:
ps.stem('strange')

'strang'

#### Lancaster Stemmer
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball Stemmers. The Lancaster stemmers save the rules externally and basically uses an iterative algorithm.

In [77]:
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped')

('jump', 'jump', 'jump')

In [78]:
ls.stem('lying')

'lying'

In [79]:
ls.stem('strange')

'strange'

#### Snowball Stemmer

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.


In [80]:
# Snowball Stemmer
from nltk.stem import SnowballStemmer
ss = SnowballStemmer("german")
print('Supported Languages:', SnowballStemmer.languages)

Supported Languages: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [81]:
# stemming on German words
# autobahnen -> highway collection
# autobahn -> single highway
ss.stem('autobahnen')

'autobahn'

In [82]:
ps = nltk.porter.PorterStemmer()
ls = nltk.stem.LancasterStemmer()

def simple_stemmer(text, stemmer=ps):
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    # alternate way
    # words = nltk.word_tokenize(text)
    # text = ' '.join(words)
    return text

#### Try calling the above defined function for both Lancaster and Porter stemmer separately

In [83]:
s = "My system keeps crashing his crashed yesterday ours crashes daily and presumably we are not lying"
s

'My system keeps crashing his crashed yesterday ours crashes daily and presumably we are not lying'

In [84]:
simple_stemmer(s, stemmer=ps)

'my system keep crash hi crash yesterday our crash daili and presum we are not lie'

In [85]:
simple_stemmer(s, stemmer=ls)

'my system keep crash his crash yesterday our crash dai and presum we ar not lying'

## Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma

In [86]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /Users/laurent/nltk_data...


True

In [87]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [88]:
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

car
men


In [89]:
# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

run
eat


In [90]:
# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

sad
fancy


In [91]:
# ineffective lemmatization
print(wnl.lemmatize('ate', 'n'))
print(wnl.lemmatize('fancier', 'v'))
print(wnl.lemmatize('fancier'))

ate
fancier
fancier


#### Building your own lemmatizer using nltk

Define a function such that you put all the above steps together so that it does the following
- Function name is `wordnet_lemmatize_text(...)`
- Input is a variable text which should take in a document (bunch of words)
- Need to tokenize the text
- Get POS tags of tokenized text
- Convert POS tags into wordnet (single letter) POS tags
- use nltk's wordnet lemmatizer
- Return lemmatized text as the output (as a string)

In [92]:
import nltk

In [93]:
from nltk.corpus import wordnet


In [94]:
s = 'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'
s

'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

In [95]:
nltk.word_tokenize(s)

['The',
 'brown',
 'foxes',
 'are',
 'quick',
 'and',
 'they',
 'are',
 'jumping',
 'over',
 'the',
 'sleeping',
 'lazy',
 'dogs',
 '!']

In [96]:
tagged_tokens = nltk.pos_tag(nltk.word_tokenize(s))
tagged_tokens

[('The', 'DT'),
 ('brown', 'JJ'),
 ('foxes', 'NNS'),
 ('are', 'VBP'),
 ('quick', 'JJ'),
 ('and', 'CC'),
 ('they', 'PRP'),
 ('are', 'VBP'),
 ('jumping', 'VBG'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('sleeping', 'VBG'),
 ('lazy', 'JJ'),
 ('dogs', 'NNS'),
 ('!', '.')]

In [None]:
wordnet.ADJ, wordnet.VERB, wordnet.NOUN, wordnet.ADV

('a', 'v', 'n', 'r')

In [None]:
tagged_tokens

[('The', 'DT'),
 ('brown', 'JJ'),
 ('foxes', 'NNS'),
 ('are', 'VBP'),
 ('quick', 'JJ'),
 ('and', 'CC'),
 ('they', 'PRP'),
 ('are', 'VBP'),
 ('jumping', 'VBG'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('sleeping', 'VBG'),
 ('lazy', 'JJ'),
 ('dogs', 'NNS'),
 ('!', '.')]

In [None]:
tag_map = {'j': wordnet.ADJ, 'v': wordnet.VERB, 'n': wordnet.NOUN, 'r': wordnet.ADV}
tag_map

{'j': 'a', 'n': 'n', 'r': 'r', 'v': 'v'}

In [None]:
'JJ'[0].lower()

'j'

In [None]:
tag_map.get('JJ'[0].lower(), wordnet.NOUN)

'a'

In [None]:
tag_map.get('XYZ'[0].lower(), wordnet.NOUN)


'n'

In [None]:
[(word, tag_map.get(tag[0].lower(), wordnet.NOUN))
    for word, tag in tagged_tokens]

[('The', 'n'),
 ('brown', 'a'),
 ('foxes', 'n'),
 ('are', 'v'),
 ('quick', 'a'),
 ('and', 'n'),
 ('they', 'n'),
 ('are', 'v'),
 ('jumping', 'v'),
 ('over', 'n'),
 ('the', 'n'),
 ('sleeping', 'v'),
 ('lazy', 'a'),
 ('dogs', 'n'),
 ('!', 'n')]

In [None]:
from nltk.corpus import wordnet
wnl = WordNetLemmatizer()

def wordnet_lemmatize_text(text):
  # tokenize text
  tokens = nltk.word_tokenize(text)

  # pos tag tokenized text
  tagged_tokens = nltk.pos_tag(tokens)

  # convert raw POS tags into wordnet tags
  tag_map = {'j': wordnet.ADJ, 'v': wordnet.VERB, 'n': wordnet.NOUN, 'r': wordnet.ADV}

  # treat unknown tags as nouns by default
  new_tagged_tokens = [(word, tag_map.get(tag[0].lower(),
                                          wordnet.NOUN))
                            for word, tag in tagged_tokens]

  lemmatized_text = ' '.join(wnl.lemmatize(word, tag) for word, tag in new_tagged_tokens)
  return lemmatized_text

In [None]:
s

'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

In [None]:
wordnet_lemmatize_text(s)

'The brown fox be quick and they be jump over the sleep lazy dog !'

#### Spacy Lemmatization

Out of the box implementation

In [None]:
s

'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

In [None]:
[word.lemma_ for word in nlp(s)]

['the',
 'brown',
 'fox',
 'be',
 'quick',
 'and',
 'they',
 'be',
 'jump',
 'over',
 'the',
 'sleep',
 'lazy',
 'dog',
 '!']

In [None]:
' '.join([word.lemma_ for word in nlp(s)])

'the brown fox be quick and they be jump over the sleep lazy dog !'

In [102]:
import spacy
# use spacy.load('en') if you have downloaded the language model en directly after install spacy
nlp = spacy.load('en_core_web_sm')

def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_
                       if word.lemma_ != '-PRON-' else word.text
                          for word in text])
    return text

In [None]:
s

'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

In [None]:
lemmatize_text(s)

'the brown fox be quick and they be jump over the sleep lazy dog !'

In [None]:
lemmas = []

text = nlp(s)

for word in text:
  lemmas.append(word.lemma_)

lemmas

['the',
 'brown',
 'fox',
 'be',
 'quick',
 'and',
 'they',
 'be',
 'jump',
 'over',
 'the',
 'sleep',
 'lazy',
 'dog',
 '!']

In [None]:
' '.join(lemmas)

'the brown fox be quick and they be jump over the sleep lazy dog !'

## Stopword Removal

In computing, stop words are words which are filtered out before or after processing of natural language data. A stop word is a commonly used word (such as “the”, “a”, “an”,etc.) which does not convey a lot of useful information

We typically remove stopwords before using text for most NLP tasks

In [None]:
nltk.corpus.stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
# the fox jumps over the dog => the will be removed if is_lower_case is set to True
# The fox jumps over the dog => the will not be removed if is_lower_case is NOT set to False

In [None]:
def remove_stopwords(text, is_lower_case=False, stopwords=None):
    if not stopwords:
        stopwords = nltk.corpus.stopwords.words('english')
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]

    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords] # keep tokens which are not in list of eng. stopwords
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]

    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

In [None]:
stop_words = nltk.corpus.stopwords.words('english')
print(stop_words[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [None]:
s

'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

In [None]:
remove_stopwords(s, is_lower_case=False)

'brown foxes quick jumping sleeping lazy dogs !'

In [None]:
remove_stopwords(s, is_lower_case=True)

'The brown foxes quick jumping sleeping lazy dogs !'

Remove the word 'the' and add the word 'brown' from the stop_words list and call the function with this new list

In [None]:
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [None]:
stop_words.remove('the')
stop_words.append('brown')

In [None]:
remove_stopwords(s, is_lower_case=False, stopwords=stop_words)

'The foxes quick jumping the sleeping lazy dogs !'

In [None]:
s

'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

In [None]:
filtered_words = []

stopwords = nltk.corpus.stopwords.words('english')

for word in nltk.word_tokenize(s):
  if word.lower() not in stopwords:
    filtered_words.append(word)

filtered_words

['brown', 'foxes', 'quick', 'jumping', 'sleeping', 'lazy', 'dogs', '!']

In [None]:
' '.join(filtered_words)

'brown foxes quick jumping sleeping lazy dogs !'

## Expand Contractions

Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.

In order to capture context better, we standardize text by expanding such contractions. ``contractions`` and ``textsearch`` enable us to do so in just a few lines of code

In [None]:
!pip install contractions
!pip install textsearch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 5.1 MB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 62.1 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.21
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [98]:
import contractions

In [99]:
list(contractions.contractions_dict.items())[:10]

[("I'm", 'I am'),
 ("I'm'a", 'I am about to'),
 ("I'm'o", 'I am going to'),
 ("I've", 'I have'),
 ("I'll", 'I will'),
 ("I'll've", 'I will have'),
 ("I'd", 'I would'),
 ("I'd've", 'I would have'),
 ('Whatcha', 'What are you'),
 ("amn't", 'am not')]

In [100]:
sample_string = "Y'all can't expand contractions I'd think! You wouldn't be able to. How'd you do it?"
sample_string

"Y'all can't expand contractions I'd think! You wouldn't be able to. How'd you do it?"

In [101]:
contractions.fix(sample_string)

'You all cannot expand contractions I would think! You would not be able to. How did you do it?'