<a href="https://colab.research.google.com/github/astrapi69/DroidBallet/blob/master/NLP_D1_2_LC2_Text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><a target="_blank" href="https://learning.constructor.org/"><img src="https://drive.google.com/uc?id=1RNy-ds7KWXFs7YheGo9OQwO3OnpvRSU1" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center>Constructor Academy, 2024</center>

# Text Pre-processing

Text is different than usual datasets we use to build our typical Machine Learning models. Text data needs to be pre-processed to ensure we have it in a form that is usable for various NLP tasks. Text processing and wrangling is a necessary and important step in any NLP project.

In this notebook, we will cover:
- Sentence tokenization
- Word tokenization
- Handling non-text characters - accents, special symbols, HTML
- Preprocessing \ normalization steps such as stemming, lemmatization, expanding contractions and stopword removal

### Import Libraries

In [None]:
import nltk
import spacy
import numpy as np
import re
import requests
from pprint import pprint
from bs4 import BeautifulSoup

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Get Text Dataset

Project Gutenberg is a large opensource and free collection of literary works from across the world. In this case, we will leverage content of the book **The Bible - Book 1: Genesis** for understanding different text processing and wrangling steps in the following sections


We will also use a smaller sample text with a few short sentences to demostrate examples as well.

In [None]:
sample_text = ("US unveils world's most powerful supercomputer, beats China. "
               "The US has unveiled the world's most powerful supercomputer called 'Summit', "
               "beating the previous record-holder China's Sunway TaihuLight. With a peak performance "
               "of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, "
               "which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, "
               "which reportedly take up the size of two tennis courts.")
sample_text

"US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts."

## Sentence Tokenization

Sentence is a syntactical as well as a logical division of content in a given corpus. In order to understand text or prepare it for various tasks, understanding sentence boundaries is an important step.

Sentence tokenization is the process of determining sentence boundaries in a given corpus.

One might think that it is merely trivial to determine a sentence boundary, we just need to split based on ".". While this generally holds true yet there are exceptions. Think of scenarios where we use "." in abbreviations and shorthand notations(such as Mr. or Mrs.).


Let us now go through a few standard ways of performing sentence tokenization

### NLTK's Tokenizer

NLTK provides a number of sentence tokenizers. Let's have a look at the default one.

In [None]:
sample_sentences = nltk.sent_tokenize(text=sample_text)

print('Total sentences in sample_text:{}'.format(len(sample_sentences)))
print('Sample text sentences :-')
sample_sentences

Total sentences in sample_text:4
Sample text sentences :-


["US unveils world's most powerful supercomputer, beats China.",
 "The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight.",
 'With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.',
 'Summit has 4,608 servers, which reportedly take up the size of two tennis courts.']

### Spacy Tokenizer

Spacy provides state-of-the-art tokenizers for tokenizing sentences

In [None]:
nlp = spacy.load('en_core_web_sm') # load the english language model, supports other languages also -> check documentation

In [None]:
text_spacy = nlp(sample_text)
# get sentences
sentences = list(text_spacy.sents)
sentences

[US unveils world's most powerful supercomputer, beats China.,
 The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight.,
 With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.,
 Summit has 4,608 servers, which reportedly take up the size of two tennis courts.]

In [None]:
sentences[0]

US unveils world's most powerful supercomputer, beats China.

In [None]:
type(sentences[0]) # remember each sentence is not a string but a special spacy data type

spacy.tokens.span.Span

In [None]:
# to access the actual string use this
sentences[0].text

"US unveils world's most powerful supercomputer, beats China."

## Word Tokenization

Word can be considered as a basic building block for NLP tasks. A combination of words make up a sentence. As in the previous section, we worked towards understanding sentence tokenization, in this section, we will focus on understanding word boundaries.

We will make use of various word tokenizers available from ``nltk`` as well as ``spacy`` in this section

### NLTK's Tokenizer

NLTK provides a number of word tokenizers. Let's have a look at the default one.

In [None]:
sample_text

"US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts."

In [None]:
# default tokenizer
words = nltk.word_tokenize(sample_text)
np.array(words)

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the',
       'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight',
       '.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

### Spacy Tokenizer

Spacy provides state-of-the-art tokenizers for tokenizing words

In [None]:
nlp = spacy.load('en_core_web_sm') # shown just to remind you, no need to do this more than once
text_spacy = nlp(sample_text)

In [None]:
words = [word.text for word in text_spacy]  #word tokenization
np.array(words)

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'", 'Summit', "'", ',', 'beating',
       'the', 'previous', 'record', '-', 'holder', 'China', "'s",
       'Sunway', 'TaihuLight', '.', 'With', 'a', 'peak', 'performance',
       'of', '200,000', 'trillion', 'calculations', 'per', 'second', ',',
       'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway',
       'TaihuLight', ',', 'which', 'is', 'capable', 'of', '93,000',
       'trillion', 'calculations', 'per', 'second', '.', 'Summit', 'has',
       '4,608', 'servers', ',', 'which', 'reportedly', 'take', 'up',
       'the', 'size', 'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

## Handling Non-Text Characters

Natural text consists of various types of characters such as alphabets, numbers, symbols, emoticons, non-printable characters and so on. For most practical use-cases we limit ourselves to alphabets (at most numbers) and ignore other types of characters.

In this section, we will focus on identification of non-text characters and how to remove them from our corpus safely.

We will focus on handling following types of characters:
- Accented Characters
- Special Characters
- HTML Tags & Noise


### Accented Characters

The most common accents are the acute (√©), grave (√®), circumflex (√¢, √Æ or √¥), tilde (√±), umlaut and dieresis (√º or √Ø ‚Äì the same symbol is used for two different purposes), and cedilla (√ß). Accent marks (also referred to as diacritics or diacriticals) usually appear above a character.

These characters are part of extended alphabet in languages such as French, Spanish, etc.

In [None]:
import unicodedata

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [None]:
s = 'S√≥mƒõ √Åccƒõntƒõd tƒõxt'
s

'S√≥mƒõ √Åccƒõntƒõd tƒõxt'

In [None]:
remove_accented_chars(s)

'Some Accented text'

### Special Characters

Symbols, emoticons and characters such as ``#``, ``@`` etc. are considered special characters

In [None]:
# [^a-zA-Z0-9\s] => this will remove anything which is not a letter (eng alphabet), number or space

In [None]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    print('Pattern used is:', pattern)
    text = re.sub(pattern, '', text)
    return text

In [None]:
s = "Well this was fun! See you at 7:30, What do you think!!? #$@@9318@ üôÇüôÇüôÇ"
s

'Well this was fun! See you at 7:30, What do you think!!? #$@@9318@ üôÇüôÇüôÇ'

In [None]:
remove_special_characters(s, remove_digits=True)

Pattern used is: [^a-zA-Z\s]


'Well this was fun See you at  What do you think  '

In [None]:
remove_special_characters(s)

Pattern used is: [^a-zA-Z0-9\s]


'Well this was fun See you at 730 What do you think 9318 '

### HTML Tags & Noise

Many times, NLP datasets are collected as part of web-scraping activities. Web-scraping involves scanning various websites to extract text from them. This process leads to content which is a mix of actual text as well as HTML tags.

In this section we will extract HTML version of ** The Bible** book. We will then use ``BeautifulSoup`` to clean out HTML tags to get actual text.

In [None]:
bible_html_url = 'http://www.gutenberg.org/cache/epub/8001/pg8001.html'

In [None]:
data = requests.get(bible_html_url)
content = data.text
print(content[7700:9000])

2" style="margin-top: 5em">Book 01        Genesis</h1>

<p id="id00003">01:001:001 In the beginning God created the heaven and the earth.</p>

<p id="id00004" style="margin-left: 0%; margin-right: 0%">01:001:002 And the earth was without form, and void; and darkness was
           upon the face of the deep. And the Spirit of God moved upon
           the face of the waters.</p>

<p id="id00005">01:001:003 And God said, Let there be light: and there was light.</p>

<p id="id00006">01:001:004 And God saw the light, that it was good: and God divided the<br>

¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†light from the darkness.<br>
</p>

<p id="id00007">01:001:005 And God called the light Day, and the darkness he called<br>

¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†Night. And the evening and the morning were the first day.<br>
</p>

<p id="id00008">01:001:006 And God said, Let there be a firmament in the midst of the<br>

¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†waters, and let it divide the waters from the waters.<br>
</p>


In [None]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])] # basically reject all script tags in HTML
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text) # remove extra blank newlines and replace with single newlines
    return stripped_text

In [None]:
clean_content = strip_html_tags(content)
print(clean_content[1036:2000])


Book 01        Genesis
01:001:001 In the beginning God created the heaven and the earth.
01:001:002 And the earth was without form, and void; and darkness was
           upon the face of the deep. And the Spirit of God moved upon
           the face of the waters.
01:001:003 And God said, Let there be light: and there was light.
01:001:004 And God saw the light, that it was good: and God divided the
¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†light from the darkness.
01:001:005 And God called the light Day, and the darkness he called
¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†Night. And the evening and the morning were the first day.
01:001:006 And God said, Let there be a firmament in the midst of the
¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†waters, and let it divide the waters from the waters.
01:001:007 And God made the firmament, and divided the waters which were
           under the firmament from the waters which were above the
           firmament: and it was so.
01:001:008 And God called the firmament Heaven. And the evening and the
 



That seemed to have worked like a charm!
---

## Text Normalization

In this section, we will prepare utilities to fix different issues with textual data.

- Expand Contractions
- Stemming
- Lemmatization

## Stemming

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form‚Äîgenerally a written word form

#### Porter Stemmer

The Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English.

In [None]:
from nltk.stem import PorterStemmer

In [None]:
ps = PorterStemmer()

ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped')

('jump', 'jump', 'jump')

In [None]:
ps.stem('lying')

'lie'

In [None]:
ps.stem('strange')

'strang'

#### Lancaster Stemmer
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball Stemmers. The Lancaster stemmers save the rules externally and basically uses an iterative algorithm.

In [None]:
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped')

('jump', 'jump', 'jump')

In [None]:
ls.stem('lying')

'lying'

In [None]:
ls.stem('strange')

'strange'

In [None]:
ps = nltk.porter.PorterStemmer()
ls = nltk.stem.LancasterStemmer()

def simple_stemmer(text, stemmer=ps):
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    # alternate way
    # words = nltk.word_tokenize(text)
    # text = ' '.join(words)
    return text

#### Try calling the above defined function for both Lancaster and Porter stemmer separately

In [None]:
s = "My system keeps crashing his crashed yesterday ours crashes daily and presumably we are not lying"
s

'My system keeps crashing his crashed yesterday ours crashes daily and presumably we are not lying'

In [None]:
simple_stemmer(s, stemmer=ps)

'my system keep crash hi crash yesterday our crash daili and presum we are not lie'

In [None]:
simple_stemmer(s, stemmer=ls)

'my system keep crash his crash yesterday our crash dai and presum we ar not lying'

To ensure grammatically correct words use lemmatization

## Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma

In [None]:
s = 'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'
s

'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

#### Spacy Lemmatization

Out of the box implementation

In [None]:
[word.lemma_ for word in nlp(s)]

['the',
 'brown',
 'fox',
 'be',
 'quick',
 'and',
 'they',
 'be',
 'jump',
 'over',
 'the',
 'sleep',
 'lazy',
 'dog',
 '!']

In [None]:
nlp = spacy.load('en_core_web_sm')

def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_
                          for word in text])
    return text

In [None]:
s

'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

In [None]:
lemmatize_text(s)

'the brown fox be quick and they be jump over the sleep lazy dog !'

## Stopword Removal

In computing, stop words are words which are filtered out before or after processing of natural language data. A stop word is a commonly used word (such as ‚Äúthe‚Äù, ‚Äúa‚Äù, ‚Äúan‚Äù,etc.) which does not convey a lot of useful information

We typically remove stopwords before using text for most NLP tasks

In [None]:
nltk.corpus.stopwords.words('english')  # standard set of stopwords - words with not much meaning

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
def remove_stopwords(text, stopwords=None):
    if not stopwords:
        stopwords = nltk.corpus.stopwords.words('english')
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens] # removing any whitespaces from words
    filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

In [None]:
stop_words = nltk.corpus.stopwords.words('english')
print(stop_words[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [None]:
s

'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'

In [None]:
remove_stopwords(s)

'brown foxes quick jumping sleeping lazy dogs !'

You can customize your own list of stopwords if needed.

Example:

Remove the word 'the' and add the word 'brown' from the stop_words list and call the function with this new list

In [None]:
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [None]:
stop_words.remove('the')
stop_words.append('brown')

In [None]:
remove_stopwords(s, stopwords=stop_words)

'The foxes quick jumping the sleeping lazy dogs !'

## Expand Contractions

Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.

In order to capture context better, we standardize text by expanding such contractions. ``contractions`` and ``textsearch`` enable us to do so in just a few lines of code

In [None]:
!pip install contractions
!pip install textsearch

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m289.9/289.9 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m110.8/110.8 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, c

In [None]:
import contractions

In [None]:
list(contractions.contractions_dict.items())[:10]

[("I'm", 'I am'),
 ("I'm'a", 'I am about to'),
 ("I'm'o", 'I am going to'),
 ("I've", 'I have'),
 ("I'll", 'I will'),
 ("I'll've", 'I will have'),
 ("I'd", 'I would'),
 ("I'd've", 'I would have'),
 ('Whatcha', 'What are you'),
 ("amn't", 'am not')]

In [None]:
sample_string = "Y'all can't expand contractions I'd think! You wouldn't be able to. How'd you do it?"
sample_string

"Y'all can't expand contractions I'd think! You wouldn't be able to. How'd you do it?"

In [None]:
contractions.fix(sample_string)

'You all cannot expand contractions I would think! You would not be able to. How did you do it?'