# Lab 3.2: Processing Text

In this lab, you will look at simple techniques to clean and prepare text data for modeling with machine learning (ML).


## Lab steps

To complete this lab, you will follow these steps:

1. [Working with simple text-cleaning processes](#1.-Working-with-simple-text-cleaning-processes)
2. [Working with lexicon-based text processing](#2.-Working-with-lexicon-based-text-processing)
        


In [None]:
#Upgrade dependencies
!pip install --upgrade pip
!pip install --upgrade scikit-learn
!pip install --upgrade sagemaker

## 1. <a name="1">Working with simple text-cleaning processes</a>
([Go to top](#Lab-3.2:-Processing-Text))

In this section, you will do some general-purpose text cleaning. The following methods for cleaning can be extended, depending on the application.

In [None]:
text = "   This is a message to be cleaned. It might involve some things like: <br>, ?, :, ''  adjacent spaces, and tabs     .  "
print(text)

First, change the text so that it's all lowercase:

In [None]:
text = text.lower()
print(text)

Next, remove leading whitespace or trailing whitespace with the following code:

In [None]:
text = text.strip()
print(text)

Use a regular expression to remove HTML tags or markup:

In [None]:
import re

text = re.compile('<.*?>').sub('', text)
print(text)

Replace punctuation with a space. Be careful with this task. Depending on the application, punctuation can actually be useful. For example, punctuation might affect the positive or negative meaning of a sentence.

In [None]:
import re, string

text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
print(text)

Remove any extra spaces and tabs:

In [None]:
import re

text = re.sub('\s+', ' ', text)
print(text)

## 2. Working with lexicon-based text processing
([Go to top](#Lab-3.2:-Processing-Text))

In the previous section, you used some general-purpose text pre-processing methods. Lexicon-based methods are usually applied *after* the common text-processing methods. They are used to normalize sentences in the dataset. *Normalization* means putting words into a similar format that will also enhance the similarities (if any) between sentences.

For this example, you must install some packages:

- punkt - A pretrained sentence tokenizer for the English language
- averaged_perceptron_tagger - A part-of-sentence (POS) tagger
- wordnet - A large database of English words that can be used to find the meanings of words, synonyms, antonyms, and more

Run the following cell:

In [None]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

#### Stopword removal
Some words in sentences can occur very frequently, and they don't contribute too much to the overall meaning of the sentences. Typically, you would use list of these words and remove them from each sentence. For example, stopwords include: *a*, *an*, *the*, *this*, *that*, *is*, *it*, *to*, and *and*.

In [None]:
# Use a tokenizer from the NLTK library
import nltk
from nltk.tokenize import word_tokenize

filtered_sentence = []

# Stopword lists can be adjusted for your problem
stopwords = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

# Tokenize the sentence
words = word_tokenize(text)
for w in words:
    if w not in stopwords:
        filtered_sentence.append(w)
text = " ".join(filtered_sentence)

In [None]:
print(text)

#### Stemming words
Stemming is a rule-based system for converting words into their root form. It removes suffixes from words. This process helps enhance similarities (if any) between sentences. 

Examples:

"jumping", "jumped" -> "jump"

"cars" -> "car"

In [None]:
# Use a tokenizer and stemmer from the NLTK library
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

# Initialize the stemmer
snow = SnowballStemmer('english')

stemmed_sentence = []
# Tokenize the sentence
words = word_tokenize(text)
for w in words:
    # Stem the word/token
    stemmed_sentence.append(snow.stem(w))
stemmed_text = " ".join(stemmed_sentence)

In [None]:
print(stemmed_text)

From the previous cell, you can see that the stemming operation is *not* perfect. It generated some mistakes, such as *messag*, *involv*, and *adjac*. Stemming is a rule-based method that sometimes mistakenly remove suffixes from words. Nevertheless, it runs quickly.

#### Lemmatizing words
If you are not satisfied with the result of stemming, you can use lemmatization instead. It usually requires more work, but it gives better results.

In [None]:
# Importing the necessary functions
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
wl = WordNetLemmatizer()

# This is a helper function to map NTLK position tags
# Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

lemmatized_sentence = []
# Tokenize the sentence
words = word_tokenize(text)
# Get position tags
word_pos_tags = nltk.pos_tag(words)
# Map the position tag and lemmatize the word or token
for idx, tag in enumerate(word_pos_tags):
    lemmatized_sentence.append(wl.lemmatize(tag[0], get_wordnet_pos(tag[1])))

lemmatized_text = " ".join(lemmatized_sentence)

In [None]:
print(lemmatized_text)

You can use the tasks you completed in this notebook for many of the business problems that you will work on in this course. 

# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.

*©2021 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.*