![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 2</a>

## Text Processing

In this notebok we explore techniques to clean text features and convert text features into numerical features that machine learning algoritms can work with. 

1. <a href="#1">Common text pre-processing</a>
2. <a href="#2">Lexicon-based text processing</a>
3. <a href="#3">Text Vectorization - Bag of Words</a>
4. <a href="#4">Putting it all together</a>



## 1. <a name="1">Common text pre-processing</a>
(<a href="#0">Go to top</a>)

In this section, we will do some general purpose text cleaning.

In [10]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2023.6.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (770 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m770.4/770.4 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2023.6.3


In [11]:
text = "   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  "

Let's first lowercase our text. 

In [12]:
text = text.lower()
print(text)

   this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  


We can get rid of leading/trailing whitespace with the following:

In [13]:
text = text.strip()
print(text)

this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .


Remove HTML tags/markups:

In [14]:
import re

text = re.compile('<.*?>').sub('', text)
print(text)

this is a message to be cleaned. it may involve some things like: , ?, :, ''  adjacent spaces and tabs     .


Replace punctuation with space

In [15]:
import re, string

text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
print(text)

this is a message to be cleaned  it may involve some things like              adjacent spaces and tabs      


Remove extra space and tabs

In [16]:
import re

text = re.sub('\s+', ' ', text)
print(text)

this is a message to be cleaned it may involve some things like adjacent spaces and tabs 


## 2. <a name="2">Lexicon-based text processing</a>
(<a href="#0">Go to top</a>)

In the previous section we saw some general purpose text pre-processing methods. Lexicon based methods are usually used __to normalize__ sentences in our dataset. The normalized sentences are later used for feature extraction. By normalization, here, we mean getting words in the sentences into a similar format that will enhance similarities (if any) between sentences. 

### Stop word removal

There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: "a", "an", "the", "this", "that", "is"

In [17]:
stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

filtered_sentence = []
words = text.split(" ")
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
text = " ".join(filtered_sentence)

In [18]:
print(text)

message be cleaned may involve some things like adjacent spaces tabs 


### Stemming

Stemming is a rule-based system to __convert words into their root form__. It removes suffixes from words. This helps us enhace similarities (if any) between sentences. 

Example:

"jumping", "jumped" -> "jump"

"cars" -> "car"

In [19]:
# We use the NLTK library
import nltk
from nltk.stem import SnowballStemmer

# Initialize the stemmer
snow = SnowballStemmer('english')

stemmed_sentence = []
words = text.split(" ")
for w in words:
    stemmed_sentence.append(snow.stem(w))
text = " ".join(stemmed_sentence)

In [20]:
print(text)

messag be clean may involv some thing like adjac space tab 


## 3. <a name="3">Text Vectorization - Bag of Words</a>
(<a href="#0">Go to top</a>)

__Machine learning models expect numerical or categorical values as input and won't work with raw text data__.

Let's convert some text feature into numerical data using the __Bag of Words (BoW)__ representation. 

The __Bag of Words (BoW)__ method involves two steps:
1. Create a vocabulary of words across the entire text feature
2. Measure the presence of the vocabulary words in each sample of the text feature

We use here the sklearn library's Bag of Words implementation, [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
countVectorizer = CountVectorizer(binary=True)

text_feature = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    'Or is the second document?',
    'Maybe this is the fourth document?'    
]

text_feature_vectorized = countVectorizer.fit_transform(text_feature)

Let's print the vocabulary below. 

In [22]:
print(countVectorizer.vocabulary_)

{'this': 11, 'is': 4, 'the': 9, 'first': 2, 'document': 1, 'second': 8, 'and': 0, 'third': 10, 'one': 6, 'or': 7, 'maybe': 5, 'fourth': 3}


Each number next to a word shows the index of it in the vocabulary, alphabetically ordered: {and:0, document:1, first:2, ...}.

__Note:__ Sklearn automatically removes punctuation, but doesn't do the other extra pre-processing methods we discussed here. <br/>
Lexicon-based methods are also not automaticaly applied, we need to call those methods before feature extraction.

Now, let's print the vectorized version of our text features.

In [23]:
print(text_feature_vectorized.toarray())

[[0 1 1 0 1 0 0 0 0 1 0 1]
 [0 1 0 0 1 0 0 0 1 1 0 1]
 [1 0 0 0 0 0 1 0 0 1 1 0]
 [0 1 1 0 1 0 0 0 0 1 0 1]
 [0 1 0 0 1 0 0 1 1 1 0 0]
 [0 1 0 1 1 1 0 0 0 1 0 1]]


__What happens when we encounter a new word during prediction?__ 

__New words will be skipped__. This usually happens when we are making predictions. For our test and validation data/text, we need to use the __.transform()__ function this time. This happens in real-time prediction cases when the model is not re-trained to accomodate new words.

In [24]:
test_text_sample = ["this document has some new words",
                 "this one is new too"]

test_text_sample_vectorized = countVectorizer.transform(test_text_sample)
print(test_text_sample_vectorized.toarray())

[[0 1 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 1 0 1 0 0 0 0 1]]


We note that these last two vectors have the same lenght (same vocabulary) like the ones before, ignoring the new words.

## 4. <a name="4">Putting it all together</a>
(<a href="#0">Go to top</a>)

Let's have a full text processing example here. We apply everything discussed in this notebook, cleaning and vectorization, on new text features.

In [25]:
# Prepare cleaning functions
import re, string
import nltk
from nltk.stem import SnowballStemmer

stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

stemmer = SnowballStemmer('english')

def preProcessText(text):
    # lowercase and strip leading/trailing white space
    text = text.lower().strip()
    
    # remove HTML tags
    text = re.compile('<.*?>').sub('', text)
    
    # remove punctuation
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
    
    # remove extra white space
    text = re.sub('\s+', ' ', text)
    
    return text

def lexiconProcess(text, stop_words, stemmer):
    filtered_sentence = []
    words = text.split(" ")
    for w in words:
        if w not in stop_words:
            filtered_sentence.append(stemmer.stem(w))
    text = " ".join(filtered_sentence)
    
    return text

def cleanSentence(text, stop_words, stemmer):
    return lexiconProcess(preProcessText(text), stop_words, stemmer)

In [26]:
# Prepare vectorizer 
from sklearn.feature_extraction.text import CountVectorizer

textvectorizer = CountVectorizer(binary=True, max_features = 50 ) # can also limit vocabulary size here, with max_features

In [27]:
# Clean and vectorize a text feature with four samples
text_feature = ["I liked the material, color and overall how it looks.<br /><br />",
             "Worked okay first two times I used it, but third time burned my face.",
             "I am not sure about this product.",
             "I never thought I would pay so much for a hair dryer.",
            ]
#print(len(text_feature))

# Clean up the text
text_feature_cleaned = [cleanSentence(item, stop_words, stemmer) for item in text_feature]

# Vectorize the cleaned text
text_feature_vectorized = textvectorizer.fit_transform(text_feature_cleaned)
print('Vocabulary: \n', textvectorizer.vocabulary_)
print('Bag of Words Binary Features: \n', text_feature_vectorized.toarray())

#print(text_feature_vectorized.shape)

Vocabulary: 
 {'like': 11, 'materi': 13, 'color': 4, 'overal': 19, 'how': 10, 'look': 12, 'work': 29, 'okay': 18, 'first': 7, 'two': 27, 'time': 26, 'use': 28, 'but': 3, 'third': 24, 'burn': 2, 'my': 15, 'face': 6, 'am': 1, 'not': 17, 'sure': 23, 'about': 0, 'product': 21, 'never': 16, 'thought': 25, 'would': 30, 'pay': 20, 'so': 22, 'much': 14, 'for': 8, 'hair': 9, 'dryer': 5}
Bag of Words Binary Features: 
 [[0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0]
 [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1]]
