I will preprocess text using an approach called bag-of-words where each text is represented by its words regardless of the order in which they are presented or the embedded grammar by completing the following steps:

 1. Tokenise 
 2. Normalise 
 3. Remove stop words
 4. Count vectorise
 5. Transform to tf-idf representation

# Setting up the Python environment
#### Download ‘stopwords’ and ‘wordnet’ corpora from nltk
The script below can help you download these corpora.

In [1]:
import nltk
nltk.download('stopwords') 
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/adarshsalapaka/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/adarshsalapaka/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Data

We will use tiny text data which will allow us to monitor inputs and outputs for each step. 

**For this data, I have chosen bold, anarchistic freedom fighter V(Hugo Weaving)'s speech when he took over the state television broadcast a day after destroying the Old Bailey. He offered the people of London in the UK reasons to ignite a revolution against oppressive, totalitarian (fascist) government. He urged the people of Britain to rise up and meet him on November 5th one year later outside the gates of Parliament, which he promised would also be destroyed**


His speech goes: Good evening, London. Allow me first to apologize for this interruption. I do, like many of you, appreciate the comforts of every day routine - the security of the familiar, the tranquility of repetition. I enjoy them as much as any bloke. But in the spirit of commemoration, thereby those important events of the past usually associated with someone's death or the end of some awful bloody struggle, a celebration of a nice holiday, I thought we could mark this November the 5th, a day that is sadly no longer remembered, by taking some time out of our daily lives to sit down and have a little chat.

There are of course those who do not want us to speak. I suspect even now, orders are being shouted into telephones, and men with guns will soon be on their way. Why? Because while the truncheon may be used in lieu of conversation, words will always retain their power. Words offer the means to meaning, and for those who will listen, the enunciation of truth. And the truth is, there is something terribly wrong with this country, isn't there?

Cruelty and injustice, intolerance and oppression. And where once you had the freedom to object, to think and speak as you saw fit, you now have censors and systems of surveillance coercing your conformity and soliciting your submission. How did this happen? Who's to blame? Well certainly there are those who are more responsible than others, and they will be held accountable, but again truth be told, if you're looking for the guilty, you need only look into a mirror.

I know why you did it. I know you were afraid. Who wouldn't be? War, terror, disease. There were a myriad of problems which conspired to corrupt your reason and rob you of your common sense. Fear got the best of you, and in your panic you turned to the now high chancellor, Adam Sutler. He promised you order, he promised you peace, and all he demanded in return was your silent, obedient consent. Last night, I sought to end that silence. Last night, I destroyed the Old Bailey, to remind this country of what it has forgotten. More than four hundred years ago, a great citizen wished to embed the 5th of November forever in our memory. His hope was to remind the world that fairness, justice, and freedom are more than words - they are perspectives. So if you've seen nothing, if the crimes of this government remain unknown to you, then I would suggest you allow the 5th of November to pass unmarked.

But if you see what I see, if you feel as I feel, and if you would seek as I seek, then I ask you to stand beside me one year from tonight, outside the gates of Parliament, and together we shall give them a 5th of November that shall never, ever be forgot.

In [2]:
part1 = """Good evening, London. Allow me first to apologize for this interruption. I do, like many of you, appreciate the comforts of every day routine - the security of the familiar, the tranquility of repetition. I enjoy them as much as any bloke. But in the spirit of commemoration, thereby those important events of the past usually associated with someone's death or the end of some awful bloody struggle, a celebration of a nice holiday, I thought we could mark this November the 5th, a day that is sadly no longer remembered, by taking some time out of our daily lives to sit down and have a little chat.

There are of course those who do not want us to speak. I suspect even now, orders are being shouted into telephones, and men with guns will soon be on their way. Why? Because while the truncheon may be used in lieu of conversation, words will always retain their power. Words offer the means to meaning, and for those who will listen, the enunciation of truth. And the truth is, there is something terribly wrong with this country, isn't there?

Cruelty and injustice, intolerance and oppression. And where once you had the freedom to object, to think and speak as you saw fit, you now have censors and systems of surveillance coercing your conformity and soliciting your submission. How did this happen? Who's to blame? Well certainly there are those who are more responsible than others, and they will be held accountable, but again truth be told, if you're looking for the guilty, you need only look into a mirror."""

part2 = """I know why you did it. I know you were afraid. Who wouldn't be? War, terror, disease. There were a myriad of problems which conspired to corrupt your reason and rob you of your common sense. Fear got the best of you, and in your panic you turned to the now high chancellor, Adam Sutler. He promised you order, he promised you peace, and all he demanded in return was your silent, obedient consent. Last night, I sought to end that silence. Last night, I destroyed the Old Bailey, to remind this country of what it has forgotten. More than four hundred years ago, a great citizen wished to embed the 5th of November forever in our memory. His hope was to remind the world that fairness, justice, and freedom are more than words - they are perspectives. So if you've seen nothing, if the crimes of this government remain unknown to you, then I would suggest you allow the 5th of November to pass unmarked.

But if you see what I see, if you feel as I feel, and if you would seek as I seek, then I ask you to stand beside me one year from tonight, outside the gates of Parliament, and together we shall give them a 5th of November that shall never, ever be forgot."""

# Prepare the packages to analyze data

In [3]:
# Import packages and modules
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
# Create a dataframe
X_train = pd.DataFrame([part1, part2], columns=['speech'])

Secondly, let’s define a text preprocessing function to pass it on to TfidfVectorizer:

In [5]:
def preprocess_text(text):
    # Tokenise words while ignoring punctuation
    tokeniser = RegexpTokenizer(r'\w+')
    tokens = tokeniser.tokenize(text)
    
    # Lowercase and lemmatise 
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
    
    # Remove stopwords
    keywords= [lemma for lemma in lemmas if lemma not in stopwords.words('english')]
    return keywords

Lastly, let’s preprocess the text data by leveraging the function defined earlier:

In [6]:
# Create an instance of TfidfVectorizer
vectoriser = TfidfVectorizer(analyzer=preprocess_text)

# Fit to the data and transform to feature matrix
X_train = vectoriser.fit_transform(X_train['speech'])

# Convert sparse matrix to dataframe
X_train = pd.DataFrame.sparse.from_spmatrix(X_train)

# Save mapping on which index refers to which words
col_map = {v:k for k, v in vectoriser.vocabulary_.items()}

# Rename each column using the mapping
for col in X_train.columns:
    X_train.rename(columns={col: col_map[col]}, inplace=True)
    
X_train

Unnamed: 0,5th,accountable,adam,afraid,ago,allow,always,apologize,appreciate,ask,...,war,way,well,wish,word,world,would,wrong,year,years
0,0.061335,0.086204,0.0,0.0,0.0,0.061335,0.086204,0.086204,0.086204,0.0,...,0.0,0.086204,0.086204,0.0,0.12267,0.0,0.0,0.086204,0.0,0.0
1,0.184983,0.0,0.086662,0.086662,0.086662,0.061661,0.0,0.0,0.0,0.086662,...,0.086662,0.0,0.0,0.086662,0.061661,0.086662,0.173324,0.0,0.086662,0.086662


We have preprocessed text into feature matrix. Let’s break this down and understand the 5 steps mentioned at the beginning with examples in the next section.

#  Code Explanation

#### 1. Tokenization

**Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.**


In this step, we will convert a string part1 into list of tokens while discarding punctuation. There are many ways we could accomplish this task. I will show you one way to do so by using RegexpTokenizer from nltk:

In [7]:
# Import module
from nltk.tokenize import RegexpTokenizer

# Create an instance of RegexpTokenizer for alphanumeric tokens
tokeniser = RegexpTokenizer(r'\w+')

# Tokenise 'part1' string
tokens = tokeniser.tokenize(part1)
print(tokens)

['Good', 'evening', 'London', 'Allow', 'me', 'first', 'to', 'apologize', 'for', 'this', 'interruption', 'I', 'do', 'like', 'many', 'of', 'you', 'appreciate', 'the', 'comforts', 'of', 'every', 'day', 'routine', 'the', 'security', 'of', 'the', 'familiar', 'the', 'tranquility', 'of', 'repetition', 'I', 'enjoy', 'them', 'as', 'much', 'as', 'any', 'bloke', 'But', 'in', 'the', 'spirit', 'of', 'commemoration', 'thereby', 'those', 'important', 'events', 'of', 'the', 'past', 'usually', 'associated', 'with', 'someone', 's', 'death', 'or', 'the', 'end', 'of', 'some', 'awful', 'bloody', 'struggle', 'a', 'celebration', 'of', 'a', 'nice', 'holiday', 'I', 'thought', 'we', 'could', 'mark', 'this', 'November', 'the', '5th', 'a', 'day', 'that', 'is', 'sadly', 'no', 'longer', 'remembered', 'by', 'taking', 'some', 'time', 'out', 'of', 'our', 'daily', 'lives', 'to', 'sit', 'down', 'and', 'have', 'a', 'little', 'chat', 'There', 'are', 'of', 'course', 'those', 'who', 'do', 'not', 'want', 'us', 'to', 'speak',

We see that each word is now a separate string. Do you notice how there are variations of the same word? 

For instance: words can differ in terms of their case: ‘and’ and ‘And’ or their suffix: ‘share’, ‘shared’ and ‘sharing’. This is where normalisation comes in to standardise.

#### 2. Normalize

**To normalise a word is to transform it into its root form**

Stemming and lemmatisation are popular ways to normalise text. In this step, we will use lemmatisation to transform words to their dictionary form as well as remove case distinction by converting all words to lowercase.

We will use WordNetLemmatizer from nltk to lemmatise our tokens

In [8]:
# Import module
from nltk.stem import WordNetLemmatizer

# Create an instance of WordNetLemmatizer
lemmatiser = WordNetLemmatizer()

# Lowercase and lemmatise tokens
lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
print(lemmas)

['good', 'even', 'london', 'allow', 'me', 'first', 'to', 'apologize', 'for', 'this', 'interruption', 'i', 'do', 'like', 'many', 'of', 'you', 'appreciate', 'the', 'comfort', 'of', 'every', 'day', 'routine', 'the', 'security', 'of', 'the', 'familiar', 'the', 'tranquility', 'of', 'repetition', 'i', 'enjoy', 'them', 'as', 'much', 'as', 'any', 'bloke', 'but', 'in', 'the', 'spirit', 'of', 'commemoration', 'thereby', 'those', 'important', 'events', 'of', 'the', 'past', 'usually', 'associate', 'with', 'someone', 's', 'death', 'or', 'the', 'end', 'of', 'some', 'awful', 'bloody', 'struggle', 'a', 'celebration', 'of', 'a', 'nice', 'holiday', 'i', 'think', 'we', 'could', 'mark', 'this', 'november', 'the', '5th', 'a', 'day', 'that', 'be', 'sadly', 'no', 'longer', 'remember', 'by', 'take', 'some', 'time', 'out', 'of', 'our', 'daily', 'live', 'to', 'sit', 'down', 'and', 'have', 'a', 'little', 'chat', 'there', 'be', 'of', 'course', 'those', 'who', 'do', 'not', 'want', 'us', 'to', 'speak', 'i', 'suspec

The words are now transformed to its dictionary form. For instance, ‘share’, ‘sharing’ and ‘shared’ are now all just ‘share’.

In [9]:
# Check how many words we have
len(lemmas)

271

We have 271 words but not all words carry the same level of contribution to the meaning of the text. In other words, there are some words that are not particularly useful to the key message. This is where stop words come in.

#### 3. Removing stop words

**Stop words are common words which provide little to no value to the meaning of the text.**

Think about this: If you had to describe yourself in three words as elaborately as possible, would you include ‘I’, or ‘am’? If I asked you to underline keywords in Joey’s speech, would you underline ‘a’ or ‘the’? Probably not. The ‘I’, ‘am’, ‘a’ and ‘the’ are examples of stop words. I think you get the idea.

Different sets of stop words may be necessary depending on the domain that the text is related to. In this step, we will leverage nltk’s stopwords corpus. You could define your own set of stop words or enrich standard stop words by adding common terms that are appropriate to the domain of the text.

Let’s first familiarise ourselves with stopwords little bit more

In [10]:
# Import module
from nltk.corpus import stopwords

# Check out how many stop words there are 
print(len(stopwords.words('english')))

# See first 5 stop words
stopwords.words('english')[:5]

179


['i', 'me', 'my', 'myself', 'we']

At the time of writing this post, there are 179 english stop words in the nltk’s stopword corpus.

Some examples include: ‘i’, ‘me’, ‘my’, ‘myself’, ‘we’. If you are curious to see the full list, simply remove [:5] from the last line of code.

Notice how these stop words are in lowercase? To effectively remove stop words, we have to ensure that all words are in lowercase. Here, we have already done so in step two.

Using a list comprehension, let’s remove all stop words from our list:

In [11]:
keywords = [lemma for lemma in lemmas if lemma not in stopwords.words('english')]
print(keywords)

['good', 'even', 'london', 'allow', 'first', 'apologize', 'interruption', 'like', 'many', 'appreciate', 'comfort', 'every', 'day', 'routine', 'security', 'familiar', 'tranquility', 'repetition', 'enjoy', 'much', 'bloke', 'spirit', 'commemoration', 'thereby', 'important', 'events', 'past', 'usually', 'associate', 'someone', 'death', 'end', 'awful', 'bloody', 'struggle', 'celebration', 'nice', 'holiday', 'think', 'could', 'mark', 'november', '5th', 'day', 'sadly', 'longer', 'remember', 'take', 'time', 'daily', 'live', 'sit', 'little', 'chat', 'course', 'want', 'us', 'speak', 'suspect', 'even', 'order', 'shout', 'telephone', 'men', 'gun', 'soon', 'way', 'truncheon', 'may', 'use', 'lieu', 'conversation', 'word', 'always', 'retain', 'power', 'word', 'offer', 'mean', 'mean', 'listen', 'enunciation', 'truth', 'truth', 'something', 'terribly', 'wrong', 'country', 'cruelty', 'injustice', 'intolerance', 'oppression', 'freedom', 'object', 'think', 'speak', 'saw', 'fit', 'censor', 'systems', 'surv

In [12]:
# Check how many words we have
len(keywords)

120

After removing stop words, we only have 120 words as opposed to 271 yet the gist is still preserved.

Now, if you scroll back up to section 2 (Final Code) and have a quick look at the preprocess_text function, you will see that this function captures the transformation process shown in steps 1 to 3.

#### 4. Count Vectorize

**Count vectorise is to convert a collection of text documents to a matrix of token counts.**

Now let’s look at counts of each word in keywords from step 3:

In [13]:
{word: keywords.count(word) for word in set(keywords)}

{'sit': 1,
 'celebration': 1,
 'listen': 1,
 'suspect': 1,
 'use': 1,
 'remember': 1,
 'telephone': 1,
 'way': 1,
 'injustice': 1,
 'sadly': 1,
 'men': 1,
 'terribly': 1,
 'hold': 1,
 'bloody': 1,
 'security': 1,
 'want': 1,
 'past': 1,
 'conversation': 1,
 'well': 1,
 'first': 1,
 'mark': 1,
 'november': 1,
 'coerce': 1,
 'interruption': 1,
 'shout': 1,
 'holiday': 1,
 'commemoration': 1,
 'responsible': 1,
 'someone': 1,
 'apologize': 1,
 'struggle': 1,
 'routine': 1,
 'take': 1,
 'intolerance': 1,
 'awful': 1,
 'even': 2,
 'object': 1,
 'good': 1,
 'every': 1,
 'conformity': 1,
 'censor': 1,
 'comfort': 1,
 'need': 1,
 'like': 1,
 'death': 1,
 'associate': 1,
 'london': 1,
 'repetition': 1,
 'could': 1,
 'end': 1,
 'many': 1,
 'mean': 2,
 'much': 1,
 'certainly': 1,
 'look': 2,
 'may': 1,
 'longer': 1,
 'happen': 1,
 'soon': 1,
 'saw': 1,
 'day': 2,
 'enunciation': 1,
 'events': 1,
 'oppression': 1,
 'something': 1,
 'wrong': 1,
 'word': 2,
 'surveillance': 1,
 'think': 2,
 'order':

The word ‘give’ occurs 3 times whereas ‘joyous’ was mentioned once.

This is essentially what CountVectorizer does to all records. CountVectorizer transforms text into a matrix of m by n where m is the number of text records, n is the number of unique tokens across all records and the elements of the matrix refer to the tally of a token for a given record.

In this step, we will convert our text dataframe to count matrix. We will pass our custom preprocessor function to CountVectorizer:

In [14]:
# Import module
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of CountfVectorizer
vectoriser = CountVectorizer(analyzer=preprocess_text)

In [15]:
# Create a dataframe
X_train = pd.DataFrame([part1, part2], columns=['speech'])

In [16]:
# Fit to the data and transform to feature matrix
X_train = vectoriser.fit_transform(X_train['speech'])

The output feature matrix will be in sparse_matrix form. Let’s convert it to a dataframe with proper column names to make it more readible:

In [17]:
# Convert sparse matrix to dataframe
X_train = pd.DataFrame.sparse.from_spmatrix(X_train)

# Save mapping on which index refers to which terms
col_map = {v:k for k, v in vectoriser.vocabulary_.items()}

# Rename each column using the mapping
for col in X_train.columns:
    X_train.rename(columns={col: col_map[col]}, inplace=True)
X_train

Unnamed: 0,5th,accountable,adam,afraid,ago,allow,always,apologize,appreciate,ask,...,war,way,well,wish,word,world,would,wrong,year,years
0,1,1,0,0,0,1,1,1,1,0,...,0,1,1,0,2,0,0,1,0,0
1,3,0,1,1,1,1,0,0,0,1,...,1,0,0,1,1,1,2,0,1,1


Once we transform it to dataframe, the columns would be just indices (i.e. numbers from 0 to n-1) instead of the actual words. Therefore, we need to rename the columns to make it easier to interpret.

When the vectoriser is fit to the data, we can find out the index mapping to words from vectoriser.vocabulary_. This index mapping is formatted as {word:index}. To rename columns, we must switch the key-value pairs to {index:word}. This is done in the second line of code and saved in col_map.

Using for loop at the end of the code, we are renaming each column using the mapping, and the output should look like what is in the table above (showing only partial output due to space limitation).
From this matrix, we can see ‘give’ has been mentioned 3 times in part1 (row index=0) and once in part2 (row index=1).

In our example, we only have 2 records each consisting of only a handful of sentences, so the count matrix is pretty small and its sparsity is not as high. Sparsity refers to the proportion of zero elements among all elements in a matrix. When you are working on real data with hundreds, thousands or even millions of records each represented by rich text, the count matrix is likely to be extremely large and contain mostly 0s. 

In those instances, using sparse format saves storage memory and speeds up further processing. As a result, you may not always convert sparse matrix to a dataframe like we did here for illustration when preprocessing text in real life.

#### 5. Transform to TF-IDF Representation

**tf-idf stands for term frequency inverse document frequency.**

When transforming to tf-idf representation, we are transforming the counts to weighted frequency where we give more significance to less frequent words and less importance to more frequent words by using a weight called inverse document frequency.

In [18]:
# Import module
from sklearn.feature_extraction.text import TfidfTransformer

# Create an instance of TfidfTransformer
transformer = TfidfTransformer()

# Fit to the data and transform to tf-idf
X_train = pd.DataFrame(transformer.fit_transform(X_train).toarray(), columns=X_train.columns)
X_train

Unnamed: 0,5th,accountable,adam,afraid,ago,allow,always,apologize,appreciate,ask,...,war,way,well,wish,word,world,would,wrong,year,years
0,0.061335,0.086204,0.0,0.0,0.0,0.061335,0.086204,0.086204,0.086204,0.0,...,0.0,0.086204,0.086204,0.0,0.12267,0.0,0.0,0.086204,0.0,0.0
1,0.184983,0.0,0.086662,0.086662,0.086662,0.061661,0.0,0.0,0.0,0.086662,...,0.086662,0.0,0.0,0.086662,0.061661,0.086662,0.173324,0.0,0.086662,0.086662
