# Preprocessing Text

The most important step in text analysis is preprocessing. This involves preparing the raw text so that it is machine-readable. For example, one can separate an article into several paragraphs, and for each paragraph separate the sentences into a list of words. One can cut off the ending of words to keep only the 'stem' or 'root' word. There are several possibilities. In this notebook, we will walk through some of the main tools to preprocess text.

## Tokenization

Tokenization is one of the first steps in preprocessing. It involves splitting a document into tokens. Often, this is simply converting a sentence or paragraph (however we define a document) into a list of words. We can do this by applying the <b>split()</b> method to a sentence.

In [93]:
sent = "The Federal Reserve targeting of a 2% inflation led to better price stability!!!"

tokens = sent.split()
print(tokens)

['The', 'Federal', 'Reserve', 'targeting', 'of', 'a', '2%', 'inflation', 'led', 'to', 'better', 'price', 'stability!!!']


Tokenization is relatively straightforward. One should consider hyphenated terms should be tokenized into multiple tokens. 

It is also useful, when thinking of tokenization, to understand how documents are to be identified. Do we want analyses to be done using individual sentences, paragraphs, etc. We will discuss more of this when we apply document-level analyses such as topic modelling or word embeddings. 

## Normalization

Normalizing text is the process of converting a text into a more uniform sequence.  

### Stemming

The first normalization technique we will go through is stemming a word, which is reducing a word to its root stem. For example, the words cook, cooked, cooking, cooks could be transformed to simply <b>cook</b>. There are two main ways of stemming a word: Porter Stemmer and Lancaster Stemming. Let's implement it in Python.

In [84]:
from nltk.stem.porter import PorterStemmer

The NLTK module is the Natural Language Toolkit library developed by researchers at UPenn. It contains a whole suite of useful preprocessing modules, one of which is the PorterStemmer. We will apply to the sentence, <b>sent</b>, "The Federal Reserve targeting of a 2% inflation led to better price stability!!!". We split the sentence using the <b>split</b> function so that each element of the list is a unique token.

First, we need to create a new Porter stemmer. Then, we will use a <a href=https://www.pythonforbeginners.com/basics/list-comprehensions-in-python>list comprehension</a> to apply the stemmer to every word in the sentence. 

In [85]:
stemmer = PorterStemmer()

# Example stem
print(stemmer.stem("cooking"))
print(stemmer.stem("cooked"))
print(stemmer.stem("cooks"))

cook
cook
cook


Now, let's apply it to each token using the list comprehension.

In [86]:
tokens_stemmed = [stemmer.stem(el) for el in tokens]
print("Original List")
print(tokens)
print("Stemmed List")
print(tokens_stemmed)

Original List
['The', 'Federal', 'Reserve', 'targeting', 'of', 'a', '2%', 'inflation', 'led', 'to', 'better', 'price', 'stability!!!']
Stemmed List
['the', 'feder', 'reserv', 'target', 'of', 'a', '2%', 'inflat', 'led', 'to', 'better', 'price', 'stability!!!']


### Lemmatizing

Lemmatizing is similar to stemming. The main difference is that lemmatizing will use a vocabulary and the morphology of a word to transform it to the inflected forms. For example, the word "ran" is just a different form of the word "run", so we could group these words together as just "run".  

Similar to the Porter Stemmer, we will use a lemmatizer from the nltk.stem library. After initializing the lemmatizer, we can apply it to individual words. The WordNetLemmatizer().lemmatize function requires one other argument, the part-of-speech. This defaults to just the noun, however we could also put in an adjective ("a"), adverb ("r"), noun ("n") or verb ("v").

In [87]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("Stemming")
print(stemmer.stem("ran"))
print("Lemmatizing")
print(lemmatizer .lemmatize("ran",pos="v"))

Stemming
ran
Lemmatizing
run


In order to lemmatize properly, we should identify whether a word pertains to a noun, adverb, verb or adjective. We can do this by running the part-of-speech tagging function (the <b>pos_tag</b> function imported with from nltk import pos_tag) on each token. This will return a tuple where the second element is the part of speech. Since the wordnet lemmatized needs to be an "a", "v", "n" or "r", and the pos_tag function uses the <a href=https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html>Penn Treebank Part-of-Speech Classification</a>, we will write a short function which does this conversion.

In [88]:
from nltk import pos_tag

def get_wordnet_pos(tag):
    # Return whether "tag" is adjective, 
    # noun, verb or adverb
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

print(nltk.pos_tag(["targeting","led"]))

[('targeting', 'VBG'), ('led', 'VBD')]


We will use a list comprehension on this list of tuples. The first element of the tuple is just the word, while the second element of the tuple is what we will pass to the get_wordnet_pos function to return either a "a", "n", "r" or "v". 

In [89]:
# Get Part-of-Speeches
tokens_pos = nltk.pos_tag(tokens)
# Use list of POSs to lemmatize
tokens_lemmatized = [lemmatizer.lemmatize(el[0],get_wordnet_pos(el[1])) for  el in tokens_pos]
print("Original List")
print(tokens)
print("Stemmed List")
print(tokens_lemmatized)

Original List
['The', 'Federal', 'Reserve', 'targeting', 'of', 'a', '2%', 'inflation', 'led', 'to', 'better', 'price', 'stability!!!']
Stemmed List
['The', 'Federal', 'Reserve', 'target', 'of', 'a', '2%', 'inflation', 'lead', 'to', 'well', 'price', 'stability!!!']


### Case-sensitivity

To help with normalization, it is also useful to lowercase sentences before we split them up into tokens. This is easily done with just the "lower()" method.

In [90]:
print("Original Sentence")
print(sent)
print("Lower-case sentence")
print(sent.lower())

Original Sentence
The Federal Reserve targeting of a 2% inflation led to better price stability!!!
Lower-case sentence
the federal reserve targeting of a 2% inflation led to better price stability!!!


### Punctuation

Most analyses are only concerned with text itself and not so much punctuation or numbers. We can replace anything which is not a letter using REgex.

In [91]:
import re 
print("Original Sentence")
print(sent)
print("Remove Punctuation/Numbers Sentence")
print(re.sub("[^A-Za-z]"," ",sent))

Original Sentence
The Federal Reserve targeting of a 2% inflation led to better price stability!!!
Remove Punctuation/Numbers Sentence
The Federal Reserve targeting of a    inflation led to better price stability   


The REGex pattern "[A-Za-z]" identifies anything which is an upper or lower case letter. By placing the "^" in front, we will match anything which is NOT a letter. Hence, the command <b>re.sub("[^A-Za-z"," ",sent)) </b> replaces anything in <b>sent</b> which is NOT a letter with an empty space.

### Canonicalization

Canonicalization is the process of converting any phrase which has more than one possible representation with a generic form. For example, the Federal Reserve may be referred to as the Fed, the Board, the Fed Board, Fed Reserve. We can replace any instance of a predefined list of names for the Federal Reserve with a unifying representation, such as <b>Federal_Reserve</b>.

In [92]:
print(re.sub("Fed Board|Federal Reserve|the Board","Federal_Reserve",sent))

The Federal_Reserve targeting of a 2% inflation led to better price stability!!!


## Noise Removal

The last step we will learn for preprocessing is noise removal. 

One of the most common ways to remove noise is to remove stop words from your text. These are words which are highly common across documents, such as the words "the", "a", "and", "yours", etc. These are words which encompass a large percentage of text, but provide little information.

A common approach to removing stopwords is via the nltk library. We will import the stop words in english, save it as a variable <b>sw</b>, and check to see if each word in tokens (which is a list of words) is a stop word. We will save a new variable which is a list of words which are not stopwords.

In [98]:
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))

tokens_excl_sw = [word for word in tokens if word not in sw]
print(tokens)
print(tokens_excl_sw)

['The', 'Federal', 'Reserve', 'targeting', 'of', 'a', '2%', 'inflation', 'led', 'to', 'better', 'price', 'stability!!!']
['The', 'Federal', 'Reserve', 'targeting', '2%', 'inflation', 'led', 'better', 'price', 'stability!!!']


### Other Noise

Headers and footers, boilerplate text, or even reduntant HTML/XML are all examples of noise which we should be cognizant of when preprocessing documents. This type of text should be removed. We can find any sentence/paragraph/phrase which contains any of these noisy texts and replace it with an empty white space, for example. The best approach to identify noisy text is typically REgex. This step of preprocessing requires heavy inspection of the individual documents on the researchers part. 

## Document-Term Matrix

In order to convert a set of sentences into a document term matrix, we will use the functions in the module <b>sklearn.feature_extraction.text</b>. 

For example, suppose we have the sentences:
    
    1. this is the first sentence
    2. here is the second sentence
    3. that would make this sentence the third sentence
    
We can summarize all of this information in a document-term matrix. The rows of this matrix will be each sentence (for example: <i>this is the first sentence</i> will be its own row), and the columns will be the vocabulary (the unique words which compose the corpus. That is, 'first', 'here', 'is', 'make', 'one', 'second', 'sentence', 'that', 'the', 'third', 'this', 'would').

Let's see what this looks like in action. First we will need to import the <b>CountVectorizer</b> from the module. 

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

Now, let's save the sentences in a Python list.

In [13]:
sentences = ["this is the first sentence",
              "here is the second sentence",
              "that would make this sentence the third sentence"]

Now, we will generate a CountVectorizer instance. Think of this variable as a "machine". It will start off with no knowledge of any sentence, and then we will feed in the sentences we created above into a variable <b>X</b>. That resulting <b>X</b> is the object which contains the document-term matrix. 

In [14]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print(X.toarray())

[[1 0 1 0 0 1 0 1 0 1 0]
 [0 1 1 0 1 1 0 1 0 0 0]
 [0 0 0 1 0 2 1 1 1 1 1]]


Each row represents a sentence. The ones mean that the word which pertains to that column show up in that row's sentence. But what are the column names? We can get that from the vectorizer we initiated above with the following command:

In [15]:
print(vectorizer.get_feature_names())

['first', 'here', 'is', 'make', 'second', 'sentence', 'that', 'the', 'third', 'this', 'would']


It might be easier to save all this information in a Pandas dataframe, so that we can easily visualize this matrix. We will use Pandas since this module is great for saving datasets with column names. When we initiate a dataframe in Pandas, we should input the column names (through the <b>columns</b> argument) as well as the actual data (through the appropriately named <b>data</b> argument). 

In [16]:
import pandas as pd

sentences_df = pd.DataFrame(columns=vectorizer.get_feature_names(),
                           data = X.toarray())
sentences_df

Unnamed: 0,first,here,is,make,second,sentence,that,the,third,this,would
0,1,0,1,0,0,1,0,1,0,1,0
1,0,1,1,0,1,1,0,1,0,0,0
2,0,0,0,1,0,2,1,1,1,1,1


And there we have it. The first row pertains to the sentence <i>this is the first sentence</i>, which contains the words "first", "is", "sentence","the", and "this". 

Notice that CountVectorizer gives you the <i>count</i>/frequency of each word. Notice how in the third sentence the word "sentence" shows up twice and it correctly has a value of 2 as a result of this. 

## TF-IDF Weights

What if we wanted to weigh each value depending on how frequent the word is in the document and across documents. For example, the word "the" shows up in all 3 documents so that word does not provide much information to discriminate across sentences. 

This is quite easily accomplished in Python through the <b>TfidfVectorizer</b>.

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
                
vectorizer2 = TfidfVectorizer()
Y = vectorizer2.fit_transform(sentences)

sentences_df_tfidf = pd.DataFrame(columns=vectorizer2.get_feature_names(),
                           data = Y.toarray())
sentences_df_tfidf

Unnamed: 0,first,here,is,make,second,sentence,that,the,third,this,would
0,0.591887,0.0,0.450145,0.0,0.0,0.349578,0.0,0.349578,0.0,0.450145,0.0
1,0.0,0.55249,0.420183,0.0,0.55249,0.32631,0.0,0.32631,0.0,0.0,0.0
2,0.0,0.0,0.0,0.397699,0.0,0.469775,0.397699,0.234887,0.397699,0.30246,0.397699


What's the best way of interpreting this? The word "the" has a relatively low TF-IDF score in the 3rd document. Why is that? Well its for 2 reasons: 1) The word "the" shows up in all 3 documents, so compared to other words in the same sentence, it has a pretty low score 2) It has an especially low score if the document has many words in that sentence. Take document 3. There are 7 words in that sentence and "the" only shows up once. Hence it will have a lower score.

## Main Takeaways
This sums up the general ideas for preprocessing. What you should walk away from this notebook are:

1. How do you Normalize a set of documents?
2. How do you Tokenize each document?
3. How do you remove noise from each document?
4. How can you summarize the documents in a document-term matrix? 
    
    4a. Using Counts
    
    4b. Using TFIDF weights