# Text Preprocessing

In [None]:
import pandas as pd
import numpy as np

from nltk.stem.snowball import EnglishStemmer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## Our running example - a corpus of three documents (tweets)

In [None]:
tweets = [
    "No one is born hating another person because of the color of his skin or his background or his religion.",
    "People must learn to hate, and if they can learn to hate, they can be taught to love.",
    "For love comes more naturally to the human heart than its opposite."
]

## Tokenizing, discarding stop words, count vectorization - one class does it all!

In the next lecture, we'll put this into a pipeline with a classifier.

By default, it converts to lowercase, it treats punctuation as spaces, and it treats two or more consecutive characters as a token.

In [None]:
# Create the vectorizer
vectorizer = CountVectorizer(stop_words="english")

# Run the vectorizer
vectorizer.fit(tweets)
X = vectorizer.transform(tweets)

FYI here's the (somewhat strange) list of stop-words that scikit-learn uses.

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print(ENGLISH_STOP_WORDS)

But in the CountVectorizer we can say stop_words = None or we can supply our own list of stop-words.

FYI, let's see the tokens that it ends up with

In [None]:
vectorizer.get_feature_names_out()

Suppose we wanted to do stemming. There is no stemmer in scikit-learn. But there are stemmers in `nltk`, e.g. `nltk.stem.snowball.EnglishStemmer`. It's a little tricky because we have to make sure that we remove stop-words before we stem.

In [None]:
class StemmedCountVectorizer(CountVectorizer):
        
    def build_tokenizer(self):
        tokenizer = super().build_tokenizer()
        stemmer = EnglishStemmer()
        if self.get_stop_words():
            return lambda doc: (stemmer.stem(t) for t in tokenizer(doc) if t not in self.get_stop_words())
        return lambda doc: (stemmer.stem(t) for t in tokenizer(doc))

Now we can use StemmedCountVectorizer instead of CountVectorizer.

In [None]:
# Create the vectorizer
vectorizer = StemmedCountVectorizer(stop_words="english")

# Run the vectorizer
vectorizer.fit(tweets)
X = vectorizer.transform(tweets)

We can see that the tokens are now different:

In [None]:
vectorizer.get_feature_names_out()

We can look at the sparse array. The first number identifies the tweet (0, 1 or 2), the second is which token, and the last is the frequency.

In [None]:
print(X)

Let's vectorize a new document.

In [None]:
new_document = "Unsurprisingly, people hate to learn that their religion loves to hate."

new_document_as_vector = vectorizer.transform([new_document])

Notice how it ignores words that weren't in the original tweets, such as "unsurprisingly".

In [None]:
print(new_document_as_vector)

## Tokenizing, discarding stop-words, TF-IDF vectorization - there's a class that does all this instead

In [None]:
# Create the vectorizer
vectorizer = TfidfVectorizer(stop_words="english")

# Run the vectorizer
vectorizer.fit(tweets)
X = vectorizer.transform(tweets)

We could create a version that does stemming again, if we wanted. But we won't bother here.

Here are the tokens. They're different because we didn't stem this time.

In [None]:
vectorizer.get_feature_names_out()

Here are the tf-idf scores.

In [None]:
print(X)

## Unigrams, Bigrams and Both

Up to now, our tokens are unigrams. 

This is what we get if we use bigrams instead:

In [None]:
# Create the vectorizer
vectorizer = CountVectorizer(ngram_range=(2,2))

# Run the vectorizer
vectorizer.fit(tweets)
X = vectorizer.transform(tweets)

In [None]:
vectorizer.get_feature_names_out()

Note that we are less likely to discard stop-words or to do stemming in this case.

More common if you are using bigrams is to allow unigrams as well.

In [None]:
# Create the vectorizer
vectorizer = CountVectorizer(ngram_range=(1,2))

# Run the vectorizer
vectorizer.fit(tweets)
X = vectorizer.transform(tweets)

In [None]:
vectorizer.get_feature_names_out()