# Computational Social Science: Text Preprocessing and Featurization

This lab will cover the basics of text preprocessing and featurization, and introduce text classification.

In [None]:
import spacy
import en_core_web_sm
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

## Data

<img src = "../images/cfpb logo.png"  />

For the next few labs, we will use the COnsumer Financial Protection Bureau's [Consumer Complaint Database](https://www.consumerfinance.gov/data-research/consumer-complaints/). The database is rich with information about individual consumer complaints about credit card fraud, debt collections, and other consumer issues.

In [None]:
cfpb = pd.read_csv("../data/CFPB 2020 Complaints.csv")
cfpb = cfpb.dropna(subset = ['Consumer complaint narrative']).reset_index()

In [None]:
cfpb.head()

## Tokenization

Tokenization is the **process of splitting text into words and sentences.** These chunks (words, sentences, etc.) are called **tokens**. One approach might be to try to do this use string methods like [str.split](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html). The problem with this is that using a separator like a "," or "." or " " may not work for some common situations. So instead, we'll use the [spaCy](https://spacy.io/) library.

### Why tokenize?

Electronic text is a linear sequence of symbols. Before any processing is to be done, text needs to be segmented into linguistic units, and this process is called tokenization.

We usually look at grammar and meaning at the level of words, related to each other within sentences, within each document. So if we're starting with raw text, we first need to split the text into sentences, and those sentences into words -- which we call "tokens".

### How to tokenize

#### Using String Methods

##### Split Into Sentences

You might imagine that the easiest way to identify sentences is to split the document at every period '.', and to split the sentences using white space to get the words.

In [None]:
# using the split function to create tokens
text = cfpb['Consumer complaint narrative'][0]
paragraph = ...
sentences = paragraph.split(...)
for s in sentences[:5]:
    print(s + '\n')

##### Split Into Tokens
From here, we might split each sentence into tokens by splitting on white space in between words. Try filling in the code below to take the first sentence and split on white spaces.

In [None]:
sentence = ...
tokens = sentence.split(...)
tokens

##### Challenge 1: What was the problem with splitting on the white space? Are there any tokens that look a little strange?

**Answer**: We may not want some words to have commas after them like "errors," and "inconsistent,"

#### spaCy

In contrast to string methods, spaCy uses pre-trained language models to incorporate context. In this case, we'll load the [en_core_web_sm](https://spacy.io/models/en), which is one of spaCy's English language models. For instance, the end of a sentence (".") should mark a new token, but the string "U.K." should not be separated at the "."'s. According to [spaCy's documentation](https://spacy.io/usage/spacy-101#annotations-token) it achieves this by taking the following steps:

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

1. Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

Now let's try applying these methods to our CFPB data. The steps to do this are:

1. Load the language model.
2. Apply it to a piece of text and save it in an spaCy "doc" object.
3. Extract each token from the doc object to a list.
4. Display the tokens

Check the documentation for help filling in these steps!

In [None]:
nlp = en_core_web_sm.load()

doc = nlp(text)
spacy_words = [token.text for token in doc]
display(f"Tokenized words: {spacy_words}")

#### Removing Stop Words and Punctuation

We now have some tokens with just a few lines of code! There are a few additional steps that we might want to take. For example, we may want to remove punctuation and stop words. Punctuation oftentimes does not add substantive information to a piece of text, and stop words are common words that appear very frequently across texts. Removing this kind of information can help with downstream classification tasks by allowing an algorithm to focus on words that distinguish documents, rather than ones that appear frequently across them. 

First, let's take a look at stop words. We can start by importing a collection of stop words from spaCy by running the following code:

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

Now let's look at some common stop words from this collection:

In [None]:
list(STOP_WORDS)[0:10]

We can do the same for punctuation:

In [None]:
from string import punctuation
punctuation

Removing punctuation and stop words is not a hard and fast rule - there may be situations where you want to keep them. In most applications, they add noise to downstream tasks, but always be mindful of your particular application when making decisions. Now that we have some tokenization tools, let's put them all together in a function!

##### Challenge 2: Write a function that takes a piece of text as an argument, and returns a list of tokens without punctuation or stop words.

In [None]:
tokens_reduced = []

def rem_punc_stop(text):
    stop_words = ...
    punc = set(...)
    
    punc_free = "".join([... for ... in ... if ... not in ...])
    
    doc = nlp(...)
    
    spacy_words = [...]
    
    no_punc = [...]
    
    return ...

In [None]:
tokens_reduced = rem_punc_stop(text)
tokens_reduced[0:5]

## Entity Recognition

spaCy also contains a number of methods for things like entity recognition. For instance, we could run the following code to check various entities. Notice that this process isn't perfect, spaCy still thinks "XX/XX/XXXX" is an organization or product even though we know this is a redacted date.

In [None]:
for entity in nlp(text).ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

## Lemmatization

Another preprocessing step we might take is reducing words down to their lemmas. Lemmatization reduces a word to its root word, while making sure the word still belongs to the language. This is in contrast to stemming, which reduces the word down to its root even if that root is not a valid word. Consider the following example:

In [None]:
for word in nlp(u'compute computer computed computing'):
    print(word.text,  word.lemma_)

Stemming these words would all result in the root "comput" but lemmatization converted these words to their shortest variant. Again, you may choose to stem or lemmatize depending on your specific application.

##### Challenge 3: Lemmatize the first consumer complaint narrative:

In [None]:
for ... in nlp(...):
    print(...,  ...)

## Chunking

We may also want to chunk more than one word together. One way to do this might be to group nounds together. 

**Challenge 4: Trying using the "noun_chunks" method to chunk nouns in the first complaint.**

In [None]:
doc = nlp(...)
for np in ...:
    print(...)

## Bag of Words

Now that we have covered some the basics of text preprocessing, we are ready to start getting our data in a format for feeding it into machine learning algorithms. There are many options for converting raw text to features in a supervised machine learning problem. The most basic of these is the "bag of words" approach. Bag of words essentially counts the number of times each word appears in a corpus, and these counts become features.

To illustrate, first let's import the CounterVectorizer method from sklearn. Once we do that, let's use our tokenizer function that we wrote earlier to initialize the CountVectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

In [None]:
bow_vector = CountVectorizer(tokenizer = rem_punc_stop, ngram_range=(1,1))

Once we create a CountVectorizer object, we can then transform a list of texts with the "fit_transform" method. This will return a sparse matrix with the counts. We can densify the matrix with the ".todense()" method.

In [None]:
bow_matrix = bow_vector.fit_transform(cfpb['Consumer complaint narrative'][0:5])

In [None]:
bow_matrix.todense()

We can also get the feature names:

In [None]:
feature_names = bow_vector.get_feature_names()
feature_names[0:10]

## Term Frequency - Inverse Document Frequency

An extension of bag-of-words is the term frequency-inverse document frequency approach. Whereas bag-of-words counts the number of words in the document. tf-idf takes this quanity and divides it by how frequently the word shows up across the corpus. In doing so, the tf-idf score downweights words that are common in the corpus and thus would not aid with classification.

**Challenge 5: Using the code from the "Bag of Words" section as a template, write code to get the tf-idf matrix for the CFPB data.**

In [None]:
tfidf_vector = TfidfVectorizer(...)

In [None]:
matrix = ...
feature_names = ...

In [None]:
matrix.todense()

## Basic Classification

**Challenge 6: Transform the text of the consumer complaint narrative into a tf-idf matrix, and use it to predict the "Timely response?" column.**

In [None]:
cfpb = cfpb[:5000]

In [None]:
# Training, Validation, Test Sets

# X
X = ...
tf = ...

tfidf_matrix =  ...

#y

y = ...

# Train/Test Split
X_train, X_test, y_train, y_test = ...

# Train/Validation Split
X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train,
                                           ...

In [None]:
nb = MultinomialNB()

nb_model = nb.fit(..., ...)

nb_pred = nb_model.predict(...)

In [None]:
import numpy as np
np.mean(... == ...)

nb_cf_matrix = confusion_matrix(..., ...)
nb_cf_matrix

Let's plot the confusion matrix! Use the following code from the "seaborn" package to make a heatmap out of the matrix.

In [None]:
nb_df_cm = pd.DataFrame(nb_cf_matrix, range(2),
                  range(2))

In [None]:
nb_df_cm

In [None]:
nb_df_cm = nb_df_cm.rename(index=str, columns={0: "no", 1: "yes"})
nb_df_cm.index = ["no", "yes"]
plt.figure(figsize = (10,7))
sn.set(font_scale=1.4)#for label size
sn.heatmap(nb_df_cm, 
           annot=True,
           annot_kws={"size": 16})

plt.title("Naive Bayes Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()