# [Computational Social Science]
## 5-1 Text Preprocessing and Featurization - Student Version

This lab will cover the basics of text preprocessing and featurization, and introduce text classification.

## Virtual Environment
Remember to always activate your virtual environment first before you install packages or run a notebook! This helps to prevent conflicts between dependencies across different projects and ensures that you are using the correct versions of packages. You must have created anaconda virtual enviornment in the `Anaconda Installation` lab. If you have not or want to create a new virtual environment, follow the instruction in the `Anaconda Installation` lab. 

<br>

If you have already created a virtual enviornment, you can run the following command to activate it: 

<br>

`conda activate <virtual_env_name>`

<br>

For example, if your virtual environment was named as CSS, run the following command. 

<br>

`conda activate CSS`

<br>

To deactivate your virtual environment after you are done working with the lab, run the following command. 

<br>

`conda deactivate`

<br>

## Preparing your environment

You will need to uncomment the following lines download `SpaCy` if you do not already have it on your local machine. `NLTK` (Natural Language Toolkit) is an older, less efficient alternative. `SpaCy` is a simplier and optimized library compared to `NLTK`, which is more comprhensive but is a bit more verbose and has a slightly higher learning curve. 

In [None]:
#!pip install spacy
#!python -m spacy download en_core_web_sm

In [None]:
# load libraries
# ----------
import spacy 
import en_core_web_sm
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sn

# settings
%matplotlib inline

## Data

<img src = "../../images/cfpb_logo.png"/>

For the next few labs, we will use the Consumer Financial Protection Bureau's [Consumer Complaint Database](https://www.consumerfinance.gov/data-research/consumer-complaints/). The database is rich with information about individual consumer complaints about credit card fraud, debt collections, and other consumer issues. This dataset is convenient for text analysis because the consumer complaints are real text generated by real people. This means that the data have all the idiosyncrasies that come with those data-generating processes. It also contains multiple different categories that we can predict, like type of product the complaint is about and whether the complaint was resolved quickly. 

The basic process is that if someone has a dispute related to consumer finance (mortgages, student loans, credit cards, etc.), they can file a dispute with the CFPB, which then contacts the company named in the dispute to get some resolution of the issue.

In [None]:
# load the data
# ----------

# load the dataframe
cfpb = pd.read_csv("../../data/CFPB 2020 Complaints.csv")

# drop missing on "Consumer complaint narrative" feature and reset the index bc we've dropped
cfpb = cfpb.dropna(subset = ['Consumer complaint narrative']).reset_index()

In [None]:
# check 
# ----------
cfpb.head()

In [None]:
# view examples just from "Consumer complaint" column 
# ----------
cfpb['Consumer complaint narrative'].head()

## Tokenization

Tokenization is the **process of splitting text into words and sentences.** These chunks (words, sentences, etc.) are called **tokens**. One approach might be to try to do this use string methods like [str.split](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html). The problem with this is that using a separator like a "," or "." or " " may not work for some common situations. So instead, we'll use the [spaCy](https://spacy.io/) library.

### Why tokenize?

Electronic text is a linear sequence of symbols. Before any processing can be done, text needs to be segmented into linguistic units, and this process is called tokenization.

We usually look at grammar and meaning at the level of words, related to each other within sentences, within each document. So if we're starting with raw text, we first need to split the text into sentences, and those sentences into words -- which we call "tokens".

### How to tokenize

#### Using String Methods

##### Split Into Sentences

You might imagine that the easiest way to identify sentences is to split the document at every period `"."`, and to split the sentences using white space to get the words.

In [None]:
# using the split function to create tokens
text = cfpb['Consumer complaint narrative'][0] # what is this line doing?
sentences = text.split(...)                    # split text by a decimal point
for s in sentences[:5]:                        # specify just the first 5 sentences
    print(s + '\n')                            # What does the "\" do?

This seems to be ok, but what if someone said something like "U.C. Berkeley charged me $50.11 by mistake."

In [None]:
# illustration of how this function breaks down
# ----------
bad_text = "U.C. Berkeley charged me $50.11 by mistake."
bad_sentences = bad_text.split(".")
for s in bad_sentences[:5]:
    print(s + '\n')

That doesn't look too good! The one sentence was split into 4 separate sentences because `"."`'s are used for things other than ending a sentence.

##### Split Into Tokens
From here, we might split each sentence into tokens by splitting on white space in between words. Try filling in the code below to take the first sentence and split on white spaces.

In [None]:
# use a different parameter in the split function to create tokens using a " " (space)
# ----------
sentence = ...                          # take the first sentence from our "sentences" object
print("Sentence to split: ", sentence)  # visualize the sentence we we splitting
tokens = sentence.split(...)            # split by " " a space this time
tokens                                  # print the token

**CHALLENGE 1:** What was the problem with splitting on the white space? Are there any tokens that look a little strange?

**ANSWER**: ...

#### spaCy

In contrast to simple string methods, `spaCy` uses pre-trained language models to incorporate context. In this case, we'll load the [en_core_web_sm](https://spacy.io/models/en), which is one of `spaCy`'s English language models. For instance, the end of a sentence (".") should mark a new token, but the string "U.K." should not be separated at the `"."`'s. According to [spaCy's documentation](https://spacy.io/usage/spacy-101#annotations-token) it achieves this by taking the following steps:

First, the raw text is split on whitespace characters, similar to `text.split(' ')`. Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

1. Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
2. Can a prefix, suffix or infix be split off? For example, punctuation like commas, periods, hyphens or quotes.

If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, `spaCy` can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

Now let's try applying these methods to our CFPB data. The steps to do this are:

1. Load the language model.
2. Apply it to a piece of text and save it in an `spaCy` "doc" object.
3. Extract each token from the doc object to a list.
4. Display the tokens

Check the documentation for help filling in these steps!

Note this next step might take a few moments to run.

In [None]:
# use a different parameter in the split function to create tokens using a " " (space)
# ----------
nlp = en_core_web_sm.load()                 # load the small language model en_core_web_sm

doc = nlp(...)                              # apply the nlp language model to process the "text" object from above
spacy_words = [token.text for token in doc] # create a list of words from doc object
display(f"Tokenized words: {spacy_words}")  # fancier way to displaying an object using the f-string formatting
#print(spacy_words)                         # another way that you might be more familier with

# note that "token.text" is an attribute of "token" not the object text we are cleaning
# see--https://spacy.io/api/token--for more details

# general syntax for list comprehension: [expression for item in iterable]

#### Removing Stop Words and Punctuation

We now have some tokens with just a few lines of code! There are a few additional steps that we might want to take. For example, we may want to remove punctuation and stop words. Punctuation oftentimes does not add substantive information to a piece of text, and stop words are common words that appear very frequently across texts. Removing this kind of information can help with downstream classification tasks by allowing an algorithm to focus on words that distinguish documents, rather than ones that appear frequently across them. 

Note that we might all have different opinions on the precise list of stop words, but in general this library should cover the basics. 

First, let's take a look at stop words. We can start by importing a collection of stop words from spaCy by running the following code:

In [None]:
# load STOP_WORDS module from spaCy library
# ----------
from spacy.lang.en.stop_words import STOP_WORDS

Now let's look at some common stop words from this collection:

In [None]:
# list stop words 1-11 from the library
# ----------
list(STOP_WORDS)[0:10]

Note that the list of words you might sight will differ because `STOP_WORDS` is a set and not a list. A [set](https://www.w3schools.com/python/python_sets.asp) is an **unordered,** **unchangable,** and **unindexed** data structure in python.  

We can do the same for punctuation:

In [None]:
# load punctuation module from string library
# ----------
from string import punctuation
punctuation

Removing punctuation and stop words is not a hard and fast rule - there may be situations where you want to keep them. In most applications, they add noise to downstream tasks, but always be mindful of your particular application when making decisions. Now that we have some tokenization tools, let's put them all together in a function!

**CHALLENGE 2:** Write a function that takes a piece of text as an argument, and returns a list of tokens without punctuation or stop words.

In [None]:
# load punctuation module from string library
# ----------

# initalize an empty list
tokens_reduced = []

# create a function to remove punctuation and stop words 
def rem_punc_stop(text):
    
    # set objects
    stop_words = ...      # set STOP_WORDS to a new object variable
    punc = set(...)       # convert punctuation to a set
    
    # essentially remove the punctuation - important to remove punctuation first to correctly capture stop words
    punc_free = "".join([... for ... in ... if ... not in ...]) # join new list of characters (ch) in text w/ condition
                                                                # if ch is not in punctuation 
                                                                # "".join() creates a string from the list comprehension

    # apply nlp to punctuation-free object
    doc = nlp(...)
    
    # extract words from processed text 
    spacy_words = [...]
    
    # filter out words 
    no_punc = [...]
    
    return ...

In [None]:
# apply function
# ----------

# apply our new function to our text object from above
tokens_reduced = rem_punc_stop(...)

# view the first 5 tokens
tokens_reduced[...]

## Entity Recognition

spaCy also contains a number of methods for things like entity recognition. For instance, we could run the following code to check various entities. Notice that this process isn't perfect, spaCy still thinks "XX/XX/XXXX" is an organization or product even though we know this is a redacted date.

In [None]:
# apply function to text object
# ----------
# run a loop and print out the 
for entity in nlp(text).ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

In [None]:
# How can we modify the above code to apply function to the cleaned tokens_reduced object?
# ----------

# first, convert tokens_reduced from list to a string
tokens_reduced_str = ...

# next, apply function
for entity in nlp(...).ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))
    

## Lemmatization

Another preprocessing step we might take is reducing words down to their lemmas. Lemmatization reduces a word to its root word, while making sure the word still belongs to the language. This is in contrast to stemming, which reduces the word down to its root even if that root is not a valid word. Consider the following example:

In [None]:
# run a loop to identify the shortest variatiant of each word 
# ----------
for word in nlp(u'compute computer computed computing'):
    print(word.text,  word.lemma_)

Stemming these words would all result in the root "comput" but lemmatization converted these words to their shortest variant. Again, you may choose to stem or lemmatize depending on your specific application. 

Now, try it again but with a new set of words:

In [None]:
# try it again with a new word to see how it changes 
# ----------
for word in nlp(u'... ... ..'): # choose a new set of words
    print(word.text,            # print the word 
          word.lemma_)          # print the lemma 

**CHALLENGE 3:** Lemmatize the first consumer complaint narrative.

In [None]:
# now run a loop to identify the lemma for each word in out "text" object we've used above 
# ----------
for ... in nlp(...): # specify the text object from above
    print(...,       # print the word  
          ...)       # print the lemma

## Chunking

We may also want to chunk more than one word together. One way to do this might be to group nouns together. 

**CHALLENGE 4:** Trying using the [`noun_chunks`](https://spacy.io/api/doc#noun_chunks) method to chunk nouns in the first complaint.

In [None]:
# chunking
# ----------
# apply the nlp to our text object to convert to a spacy object
doc = nlp(...)  # apply nlp to text object from above

# run a loop to chunk and print the obejct
for np in ...:
    print(...)  # print the np object calling the text attribute

## Bag of Words

Now that we have covered some the basics of text preprocessing, we are ready to start getting our data in a format for feeding it into machine learning algorithms. There are many options for converting raw text to features in a supervised machine learning problem. The most basic of these is the "bag of words" approach. Bag of words essentially counts the number of times each word appears in a corpus, and these counts become features.

To illustrate, first let's import the CounterVectorizer method from sklearn. Once we do that, let's use our tokenizer function that we wrote earlier to initialize the CountVectorizer.

In [None]:
# load libraries
# ----------
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

In [None]:
# initalize CountVectorizer and specify some parameters
# ----------
bow_vector = CountVectorizer(tokenizer = rem_punc_stop, # use our function for tokenizing created above
                             token_pattern = None,      # set to "None" since we have specify our own pattern
                             ngram_range=(1,1))         # use default for unigrams - see documentation

Once we create a CountVectorizer object, we can then transform a list of texts with the "fit_transform" method. This will return a sparse matrix with the counts. We can densify the matrix with the ".todense()" method.

In [None]:
# fit and transform just a subset of our data
# ----------
bow_matrix = bow_vector.fit_transform(cfpb['Consumer complaint narrative'][0:5]) 

In [None]:
# visualize
# ----------
bow_matrix.todense() # todense() densifies the sparse matrix (which only stores non-zeros elements)

We can also get the feature names:

In [None]:
# get feature names instead and view subset
# ----------
feature_names = bow_vector.get_feature_names_out()
feature_names[0:10]

## Term Frequency - Inverse Document Frequency

An extension of bag-of-words is the term frequency-inverse document frequency approach. Whereas bag-of-words counts the number of words in the document, tf-idf takes this quanity and divides it by how frequently the word shows up across the corpus. In doing so, the tf-idf score downweights words that are common in the corpus and thus would not aid with classification.

**CHALLENGE 5:** Using the code from the "Bag of Words" section as a template, write code to get the tf-idf matrix for the CFPB data. You may want to check the documentation for [TfidfVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to see other options.

In [None]:
# initalize TfidfVectorizer and specify some parameters
# ----------
tfidf_vector = TfidfVectorizer(...)   # use our function for tokenizing created above

In [None]:
# fit and transform just a subset of our data
# ----------
# fit and transform to obtain the tfidf matrix
matrix = ... # subset only the first 5 rows

# if you want to get the feature names and check the length
feature_names = ...

In [None]:
# visualize
# ----------
matrix.todense() # todense() densifies the sparse matrix (which only stores non-zeros elements)

## Basic Classification (optional)

This section is optional because we won't cover Naive Bayes until later in the semester but it will give you a sense of where we are going and what we can do with cleaned data. 

So, now that we have featurized our text, we are ready to make a prediction! Does the text of our consumer complaints predict whether or not they get a timely response?

**CHALLENGE 6:** Transform the text of the consumer complaint narrative into a tf-idf matrix, and use it to predict the "Timely response?" column.

In [None]:
#
# Naive Bayes Classification workflow
# --------------------------------------------

# subset the data for speed of analysis
# ----------
cfpb = cfpb[:5000]


# features
# ----------
X = cfpb[...]                          # select just the consumer complaint column
tf = ...(tokenizer = ....,             # initialize the TfidfVectorizer() algorithm and use our cleaning algorithim
         token_pattern = None)         # set to "None" since we have specify our own pattern



# fit tokenizer on X dataframe
tfidf_matrix =  tf.fit_transform(...) # fit and transform the features


# label
# ----------
y = cfpb['Timely response?']          # subset so outcome object only includes timely response label

# xreate train-test split
# ----------
X_train, X_test, y_train, y_test = train_test_split(...,                # tranformed feature matrix
                                                    ...,                # labels
                                                    train_size = ...,   # training split
                                                    test_size = ...,    # testing split
                                                    random_state=100)   # set random state

# xreate train-validation split
# ----------
X_train, X_validate, y_train, y_validate = train_test_split(...,              # feature training data
                                                            ...,              # outcome training data
                                                            train_size = ..., # training split
                                                            test_size = ...,  # testing split
                                                            random_state=101) # set random state






In [None]:
# set seed for reproducibility
# ----------
from spacy.util import fix_random_seed # important random seed utility from spacy library
fix_random_seed(1) # set seed

# initialize the Multinomial Naive Bayes algorithim
# ----------
nb = MultinomialNB()

# fit it on the training data
# ----------
nb_model = nb.fit(..., 
                  ...)

# predict on validation dataset
# ----------
nb_pred = nb_model.predict(....)

In [None]:
# import numpy library
# ----------
import numpy as np

# print average accuracy
# ----------
print(np.mean(nb_pred == ...))

# create a confusion matix
# ----------
nb_cf_matrix = confusion_matrix(..., 
                                ...
                                normalize = "pred") # normalize by column 
nb_cf_matrix

Let's plot the confusion matrix! Use the following code from the "seaborn" package to make a heatmap out of the matrix.

In [None]:
# convert to dataframe for visualization
# ----------
nb_df_cm = pd.DataFrame(nb_cf_matrix, 
                        range(2),
                        range(2))

# visualize
nb_df_cm

In [None]:
# create confusion matrix using Seaborn code
# ----------
# add indices and labels
nb_df_cm = nb_df_cm.rename(index=str, columns={0: "no", 1: "yes"})
nb_df_cm.index = ["no", "yes"]

# specifyc figure parameters
plt.figure(figsize = (10,7))
sn.set(font_scale=1.4)#for label size

# specify figure using seaborn library
sn.heatmap(nb_df_cm, 
           annot=True,
           annot_kws={"size": 16}, 
           fmt='.2f')  # prevent scientific notation

# figure labels
plt.title("Naive Bayes Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()