# [Computational Social Science]
## 5-3 Text Feature Engineering and Classification - Student Version

In this lab we will use the techniques we covered so far to engineer text features and train a classification algorithm.

In [None]:
import pandas as pd
import numpy as np
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Data

<img src = "../../images/cfpb logo.png"  />

We'll once again use the Consumer Financial Protection Bureau's [Consumer Complaint Database](https://www.consumerfinance.gov/data-research/consumer-complaints/). Picking up from where we left off last time, we'll focus on predicting whether a consumer complaint narrative is talking about a "checkings or savings account" issue or a "student loan" issue.

In [None]:
cfpb = pd.read_csv("../../data/CFPB 2020 Complaints.csv")
cfpb = cfpb.dropna(subset = ['Consumer complaint narrative'])
cfpb = cfpb[(cfpb['Product']=='Checking or savings account') | (cfpb['Product'] == 'Student loan')]
cfpb = cfpb[:1000]

## Text Preprocessing

Before we do any feature engineering or classification, we should first preprocess our text. Let's start by defining our custom `rem_punc_stop()` function:

In [None]:
def rem_punc_stop(text):
    stop_words = STOP_WORDS
    # Individually
    # nlp.Defaults.stop_words.add("XX")
    # nlp.Defaults.stop_words.add("XXXX")
    # nlp.Defaults.stop_words.add("XXXXXXX")
    
    # Using the bitwise |= (or) operator
    nlp.Defaults.stop_words |= {"XX", "XXXX","XXXXXXXX"}
    
    punc = set(punctuation)
    
    punc_free = "".join([ch for ch in text if ch not in punc])
    
    doc = nlp(punc_free)
    
    spacy_words = [token.text for token in doc]
    
    no_punc = [word for word in spacy_words if word not in stop_words]
    
    return no_punc

Now let's go ahead and apply our function to the consumer complaint narratives. Notice how the `rem_punc_stop()` function returns a list, but we can collapse our tokens back into strings with the `join()` string method.

In [None]:
cfpb['tokens'] = cfpb['Consumer complaint narrative'].map(lambda x: rem_punc_stop(x))
cfpb['tokens'] = cfpb['tokens'].map(lambda text: ' '.join(text))
cfpb['tokens']

## Wrap up EDA

We've already explored several exploratory data analysis techniques. There are many different ways to explore text data that we haven't covered, but let's take a look at one last basic tool: visualizing n-grams.  

In [None]:
# Initialize the BOW countervectorizer
## Notice the ngram_range argument
countvec = CountVectorizer(stop_words=STOP_WORDS, ngram_range=(2,3))
ngrams = countvec.fit_transform(cfpb['tokens'])

dictionary_dataframe = pd.DataFrame(ngrams.todense(), columns = countvec.get_feature_names())

In [None]:
df_ngram = pd.DataFrame(dictionary_dataframe.sum().reset_index()).rename(columns={'index': 'ngrams', 0:'freq'})
df_ngram = df_ngram.sort_values(by = ['freq'], ascending = False).reset_index(drop = True)
df_ngram.head()

In [None]:
sns.barplot(x="ngrams", y = 'freq', data=df_ngram[0:25])
plt.xticks(rotation=90)
plt.show()

## Challenge: N-Grams

Adjust the code above to visualize the most popular unigrams and 4-grams. What is the tradeoff involved with increasing n?

In [None]:
# unigram
...

In [None]:
# 4-gram
...

**Answer**: ...

## Feature Engineering

Last time, we saw some techniques for exploring the text of our data. Specifically, we saw how to find the length of our text and word counts:

In [None]:
cfpb['complaint_len'] = cfpb['tokens'].apply(len)
cfpb['complaint_len']

In [None]:
cfpb['word_count'] = cfpb['tokens'].apply(lambda x: len(str(x).split()))
cfpb['word_count']

We also covered subjectivity and sentiment:

In [None]:
cfpb['polarity'] = cfpb['tokens'].map(lambda text: TextBlob(text).sentiment.polarity)
cfpb['subjectivity'] = cfpb['tokens'].map(lambda text: TextBlob(text).sentiment.subjectivity)

### Build a Dictionary

Before we continue, let's take the top 25 n-grams we found earlier and turn them into their own dataframe. We'll return to these later.

In [None]:
countvec = CountVectorizer(stop_words=STOP_WORDS, ngram_range=(2,3))
ngrams = countvec.fit_transform(cfpb['tokens'])

dictionary_dataframe = pd.DataFrame(ngrams.todense(), columns = countvec.get_feature_names())

df_ngram = pd.DataFrame(dictionary_dataframe.sum().reset_index()).rename(columns={'index': 'ngrams', 0:'freq'})
df_ngram = df_ngram.sort_values(by = ['freq'], ascending = False).reset_index(drop = True)

top_25_ngrams = dictionary_dataframe.loc[:,df_ngram[0:25]['ngrams']]
top_25_ngrams.head()

## Challenge: Topic Modeling

Using the code we went over from the last lab, make a dataframe with 5 LDA generated topics. Then, create a topic model using [Non-Negative Matrix Factorization](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) and print out the words associated with the first 5 topics. NMF is another algorithm that is frequently used for topic modeling. Do you get similar topics as with your LDA topics? 

In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #{}:".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [None]:
# LDA
...

In [None]:
# NMF
...

**Answer**: 

## Classification

Now we're ready to move to classification! We are going to examine how different featurization techniques compare. Create a list with the following:
 * Text Engineered Features
 * Text Engineered Features + Topic Model
 * Non-Text Features only
 * Non-Text Features + Text Engineered Features
 * Tf-idf
 * Non-Text Features + tf-idf
 * Non-Text Features + Top 25 n-gram

You'll need to use pandas to create and .`join()` these different dataframes together. Also be sure to use `reset_index()` as necessary. Once you've created each of these dataframes (or arrays!) you should loop through all of them, train a supervised learning algorithm (like logistic regression or a decision tree classifier), and plot confusion matrices. Once you do this, think about which featurization technique worked the best, and whether combining text and non-text features was helpful. For now, don't worry about hyperparameter tuning or feature selection, though you would do these in practice.

In [None]:
cfpb.columns

In [None]:
# Engineered Text Features
engineered_features = ...

# Topic Model + Engineered Features
engineered_features_with_topics = ...

# Non-text features
# Hint: Is there something we need to do to prepare categorical features for classification?
non_text_features = ...
non_text_features_dummies = ...

# Non-text features + engineered features
non_text_engineered_features = ...

# Non-text features + tfidf
tfidf_df = ...
non_text_plus_tfidf = ...

# Top 25 ngrams + non-text
non_text_with_ngrams = ...

In [None]:
dataframes = [...]

featurization_technique = [...]

# Hint: Is there something we need to do to y to prepare it for classification?
y = ...

In [None]:
for dataframe, featurization in zip(dataframes, featurization_technique):
   
    # The code to plot a confusion matrix is provided in the for loop - fill in the code you would need to create the confusion matrix before this
    ...

    df_cm = df_cm.rename(index=str, columns={0: "Checking or savings account", 1: "Student loan"})
    df_cm.index = ["Checking or savings account", "Student loan"]
    plt.figure(figsize = (10,7))
    sns.set(font_scale=1.4)#for label size
    sns.heatmap(df_cm, 
               annot=True,
               annot_kws={"size": 16},
               fmt='g')

    plt.title(featurization)
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()

## Discussion

In general, combing text with non-text features will improve a classifier's performance. However this isn't automatic - in some cases you can actually degrade a classifier's performance by adding in more features. In this case, our engineered features were too quick to predict "checking or savings account" and our tf-idf alone outperformed tf-idf + non-text features. However, non-text features + n-grams was tied with tf-idf alone! We might prefer the former approach because it is computationally cheaper, and likely easier to explain.

---
Notebook developed by Aniket Kesari