# Classification of Consumer Complaints

The Consumer Financial Protection Bureau publishes the Consumer Complaint Database, a collection of complaints about consumer financial products and services that were sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. 

You have been provided with a dataset of over 350,000 such complaints for 5 common issue types. Your goal is to train a text classification model to identify the issue type based on the consumer complaint narrative. The data can be downloaded from https://drive.google.com/file/d/1Hz1gnCCr-SDGjnKgcPbg7Nd3NztOLdxw/view?usp=share_link 

At the end of the project, your team should should prepare a short presentation where you talk about the following:
* What steps did you take to preprocess the data?
* How did a model using unigrams compare to one using bigrams or trigrams?
* How did a count vectorizer compare to a tfidf vectorizer?
* What models did you try and how successful were they? Where did they struggle? Were there issues that the models commonly mixed up?
* What words or phrases were most influential on your models' predictions?

**Bonus:** A larger dataset containing 20 additional categories can be downloaded from https://drive.google.com/file/d/1gW6LScUL-Z7mH6gUZn-1aNzm4p4CvtpL/view?usp=share_link. How well do your models work with these additional categories?

In [1]:
import pandas as pd
import numpy as np

from joblib import dump, load

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from tqdm import tqdm
import shap

In this notebook, we will be investigating how a count vectorizer compares to a tfidf vectorizer?

In [2]:
complaints = pd.read_csv('../data/complaints_sentimentscore.csv', index_col=0)

In [3]:
complaints.head(2)

Unnamed: 0,narrative,issue,review_sentiment
0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report,0.7398
1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam,-0.7457


In [4]:
from tqdm import tqdm

In [5]:
from gensim.utils import simple_preprocess

def preprocess_stuff(text):
   return simple_preprocess(text)

tqdm.pandas(desc='Preprocessing')
complaints['processed']=complaints['narrative'].progress_apply(preprocess_stuff)

Preprocessing: 100%|██████████| 353432/353432 [02:30<00:00, 2342.56it/s]


In [6]:
#First trying the countvectorizer

In [7]:
X = complaints[['processed', 'review_sentiment']]
y = complaints['issue']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [8]:
#is it a list of lists? This should help 'flatten' them. Also maybe .ravel?
X_train_processed_flat = [' '.join(doc) for doc in X_train['processed']]
X_test_processed_flat = [' '.join(doc) for doc in X_test['processed']]

In [9]:
vect = CountVectorizer()

X_train_vec = vect.fit_transform(X_train_processed_flat)
X_test_vec = vect.transform(X_test_processed_flat)

In [10]:
X_train_vec

<265074x65214 sparse matrix of type '<class 'numpy.int64'>'
	with 21254891 stored elements in Compressed Sparse Row format>

In [11]:
nb = MultinomialNB().fit(X_train_vec, y_train)

y_pred = nb.predict(X_test_vec)

In [12]:
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7999615201792707
[[12042  2093   493  3347   316]
 [  593  4471    47   115    85]
 [   64    63  2808   111    41]
 [ 6511  1036   851 47148  1780]
 [   32    43    14    40  4214]]


Now, try the tfidf vectorizer

In [13]:
vect2 = TfidfVectorizer()

In [14]:

X_train_vec2 = vect2.fit_transform(X_train_processed_flat)
X_test_vec2 = vect2.transform(X_test_processed_flat)

In [15]:
X_train_vec2

<265074x65214 sparse matrix of type '<class 'numpy.float64'>'
	with 21254891 stored elements in Compressed Sparse Row format>

In [16]:
nb = MultinomialNB().fit(X_train_vec2, y_train)

y_pred2 = nb.predict(X_test_vec2)

In [17]:
print('tfidfvectorizer')
print(f'Accuracy: {accuracy_score(y_test, y_pred2)}')
print(confusion_matrix(y_test, y_pred2))

tfidfvectorizer
Accuracy: 0.7931935987686457
[[ 8227   199     6  9840    19]
 [ 1984  2089     2  1228     8]
 [  494    12  1166  1410     5]
 [ 1316     8     2 55930    70]
 [   68     6     0  1596  2673]]


In [18]:
print('countvectorizer')
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))

countvectorizer
Accuracy: 0.7999615201792707
[[12042  2093   493  3347   316]
 [  593  4471    47   115    85]
 [   64    63  2808   111    41]
 [ 6511  1036   851 47148  1780]
 [   32    43    14    40  4214]]


In [19]:
print('tfidfvectorizer - countvectorizer')
print(f'Difference in Accuracy: {accuracy_score(y_test, y_pred2) - accuracy_score(y_test, y_pred)}')
print(f'Difference in Accuracy per category:')
print(confusion_matrix(y_test, y_pred2)- confusion_matrix(y_test, y_pred))

tfidfvectorizer - countvectorizer
Difference in Accuracy: -0.0067679214106249885
Difference in Accuracy per category:
[[-3815 -1894  -487  6493  -297]
 [ 1391 -2382   -45  1113   -77]
 [  430   -51 -1642  1299   -36]
 [-5195 -1028  -849  8782 -1710]
 [   36   -37   -14  1556 -1541]]


In [33]:
from sklearn.inspection import permutation_importance, partial_dependence

pd.DataFrame({
    'variable': X_test_vec2,
    'importance': permutation_importance(nb, 
                                         X_test, 
                                         y_test, 
                                         scoring = 'neg_mean_absolute_error')['importances_mean']
}).sort_values('importance', ascending = False)



ValueError: setting an array element with a sequence.

In [29]:
# Wrap the model in a callable function
def model_predict_proba(text):
    vectorized_text = vect.transform([text])
    return nb.predict_proba(vectorized_text)

In [30]:
feature_names = vect.get_feature_names_out()
explainer = shap.Explainer(model_predict_proba, vect.get_feature_names_out())
shap_values = explainer(X_test_vec)
print(shap_values.values.shape)

IndexError: tuple index out of range