# Classification of Consumer Complaints

The Consumer Financial Protection Bureau publishes the Consumer Complaint Database, a collection of complaints about consumer financial products and services that were sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. 

You have been provided with a dataset of over 350,000 such complaints for 5 common issue types. Your goal is to train a text classification model to identify the issue type based on the consumer complaint narrative. The data can be downloaded from https://drive.google.com/file/d/1Hz1gnCCr-SDGjnKgcPbg7Nd3NztOLdxw/view?usp=share_link 

At the end of the project, your team should should prepare a short presentation where you talk about the following:
* What steps did you take to preprocess the data?
* How did a model using unigrams compare to one using bigrams or trigrams?
* How did a count vectorizer compare to a tfidf vectorizer?
* What models did you try and how successful were they? Where did they struggle? Were there issues that the models commonly mixed up?
* What words or phrases were most influential on your models' predictions?

**Bonus:** A larger dataset containing 20 additional categories can be downloaded from https://drive.google.com/file/d/1gW6LScUL-Z7mH6gUZn-1aNzm4p4CvtpL/view?usp=share_link. How well do your models work with these additional categories?

In [20]:
import pandas as pd
import numpy as np

from joblib import dump, load

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix

In [21]:
complaints = pd.read_csv('../data/complaints_sentimentscore.csv')

In [22]:
complaints.head(2)

Unnamed: 0.1,Unnamed: 0,narrative,issue,review_sentiment
0,0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report,0.7398
1,1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam,-0.7457


In [27]:
# keeping it simple for now.  In the next notebook I will look at bi and trigrams...

In [23]:
X = complaints[['narrative', 'review_sentiment']]
y = complaints['issue']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [24]:
vect = CountVectorizer()

X_train_vec = vect.fit_transform(X_train['narrative'])
X_test_vec = vect.transform(X_test['narrative'])

In [25]:
X_train_vec

<265074x72222 sparse matrix of type '<class 'numpy.int64'>'
	with 21821225 stored elements in Compressed Sparse Row format>

In [26]:
vect = CountVectorizer()
clf = MultinomialNB()

pipe = Pipeline([("vect", vect), ("clf", clf)])

param_grid = {
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1, 2, 5, 10, 20],
    'clf__fit_prior':[False, True]

SyntaxError: incomplete input (2186742284.py, line 9)

In [7]:
nb = MultinomialNB().fit(X_train_vec, y_train)

y_pred = nb.predict(X_test_vec)

In [8]:
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7988976663120487
[[12086  2039   500  3343   323]
 [  587  4476    51   112    85]
 [   67    55  2813   110    42]
 [ 6610  1046   834 46997  1839]
 [   36    40    12    38  4217]]


In [9]:
#The above is interesting, but not really what's needed.  Probably need to one encode and then look for true false?

In [10]:
true_predictions=y_test==y_pred
false_predictions=y_test!=y_pred

print('Number of true predictions:', np.sum(true_predictions))
print('Number of false predictions:', np.sum(false_predictions))

Number of true predictions: 70589
Number of false predictions: 17769


In [11]:
X = complaints[['review_sentiment']]
y = complaints['issue']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

Naive Bayes can't use negative values, so switching to Randomforest.

In [13]:
from sklearn.ensemble import RandomForestClassifier

In [14]:
model = RandomForestClassifier(n_estimators=10, random_state=42)

In [15]:
nb = model.fit(X_train, y_train)


In [16]:
y_pred = nb.predict(X_test)

In [17]:
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.639760972407705
[[ 2586   233    85 15210   177]
 [  613   117    25  4494    62]
 [  334    52    24  2625    52]
 [ 2752   321   174 53730   349]
 [  530    76    44  3622    71]]


Overall, not as well as unigrams, but still reasonably well, at 63%

In [18]:
true_predictions=y_test==y_pred
false_predictions=y_test!=y_pred

print('Number of true predictions:', np.sum(true_predictions))
print('Number of false predictions:', np.sum(false_predictions))

Number of true predictions: 56528
Number of false predictions: 31830


this notebook focused on looking at predictors using a sparse matrix/unigram modeling.  We also used the sentiment analysis. We did not tune any of the models. In the next notebook, we will look at other text features, such as bi and trigram's and perhaps tokenizing by sentence...