# Classification of Consumer Complaints

The Consumer Financial Protection Bureau publishes the Consumer Complaint Database, a collection of complaints about consumer financial products and services that were sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. 

You have been provided with a dataset of over 350,000 such complaints for 5 common issue types. Your goal is to train a text classification model to identify the issue type based on the consumer complaint narrative. The data can be downloaded from https://drive.google.com/file/d/1Hz1gnCCr-SDGjnKgcPbg7Nd3NztOLdxw/view?usp=share_link 

At the end of the project, your team should should prepare a short presentation where you talk about the following:
* What steps did you take to preprocess the data?
* How did a model using unigrams compare to one using bigrams or trigrams?
* How did a count vectorizer compare to a tfidf vectorizer?
* What models did you try and how successful were they? Where did they struggle? Were there issues that the models commonly mixed up?
* What words or phrases were most influential on your models' predictions?

**Bonus:** A larger dataset containing 20 additional categories can be downloaded from https://drive.google.com/file/d/1gW6LScUL-Z7mH6gUZn-1aNzm4p4CvtpL/view?usp=share_link. How well do your models work with these additional categories?

In [2]:
import pandas as pd
import numpy as np

from joblib import dump, load

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix

In [3]:
complaints = pd.read_csv('../data/complaints_sentimentscore.csv')

In [4]:
complaints.head(2)

Unnamed: 0.1,Unnamed: 0,narrative,issue,review_sentiment
0,0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report,0.7398
1,1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam,-0.7457


Need to build a simple numerical classification off 'issue'.  

In [5]:
complaints['issue'].value_counts()

issue
Incorrect information on your report    229305
Attempts to collect debt not owed        73163
Communication tactics                    21243
Struggling to pay mortgage               17374
Fraud or scam                            12347
Name: count, dtype: int64

In [11]:
def categorize_issues(issue):
    if 'Incorrect information on your report' in issue:
        return '1'
    elif 'Attempts to collect debt not owed' in issue:
        return '2'
    elif 'Communication tactics' in issue:
        return '3'
    elif 'Struggling to pay mortgage' in issue:
        return '4'
    else:
        return '5'

complaints['category'] = complaints['issue'].apply(categorize_issue)

In [12]:
complaints.head()

Unnamed: 0.1,Unnamed: 0,narrative,issue,review_sentiment,category
0,0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report,0.7398,1
1,1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam,-0.7457,5
2,2,I have a particular account that is stating th...,Incorrect information on your report,0.6808,1
3,3,I have not supplied proof under the doctrine o...,Attempts to collect debt not owed,-0.973,2
4,4,Hello i'm writing regarding account on my cred...,Incorrect information on your report,0.5944,1


In [13]:
X = complaints[['narrative', 'review_sentiment']]
y = complaints['category']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [14]:
vect = CountVectorizer()
clf = MultinomialNB()

pipe = Pipeline([("vect", vect), ("clf", clf)])

param_grid = {
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1, 2, 5, 10, 20],
    'clf__fit_prior':[1,2,3,4,5]
}

In [None]:
rs = RandomizedSearchCV(estimator = pipe, param_distributions = param_grid, verbose = 2, n_jobs = -1)
rs.fit(X_train['narrative'], y_train)

dump(rs, "../data/cv_01.joblib")

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [None]:
rs = load("../data/cv_01.joblib")

In [None]:
y_pred = rs.best_estimator_.predict(X_test['narrative'])

print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))

In [10]:
true_predictions=y_test==y_pred
false_predictions=y_test!=y_pred

print('Number of true predictions:', np.sum(true_predictions))
print('Number of false predictions:', np.sum(false_predictions))

Number of true predictions: 70589
Number of false predictions: 17769
