***Classification of Consumer Complaints***

The Consumer Financial Protection Bureau publishes the Consumer Complaint Database, a collection of complaints about consumer financial products and services that were sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first.

You have been provided with a dataset of over 350,000 such complaints for 5 common issue types. Your goal is to train a text classification model to identify the issue type based on the consumer complaint narrative. The data can be downloaded from https://drive.google.com/file/d/1Hz1gnCCr-SDGjnKgcPbg7Nd3NztOLdxw/view?usp=share_link

At the end of the project, your team should should prepare a short presentation where you talk about the following:

* What steps did you take to preprocess the data?
* How did a model using unigrams compare to one using bigrams or trigrams?
* How did a count vectorizer compare to a tfidf vectorizer?
* What models did you try and how successful were they? Where did they struggle? Were there issues that the models commonly mixed up?
* What words or phrases were most influential on your models' predictions?

Bonus: A larger dataset containing 20 additional categories can be downloaded from https://drive.google.com/file/d/1gW6LScUL-Z7mH6gUZn-1aNzm4p4CvtpL/view?usp=share_link. How well do your models work with these additional categories?

In [18]:
import pandas as pd
import numpy as np
import time

from joblib import dump, load

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import shap

In [2]:
complaints = pd.read_csv('data/complaints.csv')

In [3]:
complaints=complaints.rename(columns={'Consumer complaint narrative':'narrative', 'Issue':'issue'})

In [4]:
complaints.head()

Unnamed: 0,narrative,issue
0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report
1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam
2,I have a particular account that is stating th...,Incorrect information on your report
3,I have not supplied proof under the doctrine o...,Attempts to collect debt not owed
4,Hello i'm writing regarding account on my cred...,Incorrect information on your report


In [5]:
complaints['issue'].value_counts()

issue
Incorrect information on your report    229305
Attempts to collect debt not owed        73163
Communication tactics                    21243
Struggling to pay mortgage               17374
Fraud or scam                            12347
Name: count, dtype: int64

In [6]:
complaints['category'] = 5

complaints['category'] = np.where(complaints['issue'] == 'Incorrect information on your report', 1, complaints['category'])
complaints['category'] = np.where(complaints['issue'] == 'Attempts to collect debt not owed', 2, complaints['category'])
complaints['category'] = np.where(complaints['issue'] == 'Communication tactics', 3, complaints['category'])
complaints['category'] = np.where(complaints['issue'] == 'Struggling to pay mortgage', 4, complaints['category'])


#def categorize_issues(complaints_df):
    # if 'Incorrect information on your report' in issue:
    #     return '1'
    # elif 'Attempts to collect debt not owed' in issue:
    #     return '2'
    # elif 'Communication tactics' in issue:
    #     return '3'
    # elif 'Struggling to pay mortgage' in issue:
    #     return '4'
    # else:
    #     return '5'
#complaints['category'] = complaints['issue'].apply(categorize_issue)

complaints.head()

Unnamed: 0,narrative,issue,category
0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report,1
1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam,5
2,I have a particular account that is stating th...,Incorrect information on your report,1
3,I have not supplied proof under the doctrine o...,Attempts to collect debt not owed,2
4,Hello i'm writing regarding account on my cred...,Incorrect information on your report,1


In [7]:
X = complaints[['narrative']]
y = complaints['category']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [8]:
countvect = CountVectorizer(stop_words = 'english')
tfidf = TfidfVectorizer(stop_words = 'english')
clf = MultinomialNB()

pipe = Pipeline([("vect", countvect), ("clf", clf)])

param_grid = [{
    'vect': [countvect],
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1,2,5,10]
},
              {
    'vect': [tfidf],
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1,2,5,10]
}]

In [9]:
start = time.time()
gs = GridSearchCV(estimator = pipe, param_grid = param_grid, verbose = 3, n_jobs = 2, cv=3)
gs.fit(X_train['narrative'], y_train)
end = time.time()

print(f'The SearchCV fit took {end - start} seconds to run')

dump(gs, "data/cv_01.joblib")

Fitting 3 folds for each of 24 candidates, totalling 72 fits
The SearchCV fit took 3098.6729254722595 seconds to run


['data/cv_01.joblib']

In [10]:
gs = load("data/cv_01.joblib")

In [20]:
y_pred = gs.best_estimator_.predict(X_test['narrative'])

print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('1 = Incorrect information on your report')
print('2 = Attempts to collect debt not owed')
print('3 = Communication tactics')
print('4 = Struggling to pay mortgage')
print('5 = Fraud or scam')

Accuracy: 0.8599334525453269
[[51972  4376    45   836    97]
 [ 3803 13813   413   177    85]
 [  208  1536  3428   121    18]
 [   77    47    17  4199     3]
 [  239   223    10    45  2570]]
              precision    recall  f1-score   support

           1       0.92      0.91      0.91     57326
           2       0.69      0.76      0.72     18291
           3       0.88      0.65      0.74      5311
           4       0.78      0.97      0.86      4343
           5       0.93      0.83      0.88      3087

    accuracy                           0.86     88358
   macro avg       0.84      0.82      0.82     88358
weighted avg       0.87      0.86      0.86     88358

1 = Incorrect information on your report
2 = Attempts to collect debt not owed
3 = Communication tactics
4 = Struggling to pay mortgage
5 = Fraud or scam


In [12]:
true_predictions=y_test==y_pred
false_predictions=y_test!=y_pred

print('Number of true predictions:', np.sum(true_predictions))
print('Number of false predictions:', np.sum(false_predictions))

Number of true predictions: 75982
Number of false predictions: 12376


In [13]:
gs.best_estimator_

In [14]:
gs.cv_results_

{'mean_fit_time': array([ 25.73365831,  56.93866404, 109.16719619,  25.97839467,
         47.55933364,  82.09299707,  26.38311195,  50.85412439,
         78.34353614,  28.57733226,  55.32667184,  77.11768762,
         25.66332857,  59.98081748, 121.49669592,  27.14179699,
         53.36714824,  91.75572832,  27.77487214,  52.70489375,
         83.43306629,  26.17602539,  53.21849759,  77.07787704]),
 'std_fit_time': array([ 1.2988156 ,  0.05600926,  3.17649889,  1.73350263,  0.73601773,
         2.09475262,  0.40586041,  0.43967035,  1.48870247,  3.61767182,
        11.27289565,  0.93642913,  1.21720076,  0.47828642,  1.85872762,
         2.16114216,  2.72111761,  1.47505455,  1.04442195,  1.68790887,
         1.22457008,  1.07979695,  1.2892584 ,  1.9615998 ]),
 'mean_score_time': array([13.08966788, 25.44933009, 37.73933291, 13.53366661, 22.75434454,
        36.14643741, 12.88911915, 23.61010583, 33.78799884, 17.52335397,
        22.47432184, 31.42031058, 12.71233416, 27.27310475, 42

In [15]:
tfidf_df = pd.DataFrame({
    'params': gs.cv_results_['params'],
    #'param_ngram_range': gs.cv_results_['params']['vect__ngram_range'],
    'mean_test_score': gs.cv_results_['mean_test_score']
})
tfidf_df = pd.json_normalize(tfidf_df['params']).join(tfidf_df['mean_test_score'])
tfidf_df

Unnamed: 0,vect,vect__min_df,vect__ngram_range,mean_test_score
0,"CountVectorizer(ngram_range=(1, 2), stop_words...",1,"(1, 1)",0.811592
1,"CountVectorizer(ngram_range=(1, 2), stop_words...",1,"(1, 2)",0.857361
2,"CountVectorizer(ngram_range=(1, 2), stop_words...",1,"(1, 3)",0.844213
3,"CountVectorizer(ngram_range=(1, 2), stop_words...",2,"(1, 1)",0.805156
4,"CountVectorizer(ngram_range=(1, 2), stop_words...",2,"(1, 2)",0.838283
5,"CountVectorizer(ngram_range=(1, 2), stop_words...",2,"(1, 3)",0.840931
6,"CountVectorizer(ngram_range=(1, 2), stop_words...",5,"(1, 1)",0.801833
7,"CountVectorizer(ngram_range=(1, 2), stop_words...",5,"(1, 2)",0.810928
8,"CountVectorizer(ngram_range=(1, 2), stop_words...",5,"(1, 3)",0.794986
9,"CountVectorizer(ngram_range=(1, 2), stop_words...",10,"(1, 1)",0.800456


In [16]:
type(clf__fit_prior)

NameError: name 'clf__fit_prior' is not defined

In [None]:
inspect.getargspec(MultinomialNB())

In [None]:
gs.best_estimator_._final_estimator.params

In [None]:
coef_df = pd.DataFrame({
    'word': gs.best_estimator_[:-1].get_feature_names_out(),
    #'word': vect.get_feature_names_out(),
    'coef': gs.best_estimator_._final_estimator.coef_[0]
})

In [None]:
explainer = shap.TreeExplainer(gs.best_estimator_)
explanation = explainer(X_test)

In [None]:
gs.best_estimator_.get_feature_names_out()