***Classification of Consumer Complaints***

The Consumer Financial Protection Bureau publishes the Consumer Complaint Database, a collection of complaints about consumer financial products and services that were sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first.

You have been provided with a dataset of over 350,000 such complaints for 5 common issue types. Your goal is to train a text classification model to identify the issue type based on the consumer complaint narrative. The data can be downloaded from https://drive.google.com/file/d/1Hz1gnCCr-SDGjnKgcPbg7Nd3NztOLdxw/view?usp=share_link

At the end of the project, your team should should prepare a short presentation where you talk about the following:

* What steps did you take to preprocess the data?
* How did a model using unigrams compare to one using bigrams or trigrams?
* How did a count vectorizer compare to a tfidf vectorizer?
* What models did you try and how successful were they? Where did they struggle? Were there issues that the models commonly mixed up?
* What words or phrases were most influential on your models' predictions?

Bonus: A larger dataset containing 20 additional categories can be downloaded from https://drive.google.com/file/d/1gW6LScUL-Z7mH6gUZn-1aNzm4p4CvtpL/view?usp=share_link. How well do your models work with these additional categories?

In [68]:
import pandas as pd
import numpy as np
import time

from joblib import dump, load

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import shap

In [2]:
complaints = pd.read_csv('data/complaints.csv')

In [3]:
complaints=complaints.rename(columns={'Consumer complaint narrative':'narrative', 'Issue':'issue'})

In [4]:
complaints.head()

Unnamed: 0,narrative,issue
0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report
1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam
2,I have a particular account that is stating th...,Incorrect information on your report
3,I have not supplied proof under the doctrine o...,Attempts to collect debt not owed
4,Hello i'm writing regarding account on my cred...,Incorrect information on your report


In [5]:
complaints['issue'].value_counts()

issue
Incorrect information on your report    229305
Attempts to collect debt not owed        73163
Communication tactics                    21243
Struggling to pay mortgage               17374
Fraud or scam                            12347
Name: count, dtype: int64

In [6]:
complaints['category'] = 5

complaints['category'] = np.where(complaints['issue'] == 'Incorrect information on your report', 1, complaints['category'])
complaints['category'] = np.where(complaints['issue'] == 'Attempts to collect debt not owed', 2, complaints['category'])
complaints['category'] = np.where(complaints['issue'] == 'Communication tactics', 3, complaints['category'])
complaints['category'] = np.where(complaints['issue'] == 'Struggling to pay mortgage', 4, complaints['category'])


#def categorize_issues(complaints_df):
    # if 'Incorrect information on your report' in issue:
    #     return '1'
    # elif 'Attempts to collect debt not owed' in issue:
    #     return '2'
    # elif 'Communication tactics' in issue:
    #     return '3'
    # elif 'Struggling to pay mortgage' in issue:
    #     return '4'
    # else:
    #     return '5'
#complaints['category'] = complaints['issue'].apply(categorize_issue)

complaints.head()

Unnamed: 0,narrative,issue,category
0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report,1
1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam,5
2,I have a particular account that is stating th...,Incorrect information on your report,1
3,I have not supplied proof under the doctrine o...,Attempts to collect debt not owed,2
4,Hello i'm writing regarding account on my cred...,Incorrect information on your report,1


In [7]:
X = complaints[['narrative']]
y = complaints['category']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [8]:
vect = TfidfVectorizer(stop_words = 'english')
clf = MultinomialNB()

pipe = Pipeline([("vect", vect), ("clf", clf)])

param_grid = {
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1,2,5,10]
}

In [9]:
start = time.time()
rs = GridSearchCV(estimator = pipe, param_grid = param_grid, verbose = 3, n_jobs = 2, cv=3)
rs.fit(X_train['narrative'], y_train)
end = time.time()

print(f'The SearchCV fit took {end - start} seconds to run')

dump(rs, "data/cv_01.joblib")

Fitting 3 folds for each of 12 candidates, totalling 36 fits
The SearchCV fit took 1566.5057690143585 seconds to run


['data/cv_01.joblib']

In [10]:
rs = load("data/cv_01.joblib")

In [69]:
y_pred = rs.best_estimator_.predict(X_test['narrative'])
y_pred_proba = rs.best_estimator_.predict_proba(X_test['narrative'])[:,1]

print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('1 = Incorrect information on your report')
print('2 = Attempts to collect debt not owed')
print('3 = Communication tactics')
print('4 = Struggling to pay mortgage')
print('5 = Fraud or scam')

Accuracy: 0.8615631861291564
[[53957  2953    23   376    17]
 [ 4999 12834   343    78    37]
 [  323  1805  3143    36     4]
 [  249    90    28  3974     2]
 [  471   364    15    19  2218]]
              precision    recall  f1-score   support

           1       0.90      0.94      0.92     57326
           2       0.71      0.70      0.71     18291
           3       0.88      0.59      0.71      5311
           4       0.89      0.92      0.90      4343
           5       0.97      0.72      0.83      3087

    accuracy                           0.86     88358
   macro avg       0.87      0.77      0.81     88358
weighted avg       0.86      0.86      0.86     88358

1 = Incorrect information on your report
2 = Attempts to collect debt not owed
3 = Communication tactics
4 = Struggling to pay mortgage
5 = Fraud or scam


In [12]:
true_predictions=y_test==y_pred
false_predictions=y_test!=y_pred

print('Number of true predictions:', np.sum(true_predictions))
print('Number of false predictions:', np.sum(false_predictions))

Number of true predictions: 76126
Number of false predictions: 12232


In [13]:
rs.best_estimator_

In [15]:
rs.cv_results_

{'mean_fit_time': array([ 24.96067278,  58.78867841, 124.05309931,  32.02412168,
         50.30268002,  86.57966606,  23.79201206,  45.67601275,
         85.29503163,  23.40265354,  45.44600304,  75.67867692]),
 'std_fit_time': array([ 1.98098583,  1.9403363 , 17.24913024,  4.92937963,  0.92347622,
         3.32212081,  0.2629742 ,  0.15373732,  8.87646554,  0.16950116,
         1.06644513,  2.24892293]),
 'mean_score_time': array([12.55262891, 26.04866362, 43.93234428, 17.75483592, 23.42732048,
        35.253666  , 11.86332258, 22.04633371, 36.34690881, 11.64669236,
        21.23863975, 30.42268149]),
 'std_score_time': array([0.15054709, 1.08445916, 1.51584871, 4.99768096, 0.23795024,
        0.41405072, 0.06540274, 1.04633123, 6.74740291, 0.05464424,
        0.54343343, 2.44160571]),
 'param_vect__min_df': masked_array(data=[1, 1, 1, 2, 2, 2, 5, 5, 5, 10, 10, 10],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, Fal

In [28]:
tfidf_df = pd.DataFrame({
    'params': rs.cv_results_['params'],
    #'param_ngram_range': rs.cv_results_['params']['vect__ngram_range'],
    'mean_test_score': rs.cv_results_['mean_test_score']
})
tfidf_df = pd.json_normalize(tfidf_df['params']).join(tfidf_df['mean_test_score'])
tfidf_df

Unnamed: 0,vect__min_df,vect__ngram_range,mean_test_score
0,1,"(1, 1)",0.790655
1,1,"(1, 2)",0.675219
2,1,"(1, 3)",0.669851
3,2,"(1, 1)",0.828637
4,2,"(1, 2)",0.724843
5,2,"(1, 3)",0.706554
6,5,"(1, 1)",0.843214
7,5,"(1, 2)",0.82592
8,5,"(1, 3)",0.812505
9,10,"(1, 1)",0.846069


In [33]:
rs.best_estimator_['vect'].get_feature_names_out()

array(['00', '00 00', '00 000', ..., 'zoom', 'zwicker',
       'zwicker associates'], dtype=object)

In [56]:
def find_influential_tokens(narrative, target='negative'):

    target_idx = ['negative', 'positive'].index(target)
    
    X_target = rs.best_estimator_['vect'].transform(pd.Series([narrative]))
    orig_prob = rs.best_estimator_['clf'].predict_proba(X_target)[:,target_idx][0]
    nonzero_idx = np.nonzero(X_target.toarray()[0])[0]
    
    variants = np.repeat(X_target[0].toarray(), len(nonzero_idx), axis = 0)
    for i, j in enumerate(nonzero_idx):
        variants[i,j] = 0

    # Make a DataFrame containing all tokens in the vocabulary
    explain = pd.DataFrame({
        'token': rs.best_estimator_['vect'].get_feature_names_out()
    })
    
    # Keep only those corresponding to the tokens present in the narrative of interest
    explain = explain.loc[nonzero_idx]
    
    # Find the predicted probability when removing each token and find the different between it and the original predicted probability
    explain['delta_prob'] = orig_prob - rs.best_estimator_['clf'].predict_proba(variants)[:,target_idx]
    
    # Find the most influential values
    return explain.sort_values('delta_prob', ascending=False).head(10)

In [57]:
X_test['label'] = y_test
X_test['prediction'] = y_pred
X_test['probability'] = y_pred_proba

false_negatives = (
    X_test
    .loc[~(X_test['label'] == X_test['prediction'])]
    .sort_values('probability')
    .index
)

for idx in false_negatives[:5]:
    print(X_test.loc[idx, 'narrative'])
    print('\n ---------')

It appears that my credit with you has been compromised. I was going through my records & noticed many files which do not belong to me. Since Im a stickler for research, I found that under section 605b of the FCRA you are required by law to remove & block any account which is found to be opened due to identity theft. The dispute items do not belong to me. Im attaching the required FTC Report # XXXX for you and the bank 's records ( learned through more research both parties require ). Please block/remove these files. If you feel there is a possibility these accounts belong to me I will require all documentation that bears my signature ( another research item I found that requires you to verify with 100 % accuracy that each account is 100 % true, accurate, correct, complete & VERIFIABLE ). This is not a duplicate nor is this complaint being filed by a third party, I am filing this complaint myself. Please see this complaint is processed to the letter of the law. 
If you do not provide a

In [58]:
X_test.head()

Unnamed: 0,narrative,label,prediction,probability
168687,I was advised to contact you all regarding ite...,1,1,0.00087
24442,On XX/XX/XXXX I recieved phone call on my work...,3,2,0.545519
219970,"According to the Fair Credit Reporting Act, Se...",1,1,6.1e-05
199559,the bill was for {$16.00} to XXXX XXXX and it...,2,2,0.917842
208047,I was working on building my credit thru Lexin...,5,1,0.102685


In [59]:
idx = false_negatives[0]
print(f'Review: {X_test.loc[idx, "narrative"]}\n')
print(f'Predicted probability of positive: {X_test.loc[idx, "probability"]}\n')

find_influential_tokens(X_test.loc[idx, 'narrative'], target='negative')

Review: It appears that my credit with you has been compromised. I was going through my records & noticed many files which do not belong to me. Since Im a stickler for research, I found that under section 605b of the FCRA you are required by law to remove & block any account which is found to be opened due to identity theft. The dispute items do not belong to me. Im attaching the required FTC Report # XXXX for you and the bank 's records ( learned through more research both parties require ). Please block/remove these files. If you feel there is a possibility these accounts belong to me I will require all documentation that bears my signature ( another research item I found that requires you to verify with 100 % accuracy that each account is 100 % true, accurate, correct, complete & VERIFIABLE ). This is not a duplicate nor is this complaint being filed by a third party, I am filing this complaint myself. Please see this complaint is processed to the letter of the law. 
If you do not p

Unnamed: 0,token,delta_prob
27541,belong im,5.787228e-10
149228,remove block,4.041425e-10
99580,items opened,2.481499e-10
200297,xxxx xxxx,2.352891e-10
77063,filed party,2.32987e-10
128885,party filing,2.321059e-10
138842,processed letter,2.31708e-10
155830,required ftc,2.312106e-10
22286,attaching required,2.311396e-10
43714,complaint processed,2.299316e-10


In [61]:
X_target = rs.best_estimator_['vect'].transform(X_test.loc[[idx], 'narrative'])
X_target

<1x201841 sparse matrix of type '<class 'numpy.float64'>'
	with 180 stored elements in Compressed Sparse Row format>

In [62]:
orig_prob = rs.best_estimator_['clf'].predict_proba(X_target)[:,0][0]
orig_prob

0.9999999991207034

In [63]:
nonzero_idx = np.nonzero(X_target.toarray()[0])[0]
nonzero_idx

array([  1553,   1557,   1600,   5747,   5771,   5772,   7799,   7802,
         9038,   9784,   9892,  10704,  10706,  10768,  10827,  11283,
        11698,  11729,  11861,  18893,  18905,  22265,  22286,  22853,
        22885,  25372,  25879,  26854,  26855,  27438,  27541,  27613,
        28385,  28388,  28402,  28435,  37697,  37868,  43350,  43453,
        43541,  43714,  43992,  44137,  44554,  44580,  44657,  44676,
        45652,  45741,  49592,  49648,  51484,  51743,  52380,  63771,
        64047,  64337,  64463,  65316,  65344,  66113,  66139,  67370,
        67376,  73877,  73894,  74981,  75218,  75746,  75828,  76910,
        77063,  77130,  77140,  77161,  77228,  77246,  81367,  81463,
        82246,  82248,  83356,  83595,  85035,  85061,  89980,  90125,
        90556,  90566,  90691,  99183,  99327,  99361,  99394,  99432,
        99580, 101120, 101128, 101840, 101963, 103034, 103325, 103381,
       103406, 103941, 103963, 104290, 104294, 104915, 105297, 112840,
      

In [64]:
for idx in nonzero_idx[:5]:
    print(rs.best_estimator_['vect'].get_feature_names_out()[idx])

100
100 accuracy
100 true
605b
605b fair


In [65]:
variants = np.repeat(X_target[0].toarray(), len(nonzero_idx), axis = 0)

for i, j in enumerate(nonzero_idx):
    variants[i,j] = 0

In [67]:
# Make a DataFrame containing all tokens in the vocabulary
explain = pd.DataFrame({
    'token': rs.best_estimator_['vect'].get_feature_names_out()
})

# Keep only those corresponding to the tokens present in the review of interest
explain = explain.loc[nonzero_idx]

# Find the predicted probability when removing each token and find the different between it and the original predicted probability
explain['delta_prob'] = orig_prob - rs.best_estimator_['clf'].predict_proba(variants)[:,0]

# Find the most influential values
explain.sort_values('delta_prob', ascending=False).head(25)

Unnamed: 0,token,delta_prob
27541,belong im,5.787228e-10
149228,remove block,4.041425e-10
99580,items opened,2.481499e-10
200297,xxxx xxxx,2.352891e-10
77063,filed party,2.32987e-10
128885,party filing,2.321059e-10
138842,processed letter,2.31708e-10
155830,required ftc,2.312106e-10
22286,attaching required,2.311396e-10
43714,complaint processed,2.299316e-10


In [70]:
explain.sort_values('delta_prob', ascending=True).head(25)

Unnamed: 0,token,delta_prob
37697,claim,-1.560352e-11
167659,signature,-1.514877e-11
22853,attorney,-1.043077e-11
104915,letter,-9.336532e-12
140811,provide,-8.498091e-12
101840,knowledge,-5.31486e-12
142339,pursuant,-4.831691e-12
104290,legal,-3.012701e-12
112840,matter,-1.207923e-12
65316,documentation,-9.947598e-13


In [None]:
explainer = shap.TreeExplainer(rs.best_estimator_)
explanation = explainer(X_test)

In [29]:
rs.best_estimator_.get_feature_names_out()

AttributeError: Estimator clf does not provide get_feature_names_out. Did you mean to call pipeline[:-1].get_feature_names_out()?

In [None]:
type(clf__fit_prior)

In [None]:
inspect.getargspec(MultinomialNB())