***Classification of Consumer Complaints***

The Consumer Financial Protection Bureau publishes the Consumer Complaint Database, a collection of complaints about consumer financial products and services that were sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first.

You have been provided with a dataset of over 350,000 such complaints for 5 common issue types. Your goal is to train a text classification model to identify the issue type based on the consumer complaint narrative. The data can be downloaded from https://drive.google.com/file/d/1Hz1gnCCr-SDGjnKgcPbg7Nd3NztOLdxw/view?usp=share_link

At the end of the project, your team should should prepare a short presentation where you talk about the following:

* What steps did you take to preprocess the data?
* How did a model using unigrams compare to one using bigrams or trigrams?
* How did a count vectorizer compare to a tfidf vectorizer?
* What models did you try and how successful were they? Where did they struggle? Were there issues that the models commonly mixed up?
* What words or phrases were most influential on your models' predictions?

Bonus: A larger dataset containing 20 additional categories can be downloaded from https://drive.google.com/file/d/1gW6LScUL-Z7mH6gUZn-1aNzm4p4CvtpL/view?usp=share_link. How well do your models work with these additional categories?

In [30]:
import pandas as pd
import numpy as np
import time

from joblib import dump, load

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
complaints = pd.read_csv('data/complaints.csv')

In [3]:
complaints=complaints.rename(columns={'Consumer complaint narrative':'narrative', 'Issue':'issue'})

In [4]:
complaints.head()

Unnamed: 0,narrative,issue
0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report
1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam
2,I have a particular account that is stating th...,Incorrect information on your report
3,I have not supplied proof under the doctrine o...,Attempts to collect debt not owed
4,Hello i'm writing regarding account on my cred...,Incorrect information on your report


In [5]:
complaints['issue'].value_counts()

issue
Incorrect information on your report    229305
Attempts to collect debt not owed        73163
Communication tactics                    21243
Struggling to pay mortgage               17374
Fraud or scam                            12347
Name: count, dtype: int64

In [6]:
complaints['category'] = 5

complaints['category'] = np.where(complaints['issue'] == 'Incorrect information on your report', 1, complaints['category'])
complaints['category'] = np.where(complaints['issue'] == 'Attempts to collect debt not owed', 2, complaints['category'])
complaints['category'] = np.where(complaints['issue'] == 'Communication tactics', 3, complaints['category'])
complaints['category'] = np.where(complaints['issue'] == 'Struggling to pay mortgage', 4, complaints['category'])


#def categorize_issues(complaints_df):
    # if 'Incorrect information on your report' in issue:
    #     return '1'
    # elif 'Attempts to collect debt not owed' in issue:
    #     return '2'
    # elif 'Communication tactics' in issue:
    #     return '3'
    # elif 'Struggling to pay mortgage' in issue:
    #     return '4'
    # else:
    #     return '5'
#complaints['category'] = complaints['issue'].apply(categorize_issue)

complaints.head()

Unnamed: 0,narrative,issue,category
0,My name is XXXX XXXX this complaint is not mad...,Incorrect information on your report,1
1,I searched on XXXX for XXXXXXXX XXXX and was ...,Fraud or scam,5
2,I have a particular account that is stating th...,Incorrect information on your report,1
3,I have not supplied proof under the doctrine o...,Attempts to collect debt not owed,2
4,Hello i'm writing regarding account on my cred...,Incorrect information on your report,1


In [7]:
X = complaints[['narrative']]
y = complaints['category']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [8]:
vect = CountVectorizer(stop_words = 'english')
clf = MultinomialNB()

pipe = Pipeline([("vect", vect), ("clf", clf)])

param_grid = {
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1, 2, 5, 10],
}

In [9]:
start = time.time()
rs = GridSearchCV(estimator = pipe, param_grid = param_grid, verbose = 3, n_jobs = 2, cv=3)
rs.fit(X_train['narrative'], y_train)
end = time.time()

print(f'The SearchCV fit took {end - start} seconds to run')

dump(rs, "data/cv_01.joblib")

Fitting 3 folds for each of 12 candidates, totalling 36 fits
The SearchCV fit took 1423.161800146103 seconds to run


['data/cv_01.joblib']

In [10]:
rs = load("data/cv_01.joblib")

In [31]:
y_pred = rs.best_estimator_.predict(X_test['narrative'])
y_pred_proba = rs.best_estimator_.predict_proba(X_test['narrative'])[:,1]

print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('1 = Incorrect information on your report')
print('2 = Attempts to collect debt not owed')
print('3 = Communication tactics')
print('4 = Struggling to pay mortgage')
print('5 = Fraud or scam')

Accuracy: 0.8599334525453269
[[51972  4376    45   836    97]
 [ 3803 13813   413   177    85]
 [  208  1536  3428   121    18]
 [   77    47    17  4199     3]
 [  239   223    10    45  2570]]
              precision    recall  f1-score   support

           1       0.92      0.91      0.91     57326
           2       0.69      0.76      0.72     18291
           3       0.88      0.65      0.74      5311
           4       0.78      0.97      0.86      4343
           5       0.93      0.83      0.88      3087

    accuracy                           0.86     88358
   macro avg       0.84      0.82      0.82     88358
weighted avg       0.87      0.86      0.86     88358

1 = Incorrect information on your report
2 = Attempts to collect debt not owed
3 = Communication tactics
4 = Struggling to pay mortgage
5 = Fraud or scam


In [12]:
true_predictions=y_test==y_pred
false_predictions=y_test!=y_pred

print('Number of true predictions:', np.sum(true_predictions))
print('Number of false predictions:', np.sum(false_predictions))

Number of true predictions: 75982
Number of false predictions: 12376


In [13]:
rs.best_estimator_

In [14]:
rs.cv_results_

{'mean_fit_time': array([38.43261051, 50.15034382, 97.58667914, 23.55330785, 45.87467909,
        77.17945862, 23.53935742, 44.83133094, 70.30200982, 23.38567789,
        44.15266562, 67.8486801 ]),
 'std_fit_time': array([10.3715717 ,  0.21059118,  1.59372353,  0.14142104,  0.25482121,
         2.06290855,  0.18168268,  0.15233544,  0.51302049,  0.09844531,
         0.2116623 ,  1.27872422]),
 'mean_score_time': array([13.17802024, 21.94732245, 32.78665336, 11.72033159, 21.52165333,
        32.93847505, 11.94165794, 20.98501325, 30.05932124, 11.87298846,
        20.48968212, 29.70797952]),
 'std_score_time': array([1.10387599, 0.6800757 , 1.05379759, 0.1335645 , 0.45621665,
        1.73517955, 0.08497285, 0.49138592, 0.08108572, 0.11787183,
        0.24228338, 1.46810534]),
 'param_vect__min_df': masked_array(data=[1, 1, 1, 2, 2, 2, 5, 5, 5, 10, 10, 10],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, False],
       

In [15]:
tfidf_df = pd.DataFrame({
    'params': rs.cv_results_['params'],
    #'param_ngram_range': rs.cv_results_['params']['vect__ngram_range'],
    'mean_test_score': rs.cv_results_['mean_test_score']
})
tfidf_df = pd.json_normalize(tfidf_df['params']).join(tfidf_df['mean_test_score'])
tfidf_df

Unnamed: 0,vect__min_df,vect__ngram_range,mean_test_score
0,1,"(1, 1)",0.811592
1,1,"(1, 2)",0.857361
2,1,"(1, 3)",0.844213
3,2,"(1, 1)",0.805156
4,2,"(1, 2)",0.838283
5,2,"(1, 3)",0.840931
6,5,"(1, 1)",0.801833
7,5,"(1, 2)",0.810928
8,5,"(1, 3)",0.794986
9,10,"(1, 1)",0.800456


In [16]:
rs.best_estimator_['vect'].get_feature_names_out()

array(['00', '00 00', '00 000', ..., 'zxxxx xxxx', 'zzzz', 'zzzz changed'],
      dtype=object)

In [17]:
def find_influential_tokens(narrative, target='negative'):

    target_idx = ['negative', 'positive'].index(target)
    
    X_target = rs.best_estimator_['vect'].transform(pd.Series([narrative]))
    orig_prob = rs.best_estimator_['clf'].predict_proba(X_target)[:,target_idx][0]
    nonzero_idx = np.nonzero(X_target.toarray()[0])[0]
    
    variants = np.repeat(X_target[0].toarray(), len(nonzero_idx), axis = 0)
    for i, j in enumerate(nonzero_idx):
        variants[i,j] = 0

    # Make a DataFrame containing all tokens in the vocabulary
    explain = pd.DataFrame({
        'token': rs.best_estimator_['vect'].get_feature_names_out()
    })
    
    # Keep only those corresponding to the tokens present in the narrative of interest
    explain = explain.loc[nonzero_idx]
    
    # Find the predicted probability when removing each token and find the different between it and the original predicted probability
    explain['delta_prob'] = orig_prob - rs.best_estimator_['clf'].predict_proba(variants)[:,target_idx]
    
    # Find the most influential values
    return explain.sort_values('delta_prob', ascending=False).head(10)

In [18]:
X_test['label'] = y_test
X_test['prediction'] = y_pred
X_test['probability'] = y_pred_proba

false_negatives = (
    X_test
    .loc[~(X_test['label'] == X_test['prediction'])]
    .sort_values('probability')
    .index
)

for idx in false_negatives[:5]:
    print(X_test.loc[idx, 'narrative'])
    print('\n ---------')

COVID-19 Loan Modification is reported as Paying partial payment agreement Credit Reports are showing Unknown or Data Unavailable status rather than Current during the duration of the COVID-19 Forbearance The month the Loan Mod was executed shows no data, all later months have incorrect Scheduled Payment amounts which do not match the amount billed ( which is the amount that was paid ) XXXX is reporting XXXX and XX/XX/XXXX as in Forbearance The COVID-19 related protected status does not seem to be reflected from XX/XX/XXXX ( month entered into COVID-19 forbearance ) and XX/XX/XXXX ( first month of COVID -19 Loan Modification Trial Period ) XX/XX/XXXX Phone call with RoundPoint First Agent Agent states we do not report forbearance as current transfers to escalations Escalations Agent believes all information was input correctly on their side reported as current will verify Incorrect scheduled payment amount will dispute XXXX shows forbearance in XXXX & XXXX XXXX will verify Loan Modific

In [19]:
X_test.head()

Unnamed: 0,narrative,label,prediction,probability
168687,I was advised to contact you all regarding ite...,1,1,4.553205000000001e-22
24442,On XX/XX/XXXX I recieved phone call on my work...,3,2,0.9998364
219970,"According to the Fair Credit Reporting Act, Se...",1,1,8.66586e-34
199559,the bill was for {$16.00} to XXXX XXXX and it...,2,2,1.0
208047,I was working on building my credit thru Lexin...,5,1,0.002136279


In [20]:
idx = false_negatives[0]
print(f'Review: {X_test.loc[idx, "narrative"]}\n')
print(f'Predicted probability of positive: {X_test.loc[idx, "probability"]}\n')

find_influential_tokens(X_test.loc[idx, 'narrative'], target='negative')

Review: COVID-19 Loan Modification is reported as Paying partial payment agreement Credit Reports are showing Unknown or Data Unavailable status rather than Current during the duration of the COVID-19 Forbearance The month the Loan Mod was executed shows no data, all later months have incorrect Scheduled Payment amounts which do not match the amount billed ( which is the amount that was paid ) XXXX is reporting XXXX and XX/XX/XXXX as in Forbearance The COVID-19 related protected status does not seem to be reflected from XX/XX/XXXX ( month entered into COVID-19 forbearance ) and XX/XX/XXXX ( first month of COVID -19 Loan Modification Trial Period ) XX/XX/XXXX Phone call with RoundPoint First Agent Agent states we do not report forbearance as current transfers to escalations Escalations Agent believes all information was input correctly on their side reported as current will verify Incorrect scheduled payment amount will dispute XXXX shows forbearance in XXXX & XXXX XXXX will verify Loan

Unnamed: 0,token,delta_prob
2509726,xxxx,3.299423e-139
1920286,report,3.299423e-139
2508407,xx,3.299423e-139
1929971,reporting,3.299423e-139
621551,credit,3.299423e-139
2509818,xxxx 1300,3.299422e-139
625633,credit report,3.299422e-139
2509088,xx xx,3.299421e-139
1926962,reported,3.29942e-139
2509090,xx xxxx,3.299401e-139


In [21]:
X_target = rs.best_estimator_['vect'].transform(X_test.loc[[idx], 'narrative'])
X_target

<1x2550213 sparse matrix of type '<class 'numpy.int64'>'
	with 646 stored elements in Compressed Sparse Row format>

In [22]:
orig_prob = rs.best_estimator_['clf'].predict_proba(X_target)[:,0][0]
orig_prob

3.299422531775445e-139

In [23]:
nonzero_idx = np.nonzero(X_target.toarray()[0])[0]
nonzero_idx

array([      0,    3737,    4122,    6179,    7419,    7893,   11996,
         12026,   12261,   12272,   12945,   12946,   23436,   23540,
         23553,   23683,   23902,   23907,   24020,   24104,   24344,
         63817,   63862,   64184,   64244,   67155,   67489,   67791,
         67830,   68750,   77154,   77155,   79671,   82440,   82469,
         82856,   87390,   88405,   88592,  104618,  106042,  106093,
        111484,  111674,  111779,  118628,  119864,  120501,  120836,
        137408,  138523,  138548,  139369,  141387,  141393,  141696,
        141875,  145344,  145345,  155339,  155413,  155526,  156641,
        161646,  163330,  163596,  163763,  164063,  164198,  165349,
        165497,  165928,  170464,  170602,  185512,  186231,  191567,
        191612,  205799,  206886,  214322,  214613,  228079,  229048,
        229864,  232399,  233054,  233893,  240047,  240156,  248008,
        248011,  267009,  267830,  268364,  268933,  311545,  312054,
        314357,  314

In [24]:
for idx in nonzero_idx[:5]:
    print(rs.best_estimator_['vect'].get_feature_names_out()[idx])

00
00 month
00 payments
00 xx
10


In [25]:
variants = np.repeat(X_target[0].toarray(), len(nonzero_idx), axis = 0)

for i, j in enumerate(nonzero_idx):
    variants[i,j] = 0

In [28]:
# Make a DataFrame containing all tokens in the vocabulary
explain = pd.DataFrame({
    'token': rs.best_estimator_['vect'].get_feature_names_out()
})

# Keep only those corresponding to the tokens present in the review of interest
explain = explain.loc[nonzero_idx]

# Find the predicted probability when removing each token and find the different between it and the original predicted probability
explain['delta_prob'] = orig_prob - rs.best_estimator_['clf'].predict_proba(variants)[:,0]

# Find the most influential values
explain.sort_values('delta_prob', ascending=False).head(25)

Unnamed: 0,token,delta_prob
2509726,xxxx,3.299423e-139
1920286,report,3.299423e-139
2508407,xx,3.299423e-139
1929971,reporting,3.299423e-139
621551,credit,3.299423e-139
2509818,xxxx 1300,3.299422e-139
625633,credit report,3.299422e-139
2509088,xx xx,3.299421e-139
1926962,reported,3.29942e-139
2509090,xx xxxx,3.299401e-139


In [29]:
explain.sort_values('delta_prob', ascending=True).head(25)

Unnamed: 0,token,delta_prob
1469256,modification,-1.598784e-114
615316,covid,-3.0008090000000003e-120
615321,covid 19,-1.859277e-121
979907,forbearance,-8.541199999999999e-122
1372548,loan modification,-1.3239539999999999e-124
23907,19 forbearance,-1.858728e-125
23436,19,-5.229296e-126
1369555,loan,-1.1629850000000001e-129
2081671,servicer,-1.0262369999999999e-130
981002,forbearance plan,-5.739517e-131
