## Content

This dataset consists of real-world consumer complaints about financial products and services. Each complaint is labeled with a specific product, framing this as a supervised text classification problem. To classify future complaints based on their content, various machine learning algorithms were employed to improve the accuracy of predictions, assigning each complaint to the appropriate product category.

In [None]:
# Importing Libraries



import os
import pandas as pd
import numpy as np
from scipy.stats import randint
from io import StringIO
import gc

# Preprocessing
from sklearn.model_selection import train_test_split

# Feature selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfTransformer

# Actual models
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

# Evaluation metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

# Visualization
from IPython.display import display
import seaborn as sns # used for plot interactive graph.
import matplotlib.pyplot as plt
import seaborn as sns

## Data Preparation

In [None]:
df = pd.read_csv("https://storage.googleapis.com/suptech-lab-practical-data-science-public/complaints.csv.zip")
df.shape

In [None]:
df.head(2).T

# Create a new dataframe with two columns
df1 = df[['Product', 'Consumer complaint narrative']].copy()

# Remove missing values (NaN)
df1 = df1[pd.notnull(df1['Consumer complaint narrative'])]

# Renaming second column for a simpler name
df1.columns = ['Product', 'Complaint']

df1.shape

In [None]:
total = df1['Complaint'].notnull().sum()
round((total/len(df)*100),1)

In [None]:
# Keep 1.5 million (36.3%) that have an actual complaint body populated
del df
gc.collect()

In [None]:
df1.head(15)

In [None]:
pd.DataFrame(df1.Product.unique()).values

There are 21 distinct classes or categories (i.e., the target our model aims to predict). However, some classes overlap with others. For example, the categories Credit card and Prepaid card are also covered by the broader Credit card or prepaid card category. Now, if a new complaint is related to a credit card, the algorithm could classify it as either Credit card or Credit card or prepaid card, both of which would be correct. However, this could negatively impact the model's performance. To prevent this issue, certain category names should be revised.

In [None]:
# Renaming categories
df2 = df1.replace({
    'Product':
      {
        'Credit reporting': 'Credit reporting, repair, or other',
        'Credit reporting or other personal consumer reports': 'Credit reporting, repair, or other',
        'Credit reporting, credit repair services, or other personal consumer reports': 'Credit reporting, repair, or other',
        'Credit card': 'Credit card or prepaid card',
        'Prepaid card': 'Credit card or prepaid card',
        'Payday loan': 'Payday loan, title loan, personal loan, or advance loan',
        'Payday loan, title loan, or personal loan': 'Payday loan, title loan, personal loan, or advance loan',
        'Money transfers': 'Money transfer, virtual currency, or money service',
        'Virtual currency': 'Money transfer, virtual currency, or money service'
      }
    },
    inplace = False) # if this was a MASSIVE dataset, you may not want to make a copy but rather use df1 and inplace = True

In [None]:
del df1
gc.collect()

In [None]:
df2.groupby('Product').count()

In [None]:
# Note there are only 10 complaints of 1.5 MILLION that are classified as `Debt or credit management`, so we will combine them with `Other financial service`.
df2.replace({
    'Product':
      {
        'Debt or credit management': 'Other financial service'
      }
    },
    inplace = True)

In [None]:
len(df2.Product.unique())

In [None]:
# Renaming categories
df2 = df1.replace({
    'Product':
      {
        'Credit reporting': 'Credit reporting, repair, or other',
        'Credit reporting or other personal consumer reports': 'Credit reporting, repair, or other',
        'Credit reporting, credit repair services, or other personal consumer reports': 'Credit reporting, repair, or other',
        'Credit card': 'Credit card or prepaid card',
        'Prepaid card': 'Credit card or prepaid card',
        'Payday loan': 'Payday loan, title loan, personal loan, or advance loan',
        'Payday loan, title loan, or personal loan': 'Payday loan, title loan, personal loan, or advance loan',
        'Money transfers': 'Money transfer, virtual currency, or money service',
        'Virtual currency': 'Money transfer, virtual currency, or money service'
      }
    },
    inplace = False) # if this was a MASSIVE dataset, you may not want to make a copy but rather use df1 and inplace = True

In [None]:
# Note: because we have limited RAM on free tier, we'll want to clean up variables we're no longer using
# In a real-world data science setup, we'd likely want to keep these around in case we need to go back to them for further experimentation
del df1
gc.collect()

In [None]:
df2.groupby('Product').count()

In [None]:
df2.replace({
    'Product':
      {
        'Debt or credit management': 'Other financial service'
      }
    },
    inplace = True)

In [None]:
len(df2.Product.unique())

In [None]:
# Create a new column 'category_id' with encoded categories
df2['category_id'] = df2['Product'].factorize()[0]
category_id_df = df2[['Product', 'category_id']].drop_duplicates()

# Dictionaries for future use
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Product']].values)

# New dataframe
df2.head(10)

In [None]:
fig = plt.figure(figsize=(8,6))
colors = ['grey','grey','grey','grey','grey','grey','grey','grey','grey',
    'grey','grey','darkblue','darkblue','darkblue']
df2.groupby('Product').Complaint.count().sort_values().plot.barh(
    ylim=0, color=colors, title= 'NUMBER OF COMPLAINTS IN EACH PRODUCT CATEGORY\n')
plt.xlabel('Number of ocurrences', fontsize = 10)

In [None]:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,
                        ngram_range=(1, 2),
                        stop_words='english')

# WARNING!!! If we use the full dataset of ~1.5 million observations, the following line of code will crash the runtime (RAM resources) on Colab's free tier.
# One option would be to pay for a premium account, and go grab a coffee while this crunches for ~30min
# Because the computation is also time consuming (in terms of CPU), the data was sampled to 15,000 which is sufficient to create a model

df2 = df2.sample(15000, random_state=1337).copy()

features = tfidf.fit_transform(df2.Complaint).toarray() # Note: this will take ~5sec

labels = df2.category_id

print("Each of the %d complaints is represented by %d features (TF-IDF score of unigrams and bigrams)" %(features.shape))

In [None]:
N = 3
for Product, category_id in sorted(category_to_id.items()):
  features_chi2 = chi2(features, labels == category_id)
  indices = np.argsort(features_chi2[0])
  feature_names = np.array(tfidf.get_feature_names_out())[indices]
  unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
  bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
  print("\n==> %s:" %(Product))
  print("  * Most Correlated Unigrams are: %s" %(', '.join(unigrams[-N:])))
  print("  * Most Correlated Bigrams are: %s" %(', '.join(bigrams[-N:])))

### Splitting the data (train vs test)

In [None]:
X = df2['Complaint'] # Collection of "documents" (observations of complaints)
y = df2['Product'] # Target or the labels we want to predict (i.e., the 14 different complaints of products)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state = 1337)

In [None]:
## TRAINING MODELS

models = [LinearSVC(), MultinomialNB(),]

# 5 Cross-validation
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))

In [None]:
entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV) # Note: this line takes around 60 seconds
  for fold_idx, accuracy in enumerate(accuracies):
    entries.append((model_name, fold_idx, accuracy))

cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

In [None]:
## TESTING MODELS

mean_accuracy = cv_df.groupby('model_name').accuracy.mean()
std_accuracy = cv_df.groupby('model_name').accuracy.std()
acc = pd.concat([mean_accuracy, std_accuracy], axis= 1, ignore_index=True)
acc.columns = ['Mean Accuracy', 'Standard deviation']
acc

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='model_name', y='accuracy',
            data=cv_df,
            color='lightblue',
            showmeans=True)
plt.title("MEAN ACCURACY (cv = 5)\n", size=14);

In [None]:
## EVALUATING MODELS
X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split(features,
                                                               labels,
                                                               df2.index, test_size=0.25,
                                                               random_state=1337)
model = LinearSVC()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:
## Classification report
print('\t\t\t\tCLASSIFICATION METRICS\n')
print(metrics.classification_report(y_test, y_pred,
                                    target_names= df2['Product'].unique()))

In [None]:
## CONFUSION MATRIX
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(conf_mat, annot=True, cmap="Blues", fmt='d',
            xticklabels=category_id_df.Product.values,
            yticklabels=category_id_df.Product.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title("CONFUSION MATRIX - LinearSVC\n", size=16)

In [None]:
## Inspection of misclassifications
for predicted in category_id_df.category_id:
  for actual in category_id_df.category_id:
    if predicted != actual and conf_mat[actual, predicted] >= 20:
      print("'{}' predicted as '{}' : {} examples.".format(id_to_category[actual],
                                                           id_to_category[predicted],
                                                           conf_mat[actual, predicted]))

      display(df2.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][['Product',
                                                                'Complaint']])
      print('')

In [None]:
## Most correlated terms with each category

N = 4
for Product, category_id in sorted(category_to_id.items()):
  indices = np.argsort(model.coef_[category_id])
  feature_names = np.array(tfidf.get_feature_names_out())[indices]
  unigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 1][:N]
  bigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 2][:N]
  print("\n==> '{}':".format(Product))
  print("  * Top unigrams: %s" %(', '.join(unigrams)))
  print("  * Top bigrams: %s" %(', '.join(bigrams)))

In [None]:
## Predictions

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state = 0)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,
                        ngram_range=(1, 2),
                        stop_words='english')

fitted_vectorizer = tfidf.fit(X_train)
tfidf_vectorizer_vectors = fitted_vectorizer.transform(X_train)

model = LinearSVC().fit(tfidf_vectorizer_vectors, y_train)

In [None]:
new_complaint = """I have been enrolled back at XXXX XXXX University in the XX/XX/XXXX. Recently, i have been harassed by \
Navient for the last month. I have faxed in paperwork providing them with everything they needed. And yet I am still getting \
phone calls for payments. Furthermore, Navient is now reporting to the credit bureaus that I am late. At this point, \
Navient needs to get their act together to avoid me taking further action. I have been enrolled the entire time and my \
deferment should be valid with my planned graduation date being the XX/XX/XXXX."""
print(model.predict(fitted_vectorizer.transform([new_complaint])))

In [None]:
new_complaint_2 = """Equifax exposed my personal information without my consent, as part of their recent data breach. \
In addition, they dragged their feet in the announcement of the report, and even allowed their upper management to sell \
off stock before the announcement."""
print(model.predict(fitted_vectorizer.transform([new_complaint_2])))