## Content

This dataset consists of real-world consumer complaints about financial products and services. Each complaint is labeled with a specific product, framing this as a supervised text classification problem. To classify future complaints based on their content, various machine learning algorithms were employed to improve the accuracy of predictions, assigning each complaint to the appropriate product category.

In [1]:
# Importing Libraries
import os
import pandas as pd
import numpy as np
from scipy.stats import randint
from io import StringIO
import gc

# Preprocessing
from sklearn.model_selection import train_test_split

# Feature selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfTransformer

# Actual models
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

# Evaluation metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

# Visualization
from IPython.display import display
import seaborn as sns # used for plot interactive graph.
import matplotlib.pyplot as plt
import seaborn as sns

## Data Preparation

In [None]:
df = pd.read_csv("https://storage.googleapis.com/suptech-lab-practical-data-science-public/complaints.csv.zip")
df.shape

In [None]:
df.head(2).T

# Create a new dataframe with two columns
df1 = df[['Product', 'Consumer complaint narrative']].copy()

# Remove missing values (NaN)
df1 = df1[pd.notnull(df1['Consumer complaint narrative'])]

# Renaming second column for a simpler name
df1.columns = ['Product', 'Complaint']

df1.shape

In [None]:
total = df1['Complaint'].notnull().sum()
round((total/len(df)*100),1)

In [None]:
# Keep 1.5 million (36.3%) that have an actual complaint body populated
del df
gc.collect()

In [None]:
df1.head(15)

In [None]:
pd.DataFrame(df1.Product.unique()).values

There are 21 distinct classes or categories (i.e., the target our model aims to predict). However, some classes overlap with others. For example, the categories Credit card and Prepaid card are also covered by the broader Credit card or prepaid card category. Now, if a new complaint is related to a credit card, the algorithm could classify it as either Credit card or Credit card or prepaid card, both of which would be correct. However, this could negatively impact the model's performance. To prevent this issue, certain category names should be revised.

In [None]:
# Renaming categories
df2 = df1.replace({
    'Product':
      {
        'Credit reporting': 'Credit reporting, repair, or other',
        'Credit reporting or other personal consumer reports': 'Credit reporting, repair, or other',
        'Credit reporting, credit repair services, or other personal consumer reports': 'Credit reporting, repair, or other',
        'Credit card': 'Credit card or prepaid card',
        'Prepaid card': 'Credit card or prepaid card',
        'Payday loan': 'Payday loan, title loan, personal loan, or advance loan',
        'Payday loan, title loan, or personal loan': 'Payday loan, title loan, personal loan, or advance loan',
        'Money transfers': 'Money transfer, virtual currency, or money service',
        'Virtual currency': 'Money transfer, virtual currency, or money service'
      }
    },
    inplace = False) # if this was a MASSIVE dataset, you may not want to make a copy but rather use df1 and inplace = True

In [None]:
del df1
gc.collect()

In [None]:
df2.groupby('Product').count()

In [None]:
# Note there are only 10 complaints of 1.5 MILLION that are classified as `Debt or credit management`, so we will combine them with `Other financial service`.
df2.replace({
    'Product':
      {
        'Debt or credit management': 'Other financial service'
      }
    },
    inplace = True)

In [None]:
len(df2.Product.unique())

In [None]:
# Renaming categories
df2 = df1.replace({
    'Product':
      {
        'Credit reporting': 'Credit reporting, repair, or other',
        'Credit reporting or other personal consumer reports': 'Credit reporting, repair, or other',
        'Credit reporting, credit repair services, or other personal consumer reports': 'Credit reporting, repair, or other',
        'Credit card': 'Credit card or prepaid card',
        'Prepaid card': 'Credit card or prepaid card',
        'Payday loan': 'Payday loan, title loan, personal loan, or advance loan',
        'Payday loan, title loan, or personal loan': 'Payday loan, title loan, personal loan, or advance loan',
        'Money transfers': 'Money transfer, virtual currency, or money service',
        'Virtual currency': 'Money transfer, virtual currency, or money service'
      }
    },
    inplace = False) # if this was a MASSIVE dataset, you may not want to make a copy but rather use df1 and inplace = True

In [None]:
# Note: because we have limited RAM on free tier, we'll want to clean up variables we're no longer using
# In a real-world data science setup, we'd likely want to keep these around in case we need to go back to them for further experimentation
del df1
gc.collect()