<div class="alert alert-block alert-success">
<b>Kernel Author:</b>  <br>
<a href="https://bhishanpdl.github.io/" , target="_blank">Bhishan Poudel,  Data Scientist, Ph.D Astrophysics</a> .
</div>

# Description
This project uses the [consumer complaint database](https://catalog.data.gov/dataset/consumer-complaint-database).

## Data Description
The Consumer Complaint Database is a collection of complaints about consumer financial products and services that we sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaints referred to other regulators, such as complaints about depository institutions with less than $10 billion in assets, are not published in the Consumer Complaint Database. The database generally updates daily.

## Purpose
Classify consumer complaints into predefined categories.

Classification algorithms
- Linear Support Vector Machine (LinearSVM)
- Random Forest
- Multinomial Naive Bayes 
- Logistic Regression.

# Imports

In [1]:
import time
time_start_notebook = time.time()

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns


import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot') 

# random state
SEED=100

In [4]:
import re
import string
import nltk
from nltk.corpus import stopwords

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2

In [6]:
import sklearn
import tqdm

[(x.__name__,x.__version__) for x in 
 [np,pd,sns,sklearn,tqdm,nltk]]

[('numpy', '1.18.4'),
 ('pandas', '1.0.3'),
 ('seaborn', '0.9.0'),
 ('sklearn', '0.23.0'),
 ('tqdm', '4.46.0'),
 ('nltk', '3.2.5')]

# Load the data

In [12]:
df = pd.read_csv('../data/data_clean.csv')
df.head(2).append(df.tail(2))

Unnamed: 0,product,complaint,category_id,complaint_lst,complaint_clean
0,Mortgage,Hello : ditech.com is my mortgage company. The...,0,"['helo', 'ditechcom', 'mortgage', 'company', '...",helo ditechcom mortgage company placed automat...
1,"Credit reporting, credit repair services, or o...",This a formal complaint against TransUnion reg...,1,"['formal', 'complaint', 'trans', 'union', 'reg...",formal complaint trans union regarding inacura...
645,Debt collection,The company is reporting to the credit bureau ...,6,"['company', 'reporting', 'credit', 'bureau', '...",company reporting credit bureau debt owe ever ...
646,Credit card or prepaid card,I returned merchandise to a merchant in the am...,3,"['returned', 'merchandise', 'merchant', 'amoun...",returned merchandise merchant amount merchant ...


In [17]:
df_id_to_product = pd.read_csv('../data/id_to_product.csv')
ser_id_to_product = df_id_to_product.iloc[:,0]
ser_id_to_product

0                                              Mortgage
1     Credit reporting, credit repair services, or o...
2             Payday loan, title loan, or personal loan
3                           Credit card or prepaid card
4                           Checking or savings account
5                                 Vehicle loan or lease
6                                       Debt collection
7     Money transfer, virtual currency, or money ser...
8                                          Student loan
9                               Bank account or service
10                                        Consumer Loan
Name: 0, dtype: object

In [19]:
dic_id_to_product = ser_id_to_product.to_dict()
dic_product_to_id = {v:k for k,v in dic_id_to_product.items()}

dic_product_to_id

{'Mortgage': 0,
 'Credit reporting, credit repair services, or other personal consumer reports': 1,
 'Payday loan, title loan, or personal loan': 2,
 'Credit card or prepaid card': 3,
 'Checking or savings account': 4,
 'Vehicle loan or lease': 5,
 'Debt collection': 6,
 'Money transfer, virtual currency, or money service': 7,
 'Student loan': 8,
 'Bank account or service': 9,
 'Consumer Loan': 10}

# EDA for Text Data

## Find top N correlated terms for each category
- https://www.kaggle.com/selener/multi-class-text-classification-tfidf

**Term Frequency** : This summarizes how often a given word appears within a document.

$\mathrm{TF}=\frac{\text { Number of times the term appears in the doc }}{\text { Total number of words in the doc }}$

**Inverse Document Frequency**: This downscales words that appear a lot across documents.
A term has a high IDF score if it appears in a few documents.
Conversely, if the term is very common among documents (i.e., “the”, “a”, “is”),
the term would have a low IDF score.

$\mathrm{IDF}=\ln \left(\frac{\text { Number of docs }}{\text { Number docs the term appears in }}\right)$

**Term Frequency – Inverse Document Frequency TF-IDF**: 
TF-IDF is the product of the TF and IDF scores of the term.

$\mathrm{TF}-\mathrm{IDF}=\frac{\mathrm{TF}}{\mathrm{IDF}}$

TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents. The higher the TFIDF score, the rarer the term is. For instance, in a Mortgage complaint the word mortgage would be mentioned fairly often. However, if we look at other complaints, mortgage probably would not show up in many of them. We can infer that mortgage is most probably an important word in Mortgage complaints as compared to the other products. Therefore, mortgage would have a high TF-IDF score for Mortgage complaints.

TfidfVectorizer class can be initialized with the following parameters:

- min_df: remove the words from the vocabulary which have occurred in less than "min_df"
number of files.
- max_df: remove the words from the vocabulary which have occurred in more than _{ maxdf" }
total number of files in corpus.
- sublinear_tf: set to True to scale the term frequency in logarithmic scale.
- stop_words: remove the predefined stop words in 'english':
- use_idf: weight factor must use inverse document frequency.
- ngram_range: (1,2) to indicate that unigrams and bigrams will be considered.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf

TfidfVectorizer()

In [21]:
tfidf = TfidfVectorizer(sublinear_tf=True,
                        min_df=5,
                        ngram_range=(1, 2), 
                        stop_words='english')
tfidf

TfidfVectorizer(min_df=5, ngram_range=(1, 2), stop_words='english',
                sublinear_tf=True)

In [22]:
# transform each complaint into a vector
features = tfidf.fit_transform(df['complaint_clean']).toarray()

labels = df['category_id']

print("Each of the %d complaints is represented by %d features (TF-IDF score of unigrams and bigrams)" %(features.shape))

Each of the 647 complaints is represented by 2023 features (TF-IDF score of unigrams and bigrams)


In [25]:
# Finding the three most correlated terms with each of the product categories
from sklearn.feature_selection import chi2
from tqdm import tqdm

products = []
top3uni = []
top3bi = []

N = 3
for product, category_id in tqdm(sorted(dic_product_to_id.items())):
    features_chi2 = chi2(features, labels == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    products.append(product)
    top3uni.append(', '.join(unigrams[-N:]))
    top3bi.append(', '.join(bigrams[-N:]))

100%|██████████| 11/11 [00:00<00:00, 98.74it/s]


In [26]:
df_top_corr = pd.DataFrame({'product': products, 'unigram': top3uni, 'bigram': top3bi})
df_top_corr

Unnamed: 0,product,unigram,bigram
0,Bank account or service,"bank, deposited, overdraft","money acount, closing acount, fund acount"
1,Checking or savings account,"checking, bonus, debit","checking acount, fraud claim, debit card"
2,Consumer Loan,"finance, husband, instalment","xx financial, xx payment, ben paid"
3,Credit card or prepaid card,"reward, purchase, card","received statement, american expres, credit card"
4,"Credit reporting, credit repair services, or o...","experian, acounts, equifax","credit reporting, trans union, credit report"
5,Debt collection,"colector, colection, debt","debt owe, portfolio recovery, debt colector"
6,"Money transfer, virtual currency, or money ser...","scamed, transfer, fund","check xx, xx bank, acount day"
7,Mortgage,"escrow, modification, mortgage","mortgage payment, loan modification, mortgage ..."
8,"Payday loan, title loan, or personal loan","los, internet, source","pa xx, xx pa, report report"
9,Student loan,"deferment, student, navient","payment plan, pay loan, student loan"


# Total Time Taken

In [27]:
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))

Time taken to run whole notebook: 0 hr 3 min 56 secs
