## Complaint Categorization Baseline Model

Fast and efficient handling of complaints on consumer forums is vital to commerce industry today. This notebook presents a baseline approach towards solving this problem. Consumer complaints on financial products is taken as the dataset to establish results.

Tf-idf (term frequency times inverse document frequency) scheme to weight individual tokens is often used in information retrieval. One of the advantage of tf-idf is reduce the impact of tokens that occur very frequently, hence offering little to none in terms of information.
The tf-idf of term 't' in document 'd' is tf-idf(d, t) = tf(t) * idf(d, t), where tf(t) is the number of times t occurs while idf is given by idf(d, t) = log [(1 + n) / (1 + df(d,t) + 1]

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Importing pandas for operating on dataset
import pandas as pd

df = pd.read_csv('../Dataset/complaints.csv')

### Typical Complaint

In [2]:
df['Consumer complaint narrative'][0]

'I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements'

### Categories

In [3]:
print(df.Product.unique())

['Credit reporting' 'Consumer Loan' 'Debt collection' 'Mortgage'
 'Credit card' 'Other financial service' 'Bank account or service'
 'Student loan' 'Money transfers' 'Payday loan' 'Prepaid card'
 'Virtual currency'
 'Credit reporting, credit repair services, or other personal consumer reports'
 'Credit card or prepaid card' 'Checking or savings account'
 'Payday loan, title loan, or personal loan'
 'Money transfer, virtual currency, or money service'
 'Vehicle loan or lease']


### Train-test split
15% of the total data is used as validation data while the remaining as training. This leads to 152809 training instances while 26967 validation instances.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    df['Consumer complaint narrative'].values, df['Product'].values, 
    test_size=0.15, random_state=0)
print('Training utterances: {}'.format(X_train.shape[0]))
print('Validation utterances: {}'.format(X_test.shape[0]))

Training utterances: 152809
Validation utterances: 26967


### Calculating tf-idf scores
Calculating tf-idf scores for each unique token in the dataset and creating frequency chart for each utterance in the dataset.

In [5]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [6]:
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)
X_train, X_test

(<152809x76350 sparse matrix of type '<class 'numpy.float64'>'
 	with 13864799 stored elements in Compressed Sparse Row format>,
 <26967x76350 sparse matrix of type '<class 'numpy.float64'>'
 	with 2447784 stored elements in Compressed Sparse Row format>)

### Feature Selection
Chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

In [7]:
from sklearn.feature_selection import SelectKBest, chi2

ch2 = SelectKBest(chi2, k=5000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)

X_train, X_test

(<152809x5000 sparse matrix of type '<class 'numpy.float64'>'
 	with 10780400 stored elements in Compressed Sparse Row format>,
 <26967x5000 sparse matrix of type '<class 'numpy.float64'>'
 	with 1907878 stored elements in Compressed Sparse Row format>)

### Naive Bayes
In multinomial naive bayes the probability of a document $d$ being in class $c$ is computed as $$P(c|d) = P(c) \prod_{1\le k \le n_d}{P(t_k|c)} $$ where, $P(c)$ is the prior probablity of a document occuring in class $c$ and $P(t_k|c)$ is the conditional probability of term $t_k$ occurring in a document of class $c$.

In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
clf = MultinomialNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(accuracy_score(y_test, pred))

0.7656024029369229
