# Automatic Ticket Classification
Eeshan Gupta  
eeshangpt@gmail.com

## Introduction to Problem Statement

For a financial company, customer complaints carry a lot of importance, as they are often an indicator of the shortcomings in their products and services. If these complaints are resolved efficiently in time, they can bring down customer dissatisfaction to a minimum and retain them with stronger loyalty. This also gives them an idea of how to continuously improve their services to attract more customers.

### Business goal

You need to build a model that is able to classify customer complaints based on the products/services. By doing so, you can segregate these tickets into their relevant categories and, therefore, help in the quick resolution of the issue.

## Table of content

1. [Introduction to problem statemtent](#Introduction-to-Problem-Statement)
2. [Reading in the data](#Reading-the-data)
3. [Cleaning the data](#Cleaning-the-data)
4. [Pre-processing the data](#Pre-Proccessing-the-data)
5. [Data Visualization](#Data-Visualization)
6. [Feature Engineering](#)
7. [Model Building](#)
8. [Inferences from the model](#)

## Reading the data

### Installations and Imports

In [1]:
import json
import os
import pickle

import numpy as np
import pandas as pd
import nltk

import json 
import numpy as np
import pandas as pd
import re, nltk, spacy, string
import en_core_web_sm
import seaborn as sns
import matplotlib.pyplot as plt


from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from pprint import pprint

import nltk
from nltk.stem import WordNetLemmatizer

from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords

import swifter 
from collections import Counter
from wordcloud import WordCloud, STOPWORDS

from gensim.corpora.dictionary import Dictionary
from gensim.models.nmf import Nmf
from gensim.models.coherencemodel import CoherenceModel
from operator import itemgetter

from sklearn.model_selection import train_test_split
from sklearn.decomposition import NMF
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn import naive_bayes # .MultinomialNB
from sklearn.naive_bayes import MultinomialNB

from textblob import TextBlob

In [2]:
%matplotlib inline

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

[nltk_data] Downloading package punkt to /home/eeshan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/eeshan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/eeshan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/eeshan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
model = spacy.load("en_core_web_sm")
nlp = en_core_web_sm.load()

In [4]:
PRJ_DIR = os.getcwd()
DATA_DIR = os.path.join(PRJ_DIR, 'data')

In [5]:
file_name = 'complaints-2021-05-14_08_16.json'
pkl_file_name = file_name + ".pkl"

In [6]:
try:
    assert os.path.isfile(os.path.join(DATA_DIR, pkl_file_name))
    print("Pickle found. Now loading...")
    with open(os.path.join(DATA_DIR, pkl_file_name), 'rb') as f:
        data = pickle.load(f)
except AssertionError as e:
    print("Serialized file not found. Now reading the raw file....")
    with open(os.path.join(DATA_DIR, file_name)) as f:
        data = json.load(f)
    print("Raw file is read. Now pickling.....")
    with open(os.path.join(DATA_DIR, pkl_file_name), 'wb') as f:
        pickle.dump(data, f)

Pickle found. Now loading...


In [7]:
df = pd.json_normalize(data)

In [8]:
df.sample(10)

Unnamed: 0,_index,_type,_id,_score,_source.tags,_source.zip_code,_source.complaint_id,_source.issue,_source.date_received,_source.state,_source.consumer_disputed,_source.product,_source.company_response,_source.company,_source.submitted_via,_source.date_sent_to_company,_source.company_public_response,_source.sub_product,_source.timely,_source.complaint_what_happened,_source.sub_issue,_source.consumer_consent_provided
70747,complaint-public-v2,complaint,1893447,0.0,,917XX,1893447,"Loan servicing, payments, escrow account",2016-04-25T12:00:00-05:00,CA,Yes,Mortgage,Closed with explanation,JPMORGAN CHASE & CO.,Web,2016-04-25T12:00:00-05:00,,Home equity loan or line of credit,Yes,I entered into a written repayment plan With C...,,Consent provided
49879,complaint-public-v2,complaint,1008554,0.0,,90016,1008554,Deposits and withdrawals,2014-08-29T12:00:00-05:00,CA,No,Bank account or service,Closed with explanation,JPMORGAN CHASE & CO.,Web,2014-08-29T12:00:00-05:00,,Checking account,Yes,,,
56810,complaint-public-v2,complaint,2148428,0.0,,53208,2148428,Problems caused by my funds being low,2016-10-05T12:00:00-05:00,WI,No,Bank account or service,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2016-10-07T12:00:00-05:00,,Checking account,Yes,,,
72788,complaint-public-v2,complaint,1531697,0.0,,48180,1531697,Sale of account,2015-08-22T12:00:00-05:00,MI,No,Credit card,Closed with explanation,JPMORGAN CHASE & CO.,Web,2015-08-22T12:00:00-05:00,,,Yes,,,Consent not provided
74604,complaint-public-v2,complaint,1594391,0.0,,91387,1594391,Deposits and withdrawals,2015-10-06T12:00:00-05:00,CA,No,Bank account or service,Closed with monetary relief,JPMORGAN CHASE & CO.,Web,2015-10-06T12:00:00-05:00,,Other bank product/service,Yes,,,Consent not provided
6547,complaint-public-v2,complaint,3815009,0.0,,79936,3815009,"Other features, terms, or problems",2020-08-26T12:00:00-05:00,TX,,Credit card or prepaid card,Closed with explanation,JPMORGAN CHASE & CO.,Web,2020-08-26T12:00:00-05:00,,General-purpose credit card or charge card,Yes,,Other problem,Consent withdrawn
60920,complaint-public-v2,complaint,355256,0.0,,97224,355256,"Loan modification,collection,foreclosure",2013-03-14T12:00:00-05:00,OR,No,Mortgage,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2013-03-18T12:00:00-05:00,,Other mortgage,Yes,,,
77066,complaint-public-v2,complaint,3009507,0.0,,76109,3009507,Managing an account,2018-09-04T12:00:00-05:00,TX,,Checking or savings account,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2018-09-10T12:00:00-05:00,,Checking account,Yes,,Deposits and withdrawals,
56310,complaint-public-v2,complaint,224175,0.0,,28630,224175,"Loan modification,collection,foreclosure",2013-01-08T12:00:00-05:00,NC,No,Mortgage,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2013-01-09T12:00:00-05:00,,Other mortgage,Yes,,,
49573,complaint-public-v2,complaint,1636876,0.0,,20746,1636876,Identity theft / Fraud / Embezzlement,2015-11-03T12:00:00-05:00,MD,No,Credit card,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2015-11-13T12:00:00-05:00,,,Yes,,,


In [9]:
df.columns 

Index(['_index', '_type', '_id', '_score', '_source.tags', '_source.zip_code',
       '_source.complaint_id', '_source.issue', '_source.date_received',
       '_source.state', '_source.consumer_disputed', '_source.product',
       '_source.company_response', '_source.company', '_source.submitted_via',
       '_source.date_sent_to_company', '_source.company_public_response',
       '_source.sub_product', '_source.timely',
       '_source.complaint_what_happened', '_source.sub_issue',
       '_source.consumer_consent_provided'],
      dtype='object')

## Cleaning the data

#### Making column labels human readable

In [10]:
clean_col_names = {i: str(i).replace("_","").replace("source.","") for i in df.columns}
df.rename(columns= clean_col_names, inplace=True)
df.rename(columns={"complaintwhathappened":"complaints"}, inplace=True)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78313 entries, 0 to 78312
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   index                    78313 non-null  object 
 1   type                     78313 non-null  object 
 2   id                       78313 non-null  object 
 3   score                    78313 non-null  float64
 4   tags                     10900 non-null  object 
 5   zipcode                  71556 non-null  object 
 6   complaintid              78313 non-null  object 
 7   issue                    78313 non-null  object 
 8   datereceived             78313 non-null  object 
 9   state                    76322 non-null  object 
 10  consumerdisputed         78313 non-null  object 
 11  product                  78313 non-null  object 
 12  companyresponse          78313 non-null  object 
 13  company                  78313 non-null  object 
 14  submittedvia          

#### Finding NAs and blank values

In [12]:
(df.isna() | (df[:] == '')).sum()

index                          0
type                           0
id                             0
score                          0
tags                       67413
zipcode                     6757
complaintid                    0
issue                          0
datereceived                   0
state                       1991
consumerdisputed               0
product                        0
companyresponse                0
company                        0
submittedvia                   0
datesenttocompany              0
companypublicresponse      78309
subproduct                 10571
timely                         0
complaints                 57241
subissue                   46297
consumerconsentprovided     1008
dtype: int64

#### Replacing blanks

In [13]:
df = df.replace("", np.nan)

In [14]:
df.isna().sum() * 100 / df.shape[0]

index                       0.000000
type                        0.000000
id                          0.000000
score                       0.000000
tags                       86.081493
zipcode                     8.628197
complaintid                 0.000000
issue                       0.000000
datereceived                0.000000
state                       2.542362
consumerdisputed            0.000000
product                     0.000000
companyresponse             0.000000
company                     0.000000
submittedvia                0.000000
datesenttocompany           0.000000
companypublicresponse      99.994892
subproduct                 13.498397
timely                      0.000000
complaints                 73.092590
subissue                   59.117899
consumerconsentprovided     1.287143
dtype: float64

#### Removing blanks

In [15]:
df_cleaned = df.dropna(subset=['complaints'])

In [16]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21072 entries, 1 to 78312
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   index                    21072 non-null  object 
 1   type                     21072 non-null  object 
 2   id                       21072 non-null  object 
 3   score                    21072 non-null  float64
 4   tags                     3816 non-null   object 
 5   zipcode                  16427 non-null  object 
 6   complaintid              21072 non-null  object 
 7   issue                    21072 non-null  object 
 8   datereceived             21072 non-null  object 
 9   state                    20929 non-null  object 
 10  consumerdisputed         21072 non-null  object 
 11  product                  21072 non-null  object 
 12  companyresponse          21072 non-null  object 
 13  company                  21072 non-null  object 
 14  submittedvia          

In [17]:
print(f"{df_cleaned.shape[0] * 100 / df.shape[0]:.2f}% of original complaints can be used for processing")

26.91% of original complaints can be used for processing


**26.91% of original complaints can be used for processing**

In [18]:
del df

## Pre-Proccessing the data

In [19]:
for complaint in df_cleaned['complaints'].sample(1):
    print(complaint)

Hello. I am writing to get help on unfair pratices used with credit and debit card checking accounts via chase bank. I recently authorized a payment in the amount of XXXX to a company called XXXX for credit repair services on XXXX XXXX XXXX, XXXX. I orginally authorized only XXXX} and when I saw the XXXX  charge, I called the representative at XXXX and they explained that they erroneously forgot to mention that an activation fee of XXXX would also be assessed. I saw the charge pending on my checking account via mobile banking on XXXX XXXX XXXX, XXXX. On XXXX XXXX, XXXX the payment was shown as posted and paid. I also used my debit card for other items as well during this time frame. I check my account regularly and noticed that all was well and I had no issues with NSF fees at all. On XXXX XXXX, XXXX my account had a positive balance of XXXX}. My husband sent me a XXXX payment of XXXX. I noticed that the funds later in the day was not shown as available after using my debit card for a 

#### Cleaning the text

In [20]:
def cleanText(text):
  text = text.lower()
  text = re.sub(r'\[|\]',"",text)    # Remove text in square brackets 
  text = re.sub(r'[^\w\s]',"",text)  # Remove punctuation
  text = re.sub(r'\w*\d\w*',"",text) # Remove words containing numbers
  return " ".join(text.split())  # Remove unwanted empty spacs 

In [21]:
df_cleaned["complaints"] = df_cleaned["complaints"].swifter.apply(cleanText) 

Pandas Apply:   0%|          | 0/21072 [00:00<?, ?it/s]

In [22]:
for complaint in df_cleaned['complaints'].sample(1):
    print(complaint)

ive held a couple of credit cards with chase for a total of about years after making a big payment made to my cards card chase responded by reducing my credit rating on two cards down to within of the balance on the card my credit rating suffered a xxxx point drop as a result and chase is using my lowered credit score as the justification for lowering my limit even though theyre the ones who did it throughout my history with chase ive never missed a payment and ive always paid more than the minimum monthly balance ive been the perfect customer since this particular card is a rewards card so i used the card instead of letting it sit after getting sick over the summer with xxxx my credit card use increased for a while but the extra use was paid off when my xxxx check came in chase says my increase in utilization was the reason for their cut in my credit limit


#### Finding the length of compaints

In [23]:
df_cleaned['word_freq_complaints'] = df_cleaned['complaints'].apply(lambda x: len(str(x).split(' ')))
df_cleaned['word_freq_complaints'].describe()

count    21072.000000
mean       243.966211
std        259.961767
min          1.000000
25%         93.000000
50%        175.000000
75%        308.000000
max       5276.000000
Name: word_freq_complaints, dtype: float64

#### Lemmatizing the text

In [24]:
stop_words = stopwords.words('english')
lem = WordNetLemmatizer()

In [25]:
def lemmatize_text(text):     
    lemmatized = []
    doc = nlp(text)
    for word in doc:
        lemmatized.append(word.lemma_)
    return " ".join(lemmatized)

In [26]:
def get_nouns(text):
    blob = TextBlob(text)
    return ' '.join([ word for (word,tag) in blob.tags if tag == "NN"])

In [None]:
df_cleaned['lemmatized_complaints'] = df_cleaned['complaints'].swifter.apply(lemmatize_text).swifter.apply(get_nouns)

Pandas Apply:   0%|          | 0/21072 [00:00<?, ?it/s]

#### Only keeping the Complaints and Lemmatized Complaints for further processing

In [None]:
df_cleaned = df_cleaned[['complaints','lemmatized_complaints']]

In [None]:
df_cleaned.sample(1)

#### Extracing POS Tags

In [None]:
def extract_POS_tags(text):
  nouns =  [token for token, pos in pos_tag(word_tokenize(text)) if pos.startswith('N')]
  return ' '.join(nouns)

In [None]:
df_cleaned['complaint_POS_removed'] =  df_cleaned['lemmatized_complaints'].swifter.apply(extract_POS_tags)

In [None]:
df_cleaned.sample(5)

## Data Visualization

#### Length of complaints

In [None]:
df_cleaned["complaint_length"] = df_cleaned["complaints"].apply(len)

In [None]:
fig = plt.figure(figsize=(10,6))
plt.hist(df_cleaned['complaint_length'], bins=50)
plt.title('Distribution of Complaint length', fontsize=20)
plt.ylabel('Number of complaints', fontsize=14)
plt.xlabel('Complaint character length', fontsize=14)
plt.show()

#### Word Cloud

In [None]:
stopwords = set(STOPWORDS)

In [None]:
wordcloud = WordCloud(background_color = 'black', width = 800, height = 400, stopwords = stopwords,
                      colormap = 'viridis', max_words = 180, contour_width = 3,
                      max_font_size = 80, contour_color = 'steelblue',
                      random_state = 0).generate(str(df_cleaned['complaint_POS_removed']))

fig = plt.figure(figsize=(20,15))
plt.imshow(wordcloud)

#### Cleaning POS Tags

In [None]:
df_cleaned['complaint_clean'] = df_cleaned['complaint_POS_removed'].str.replace('-PRON-', '')
df_cleaned.sample(5)

#### Unigram, bigram and trigram analysis

In [None]:
def get_top_unigrams(text, n=None):
    vec = CountVectorizer(stop_words='english').fit(text)
    bag_of_words = vec.transform(text)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
common_words = get_top_unigrams(df_cleaned['complaint_clean'].values.astype('U'), 30)
unigram = pd.DataFrame(common_words, columns = ['unigram' , 'count'])
unigram.head(10)

In [None]:
unigram_top_10 = unigram.sort_values('count', ascending=False).head(10)
plt.figure(figsize=(10, 6))
bars = plt.bar(unigram_top_10['unigram'], unigram_top_10['count'], color='skyblue')

# Annotate bars with values
for bar, freq in zip(bars, unigram_top_10['unigram']):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), str(freq),
             ha='center', va='bottom')

plt.ylabel('Frequency')
plt.title('Frequency of Unigrams')
plt.xticks([])
plt.show()

In [None]:
def get_top_bigrams(text, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(text)
    bag_of_words = vec.transform(text)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
common_words = get_top_bigrams(df_cleaned['complaint_clean'].values.astype('U'), 30)
bigram = pd.DataFrame(common_words, columns = ['bigram' , 'count'])
bigram.head(10)

In [None]:
bigram_top_10 = bigram.sort_values('count', ascending=False).head(10)
plt.figure(figsize=(15, 6))
bars = plt.bar(bigram_top_10['bigram'], bigram_top_10['count'], color='skyblue')

# Annotate bars with values
for bar, freq in zip(bars, bigram_top_10['bigram']):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), str(freq),
             ha='center', va='bottom')

plt.ylabel('Frequency')
plt.title('Frequency of Bigrams')
plt.xticks([])
plt.show()

In [None]:
def get_top_trigrams(text, n=None):
    vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(text)
    bag_of_words = vec.transform(text)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
common_words = get_top_trigrams(df_cleaned['complaint_clean'].values.astype('U'), 30)
trigram = pd.DataFrame(common_words, columns = ['trigram' , 'count'])
trigram.head(10)

In [None]:
trigram_top_10 = trigram.sort_values('count', ascending=False).head(8)
plt.figure(figsize=(15, 6))
bars = plt.bar(trigram_top_10['trigram'], trigram_top_10['count'], color='skyblue')

# Annotate bars with values
for bar, freq in zip(bars, trigram_top_10['trigram']):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), str(freq),
             ha='center', va='bottom')

plt.ylabel('Frequency')
plt.title('Frequency of Trigrams')
plt.xticks([])
plt.show()

#### Removing personal data marker from the text

In [None]:
def remove_allXX(text):
  return re.sub('[x]{2,}',"",text)

In [None]:
df_cleaned['complaint_clean'] = df_cleaned['complaint_clean'].swifter.apply(remove_allXX)

In [None]:
df_cleaned['complaint_clean'] = df_cleaned['complaint_clean'].str.replace('xxxx','')

In [None]:
df_cleaned.sample(10)

## Feature Engineering

#### TF-IDF Vectorization

In [None]:
tfidf_model = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [None]:
doc_term_mat = tfidf_model.fit_transform(df_cleaned['complaint_clean'])

## Topic Modelling

In [None]:
texts = df_cleaned['complaint_clean']
dataset = [d.split() for d in texts]

In [None]:
dictionary = Dictionary(dataset)

In [None]:
dictionary.filter_extremes(
    no_below=3,
    no_above=0.85,
    keep_n=5000
)

In [None]:
corpus = [dictionary.doc2bow(text) for text in dataset]

In [None]:
topic_nums = list(np.arange(5, 10, 1))

In [None]:
coherence_scores = []

for num in topic_nums:
    nmf = Nmf(
        corpus=corpus,
        num_topics=num,
        id2word=dictionary,
        chunksize=2000,
        passes=5,
        kappa=.1,
        minimum_probability=0.01,
        w_max_iter=300,
        w_stop_condition=0.0001,
        h_max_iter=100,
        h_stop_condition=0.001,
        eval_every=10,
        normalize=True,
        random_state=42
    )
    
    # Run the coherence model to get the score
    cm = CoherenceModel(
        model=nmf,
        texts=texts,
        dictionary=dictionary,
        coherence='c_v'
    )
    
    coherence_scores.append(round(cm.get_coherence(), 5))

In [None]:
scores = list(zip(topic_nums, coherence_scores))
best_num_topics = sorted(scores, key=itemgetter(1), reverse=True)[0][0]

print(best_num_topics)

#### Manual Topic Modelling

In [None]:
num_topics = 5

In [None]:
nmf_model = NMF(n_components=5, random_state=40)

In [None]:
nmf_model.fit(doc_term_mat)
len(tfidf_model.get_feature_names_out())

In [None]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf_model.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print('\n')

In [None]:
topic_results = nmf_model.transform(doc_term_mat)
topic_results[0].round(2)
topic_results[0].argmax()
topic_results.argmax(axis=1)

In [None]:
df_cleaned['topic'] = topic_results.argmax(axis=1) 

In [None]:
df_cleaned.head(10)

In [None]:
# df_cleaned=df_cleaned.groupby('topic').head(5)
# df_cleaned.sort_values('Topic')