__Automatic Ticket Assignment__

Whenever a complaint raised by user, model will predict the department related to the complaint to automatically assign that ticket the respective team. The data is collected from the website __CFPB__ (Consumer Financial Protection Bureau), an official website of the US government for raising finance related complaints

In [1]:
#Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
#loading the raw dataset
raw_data = pd.read_csv('complaints-2024-08-16_19_43.csv')

In [3]:
raw_data.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,05/31/24,Debt collection,Credit card debt,Attempts to collect debt not owed,Debt was result of identity theft,I am writing to formally bring to your attenti...,,Credit Card Receivables Fund Incorporated,GA,303XX,,Consent provided,Web,05/31/24,Closed with explanation,Yes,,9146890
1,06/18/24,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,"Dear Sir/Ma'am, Be advised that the descriptio...",Company has responded to the consumer and the ...,Experian Information Solutions Inc.,FL,33319,,Consent provided,Web,06/18/24,Closed with explanation,Yes,,9290515
2,05/30/24,Debt collection,Telecommunications debt,False statements or representation,Attempted to collect wrong amount,JEFFERSON CAPITAL SYST Date opened XX/XX/XXXX ...,,CL Holdings LLC,TX,787XX,Servicemember,Consent provided,Web,05/30/24,Closed with explanation,Yes,,9133121
3,05/30/24,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Credit inquiries on your report that you don't...,UNAUTHORIZED INQUIRIES,,"Bread Financial Holdings, Inc.",TX,76179,,Consent provided,Web,06/03/24,Closed with non-monetary relief,Yes,,9134660
4,06/05/24,Credit reporting or other personal consumer re...,Credit reporting,Problem with a company's investigation into an...,Their investigation did not fix an error on yo...,I am now requesting proof of the previous inve...,,Experian Information Solutions Inc.,NY,10457,,Consent provided,Web,06/05/24,In progress,Yes,,9178854


__Feature Engineering__

We need to take the features which contributes to the prediction. We aim is to predict the team for which complaint raised. So let us take product as the dependant variable (team) and Issue, Sub-issue and complaint narative which is the complaint raised by user

In [4]:
data = raw_data[['Product','Issue','Sub-issue','Consumer complaint narrative']]
data.head()

Unnamed: 0,Product,Issue,Sub-issue,Consumer complaint narrative
0,Debt collection,Attempts to collect debt not owed,Debt was result of identity theft,I am writing to formally bring to your attenti...
1,Credit reporting or other personal consumer re...,Incorrect information on your report,Information belongs to someone else,"Dear Sir/Ma'am, Be advised that the descriptio..."
2,Debt collection,False statements or representation,Attempted to collect wrong amount,JEFFERSON CAPITAL SYST Date opened XX/XX/XXXX ...
3,Credit reporting or other personal consumer re...,Improper use of your report,Credit inquiries on your report that you don't...,UNAUTHORIZED INQUIRIES
4,Credit reporting or other personal consumer re...,Problem with a company's investigation into an...,Their investigation did not fix an error on yo...,I am now requesting proof of the previous inve...


In [5]:
#Let us check the number of teams
data['Product'].unique()

array(['Debt collection',
       'Credit reporting or other personal consumer reports',
       'Checking or savings account', 'Credit card',
       'Payday loan, title loan, personal loan, or advance loan',
       'Mortgage', 'Vehicle loan or lease', 'Prepaid card',
       'Student loan',
       'Money transfer, virtual currency, or money service',
       'Debt or credit management'], dtype=object)

In [6]:
#Number of data in each category
data['Product'].value_counts()

Product
Credit reporting or other personal consumer reports        75498
Debt collection                                             8403
Credit card                                                 4458
Checking or savings account                                 3787
Mortgage                                                    1450
Money transfer, virtual currency, or money service          1000
Vehicle loan or lease                                        862
Student loan                                                 819
Payday loan, title loan, personal loan, or advance loan      533
Prepaid card                                                 402
Debt or credit management                                    128
Name: count, dtype: int64

There are less number of data in Vehicle loan or lease, Student loan, Payday loan, title loan, personal loan, or advance loan so let us make it as single team called __Loan__

In [7]:
data['Product'].replace({'Vehicle loan or lease':'Loan','Student loan':'Loan','Payday loan, title loan, personal loan, or advance loan':'Loan'},inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Product'].replace({'Vehicle loan or lease':'Loan','Student loan':'Loan','Payday loan, title loan, personal loan, or advance loan':'Loan'},inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Product'].replace({'Vehicle loan or lease':'Loan','Student loan':'Loan','Payday loan, title loan, personal loan, or advance loan':'Loan'},inplace=True)


In [8]:
data['Product'].value_counts()

Product
Credit reporting or other personal consumer reports    75498
Debt collection                                         8403
Credit card                                             4458
Checking or savings account                             3787
Loan                                                    2214
Mortgage                                                1450
Money transfer, virtual currency, or money service      1000
Prepaid card                                             402
Debt or credit management                                128
Name: count, dtype: int64

Credit reporting or other personal consumer reports, Checking or savings account and Money transfer, virtual currency, or money service looks lengthy so let us make it short and meaningful

In [9]:
data['Product'].replace({'Credit reporting or other personal consumer reports':'Consumer Reports','Checking or savings account':'Accounts','Money transfer, virtual currency, or money service':'Money Trasfer'},inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Product'].replace({'Credit reporting or other personal consumer reports':'Consumer Reports','Checking or savings account':'Accounts','Money transfer, virtual currency, or money service':'Money Trasfer'},inplace=True)


In [10]:
data['Product'].value_counts()

Product
Consumer Reports             75498
Debt collection               8403
Credit card                   4458
Accounts                      3787
Loan                          2214
Mortgage                      1450
Money Trasfer                 1000
Prepaid card                   402
Debt or credit management      128
Name: count, dtype: int64

Prepaid card and Debt or credit management has very less number of data. So let us drop Prepaid Card, for Debt or credit management let us merge it with Debt collection

In [11]:
data['Product'].replace({'Debt or credit management':'Debt collection'},inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Product'].replace({'Debt or credit management':'Debt collection'},inplace=True)


In [12]:
data = data[data['Product']!='Prepaid card']

In [13]:
data['Product'].value_counts()

Product
Consumer Reports    75498
Debt collection      8531
Credit card          4458
Accounts             3787
Loan                 2214
Mortgage             1450
Money Trasfer        1000
Name: count, dtype: int64

Team names look better now!

There are huge number of data in Consumer Reports than other teams which may leads to overfitting. So let us downsample the data in order to avoid overfitting and to reduce training complexity

In [14]:
data_CR = data[data['Product']=='Consumer Reports'][:6000]

In [15]:
data1 = data[data['Product']!='Consumer Reports']

In [16]:
data_teams = pd.concat([data1,data_CR],ignore_index=True)

In [17]:
data_teams['Product'].value_counts()

Product
Debt collection     8531
Consumer Reports    6000
Credit card         4458
Accounts            3787
Loan                2214
Mortgage            1450
Money Trasfer       1000
Name: count, dtype: int64

Product column looks much better now

In [18]:
data_teams.head()

Unnamed: 0,Product,Issue,Sub-issue,Consumer complaint narrative
0,Debt collection,Attempts to collect debt not owed,Debt was result of identity theft,I am writing to formally bring to your attenti...
1,Debt collection,False statements or representation,Attempted to collect wrong amount,JEFFERSON CAPITAL SYST Date opened XX/XX/XXXX ...
2,Debt collection,Attempts to collect debt not owed,Debt is not yours,I want to stress that I did not give written p...
3,Debt collection,Written notification about debt,Notification didn't disclose it was an attempt...,I recognize the importance of removing any inc...
4,Accounts,Closing an account,Company closed your account,I wrote a letter to Chex System inquiring abou...


Issue and Sub-issue is not meaningful. So let us make it us single column called Subject

In [19]:
data_teams['Subject'] = data_teams['Issue']+' '+data_teams['Sub-issue']

In [20]:
data_teams.head()

Unnamed: 0,Product,Issue,Sub-issue,Consumer complaint narrative,Subject
0,Debt collection,Attempts to collect debt not owed,Debt was result of identity theft,I am writing to formally bring to your attenti...,Attempts to collect debt not owed Debt was res...
1,Debt collection,False statements or representation,Attempted to collect wrong amount,JEFFERSON CAPITAL SYST Date opened XX/XX/XXXX ...,False statements or representation Attempted t...
2,Debt collection,Attempts to collect debt not owed,Debt is not yours,I want to stress that I did not give written p...,Attempts to collect debt not owed Debt is not ...
3,Debt collection,Written notification about debt,Notification didn't disclose it was an attempt...,I recognize the importance of removing any inc...,Written notification about debt Notification d...
4,Accounts,Closing an account,Company closed your account,I wrote a letter to Chex System inquiring abou...,Closing an account Company closed your account


In [21]:
#let us drop the Issue and Sub-issue colums
data_t1 = data_teams[['Product','Subject','Consumer complaint narrative']]
data_t1.head()

Unnamed: 0,Product,Subject,Consumer complaint narrative
0,Debt collection,Attempts to collect debt not owed Debt was res...,I am writing to formally bring to your attenti...
1,Debt collection,False statements or representation Attempted t...,JEFFERSON CAPITAL SYST Date opened XX/XX/XXXX ...
2,Debt collection,Attempts to collect debt not owed Debt is not ...,I want to stress that I did not give written p...
3,Debt collection,Written notification about debt Notification d...,I recognize the importance of removing any inc...
4,Accounts,Closing an account Company closed your account,I wrote a letter to Chex System inquiring abou...


Let us create complain column and convert the product into numbers

In [22]:
data_t1['Complaint']=data_t1['Subject']+' '+data_t1['Consumer complaint narrative']
data_t1['Team'] = data_t1['Product']
final_data = data_t1[['Team','Complaint']]
final_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_t1['Complaint']=data_t1['Subject']+' '+data_t1['Consumer complaint narrative']


Unnamed: 0,Team,Complaint
0,Debt collection,Attempts to collect debt not owed Debt was res...
1,Debt collection,False statements or representation Attempted t...
2,Debt collection,Attempts to collect debt not owed Debt is not ...
3,Debt collection,Written notification about debt Notification d...
4,Accounts,Closing an account Company closed your account...


In [23]:
final_data['Team'].value_counts()

Team
Debt collection     8531
Consumer Reports    6000
Credit card         4458
Accounts            3787
Loan                2214
Mortgage            1450
Money Trasfer       1000
Name: count, dtype: int64

In [24]:
team_code = {
    'Debt collection':0,
    'Consumer Reports':1,
    'Credit card':2,
    'Accounts':3,
    'Loan':4,
    'Mortgage':5,
    'Money Trasfer':6
}

In [25]:
final_data['Team'].replace(team_code,inplace=True)
final_data.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_data['Team'].replace(team_code,inplace=True)
  final_data['Team'].replace(team_code,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_data['Team'].replace(team_code,inplace=True)


Unnamed: 0,Team,Complaint
0,0,Attempts to collect debt not owed Debt was res...
1,0,False statements or representation Attempted t...
2,0,Attempts to collect debt not owed Debt is not ...
3,0,Written notification about debt Notification d...
4,3,Closing an account Company closed your account...


Let us do some tranformation on text column

In [26]:
#conver to lower case
final_data['Complaint'] = final_data['Complaint'].str.lower()
final_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_data['Complaint'] = final_data['Complaint'].str.lower()


Unnamed: 0,Team,Complaint
0,0,attempts to collect debt not owed debt was res...
1,0,false statements or representation attempted t...
2,0,attempts to collect debt not owed debt is not ...
3,0,written notification about debt notification d...
4,3,closing an account company closed your account...


In [27]:
#Remove special characters
import re
final_data['Complaint'] = final_data['Complaint'].apply(lambda x:re.sub('[^a-z A-Z 0-9-]+','',str(x)))
final_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_data['Complaint'] = final_data['Complaint'].apply(lambda x:re.sub('[^a-z A-Z 0-9-]+','',str(x)))


Unnamed: 0,Team,Complaint
0,0,attempts to collect debt not owed debt was res...
1,0,false statements or representation attempted t...
2,0,attempts to collect debt not owed debt is not ...
3,0,written notification about debt notification d...
4,3,closing an account company closed your account...


In [28]:
#Remove additional Spaces
final_data['Complaint'] = final_data['Complaint'].apply(lambda x: " ".join(x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_data['Complaint'] = final_data['Complaint'].apply(lambda x: " ".join(x.split()))


In [29]:
final_data.iloc[0]['Complaint']

'attempts to collect debt not owed debt was result of identity theft i am writing to formally bring to your attention a concern regarding my inability to dispute an account in violation of the provisions set forth under 12 cfr 100634 c 4 i c despite my attempts to address and dispute the accuracy of this debt i have not been provided with the necessary information or opportunity to dispute this account effectively as mandated by the aforementioned regulation according to 12 cfr 100634 c 4 i c debt collectors are required to inform consumers of their right to dispute the validity of a debt within 30 days of receiving the validation notice however in my case i was not provided with a clear and unequivocal notice of my right to dispute the debt as a result i have been deprived of my statutory right to challenge the accuracy of the information being reported against me the failure to adhere to this regulation has caused me significant distress and confusion therefore i kindly request the c

In [30]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Moham\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [31]:
#Lemmatize the words
from nltk.stem import WordNetLemmatizer
lemmetizer = WordNetLemmatizer()

In [32]:
def lemmatize_words(text):
    return " ".join([lemmetizer.lemmatize(word) for word in text.split()])

In [33]:
final_data['Complaint'] = final_data['Complaint'].apply(lambda x:lemmatize_words(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_data['Complaint'] = final_data['Complaint'].apply(lambda x:lemmatize_words(x))


In [None]:
final_data.iloc[0]['Complaint']

'attempt to collect debt not owed debt wa result of identity theft i am writing to formally bring to your attention a concern regarding my inability to dispute an account in violation of the provision set forth under 12 cfr 100634 c 4 i c despite my attempt to address and dispute the accuracy of this debt i have not been provided with the necessary information or opportunity to dispute this account effectively a mandated by the aforementioned regulation according to 12 cfr 100634 c 4 i c debt collector are required to inform consumer of their right to dispute the validity of a debt within 30 day of receiving the validation notice however in my case i wa not provided with a clear and unequivocal notice of my right to dispute the debt a a result i have been deprived of my statutory right to challenge the accuracy of the information being reported against me the failure to adhere to this regulation ha caused me significant distress and confusion therefore i kindly request the consumer fin

: 

In [34]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(final_data['Complaint'],final_data['Team'],test_size=0.25,random_state=5)

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train).toarray()
X_test_tfidf = vectorizer.transform(X_test).toarray()

In [47]:
import pickle
pickle.dump(vectorizer,open('vectorizer.pkl','wb'))

In [36]:
from sklearn.metrics import accuracy_score
def evaluate_model(model,X_train,y_train,X_test,y_test):
    report={}
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    report['Train_Accuracy']=accuracy_score(y_train,y_train_pred)
    report['Test_Accuracy']=accuracy_score(y_test,y_test_pred)
    return report

In [37]:
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB()
NB.fit(X_train_tfidf,y_train)

In [38]:
evaluate_model(NB,X_train_tfidf,y_train,X_test_tfidf,y_test)

{'Train_Accuracy': 0.8450437317784256, 'Test_Accuracy': 0.8365889212827988}

Lets train with some ensemble models

In [39]:
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier()
RFC.fit(X_train_tfidf,y_train)

In [40]:
evaluate_model(RFC,X_train_tfidf,y_train,X_test_tfidf,y_test)

{'Train_Accuracy': 0.9754130223517978, 'Test_Accuracy': 0.9435860058309038}

In [98]:
import xgboost
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train_tfidf,y_train)

In [99]:
evaluate_model(xgb,X_train_tfidf,y_train,X_test_tfidf,y_test)

{'Train_Accuracy': 0.9753158406219631, 'Test_Accuracy': 0.9572886297376093}

In [48]:
#save the model
pickle.dump(RFC,open('model.pkl','wb'))

In [52]:
test = ['The infomation You have provided is  incorrect!']

In [53]:
test = re.sub('[^a-z A-Z 0-9-]+','',str(test))
test

'The infomation You have provided is  incorrect'

In [59]:
test = " ".join(test.split())

In [71]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Moham\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [76]:
from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(test)
word_tokens

['The', 'infomation', 'You', 'have', 'provided', 'is', 'incorrect']

In [78]:
test1 = []
for i in word_tokens:
    test1.append(lemmatizer.lemmatize(i,pos="v"))
test1 

['The', 'infomation', 'You', 'have', 'provide', 'be', 'incorrect']

In [92]:
test2 = " ".join(test1)

In [94]:
test3 = [test2]

In [95]:
vectors = vectorizer.transform(test3)

In [97]:
RFC.predict(vectors)[0]

1