In [6]:
import azureml.core
azureml.core.VERSION

Import the data set to a panda data frame

In [9]:
import pandas as pd

df1 = pd.read_csv('./data/complaints.csv')
df1.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,transworld systems inc. \nis trying to collect...,,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,3384392
1,2019-09-19,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,,3379500
2,2019-10-25,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,I would like to request the suppression of the...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",CA,937XX,,Consent provided,Web,2019-10-25,Closed with explanation,Yes,,3417821
3,2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"Over the past 2 weeks, I have been receiving e...",,"Diversified Consultants, Inc.",NC,275XX,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,3433198
4,2019-02-08,Vehicle loan or lease,Lease,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,,,HYUNDAI CAPITAL AMERICA,FL,33161,,Consent not provided,Web,2019-02-08,Closed with non-monetary relief,Yes,,3146310


For our implementation, we use only the Consumer complaint narrative column, which we rename to Complaint and contains the textual information from the consumer complaints, and the Product column, which represents the financial products or services associated with a complaint

In [10]:
df2 = df1[['Product', 'Consumer complaint narrative']]
df2.columns = ['Product', "Complaint"]
df2.head()

Unnamed: 0,Product,Complaint
0,Debt collection,transworld systems inc. \nis trying to collect...
1,"Credit reporting, credit repair services, or o...",
2,"Credit reporting, credit repair services, or o...",I would like to request the suppression of the...
3,Debt collection,"Over the past 2 weeks, I have been receiving e..."
4,Vehicle loan or lease,


The dataset has approximately 1.4M rows, but a great portion of them has missing data in the Complaint column. Here we just drop all rows with missing data, and we end up with 472K rows

In [11]:
df2.shape

(1586980, 2)

In [13]:
df2.dropna(inplace=True)

In [14]:
df2.shape

(525657, 2)

There are 18 distinct values for the Product column, but some of them are very underrepresented. Also, there is a lot of overlapping among them.

We then consolidate the distinct values for the Product column into 6 distinct categories: Credit Reporting, Debt Collection, Mortgage, Card Services, Loans, and Banking Services.

In [15]:
df2['Product'].value_counts()

Credit reporting, credit repair services, or other personal consumer reports    161981
Debt collection                                                                 112676
Mortgage                                                                         64275
Credit card or prepaid card                                                      35664
Credit reporting                                                                 31588
Student loan                                                                     25917
Checking or savings account                                                      20957
Credit card                                                                      18838
Bank account or service                                                          14885
Consumer Loan                                                                     9473
Vehicle loan or lease                                                             8882
Money transfer, virtual currency, or money 

In [16]:
df2.replace({'Product':
             {'Credit reporting, credit repair services, or other personal consumer reports': 'Credit Reporting',
              'Debt collection': 'Debt Collection',
              'Credit reporting': 'Credit Reporting',
              'Credit card': 'Card Services',
              'Bank account or service': 'Banking Services',
              'Credit card or prepaid card': 'Card Services',
              'Student loan': 'Loans',
              'Checking or savings account': 'Banking Services',
              'Consumer Loan': 'Loans',
              'Vehicle loan or lease': 'Loans',
              'Money transfer, virtual currency, or money service': 'Banking Services',
              'Payday loan, title loan, or personal loan': 'Loans',
              'Payday loan': 'Loans',
              'Money transfers': 'Banking Services',
              'Prepaid card': 'Card Services',
              'Other financial service': 'Other',
              'Virtual currency': 'Banking Services'}
            }, inplace= True)

In [17]:
df2 = df2[df2['Product'] != 'Other']
pd.DataFrame(df2['Product'].value_counts())

Unnamed: 0,Product
Credit Reporting,193569
Debt Collection,112676
Mortgage,64275
Card Services,55952
Loans,53029
Banking Services,45864


We need to represent data as numeric values for the model. Here we create a new column Product_Label that encodes the information from the Product column into numeric values.

We need to do something similar for the textual information from the Complaint column, but as this is dependent of the model architecture, this is done in the subsequent notebook.

In [18]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
df2['Product_Label'] = enc.fit_transform(df2['Product'])

In [19]:
df2.head()

Unnamed: 0,Product,Complaint,Product_Label
0,Debt Collection,transworld systems inc. \nis trying to collect...,3
2,Credit Reporting,I would like to request the suppression of the...,2
3,Debt Collection,"Over the past 2 weeks, I have been receiving e...",3
11,Banking Services,"I was sold access to an event digitally, of wh...",0
12,Debt Collection,While checking my credit report I noticed thre...,3


In [20]:
df2.iloc[4]['Complaint']

'While checking my credit report I noticed three collections by a company called ARS that i was unfamiliar with. I disputed these collections with XXXX, and XXXX and they both replied that they contacted the creditor and the creditor verified the debt so I asked for proof which both bureaus replied that they are not required to prove anything. I then mailed a certified letter to ARS requesting proof of the debts n the form of an original aggrement, or a proof of a right to the debt, or even so much as the process as to how the bill was calculated, to which I was simply replied a letter for each collection claim that listed my name an account number and an amount with no other information to verify the debts after I sent a clear notice to provide me evidence. Afterwards I recontacted both XXXX, and XXXX, to redispute on the premise that it is not my debt if evidence can not be drawn up, I feel as if I am being personally victimized by ARS on my credit report for debts that are not owed 

We can further preprocess the data, by trying to decrease the vocabulary size for the text. Here we perform a light text preprocessing, by removing punctuation, removing the masked information (XXX… patterns), removing extra spaces and finally normalize everything to lowercase.

In [21]:
import string

table = str.maketrans(string.punctuation, ' '*len(string.punctuation))
df2['Complaint'] = df2['Complaint'].str.translate(table)
df2['Complaint'] = df2['Complaint'].str.replace('X+', '')
df2['Complaint'] = df2['Complaint'].str.replace(' +', ' ')
df2['Complaint'] = df2['Complaint'].str.lower()
df2['Complaint'] = df2['Complaint'].str.strip()

In [24]:
df2.iloc[4]['Complaint']

'while checking my credit report i noticed three collections by a company called ars that i was unfamiliar with i disputed these collections with and and they both replied that they contacted the creditor and the creditor verified the debt so i asked for proof which both bureaus replied that they are not required to prove anything i then mailed a certified letter to ars requesting proof of the debts n the form of an original aggrement or a proof of a right to the debt or even so much as the process as to how the bill was calculated to which i was simply replied a letter for each collection claim that listed my name an account number and an amount with no other information to verify the debts after i sent a clear notice to provide me evidence afterwards i recontacted both and to redispute on the premise that it is not my debt if evidence can not be drawn up i feel as if i am being personally victimized by ars on my credit report for debts that are not owed to them or any party for that 

There is some text in the Complaint column that has 0 or very few words, which represents about 1,000 rows in the dataset. Here we consider the minimum of 5 words for the text to have some useful information.

In [25]:
lengths = [len(df2.iloc[i]['Complaint'].split()) for i in range(len(df2))]
print(max(lengths))
print(min(lengths))

5958
0


In [26]:
df2 = df2[[l >= 5 for l in lengths]]
df2.shape

(524427, 3)

In [27]:
pd.DataFrame(df2['Product'].value_counts())

Unnamed: 0,Product
Credit Reporting,192899
Debt Collection,112495
Mortgage,64256
Card Services,55925
Loans,53003
Banking Services,45849


We then save the preprocessed dataset, and another one corresponding to a 10% sample.

In [28]:
df2.to_csv('./data/consumer_complaint_data_prepared.csv', index=False)

In [29]:
df2.sample(n=int(len(df2)*0.1), random_state=111).to_csv('./data/consumer_complaint_data_sample_prepared.csv', index=False)