<h1 style='color:#00868b'>Read, balance and clean dataset<span class="tocSkip"></span></h1>

# Start

In [59]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Read dataset

In [60]:
df = pd.read_csv("complaints-2020-01-22_08_24.csv", encoding="utf-8")

In [61]:
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,07/23/19,"Credit reporting, credit repair services, or o...",Credit reporting,Credit monitoring or identity theft protection...,Problem canceling credit monitoring or identif...,I have complained many times that the credit r...,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,CA,926XX,,Consent provided,Web,07/23/19,Closed with explanation,Yes,,3315279
1,07/26/19,Debt collection,I do not know,False statements or representation,Attempted to collect wrong amount,please review the current fraud account and al...,Company believes it acted appropriately as aut...,"Ideal Collection Services, Inc.",FL,333XX,,Consent provided,Web,07/26/19,Closed with explanation,Yes,,3319487
2,06/03/19,Debt collection,I do not know,Attempts to collect debt not owed,Debt was paid,Called multiple times over the years for a deb...,,"ONEMAIN FINANCIAL HOLDINGS, LLC.",FL,327XX,,Consent provided,Web,06/07/19,Closed with explanation,Yes,,3262794
3,07/03/19,Debt collection,Other debt,Attempts to collect debt not owed,Debt was result of identity theft,I sent in a letter to the company to have them...,,"Diversified Consultants, Inc.",VA,232XX,,Consent provided,Web,07/03/19,Closed with explanation,Yes,,3295208
4,07/14/19,"Credit reporting, credit repair services, or o...",Credit reporting,Improper use of your report,Received unsolicited financial product or insu...,On XX/XX/19 I applied for a Debt Relief Produc...,,"ClearOne Advantage, LLC",PA,191XX,"Older American, Servicemember",Consent provided,Web,07/18/19,Closed with explanation,Yes,,3306130


Example:

In [62]:
df["Consumer complaint narrative"][45144]

'I noticed a collection debt appear on my credit report earlier this week ( XX/XX/18 ). I XXXX XXXX the name of the company and looked up their website - I have never received anything in writing nor any phone calls ( that resulted in a voicemail - I do not answer calls from unknown numbers ) from this company about this debt. I contacted their settlement department via e-mail using the e-mail address on their website requesting more information about the debt owed, but have not received a response. I have no way to pay the debt if they do not send me the account information that is required for their website, which means it will remain on my credit report until it is sorted out.'

In [91]:
df.shape

(485701, 18)

## Data preprocessing

### Clean

In [64]:
import re
import string

def clean_document(complaint):
    # turn text to lowercase
    complaint = complaint.lower()
    # remove URLs
    complaint = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', ' ', complaint)
    complaint = re.sub('https? ?: ?// ?(?:[-\w.]|(?:%[\da-fA-F]{2}))+', ' ', complaint)
    # removed censored words
    complaint = re.sub('[xX]{2,20}', ' ', complaint)
    # remove special and non-sensical characters
    complaint = re.sub("[.,#'\"\(\):$;+\-\!?%}/{*]",' ', complaint)
    complaint = re.sub('\n', ' ', complaint)
    complaint = re.sub('\t', ' ', complaint)
    # remove normal dates and censored dates
    complaint = re.sub('[\dx]{1,2}/[\dx]{1,2}/[\dx]{2,4}', ' ', complaint)
    # remove numbers
    complaint = re.sub("[0-9]", "", complaint);
    # normalise spaces to just one space
    complaint = re.sub(" +", " ", complaint);
    
    return complaint


In [65]:
df["Consumer complaint narrative"] = df["Consumer complaint narrative"].apply(clean_document)

Example:

In [66]:
df["Consumer complaint narrative"][45144]

'i noticed a collection debt appear on my credit report earlier this week i the name of the company and looked up their website i have never received anything in writing nor any phone calls that resulted in a voicemail i do not answer calls from unknown numbers from this company about this debt i contacted their settlement department via e mail using the e mail address on their website requesting more information about the debt owed but have not received a response i have no way to pay the debt if they do not send me the account information that is required for their website which means it will remain on my credit report until it is sorted out '

Export to CSV

In [92]:
df.to_csv("corpus_cleaned_for_LDA.csv", index=False)

### Stemming with some initial removal of stop words

This reduces the variation in text data by converting words to their word stem. Applying stemming allows for LDA to focus much more finely on the base form of a word, rather than focusing on the differences in the various variations of a word. The code below is analagous to sprint 1's code in [<code>lemmatization.py</code>](lemmatization.py).

In [96]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords

ps = PorterStemmer()

i = 0


# Tokenize the complaint
for ind, complaint in df["Consumer complaint narrative"].items():
    i = i+1
    words = word_tokenize(complaint)
    new_words = []
    # Stem the words in the complaint
    for word in words:
        new_words.append(ps.stem(word))
    s = ' '.join(new_words)
    df["Consumer complaint narrative"][ind] = s
    if (i % 1000) == 0:
        print(i)

# to csv for later use
df.to_csv("corpus_cleaned_and_stemmed_for_LDA.csv", index=False)

df.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139000
140000
141000
142000
143000
144000
145000
146000
147000
148000
149000
150000
151000
152000
153000
154000
155000
156000
157000
158000
15

(485701, 18)

### Tokenzation, Document Term Matrix & removing stop words

Stop words include: 
* common English words;
* company names, which we can obtain from the company column, although this may not cover all companies mentioned in the consumer complaint narrative;
* combinations of two or more x letters to hide personal information (these were already removed in the data cleaning step);
* state names.

In [162]:
df = pd.read_csv("corpus_cleaned_and_stemmed_for_LDA.csv", encoding="utf-8")
df.shape

(485701, 18)

In [163]:
# 13 complaints seem to be null
# 6198, 15086, 23035, 34053, 42599, 46636, 57917, 96134, 115824, 133140, 223668, 248796, 451012
print("Nulls: ", df["Consumer complaint narrative"].isnull().sum())
print(df[df["Consumer complaint narrative"].isnull()])
df_complaints = df["Consumer complaint narrative"].dropna()
print("Nulls: ", df["Consumer complaint narrative"].isnull().sum())

Nulls:  13
       Date received                                            Product  \
6198        09/05/19  Credit reporting, credit repair services, or o...   
15086       04/21/19  Credit reporting, credit repair services, or o...   
23035       05/16/19  Credit reporting, credit repair services, or o...   
34053       10/27/19  Credit reporting, credit repair services, or o...   
42599       05/03/19  Credit reporting, credit repair services, or o...   
46636       09/05/19  Credit reporting, credit repair services, or o...   
57917       06/15/19  Credit reporting, credit repair services, or o...   
96134       07/19/19  Credit reporting, credit repair services, or o...   
115824      08/01/18  Credit reporting, credit repair services, or o...   
133140      09/08/17  Credit reporting, credit repair services, or o...   
223668      09/09/17  Credit reporting, credit repair services, or o...   
248796      02/12/19  Credit reporting, credit repair services, or o...   
451012      08

In [164]:
df_complaints.shape

(485688,)

In [165]:
my_additional_stop_words = []

# US states, capitalised and lower
states_abbr = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
  "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
  "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]

states_abbr = [item.lower() for item in states_abbr]
states = [item.lower() for item in states]
states.extend(states_abbr)
print(states)

# add to list of additional stop words
my_additional_stop_words.extend(states)

# -------------------------------------------

# Companies mentioned in the column Company
companies = df["Company"].unique()
print(companies)

# add to list of additional stop words
my_additional_stop_words.extend(companies)

print("Length of list: ", len(my_additional_stop_words))

# other stop words
my_additional_stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come'])


['alabama', 'alaska', 'arizona', 'arkansas', 'california', 'colorado', 'connecticut', 'delaware', 'florida', 'georgia', 'hawaii', 'idaho', 'illinois', 'indiana', 'iowa', 'kansas', 'kentucky', 'louisiana', 'maine', 'maryland', 'massachusetts', 'michigan', 'minnesota', 'mississippi', 'missouri', 'montana', 'nebraska', 'nevada', 'new hampshire', 'new jersey', 'new mexico', 'new york', 'north carolina', 'north dakota', 'ohio', 'oklahoma', 'oregon', 'pennsylvania', 'rhode island', 'south carolina', 'south dakota', 'tennessee', 'texas', 'utah', 'vermont', 'virginia', 'washington', 'west virginia', 'wisconsin', 'wyoming', 'al', 'ak', 'az', 'ar', 'ca', 'co', 'ct', 'dc', 'de', 'fl', 'ga', 'hi', 'id', 'il', 'in', 'ia', 'ks', 'ky', 'la', 'me', 'md', 'ma', 'mi', 'mn', 'ms', 'mo', 'mt', 'ne', 'nv', 'nh', 'nj', 'nm', 'ny', 'nc', 'nd', 'oh', 'ok', 'or', 'pa', 'ri', 'sc', 'sd', 'tn', 'tx', 'ut', 'vt', 'va', 'wa', 'wv', 'wi', 'wy']
['Experian Information Solutions Inc.' 'Ideal Collection Services, Inc.

Stem stop words

In [166]:
i = 0

for entry in my_additional_stop_words:
    words = word_tokenize(entry)
    new_words = []
    # Stem the words in the complaint
    for word in words:
        new_words.append(ps.stem(word))
    s = ' '.join(new_words)
    my_additional_stop_words[i] = s
    i = i+1
english_stop_words = text.ENGLISH_STOP_WORDS
i = 0
for entry in english_stop_words:
    words = word_tokenize(entry)
    new_words = []
    # Stem the words in the complaint
    for word in words:
        new_words.append(ps.stem(word))
    s = ' '.join(new_words)
    my_additional_stop_words[i] = s
    i = i+1

In [168]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text 

extended_stop_words = english_stop_words.union(my_additional_stop_words)
print(extended_stop_words)

frozenset({'tag intermedi hold compani , llc', 'qualiti accept llc', 'ashton & weinberg , inc .', 'gener mortgag compani', 'colorado capit invest , inc .', 'medallion mortgag compani', 'maxitransf corpor', 'might', 'frost bank', 'mountain west legal solut , pllc', 'berlin-wheel , inc. ( kansa )', 'debt manag hold llc', 'omni financi group , inc', 'commerci recoveri system', 'bf capit inc .', 'lauren financi servic inc', 'cheney financi servic , inc .', 'home american mortgag corp', 'slovin & associ co. , lpa', 'cityworth mortgag llc', 'credit servic intern corp', 'emcc hold , llc', 'veripro solut inc .', 'innov strateg solut llc', 'abov', 'nation process group llc', 'three B financi , llc', 'fairfield servic M , llc', 'diaz & associ , inc .', 'delev & associ , llc', 'luna financi llc', 'possibl financi inc', 'aci worldwid , corp .', 'speedycash , inc .', 'acceler creditor servic , inc', 'law offic of donald R conrad p.c .', 'US mortgag corpor', 'traco invest corp', 'marlett fund , llc'

In [170]:
# vect (bag of words)
count_vect = CountVectorizer(
    stop_words="english",
    # ngram_range=(1,2),
    min_df=2, # only keep words that appear twice
    max_df=0.5 # appears max in 50% of documents
)


X = count_vect.fit_transform(df_complaints)

In [178]:
X.shape

(485688, 37792)

In [179]:
X_df = pd.DataFrame(X)
X_df.to_csv("corpus_sprint2_LDA_ready_bag_of_words.csv", index=False)

In [180]:
X_df.head()

Unnamed: 0,0
0,"(0, 520)\t1\n (0, 657)\t1\n (0, 1175)\t1\n..."
1,"(0, 8389)\t1\n (0, 13684)\t1\n (0, 13722)\..."
2,"(0, 1638)\t1\n (0, 4512)\t1\n (0, 5033)\t1..."
3,"(0, 1241)\t2\n (0, 1421)\t2\n (0, 2607)\t1..."
4,"(0, 921)\t3\n (0, 1226)\t1\n (0, 1340)\t1\..."


### Dimensionality reduction

Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with scipy.sparse matrices efficiently ([source](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)).

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

In [173]:
from sklearn.decomposition import TruncatedSVD

n_components=[15,30,50,100]

for components in n_components:
    print("Components: ", components)
    # apply tSVD
    tSVD = TruncatedSVD(components)
    principal_components = tSVD.fit_transform(X)
    print("Done transforming.")
    principal_components_df = pd.DataFrame(principal_components)
    # Export to csv for later use
    print("Exporting to csv...")
    principal_components_df.to_csv("corpus_sprint2_pc_" + str(components) + "_LDA.csv", index=False)

Components:  15
Done transforming.
Exporting to csv...
Components:  30
Done transforming.
Exporting to csv...
Components:  50
Done transforming.
Exporting to csv...
Components:  100
Done transforming.
Exporting to csv...
