## Background
This notebook presents an LSTM model for supervised text-classification of consumer complaints. The model takes the written complaints of financial products from US consumers and classifies the them into various financial categories (e.g., mortgage fraud, undue fees, identity theft).

The model is based on a similar analysis I performed during an Insight Data fellowship using a proprietary data for classification and sentiment analysis.

In this demo, we will design a <b>Long Short Term Memory</b> (LSTM) deep-learning model using the Keras API. It is meant as a benchmark for another model in the sibling file XXX, which uses another technique. The following analysis may stand on its own, although I would suggest users try and take inspiration from the other file.

## Analysis
In the following script, we will do the following:

+ some basic NLP preprocessing
+ design and execute an LSTM
+ validation
+ hyperparameter tuning

## Data
Data is included in the 'data' directory as a large compressed .csv file from the US Consumer Financial Protection Bureau [available here](https://www.consumerfinance.gov/data-research/consumer-complaints/search/?from=0&searchField=all&searchText=&size=25&sort=created_date_desc). To save space, I keep the data as a tar xz file, and in the python code I extract it for one-time use in a tmp directory. This works on Linux/Mac, but windows users may have to manually extract the contents, and modify the script to import from where they extracted the contents. 

In [1]:
import os
import pandas as pd
import numpy as np
import re 
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from keras import backend as tf
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Embedding, LSTM, RepeatVector, concatenate, Dense, Reshape, Flatten
from keras.models import Model 
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.metrics import roc_curve, auc, confusion_matrix
from array import array

# set the working directory
# os.chdir(".") # generally a good practise


Using TensorFlow backend.


0

### Data (extraction and import)
The following decompresses the tar.xz file, calling the Unix function tar through os.system. If this doesn't work for you, just manully navigate to the data directory and decompress the .tar.xz file using whatever program you have.

In [15]:
os.system("tar xf data/complaints-2018-09-30_17_53.csv.tar.xz -C /data/")

0

In [38]:
# data file
data_dir = "data/" # I use /tmp/
fname = "complaints-2018-09-30_17_53.csv.tar.xz"
f = data_dir + fname

# read the complaint data 
d_raw = pd.read_csv(f, usecols = ['State','Complaint ID','Consumer complaint narrative','Product', 'Sub-product', 'Issue', 'Sub-issue'])
d_raw.shape # notice 191829 rows and 7 columns

# fill NaN with blanks
for col_ in ['Product','Sub-product','Issue']:
   d_raw[col_] = d_raw[col_].fillna(" ") # fill NaN with a character

print(d_raw)

The key parts of the data include:
+ Consumer complaint narrative: the text of the consumer's complaint
+ Product: highest-level of complaint categorization 
+ Sub-product: 2nd-level of complaint categorization
+ Issue: 3rd-level of complaint categorization
+ Sub-Issue: ...

We will work at the level of <b>Issue</b> (which I will refer to as 'labels3'. There are over 400 sub-issues. 

The following cell combines Products and Issues, and collects all the unique categorizes. We will also <b>truncate</b> the data to exclude Issues with less than <b>10</b> representatives.

In [41]:
# factorize the two levels (Product and Prodcut+Issue) to get unique values
d_raw['Label1'] = pd.factorize(d_raw['Product'])[0]
d_raw['Label3'] = pd.factorize(d_raw['Product'] + d_raw['Sub-product']+d_raw['Issue'])[0] # 570 Categories
 
# Dictionary: category integers vs. category names
cats = [pd.factorize(d_raw['Product'])[1], 
        pd.factorize(d_raw['Product'] + d_raw['Sub-product'])[1], 
        pd.factorize(d_raw['Product'] + d_raw['Sub-product']+d_raw['Issue'])[1]]

# truncate the data: only use categories with at least 10 observations
col_label = 'Label3' # columns to use for filtering
cutoff = 10 # truncation cutoff
labels_counts = d_raw.groupby([col_label]).size() # counts of Level3 categories 
which_labels = np.where(labels_counts>=cutoff)[0] # which categories have at least 'cutoff'

# make new (truncated) dataset
ixSubset = d_raw.Label3.isin(which_labels) # subset integers
# new dataset 'd', as subset of d_raw
d = (d_raw[ixSubset]).copy()

# new data set
print(d.shape) # vs d_raw.shape
print(cats)

# del d_raw


(191193, 9)


## Natural Language Processing
We use some basic NLP techniques to prepare the data for input into the LSTM (the data is already pretty clean, otherwise, your dataset will involve a lot more, like removing non-english respondants, auto-correct):
+ remove/replace contractions (e.g., can't vs cannot)
+ remove non-alphanumeric characters
+ remove double whitespace
+ remove stop words
+ stemming (e.g., {improvement, improved} = {improv,improv}
+ cap the number of words for model

<b>WARNING: the stemming takes a long time >2 minutes </b>


In [43]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk import download as nltk_downloader 

# quick function to replace substitutions
def decontracted(phrase):
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

 # function to do some basic NLP pre-processing steps: replacing contractions, stemming words, removing stop words
def nlp_preprocess(text_column, # column in Panda table with text
                   stop_words, # list of English stopwords
                   word_clip = 300): # truncate the number of words
   # remove contractions
   cTextl = [decontracted(x) for x in text_column.values.tolist()]
   # remove double spacing and non-alphanumeric characters
   cTextl=[re.sub(' +',' ',re.sub(r'\W+', ' ', x)) for x in cTextl]
   # lower case the words
   cTextl = [x.lower() for x in cTextl]
   # stop words and stemming
   for i in range(0,len(cTextl)):
      rawtext = cTextl[i].split(" ") # splits sentence by spaces
      rawtext = rawtext[0:min(word_clip,len(rawtext))] # take only 300 words maximum
      # stem and remove stopwords in one line (expensive operation)
      newtext = " ".join(ps.stem(word) for word in rawtext if not word in stop_words)  # loop through words, stem,join
      cTextl[i] = newtext
   return pd.DataFrame(cTextl)

# get the default English stopwords from nlkt pacakge
stop_words = set(stopwords.words('english')) # list of stopwords to remove
# get the stemming object from nltk
ps = PorterStemmer()  # stemmer

# text column to feed into the LSTM-deep learning model:
col_text = 'Consumer complaint narrative' # name of the column with the text 

# NLP processing function
cText = nlp_preprocess(d[col_text],stop_words, word_clip = 300)

# show example of the original versus processed data
print("Original text: " + d[col_text][0] + "\n") #original 
print("Processed text: " + cText.iloc[0,0]) # stemmed


### Vectorize Text Data for LSTM input
Having cleanned the text data, we now use the NLTK 'tokenizer' to vectorize the text into integers representing the most common 3000 words. NOTE: 'max_tokens' should be considered a type of <b>hyper-parameter</b>. More words can potentially capture more meaning, or more noise.

The matrix of token-sequence (X, below) will be the input for the word-embedding

In [53]:
# maximum number of words to consider in corpus for embedding (2000-10000 seems the general range. You should treat this like a (coarse) hyperparameter
max_tokens = 3000 
tokenizer = Tokenizer(num_words=max_tokens, split=' ')
tokenizer.fit_on_texts("STARTCODON " + cText[0])
# notice the addition of a start codon to signal to the LSTM where the sentence begins (due to the subsequent zero-padding to standardize the length of every tokenized-sentence/sequence)

# Model Input: tokenized the text data for input into word-embedding layer
X = pad_sequences(tokenizer.texts_to_sequences(("STARTCODON " + cText[0]).values)) # tokenize and pad with zeros

# number of observations/rows
n_obs = X.shape[0] 

# notice the shape: 300 tokens per complaint (standardized with zero-padding)
print(X.shape)

(191193, 301)


### Response Variable (Y): N-hot-coding
The response variable will be a matrix of one-hot-codings for all the different types of 'Issues'/label3 categories. In this set, there are

In [None]:
# Model Output: multinomial & multiclass labels (level 3)
Y = pd.get_dummies(d['Label3'].values) # one-hot coding

# get integers representing each (label3) class (these are the column names)
Ynames_int = Y.columns # notice the confusing mapping of different integers to different integers

# get english issue labels corresponding to each integer value in Ynames_int
Ynames_char = [cats[2][i] for i in Ynames_int] # actual names

# Finally, convert Y into a numpy matrix (not a panda df)
Y = Y.values
