## PART 0: Background

This notebook presents an LSTM model for supervised text-classification of consumer complaints. The model takes the written complaints of financial products from US consumers and classifies the the text into various financial categories (e.g., mortgage fraud, undue fees, identity theft). This model also usses a 'topic embedding' technique inspired by the [instacart model](https://tech.instacart.com/deep-learning-with-emojis-not-math-660ba1ad6cdc) in order to learn about the relationships among categories in a <b>low-dimensional embedding space</b>. Each category is transformed from a discrete, independent entity to a vector of continuous values: categories closer in this space are more related/redundant.

The model is based on a similar analysis I performed during an [Insight Data Fellowship](https://www.insightdatascience.com/) using proprietary data for text-classification and sentiment analysis (from a AI start-up in Toronto Canada).

The crux of the demo is a <b>Long Short Term Memory</b> (`LSTM`) deep-learning model using the `Keras` API. 

### Analysis
In the following script, we will do the following:

+ basic NLP preprocessing
+ LSTM model with category embeddings
+ model validation
+ visualizations of the category embedding layer (using t-SNE).
+ see how the model classifies new complaints about Cryptocurrencies (like Coinbase troubles!)
+ TODO hyperparameter tuning (see separate scripts/functions in [FinComplain_LSTM_default_hyperparam-tuning.ipynb](https://github.com/faraway1nspace/NLP_topic_embeddings/blob/master/FinComplain_LSTM_default_hyperparam-tuning.ipynb)

### Data
Data is included in the 'data' directory as a large compressed .csv file from the US Consumer Financial Protection Bureau [available here](https://www.consumerfinance.gov/data-research/consumer-complaints/search/?from=0&searchField=all&searchText=&size=25&sort=created_date_desc). To save space, I keep the csv file compressed as tar.xz file; in the python code, I extract the data for one-time use, and save the temporary csv in my "tmp" directory. This works on Linux/Mac, but Windows users will have to manually extract the .tar.xz file, and modify the script to import the extraced-data, based on whichever folder the contents are de-compressed.

### Neural Architecture
The category embedding layer enters the model via another input head (bottom layers), parallel to the text-inputs which goes through the LSTM and word-embedding (upper layers). The 'default model' analysis is essentially the analysis comprising only the word-embedding + LSTM (see [FinComplain_LSTM_default_hyperparam-tuning.ipynb](https://github.com/faraway1nspace/NLP_topic_embeddings/blob/master/FinComplain_LSTM_default_hyperparam-tuning.ipynb)).

![foo](img/architecture_embedding.png)

## Part 1: Setup and NLP
### Libraries
The main workhorse functions are in the `keras` API (with Tensorflow as the backend) and the `nltk` package for NLP pre-processing of the text data.

In [1]:
%matplotlib notebook
import os
import pandas as pd
import numpy as np
import re 
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import SnowballStemmer
from keras import backend as tf
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Embedding, LSTM, RepeatVector, concatenate, Dense, Reshape, Flatten
from keras.models import Model 
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.metrics import roc_curve, auc, confusion_matrix, f1_score
from array import array

# set the working directory
# os.chdir(".") # generally a good practise


Using TensorFlow backend.


### Data (extraction and import)
The following decompresses the tar.xz file, calling the Unix function tar through `os.system()`. If this doesn't work for you, just manully navigate to the data directory and decompress the .tar.xz file using whatever program you have.

In [2]:
os.system("tar xf data/complaints-2018-09-30_17_53.csv.tar.xz -C data/")
#os.system("tar xf data/complaints-2018-09-30_17_53.csv.tar.xz -C /tmp/") # for Mac or Linux users, I like to save it to the temporary directory

0

In [3]:
# data file
data_dir = "data/" # I like to exatract to /tmp/ (on linux/mac)
fname = "complaints-2018-09-30_17_53.csv"
f = data_dir + fname

# read the complaint data 
d_raw = pd.read_csv(f, usecols = ['State','Complaint ID','Consumer complaint narrative','Product', 'Sub-product', 'Issue', 'Sub-issue'])
d_raw.shape # notice 191829 rows and 7 columns
os.system("rm data/complaints-2018-09-30_17_53.csv")

# fill NaN with blanks
for col_ in ['Product','Sub-product','Issue']:
   d_raw[col_] = d_raw[col_].fillna(" ") # fill NaN with a blank character


In [4]:
print(d_raw[['Product','Issue','Consumer complaint narrative']].iloc[0:5,:])

                                             Product  \
0  Credit reporting, credit repair services, or o...   
1                                           Mortgage   
2                        Credit card or prepaid card   
3                                           Mortgage   
4  Money transfer, virtual currency, or money ser...   

                            Issue  \
0     Improper use of your report   
1  Trouble during payment process   
2    Problem when making payments   
3  Credit decision / Underwriting   
4           Other service problem   

                        Consumer complaint narrative  
0  I spoke with the creditors and I also called X...  
1  Reference closed Consumer complaint file # XXX...  
2  I have had a credit card with Discover for sev...  
3  CitiMortgage gave me a modification, then almo...  
4  XXXX XXXX told me to call Capital One so they ...  


The <b>key fields</b> of the data include:
+ `Consumer complaint narrative`: the text of the consumer's complaint
+ `Product`: highest-level of complaint categories 
+ `Sub-product`: 2nd-level of complaint categorization
+ `Issue`: 3rd-level of complaint categorization
+ Sub-Issue: ...

We will work at the level of <b>Issue</b> (which I will refer to as `labels3`). There are over 400 sub-issues. 

The following cell combines Products and Issues, and collects all the unique categorizes. We will also <b>truncate</b> the data to exclude Issues with less than <b>10</b> representatives.

In [5]:
# factorize the two levels (Product and Product+Issue) to get unique values
d_raw['Label1'] = pd.factorize(d_raw['Product'])[0]
# combine Product + Issues
d_raw['Label3'] = pd.factorize(d_raw['Product'] + d_raw['Sub-product']+d_raw['Issue'])[0] # 570 Categories
 
# Dictionary: category integers vs. category names
cats = [pd.factorize(d_raw['Product'])[1], 
        pd.factorize(d_raw['Product'] + d_raw['Sub-product'])[1], 
        pd.factorize(d_raw['Product'] + d_raw['Sub-product']+d_raw['Issue'])[1]]

# truncate the data: only use categories with at least 10 observations
col_label = 'Label3' # columns to use for filtering
cutoff = 10 # truncation cutoff
labels_counts = d_raw.groupby([col_label]).size() # counts of Level3 categories 
which_labels = np.where(labels_counts>=cutoff)[0] # which categories have at least 'cutoff'

# make new (truncated) dataset
ixSubset = d_raw.Label3.isin(which_labels) # subset integers
# new dataset 'd', as subset of d_raw
d = (d_raw[ixSubset]).copy()

# new data set
print(d.shape) # vs d_raw.shape
print(cats)

# del d_raw


(191193, 9)
[Index(['Credit reporting, credit repair services, or other personal consumer reports',
       'Mortgage', 'Credit card or prepaid card',
       'Money transfer, virtual currency, or money service',
       'Checking or savings account', 'Payday loan', 'Debt collection',
       'Vehicle loan or lease', 'Student loan', 'Credit card', 'Consumer Loan',
       'Bank account or service', 'Payday loan, title loan, or personal loan',
       'Credit reporting', 'Prepaid card', 'Money transfers',
       'Other financial service', 'Virtual currency'],
      dtype='object'), Index(['Credit reporting, credit repair services, or other personal consumer reportsCredit reporting',
       'MortgageConventional home mortgage',
       'Credit card or prepaid cardGeneral-purpose credit card or charge card',
       'MortgageConventional fixed mortgage',
       'Money transfer, virtual currency, or money serviceDomestic (US) money transfer',
       'Checking or savings accountSavings account', 'P

Let's visualize the distribution of categories in the data. This is illustrative  of the <b>'lumping & splitting' dilemma</b>: no taxonomy (ontology?) can be defined purely objectively; rather, humans arbiters are constantly confronted with potentially new categories and ask: "do I lump together two categories into one, or do I create a new category altogether?" The arbitrariness of the taxonomy leads to an exponential increase in the number of categories over time (more categories beget more categories = exp. growth).

Later, this will motivate the need for sample-weighting

In [7]:
labels_counts.values.sort() # sort the counts of each category ('issues')

import matplotlib.pyplot as plt
# staircase plot for frequency data
plt.title('Frequency of Categories in the Consumer Complaint Database')
plt.step([i for i in range(0,len(labels_counts))], labels_counts)
plt.ylabel('Counts')
plt.xlabel('Categories')
plt.text(50,25000, "The top 14 categories represent 50% of the data\n The remaining data is made up of 597 other categories ")
plt.show()

<IPython.core.display.Javascript object>

### Prepare data:  natural language processing for cleaning
We use some basic NLP techniques to prepare the data for input into the LSTM (the data is already pretty clean, otherwise, your dataset will involve a lot more, like removing non-english respondants, auto-correct):
+ remove/replace contractions (e.g., can't vs cannot)
+ remove non-alphanumeric characters
+ remove double whitespace
+ remove stop words
+ stemming (e.g., {improvement, improved} = {improv,improv}
+ cap the number of words for model

<b>WARNING: the stemming takes a long time >2 minutes </b>


In [6]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk import download as nltk_downloader 

# a personal stemming function to speed up the stemming: uses a fast-dictionary for already stemmed words, which reduces the computation time by half
class myStemmer:
   def __init__(self):
      self.d = {}
      self.stemmer = SnowballStemmer("english")
   def stem(self,word):
      if word in self.d:
         ret = self.d[word]
      else:
         ret = self.stemmer.stem(word)
         self.d[word] = ret
      return ret
   def deleteme(self):
      del self.d
      self.d = {}
      print(self.d)

# quick function to replace substitutions
def decontracted(phrase):
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

 # function to do some basic NLP pre-processing steps: replacing contractions, stemming words, removing stop words
def nlp_preprocess(text_column, # column in Panda table with text
                   stop_words, # list of English stopwords
                   word_clip = 300): # truncate the number of words
   print("starting NLP process...")
   mystem = myStemmer()
   # remove contractions
   cTextl = [decontracted(x) for x in text_column.values.tolist()]
   # remove double spacing and non-alphanumeric characters
   cTextl=[re.sub(' +',' ',re.sub(r'\W+', ' ', x)) for x in cTextl]
   # lower case the words
   cTextl = [x.lower() for x in cTextl]
   # stop words and stemming
   for i in range(0,len(cTextl)):
      rawtext = cTextl[i].split(" ") # splits sentence by spaces
      rawtext = rawtext[0:min(word_clip,len(rawtext))] # take only 300 words maximum
      # stem and remove stopwords in one line (expensive operation)
      newtext = " ".join(mystem.stem(word) for word in rawtext if not word in stop_words)  # loop through words, stem,join
      cTextl[i] = newtext
   print("done NLP processing.")
   return pd.DataFrame(cTextl)

# get the default English stopwords from nlkt pacakge
stop_words = set(stopwords.words('english')) # list of stopwords to remove

# name of column with the consumer complaint text (to feed into the LSTM)
col_text = 'Consumer complaint narrative' # name of the column with the text 

# NLP: pre-process the text/complaints
cText = nlp_preprocess(d[col_text],stop_words, word_clip = 300)

# demo: compare original text versus processed text
print("Original text: " + d[col_text][0] + "\n") #original 
print("Processed text: " + cText.iloc[0,0]) # stemmed


starting NLP process...
done NLP processing.
Original text: I spoke with the creditors and I also called XXXX XXXX and I was given false information from both parties concerning ways of getting this problem resolved. I was told in the past XXXX XXXX were experiencing problem with this situation.

Processed text: spoke creditor also call xxxx xxxx given fals inform parti concern way get problem resolv told past xxxx xxxx experienc problem situat 


### Prepare Data: Features (X) and Labels (Y)
Having cleanned the text data, we now use the NLTK `Tokenizer` to vectorize the text into integers. The integers represent the most common 3000 words, which is specified by the object `max_tokens`. NOTE: `max_tokens` should be considered a type of <b>hyper-parameter</b>. More words can potentially capture more meaning, or more noise.
+ The matrix of token-sequence (`X`, below) will be the features for the word-embedding.
+ The response variable (`Y`) will be a matrix of one-hot-codings for all the different types of 'Issues'/`Label3` categories. 

In [9]:
# maximum number of words to consider in corpus for embedding (2000-10000 seems the general range. You should treat this like a (coarse) hyperparameter
max_tokens = 2000 # maximum number of word/tokens 
tokenizer = Tokenizer(num_words=max_tokens, split=' ')
tokenizer.fit_on_texts("STARTCODON " + cText[0])
# notice the addition of a start codon to signal to the LSTM where the sentence begins (due to the subsequent zero-padding to standardize the length of every tokenized-sentence/sequence)

# Model Input: tokenized the text data for input into word-embedding layer
X = pad_sequences(tokenizer.texts_to_sequences(("STARTCODON " + cText[0]).values)) # tokenize and pad with zeros
n_obs = X.shape[0] 

# Model Output: multinomial & multiclass labels (level 3)
Y = pd.get_dummies(d['Label3'].values) # one-hot coding
Y = Y.values # Convert Y into a numpy integer matrix

### Prepare Data: Category Embedding
The category embedding is another input into the model. We are not TELLING the model which Label/category is assigned to each sentence. Instead, we merely represent each sentence with <b>all possible labels</b>. This input is `X_labels3`. For each consumer complaint, it is merely a vector of integers `[0,1,2,3,...,nLabels3]`.

The point of this exercise is for the Embedding layer to learn the relationship among the different categories/labels3.

In [10]:
# Model Data 2: Inputs: vector of Categories/Labels (level 3) (literally, 0,1,2,3,...,N_topics
uLabels3 = d['Label3'].unique() # unique level 3 labels
nLabels3 = len(uLabels3) # number of unique level 3 labels
X_labels3  = np.repeat(np.array([i for i in range(0,nLabels3)],dtype=int).reshape(1,nLabels3),n_obs,axis=0) # input for embedding layer

print(X_labels3)

[[  0   1   2 ... 422 423 424]
 [  0   1   2 ... 422 423 424]
 [  0   1   2 ... 422 423 424]
 ...
 [  0   1   2 ... 422 423 424]
 [  0   1   2 ... 422 423 424]
 [  0   1   2 ... 422 423 424]]


### Sample Weights
Due to the extreme skew in frequency of different categories/labels3 in the data, we should use the `sample_weight` argument to up-weight the contributions of rare categories, and down-weight the contributions of the most common categories. The vector `vWeights` is 

In [11]:
# sample weights: balancing the weights among different labels (this function only works on NON-multiclass labels, i.e., it works when there is ONE label per observation/row)
def get_class_weights(Y, # N-hot-coding response matrix
                      clip_ = 100000): # maximum weight for rarer cases
   # total counts of different categories
   weights_total_class_counts = (Y).sum(axis=0)
   # make weights inversely proportional to the categories' frequency
   weights_by_class = (min(weights_total_class_counts)/weights_total_class_counts)
   # convert one-hot coding (in Y) to integers (to identify the category per observation/row)
   Y_int = np.argmax(Y,axis=1)
   # raw sample weights
   vWeights_raw = np.array([weights_by_class[i] for i in Y_int], dtype=float)
   # cap the weights at some high value (to prevent overflow)
   vWeights = np.clip(vWeights_raw * (Y.shape[0]/sum(vWeights_raw)),0,clip_)
   return vWeights

vWeights = get_class_weights(Y,5) # get weights


### Training vs Test sets
As crude check of the out-of-sample statistics, we will split the data in two sub-samples: i) a training set (in-sample) and a test set (out-of-sample).

In [12]:
# split the dataL set proportion for training/in-sample
fInsampleProportion = 0.4 # proportion of data for training / insample

# training and testing data
X_train, X_test,Y_train, Y_test, W_train, W_test, X_labels3_train, X_labels3_test = train_test_split(X,Y,vWeights, X_labels3, test_size = 1-fInsampleProportion, random_state = 99)

# one-off validation data (this is bad practise, see the cross-validatoin part)
ix_Validation = np.random.choice(X_train.shape[0],2000)
X_val = X_test[ix_Validation]
Y_val = Y_test[ix_Validation]
X_labels3_val = X_labels3_train[ix_Validation]

# Ready for the LSTM model!!!


## PART 2: LSTM + Embedding model
Next, we run an Long-Short-Term model for reading the (tokenized) text data, and classifying the categories ('Issues'/label3). The core layers are:
+ word `Embedding()` (like word2vec)
+ category `Embedding()` (like in the famous instacart model)
+ `LSTM()` 
+ `RepeatVector()` and `concatenate()`: to merge the outputs from the two embeddings (the LSTM output and the category embedding). 
+ `Dense()`: layer before the final activation layer

Notice the hyperparameters:
+ `dim_embed_lstm`: dimension of the word embedding
+ `dim_out_lstm` : dimension of the LSTM output
+ `dim_embed_categ` : dimension of the category-embedding (should be less than sqrt(N_categories)
+ `dim_hidden_nodes_final` : number of nodes in the final hidden layer (not so important to tune)
+ `fDropout`: regularization, dropout rate for inputs into the LSTM
+ `fDropout_RNN`: regularization, dropout for the the connections between the recurrent units

+ `batch_size` : batch size for back-propogation (should be 32 - 128)
+ `n_epoch` : number of epochs

<b>Warning</b> Due to funny memory issues with Jupyter Notebooks, it is best to run tensorflow through the terminal, not within a notebook. you have been warned...

In [30]:
# Hyperparameters
dim_embed_lstm = 180 # word embedding dimension
dim_out_lstm = 90 # LSTM output dimension
dim_embed_categ = 3 # category embedding dimension
dim_hidden_nodes_final = (np.linspace(dim_out_lstm,Y.shape[1]).round().astype(int))[1] # number of hidden nodes
fDropout = 0.5 # LSTM input dropout
fDropout_RNN = 0.1 #LSTM recurrent-cross-dropout

# number of epochs
n_epoch = 25
batch_size = 200

# use the Keras API to build the model
# Main LSTM model. Two inputs: X=word features (tokenized); X_labels=matrix of categories [1,2,3,...,nLabels]
# ... the output is Y: one-hot-coding of labels for each row of X,X_labels.
# Notice the `RepeatVector` which merges the output from the LSTM with the output from the Category embedding
def run_model(X, # tokenized text
              X_labels, # categories vector [1,2,3,...]
              Y, # labels (one-hot-coding)
              W, # sample weights
              X_val, # out-of-sample validation: X
              X_labels_val, # # out-of-sample validation: X_labels
              Y_val, # # out-of-sample validation: Y
              max_tokens, # number of words/tokens used to tokenize sentences
              n_epoch = 30, # hyperparameter
              batch_size = 64, # hyperparameter
              dim_embed_lstm= 200, # word embedding dimension
              dim_out_lstm =100, # dimension of LSTM output
              fDropout_RNN=0.1, # regularization: recurrent_dropout for LSTM
              fDropout=0.33, # regularization: input dropout for LSTM
              dim_embed_categ=5, # embedding dimension of the categories
              dim_hidden_nodes_final = 100, # number of nodes in final hidden layer
              verbose=2):
   nLabels = X_labels.shape[1]
   # main input: the word features (tokenized) 
   lstm_input_layer = Input(shape=(X.shape[1],), dtype='int32', name='lstm_input',) # lstm input
   # word embedding (like word2vec)
   lstm_embed_layer = Embedding(input_dim=max_tokens, output_dim=dim_embed_lstm, input_length=X.shape[1])(lstm_input_layer)
   lstm_output = LSTM(dim_out_lstm, dropout = fDropout, recurrent_dropout = fDropout_RNN)(lstm_embed_layer)
   # reshape the LSTM output to concatenate with the category embedding
   lstm_reshape = RepeatVector(nLabels)(lstm_output)  
   label3_input_layer = Input(shape=(X_labels.shape[1],), dtype='int32', name='label3_input') # input the topics
   label3_embed_layer = Embedding(input_dim=nLabels, output_dim = dim_embed_categ, input_length=X_labels.shape[1])(label3_input_layer) # topic embedding: should have dim: None,7,embed_dim
   # merge the LSTM output with the category embedding
   x = concatenate([lstm_reshape,label3_embed_layer],axis=2) # is this axis 1 or 2??
   # final hidden layer
   hidden_layer = Dense(dim_hidden_nodes_final, activation='relu')(x)
   final_layer = Dense(1, activation='sigmoid')(hidden_layer) # main output for categories
   # reshape the output so that it is multinomial in [n_obs, n_categories]
   final_layer2 = Flatten()(final_layer) # dimension: [n_obs, n_categories]
   main_output = Dense(Y.shape[1], activation='softmax',name = 'main_output')(final_layer2) # main output for categories 
   model = Model(inputs=[lstm_input_layer, label3_input_layer], outputs=[main_output])
   model.compile(loss = "categorical_crossentropy", optimizer='adam')
   print(model.summary()) # 
   history = model.fit({'lstm_input': X, 'label3_input': X_labels }, {'main_output': Y}, epochs = n_epoch, batch_size=batch_size, verbose = verbose, sample_weight = W, validation_data=([X_val, X_labels_val], Y_val))
   return model, history

model,history = run_model(X_train,X_labels3_train, Y_train, W_train, X_val,X_labels3_val,Y_val, max_tokens, n_epoch, batch_size, dim_embed_lstm, dim_out_lstm, fDropout_RNN, fDropout, dim_embed_categ, dim_hidden_nodes_final,verbose=2)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
lstm_input (InputLayer)         (None, 301)          0                                            
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 301, 180)     360000      lstm_input[0][0]                 
__________________________________________________________________________________________________
lstm_4 (LSTM)                   (None, 90)           97560       embedding_7[0][0]                
__________________________________________________________________________________________________
label3_input (InputLayer)       (None, 425)          0                                            
__________________________________________________________________________________________________
repeat_vec

Notice that the validation loss seems to reach a minimum around 21 to 25 iterations

## PART 3: Validation (ROC/AUC scores)
To validate the model, we will predict the categories of the hold-out data (`X_test`,`Y_test`), and calculate the out-of-sample <b>ROC/AUC score</b>. We will do this both _globally_ and _per-category_. It is important to do this _per category_ (labels3) because the global score is dominated by 3% of the most common categories.

In [31]:
# out-of-sample data: predict the categories
pwide =  model.predict([X_test,X_labels3_test]) # probability vector
pvec = pwide.flatten() # flatten probability matrix (multinomial) into a vector
yvec = Y_test.flatten() # flatten multinomial matrix into a vector

# Global AUC score (all categories)
global_fpr, global_tpr, threshold = roc_curve(yvec, pvec)
global_roc_auc = auc(global_fpr, global_tpr) # 
print("Global AUC (holdout) score is %f" % global_roc_auc)  # 0.94

# Per category AUC scores 
roc_topic = [] # container for per-category AUC scores
Y_labels_test = Y_test.argmax(axis=1)
for i in range(0,Y.shape[1]):
    tmptopic_fpr, tmptopic_tpr, threshold = roc_curve(Y_test[:,i].flatten(), pwide[:,i].flatten())
    roc_topic.append(auc(tmptopic_fpr, tmptopic_tpr))

# average ROC/AUC score overall categories
print("Average AUC (holdout) score is %f" % (sum(roc_topic)/len(roc_topic)))

# make a histogram of the distribution of AUC-scores (per category)
%matplotlib notebook
import matplotlib.pyplot as plt
n, bins, patches = plt.hist(roc_topic, 20, facecolor='blue', alpha=0.5)
plt.xlabel('AUC scores (per category)')
plt.ylabel('Frequency')
plt.title(r'cv-AUC scores per category')
plt.show()

Global AUC (holdout) score is 0.934826
Average AUC (holdout) score is 0.886575


<IPython.core.display.Javascript object>

It is reassuring that the overwhelming majority of categories have decent predictive performance, as shown in the above histogram of per-category-AUC scores. Only a few are between 0.5 to 0.8. The global AUC score is nearly the same as the [basic LSTM model](https://github.com/faraway1nspace/NLP_topic_embeddings/blob/master/FinComplain_LSTM_default_model.ipynb). So, in this case, there was little _predictive_ improvement from the embeddings (this differed from my experience with a proprietary dataset with more redundant labels/categories). 

Nonetheless, we can inspect the embeddings to see what was learned.


## PART 4: Qualitative Validation: Category Embeddings
This script differed from the ['default'/'basic' LSTM model](https://github.com/faraway1nspace/NLP_topic_embeddings/blob/master/FinComplain_LSTM_default_model.ipynb) by the category embeddings.  Each category is transformed from a discrete, independent entity to a vector of continuous values: categories closer in this space are more related/redundant.

We can inspect these vectors by accessing the embedding weights (`get_weights()`) for the model layer #5.

In [32]:
# get the weights of the embedding
embtopic = model.layers[5].get_weights()[0] #
print(embtopic.shape)

# reduce the dimensions to 2 (for visualization)
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, verbose=0, perplexity=30, n_iter=300)
tsne_res = tsne.fit_transform(embtopic)

# cluster the embeddings  with KMeans
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=7) # You want cluster the passenger records into 2: Survived or Not survived
kmeans.fit(embtopic) # fit the kmeans
grp_kmeans = kmeans.predict(embtopic)

# plot the kmeans groups 
%matplotlib notebook
import matplotlib.pyplot as plt
clmap = plt.cm.get_cmap('nipy_spectral') # color palette
vColour = clmap(grp_kmeans/max(grp_kmeans)) # assign colors to different kmeans-clusters
plt.scatter(tsne_res[:,0],tsne_res[:,1], label='Category Embedding (t-SNE)', c=vColour, s=25, marker="o")
plt.title('Clustering of Financial Complaint Categories\n (Low-Dimensional Embedding)')
plt.ylabel('t-SNE Embedding Dimension 2')
plt.xlabel('t-SNE Embedding Dimension 1')
plt.show()


(425, 3)


<IPython.core.display.Javascript object>

How do the _inferred_ clusters compare to the Bureau's 18 financial categories? (what they call "Products"). We can use the <b>Rand Index</b> to measure the similarity between groupings (a measure of similarity between two clusterings). The index is sensitive to the a-priori number of clusters we set for the K-means; therefore, we will calculate the Rand Index over a sequence of Kmeans-cluster numbers. 

In [34]:
from scipy.special import comb
# Rand Index
def rand_index_score(clusters, classes):
    tp_plus_fp = comb(np.bincount(clusters), 2).sum()
    tp_plus_fn = comb(np.bincount(classes), 2).sum()
    A = np.c_[(clusters, classes)]
    tp = sum(comb(np.bincount(A[A[:, 0] == i, 1]), 2).sum()
             for i in set(clusters))
    fp = tp_plus_fp - tp
    fn = tp_plus_fn - tp
    tn = comb(len(A), 2) - tp - fp - fn
    return (tp + tn) / (tp + fp + fn + tn)

# calculate the rand index per Kmean-cluster-number
cat_combos = (d.drop_duplicates(subset=['Label1','Label3'])).copy()
nLabels1 = len(cats[0]) # number of labels ("Products")
rand_index_per_k = [0,0] # 
# loop through different number of clusters
for k in range(3,nLabels1): 
   kmeans = KMeans(n_clusters=k) # You want cluster the passenger records into 2: Survived or Not survived
   kmeans.fit(embtopic) # fit the kmeans
   grp_kmeans = kmeans.predict(embtopic)
   rand_index_per_k.append(rand_index_score(cat_combos['Label1'].values, grp_kmeans))

print(rand_index_per_k)

# plot the Rand Indices
import matplotlib.pyplot as plt
plt.plot([i for i in range(0,len(rand_index_per_k))], rand_index_per_k,'b')
plt.plot([0,nLabels1], [max(rand_index_per_k),max(rand_index_per_k)], 'r--')
plt.title("Similarity Between Category Groupings:\n Inferred Clusters (K-Means) vs. Human Groupings")
plt.ylabel("Rand Index")
plt.xlabel("Number of K-means Clusters")

[0, 0, 0.5630188679245283, 0.6139733629300776, 0.6884794672586015, 0.6914539400665927, 0.7369700332963374, 0.7547502774694783, 0.7736403995560488, 0.7854051054384018, 0.7879467258601554, 0.7896115427302997, 0.7890011098779134, 0.7985238623751387, 0.8052830188679245, 0.8150832408435073, 0.8186903440621531]


<IPython.core.display.Javascript object>

Text(0.5, 0, 'Number of K-means Clusters')

### Qualitative assessment using cryptocurrencies complaints

Let's check a few Cryptocurrency-related complaints which were DISCARDED because their category had less than 10 observations. For example, consider the following complaint: <i>"In XXXX of this year I opened an account at Coinbase the most well known exchange for cryptocurrency... On XXXX XXXX, 2017 upon trying to do a trade I was asked for re-verification of my identity. as requested I submitted copy of my drivers license, got a notice that verification failed. Tried it a couple of more times. Same result. I called support they said everything looked fine on their end and that they will send it to their next support level. Got an email right away that..."</i>

Obviously, the trained model can't classify this sentence into its specific crypto-related Issue (because we truncated the data to exclude this Issue). Nonetheless, it is interesting to see how the model classifies it, in lieu of this information, and whether it matches our intuition.

We'll make a new subset of the data called `d_crypto` and then we will have to re-tokenize the data, perform sending it through `model.predict` function.

In [35]:
# get the discarded categories: which categories have less than 10
which_labels_discarded = [i for i in range(0,labels_counts.shape[0]) if i not in which_labels] #excluded labels3
# rows of data in discarded categories
ixDiscarded = d_raw.Label3.isin(which_labels_discarded) # subset integers
# data of the discarded complaints & categories
d_discard = (d_raw[ixDiscarded]).copy() # subset of data for those discarded data
d_discard['Consumer complaint narrative'] = d_discard['Consumer complaint narrative'].fillna(" ") # fill NaN

# find cryptocurrency related complaints
d_crypto = d_discard[d_discard['Consumer complaint narrative'].str.contains("cryptocurrency")] # find word 'crypto' in the complain text
print(d_crypto.shape)

# Print the complaints from one of the crypto-related narratives
print(d_crypto['Consumer complaint narrative'].values[1][0:1000])

# use NLP to pre-process the new Cryptocurrency complaints (match the pre-processing used to train the model)
cText_crypto = nlp_preprocess(d_crypto['Consumer complaint narrative'], stop_words = stop_words, word_clip=300)

# tokenize the new cryptocurrency text (remember the tokenizer was trained above)
crypto_tokens = tokenizer.texts_to_sequences(("STARTCODON " + cText_crypto[0]).values) # tokenize crypto text 
X_crypto = pad_sequences(sequences =crypto_tokens, maxlen = X.shape[1])
X_crypto.shape[1] == X.shape[1] # check that the token-sequence-length is the same as the model input 'X'

# we also need to provide the model with a vector of categories
X_crypto_labels = X_labels3[range(0,X_crypto.shape[0])]


(2, 9)
In XXXX of this year I opened an account at Coinbase the most well known exchange for cryptocurrency ( https : //www.coinbase.com ). On XXXX XXXX, 2017 upon trying to do a trade I was asked for re-verification of my identity. as requested I submitted copy of my drivers license, got a notice that verification failed. Tried it a couple of more times. Same result. I called support they said everything looked fine on their end and that they will send it to their next support level. Got an email right away that it will take 4 to 5 business days. In the mean time my account is complete frozen. That was on XXXX XXXX, over a month ago. I have over {$42000.00} in my USD account. Called support three more times. They indicated I need to wait my turn in upper support. I tried to withdraw the funds. I could not. I called to close the account they said I could not until issue is resolved by upper support that there is nothing anyone can do.


After processing the crypto-complaints so that it matches the tokenized input data (`X`), we use the model.predict function to aget the class multinomial probabilities (`p_crypto`) and then take the maximum probability to classify them (`Y_predicted_crypto`). The outputted integer can be mapped to English labels ('Issues') by our object 'cats'. 

Intuitively, we'd expect the predicted classes to pertain to something like foreign currency exchanges or withdrawal issues. <b>Does the predicted classes match our intuition?</b>

In [47]:
# predict the Issue/Labels3 category of the crypto complains
p_crypto =  model.predict([X_crypto,X_crypto_labels]) # probability matrix

1# get the highest ranking labels3
Y_predicted_crypto = (-1*p_crypto).argsort(axis=1) # sort descending: highest probability classes to lowest

# get the English-labels of the predicted categories/label3
Issues_predicted_crypto = cats[2][Y_predicted_crypto]
#print(Issues_predicted_crypto)

print("Coinbase complaint: predicted category #1: %s" % Issues_predicted_crypto[1,0]) # Customer consumer complaint
print("Coinbase complaint: predicted category #2: %s" % Issues_predicted_crypto[1,1]) # Virtual currency
print("Coinbase complaint: predicted category #3: %s" % Issues_predicted_crypto[1,2]) # Credit

Coinbase complaint: predicted category #1: Credit reporting, credit repair services, or other personal consumer reportsCredit repair servicesProblem with customer service
Coinbase complaint: predicted category #2: Money transfer, virtual currency, or money serviceVirtual currencyOther transaction problem
Coinbase complaint: predicted category #3: Credit card or prepaid cardGeneral-purpose prepaid cardProblem getting a card or closing an account


#### Crypto-Currency Results
+ The model seems to correctly classify the first crypto-issue correctly as a <b>virtual currency/digital wallet scam</b> (Warren Buffet would be proud!), this is the 2nd highest category
+ the third category is correct about <b>closing an account</b>, but instead of it being a general virtual currency problem, the model incorrectly assigns it to a _credit card account problem_ (although cryptos are perhaps closest to credit cards, in the world of finance).
+ unfortunately, the highest probability issue is a <b>generic customer service complaint</b>. Hardly informative. 


## PART 4: Hyperparameter Tuning

For the previous LSTM model, the hyperparameters (embedding dimension, maximum number of tokens, number of epochs) were arbitrary. In the next section, we will try to find slightly better values through k-fold cross-validation. 

<i>See the next Jupyter notebook called [https://github.com/faraway1nspace/NLP_topic_embeddings/blob/master/FinComplain_LSTM_default_hyperparam-tuning.ipynb](https://github.com/faraway1nspace/NLP_topic_embeddings/blob/master/FinComplain_LSTM_default_hyperparam-tuning.ipynb) for the hyper-parameter tuning

## Conclusions
foobar 