## Hyper-Parameter Tuning (default model)
This document perfoms cross-validation tuning of hyperparameters, for use in the "Default" model performing text classification of consumer complaints (using NLP-LSTM mode; see script [here](https://github.com/faraway1nspace/NLP_topic_embeddings/blob/master/FinComplain_LSTM_default_model.ipynb)).

<b>Hyperparameters</b> should be done by minimizing a cross-validation loss. In this script, I use a <b>Thompson-sampler</b> (inspired by a cruide _reinforcement-learning_ / multi-arm bandit algorithm), to stochastically explore the high-dimensional hyperparameter space and find good hyperparameter values. This is an original algorithm and is untested, but seems to work well.

The hyperparameters include:
+ the embedding dimension of the word-embedding (like word2vec) 
+ the size of the corpus used in the word-tokenization (which feeds into the word-embedding). 
+ the dimensionality of the Long-Short-Term-memory outputs
+ the Dropout rate in the LSTM

... as well as choosing total number of epochs to run the LSTM.

### Function overview
+ `run_model` The main function which runs an individual cross-validation run 
+ `proposal_hyperparam` The main thompson sampler function which proposes new samples from the hyperparameter space.

The procedure is straight forward
+ get <b>samples</b> from the hyperparameter space
+ do <b>3-fold cross-validation</b> to estimate an Expected Loss (aka hold-out loss)
+ use the Expected Loss as the 'reward' in a <b>multi-arm bandit learner</b>
+ calculate <b>probabilities</b> for each combination of hyperparameters
+ use the <b>Thompson-sampler</b>/multi-arm bandit algorithm to draw a new sample from the hyperparameter space
+ <b>repeat</b> for about 30 iterations.

The multi-arm bandit learner should progressively sample from the hyperparameter space that has a higher-probability of minimizing the Expected Loss.


## Functions: Data Import & Natural Language Pre-Processing

The following functions are some idiosyncratic functions to import and clean the Financial complaint data. For the background of the data source and the models' purpose, please see the [Readme file](https://github.com/faraway1nspace/NLP_topic_embeddings) as well as the [US Consumer Complaint Database](https://www.consumerfinance.gov/data-research/consumer-complaints/) at the US Consumer Financial Protection Bureau website.
+ `import_and_clean_data` : reads the complaint data and organizes it for the `keras` LSTM model
+ `nlp_preprocess` " does some basic NLP pre-processing (stemming, removing stop words, etc.)

In [None]:
# %matplotlib notebook
import os
import time
import pandas as pd
import numpy as np
import re
from math import log,exp
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk import download as nltk_downloader 
from keras import backend as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Embedding, LSTM, RepeatVector, concatenate, Dense, Reshape, Flatten
from keras.models import Model
from scipy.stats import rankdata as rd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, confusion_matrix
from sklearn.model_selection import KFold # import KFold
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor

# set the working directory
os.chdir("/media/AURA/Documents/JobsApplications/insightdata/nlp/demo/consumer_financial_protection/")

# NLP function to replace english contractions
def decontracted(phrase):
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

# function to do some basic NLP pre-processing steps: replacing contractions, stemming words, removing stop words
def nlp_preprocess(text_column, # column in Panda table with text
                   stop_words, # list of English stopwords
                   word_clip = 300): # truncate the number of words
   # remove contractions
   ps = PorterStemmer()  # stemmer    
   cTextl = [decontracted(x) for x in text_column.values.tolist()]
   # remove double spacing and non-alphanumeric characters
   cTextl=[re.sub(' +',' ',re.sub(r'\W+', ' ', x)) for x in cTextl]
   # lower case the words
   cTextl = [x.lower() for x in cTextl]
   # stop words and stemming
   for i in range(0,len(cTextl)):
      rawtext = cTextl[i].split(" ") # splits sentence by spaces
      rawtext = rawtext[0:min(word_clip,len(rawtext))] # take only 300 words maximum
      # stem and remove stopwords in one line (expensive operation)
      newtext = " ".join(ps.stem(word) for word in rawtext if not word in stop_words)  # loop through words, stem,join
      cTextl[i] = newtext
   return pd.DataFrame(cTextl)

# function: import and pre-process the data (prepare for Keras)
def import_and_clean_data(filename, # file name of data to import (either a .csv or a tar.xz file of a .csv)
                          col_label = "Label3",
                          data_dir = "data/", # directory
                          tmp_dir = "/tmp/", # if file is a .tar.xz, where to temporarily extract the data (Windows users need specify differently than /tmp/
                          rare_categories_cutoff = 10, # threshold for categories to be included in the training set
                          word_clip = 300): # max number of words in text to accept (only first 300 words are retained
   # check
   if "tar.xz" in filename:
      print("decompressing " + filename + " into "+tmp_dir)
      # command for shell
      os_system_command = "tar xf "+data_dir+filename+" -C "+tmp_dir
      print(os_system_command)
      # run decompression command (for Linux/Mac)
      os.system(os_system_command)
      newfilename = tmp_dir + filename.split(".tar.xz")[0]
   else:
      print("importing csv called " + filename)
      newfilename = data_dir + filename
   # read the complaint data 
   d_raw = pd.read_csv(newfilename, usecols = ['State','Complaint ID','Consumer complaint narrative','Product', 'Sub-product', 'Issue', 'Sub-issue'])
   print("imported " + str(d_raw.shape[0]) + " rows of data") # notice 191829 rows and 7 columns
   # fill NaN with blanks
   for col_ in ['Product','Sub-product','Issue']:
      d_raw[col_] = d_raw[col_].fillna(" ") # fill NaN with a character
   # factorize the two levels (Product and Product+Issue) to get unique values
   d_raw['Label1'] = pd.factorize(d_raw['Product'])[0]
   # combine Product + Issues
   d_raw['Label3'] = pd.factorize(d_raw['Product'] + d_raw['Sub-product']+d_raw['Issue'])[0] # 570 Categories
   # Dictionary: category integers vs. category names
   cats = [pd.factorize(d_raw['Product'])[1],  pd.factorize(d_raw['Product'] + d_raw['Sub-product'])[1], pd.factorize(d_raw['Product'] + d_raw['Sub-product']+d_raw['Issue'])[1]]
   # truncate the data: only use categories with at least 10 observations
   labels_counts = d_raw.groupby([col_label]).size() # counts of Level3 categories 
   which_labels = np.where(labels_counts>=rare_categories_cutoff)[0] # which categories have at least 'cutoff'
   # make new (truncated) dataset
   ixSubset = d_raw.Label3.isin(which_labels) # subset integers
   # new dataset 'd', as subset of d_raw
   d = (d_raw[ixSubset]).copy()
   # NLP pre-processing: stopwords removal, stemming, etc.
   # get the default English stopwords from nlkt pacakge
   from nltk.corpus import stopwords 
   stop_words = set(stopwords.words('english')) # list of stopwords to remove
   # get the stemming object from nltk
   ps = PorterStemmer()  # stemmer
   # column in data with the Text data (to feed into the LSTM)
   col_text = 'Consumer complaint narrative' # name of the column with the text 
   # NLP: pre-process the text/complaints
   print("performing NLP pre-processing on column " + col_text + " (remove stop words, stemming,...). This may take a while...")
   cText = nlp_preprocess(d[col_text],stop_words, word_clip = word_clip)
   print("Done NLP pre-processing")
   # process the labels, make into a N-hot-coding matrix 
   Y = pd.get_dummies(d['Label3'].values) # one-hot coding
   # get integers representing each (label3) class (these are the column names)
   Ynames_int = Y.columns # notice the confusing mapping of different integers to different integers
   # get english issue labels corresponding to each integer value in Ynames_int
   Ynames_char = [cats[2][i] for i in Ynames_int] # actual names
   # Finally, convert Y into a numpy matrix (not a panda df)
   Y = Y.values
   print("returning cleaned text data 'cText' for use in tokenization; 'Y' as an one-hot-coding matrix of categories; and 'cats' a dictionary matches columns in Y to english category names")
   if "tar.xz" in filename:
      print("deleting temporary file " + filename)
      os_system_command = "rm "+ newfilename
      print(os_system_command)
      os.system(os_system_command)
      print("done pre-processing")
   return cText, Y, cats

# function to calculate sample_weights for keras argument sample_weight
def get_class_weights(Y, # N-hot-coding response matrix
                      clip_ = 100000): # maximum weight for rarer cases
   weights_total_class_counts = (Y).sum(axis=0)
   weights_by_class = (min(weights_total_class_counts)/weights_total_class_counts) # weight by the rarest case
   Y_int = np.argmax(Y,axis=1)
   vWeights_raw = np.array([weights_by_class[i] for i in Y_int], dtype=float)
   vWeights = np.clip(vWeights_raw * (Y.shape[0]/sum(vWeights_raw)),0,clip_)
   return vWeights


## Functions: Hyperparameter Sampling

With 4 hyperparameters and 3 discrete values each, there are 81 possible _combinations_ of hyperparameters. In reality, for a production/research project you'd likely want more fine-grained hyperparameter values, making it inefficient to do cross-validation on EVERY combination. Instead, we'll do a <b>stochastic search</b> across the space of hyperparameter combinations.

The goal is to find the best combination of hyperparameter values, much earlier than trying all 81 combinations. But how?

Stochastic exploration of the hyperparameter space requires some way to calculate _probabilities_ for each combination of hyperparameters. I use a simple principle from Thompson sampling: the more _uncertain_ a potential action/reward is, the more _likely_ we should try it, and thus reduce our uncertainty and quickly find the highest rewarding action. Here _action_ means picking a combination of hyperparameters to estimate their Expected Loss; and _reward_ is the Expected Loss, or, finding the lowest Expected Loss.

The key is how to estimate the probabilities of sampling each combination of hyperparameter: this will be acheive through estimation by an ensemble of regressors (ridge-regression and a decision tree). The regressor will estimate the Expected Loss using the hyperparameter variables as predictor variables. Using a technique of 'leave-on-out' subsampling, we can turn the <b>stability</b> of these estimates into probabilities over the space of hyperparameter combinations. Then, we simply use these probabilities over the hyperparameter space to sample new combinations: higher probability/more stable estimates will be more likely to be picked.

There are a bunch of functions, the main ones pertaining to the hyperparameters are:
 + `make_hyperparameters_combos` : takes a dictionary of hyperparameters and their values and makes a grid of all possible combinations
 + `optimal_order_of_hyperparameter_runs` : finds the optimal order in which to test the different combination of hyperparameter values. This will find high-contrasting parameter-combinations, so that the space of hyperparameter values is quickly explored (prior to evoking the multi-arm bandit algorithm)
 + `proposal_hyperparam`: main Thompson sampler; proposes new combinations of hyperparameters for running in cross-validation. Switches between a deterministic algorithm and a stochastic algorithm after a preset number of iterations (`toggle_learner`).

In [None]:
# internal function for cross-validation: creates all combinations of hyperparameters
def make_hyperparameters_combos(hyper_parameters):
   # hyperparmeters' values
   hyp_args = [x[1] for x in hyper_parameters.items()]
   # names of hyperparameters
   hyp_names = [x[0] for x in hyper_parameters.items()]
   # all combos of hyperparameters, as a tensor
   hyp_args_grid = np.meshgrid(*hyp_args)
   # dimensions of the hyperparameters
   hyp_args_dimensions = list(hyp_args_grid[0].shape)
   # total number of hyperparameter combos
   total_combos = round(exp(sum(map(log,hyp_args_dimensions))))
   # reshape into a [n_combos,parameters]
   hyp_grid = (np.array(hyp_args_grid).reshape(len(hyp_args_grid),total_combos)).T
   # convert grid to data.frame
   hyp_pd = pd.DataFrame(hyp_grid, columns = hyp_names)
   # also make a companion sequence
   hyp_seq = [[j for j in range(0,len(x[1]))] for x in hyper_parameters.items()]
   hyp_seq_grid = (np.array(np.meshgrid(*hyp_seq)).reshape(len(hyp_args_grid),total_combos)).T
   # make an empty panda data.frame to fill with results
   empty_res = pd.DataFrame({"cv_loss": [0 for i in range(0,hyp_pd.shape[0])], "best_epoch": [0 for i in range(0,hyp_pd.shape[0])]})
   return hyp_pd, hyp_seq_grid, empty_res

# internal function for cross-validation: optimal order of running different hyperparameter scenarios: calculates a 'scenario distance' by finding scenarios that are maximally contrasting with each other
def optimal_order_of_hyperparameter_runs(hyp_seq_grid):
   scenario_dist = -2 * np.dot(hyp_seq_grid, hyp_seq_grid.T) + np.sum(hyp_seq_grid**2, axis=1) + np.sum(hyp_seq_grid**2, axis=1)[:, np.newaxis]
   scenario_rows = [0] # start with row 1
   for i in range(0,hyp_seq_grid.shape[0]):
      cur_dist = -2 * np.dot(hyp_seq_grid, hyp_seq_grid[scenario_rows].T) + np.sum(hyp_seq_grid[scenario_rows]**2, axis=1) + np.sum(hyp_seq_grid**2, axis=1)[:, np.newaxis]
      elem_rank_by_distance = rd(-(((cur_dist**2)).sum(axis=1))**(0.5)).argsort()
      pos_elem = [x for x in elem_rank_by_distance if x not in scenario_rows]
      if len(pos_elem)>0:
         scenario_rows.append(pos_elem[0])
   return scenario_rows

# internal function for cross-validation: make validation sets and test sets
def make_cv_weights(n_obs, fHoldoutProportion = 0.5, kfold = 3, seed = 1000):
   # trainging set and test set
   ix_train, ix_test = train_test_split([i for i in range(0,n_obs)], test_size = fHoldoutProportion, random_state = seed)
   # divide the training data into cross-validation sets
   cv_splitter = KFold(n_splits = kfold)
   cv_sets = []
   for ix_insample, ix_validation in cv_splitter.split(ix_train):
      cv_sets.append([ix_insample, ix_validation])
   return ix_train, ix_test, cv_sets

# internal function for cross-validation: propose hyperparameters, by two methods:
# ... i) sequential (just loops through combinations)
# ... ii) thompson sampling / multi-arm bandit reinforcement learner (ridge regression & decision trees to try to predict what might be best hyperparameters)
def proposal_hyperparam(run_number, # iteration number for the algorithm
                        cv_results, # current results of the cv-loss (panda frame)
                        hyp_pd, # output from "make_hyperparameters_combos" function
                        hyp_optimal, # output from "make_hyperparameters_combos" function
                        toggle_learner = 10, # when to switch to mult-arm bandit learning
                        ridge_alpha = 7, # ridge regression learner: shrinkage parameter
                        max_depth = 2, # tree-learner maximum tree depth
                        multinomial_prior = 1): # diffuse prior on the model space for the multi-arm bandit 
   n_models = len(hyp_optimal)
   if (run_number < toggle_learner) | (run_number > (hyp_pd.shape[0]-1)):
      # just sequentially loop through scenarios
      print("next hyperparameter: estimated by maximum parameter contrast")
      return_hypIx = hyp_optimal[run_number]
      return_hyperparameters = hyp_pd.iloc[return_hypIx]
   else:
      print("next hyperparameter: Thompson sampling of hyperparameters based on a multi-arm bandit learner")      
      which_done = np.where(cv_results['best_epoch'].values >0)
      which_notdone = np.where(cv_results['best_epoch'].values == 0)
      # train ridge regression 
      learner1 = Ridge(alpha=ridge_alpha, copy_X = True,normalize=True)
      learner2 = DecisionTreeRegressor(max_depth = max_depth)
      # multi-arm bandit: use leave-one-out estimation to get probabilities of each scenario being the best
      counts_best_loss = np.zeros(len(which_notdone[0])) + multinomial_prior # notice shrinkage parameter multinomial_prior=1
      for j in range(0,len(which_done[0])):
         # drop one observation
         loo = which_done[0][np.arange(len(which_done[0]))!=j]
         # fit learners
         lr1 = learner1.fit(X = hyp_pd.values[loo], y = cv_results['cv_loss'].values[loo]) 
         lr2 = learner2.fit(X = hyp_pd.values[loo], y = cv_results['cv_loss'].values[loo]) 
         # predict which will be the lowest loss         
         pred_loss1 = lr1.predict(hyp_pd.values[which_notdone]) # expected loss from the learner
         pred_loss2 = lr2.predict(hyp_pd.values[which_notdone]) # expected loss from the learner         
         # increment number of time's each scenario is predicted to be the best
         counts_best_loss += 0.5*(pred_loss1 == min(pred_loss1))
         counts_best_loss += 0.5*(pred_loss2 == min(pred_loss2)) 
      # convert frequency of each scenario being the best
      thompson_sampling_prob = counts_best_loss/sum(counts_best_loss)
      # Thompson sampler: random sample from the probabilities
      return_hypIx = np.random.choice(which_notdone[0],p=thompson_sampling_prob)
      # get the index of the hyperparameters for the best expectd loss
      # hyperparameters for the best expectd loss
      return_hyperparameters = hyp_pd.iloc[return_hypIx]
   return return_hypIx, return_hyperparameters 

