### Complaint Generator

### Quick background
Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate.

RNNs can be used to make predictions, or to learn from sequential data and generate similar data.

In [1]:
# import dependencies
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense , Activation
from keras.optimizers import RMSprop
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import LSTM
from keras.utils import np_utils
from keras.callbacks import LambdaCallback, ModelCheckpoint
import random
import sys
import io
import ast

Using TensorFlow backend.


In [2]:
import tensorflow as tf
print(tf.test.gpu_device_name())
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

/device:GPU:0


In [3]:
import pandas as pd
df = pd.read_csv('Consumer_Complaints.csv',  encoding='utf-8-sig')
df = df[['Consumer complaint narrative','Product']]
df = df[pd.notnull(df['Consumer complaint narrative'])]
df.rename(columns = {'Consumer complaint narrative':'narrative'}, inplace = True)
df.head(10)

Unnamed: 0,narrative,Product
1,When my loan was switched over to Navient i wa...,Student loan
2,I tried to sign up for a spending monitoring p...,Credit card or prepaid card
7,"My mortgage is with BB & T Bank, recently I ha...",Mortgage
13,The entire lending experience with Citizens Ba...,Mortgage
14,My credit score has gone down XXXX points in t...,Credit reporting
16,I few months back I contacted XXXX in regards...,"Credit reporting, credit repair services, or o..."
25,I '' m a victim of fraud and I have a file wit...,"Credit reporting, credit repair services, or o..."
28,I have been disputing a Bankruptcy on my credi...,"Credit reporting, credit repair services, or o..."
42,Today I received a phone call from a number li...,Debt collection
53,I have gotten 7 calls from AR Resources lookin...,Debt collection


In [4]:
# using bs4 to clean the data

from bs4 import BeautifulSoup
import re
def cleanText(text):
    text = BeautifulSoup(text, "html5lib").text
    text = re.sub(r'\|\|\|', r' ', text)
    text = re.sub(r'[{}()//]', r' ', text) 
    text = re.sub(r'http\S+', r'<URL>', text)
    text = text.lower()
    text = text.replace('x', '')
    text = text.replace('\n',' ')
    return text

In [5]:
# Apply the cleaning
df['narrative'] = df['narrative'].apply(cleanText)

In [6]:
# See how the new text looks like 
df.head(30)

Unnamed: 0,narrative,Product
1,when my loan was switched over to navient i wa...,Student loan
2,i tried to sign up for a spending monitoring p...,Credit card or prepaid card
7,"my mortgage is with bb & t bank, recently i ha...",Mortgage
13,the entire lending eperience with citizens ban...,Mortgage
14,my credit score has gone down points in the l...,Credit reporting
16,i few months back i contacted in regards to ...,"Credit reporting, credit repair services, or o..."
25,i '' m a victim of fraud and i have a file wit...,"Credit reporting, credit repair services, or o..."
28,i have been disputing a bankruptcy on my credi...,"Credit reporting, credit repair services, or o..."
42,today i received a phone call from a number li...,Debt collection
53,i have gotten 7 calls from ar resources lookin...,Debt collection


###  Sequence data for input to LSTM
Right now we have a dataframe with columns corresponding to user ids and message text where each row corresponds to a single message sent. This is pretty far from the 3D shape the input layer of our LSTM network requires: model.add(LSTM(batch_size, input_shape=(time_steps, features))) where batch_size is the number of sequences in each sample (can be one or more), time_steps is the size of observations in each sample, and features is the number of possible observable features (i.e., characters in our case).

1. Subset the data to form a corpus
2. Format the corpus from #1 into arrays of semi-overlapping sequences of uniform length and next characters
3. Represent the sequence data from #2 as sparse boolean tensors

In [13]:
# Let's try to model after student loan complains
complaint_product = "Student loan"
df1 = df[df['Product'] == complaint_product].reset_index(drop=True)

In [14]:
df1.drop(columns = ['Product'], inplace=True)

In [15]:
text = df1['narrative']
n_messages = len(text)
n_chars = len(' '.join(map(str, text)))

print("Student Loans accounts for %d messages" % n_messages)
print("Their complains add up to %d characters" % n_chars)

Student Loans accounts for 19476 messages
Their complains add up to 23768172 characters


In [16]:
sample_size = int(len(text) * 0.2)

text = text[:sample_size]
text = ' '.join(map(str, text))

Format the corpus into arrays of semi-overlapping sequences of uniform length and next characters
The rest of the code used here is adapated from this example script, originally written by François Chollet (author of Keras and Kaggler), to prepare the data in the correct format for training an LSTM. Since we're training a character-level model, we relate unique characters (e.g., "a", "b", "c", ...) to numeric indices in the cell below. If you rerun this code yourself by clicking "Fork Notebook" you can print out all of the characters used.

In [17]:
chars = sorted(list(set(text)))
print('Count of unique characters (i.e., features):', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Count of unique characters (i.e., features): 60


This next cell step gives us an array, sentences, made up of maxlen (40) character sequences chunked in steps of 3 characters from our corpus user, and next_chars, an array of single characters from user at i + maxlen for each i. I've printed out the first 10 strings in the array so you can see we're chunking the corpus into partially overlapping, equal length "sentences".

In [18]:
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences), "\n")

print(sentences[:10], "\n")
print(next_chars[:10])

Number of sequences: 1575245 

['when my loan was switched over to navien', 'n my loan was switched over to navient i', 'y loan was switched over to navient i wa', 'oan was switched over to navient i was n', ' was switched over to navient i was neve', 's switched over to navient i was never t', 'witched over to navient i was never told', 'ched over to navient i was never told th', 'd over to navient i was never told that ', 'ver to navient i was never told that i h'] 

['t', ' ', 's', 'e', 'r', 'o', ' ', 'a', 'i', 'a']


You can see how the next character following the first sequence 'hi folks. just doing the new signee stuf' is the character f to finish the word "stuff". And the next character following the sequence 'folks. just doing the new signee stuff. ' is the character h to start the word "hello". In this way, it should be clear now how next_chars is the "data labels" or ground truth for our sequences in sentences and our model trained on this labeled data will be able to generate new next characters as predictions given sequence input.

Represent the sequence data as sparse boolean tensors
The next cell will take a few seconds to run. We're creating a sparse boolean tensors x and y encoding character-level features from sentences and next_chars to use as inputs to the model we train. The shape we end up with will be: input_shape=(maxlen, len(chars)) where maxlen is 40 and len(chars) is the number of features (i.e., unique count of characters from our corpus).

In [19]:
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [20]:
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

In [21]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def on_epoch_end(epoch, logs):
    # Function invoked for specified epochs. Prints generated text.
    # Using epoch+1 to be consistent with the training epochs printed by Keras
    if epoch+1 == 1 or epoch+1 == 15:
        print()
        print('----- Generating text after Epoch: %d' % epoch)

        start_index = random.randint(0, len(text) - maxlen - 1)
        for diversity in [0.2, 0.5, 1.0, 1.2]:
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index: start_index + maxlen]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(400):
                x_pred = np.zeros((1, maxlen, len(chars)))
                for t, char in enumerate(sentence):
                    x_pred[0, t, char_indices[char]] = 1.

                preds = model.predict(x_pred, verbose=0)[0]
                next_index = sample(preds, diversity)
                next_char = indices_char[next_index]

                generated += next_char
                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()
    else:
        print()
        print('----- Not generating text after Epoch: %d' % epoch)

generate_text = LambdaCallback(on_epoch_end=on_epoch_end)

In [22]:
# define the checkpoint
filepath = "weights.hdf5"
checkpoint = ModelCheckpoint(filepath, 
                             monitor='loss', 
                             verbose=1, 
                             save_best_only=True, 
                             mode='min')

In [23]:
# fit model using our gpu
with tf.device('/gpu:0'):
    model.fit(x, y,
              batch_size=128,
              epochs=20,
              verbose=2,
             callbacks=[generate_text, checkpoint])

Epoch 1/20
 - 1380s - loss: 1.3568

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "time. while my balances are decreasing w"
time. while my balances are decreasing who i was told that i was told that i have been trying to the payments. i am not accruing and i was applied to continue to send a complaint and the payments and then the payments and i have been told that i had to send me to the account is the payments of the interest rate i was a credit reports t

  after removing the cwd from sys.path.


o pay the payment was to make a compand in the payment of  $100.00  of the payment of the account and i
----- diversity: 0.5
----- Generating with seed: "time. while my balances are decreasing w"
time. while my balances are decreasing with my current in                         i have called and i had sending to pay  $  $100.00  a private loans with the account and i could not have and i was not added out to the consolidated me to receive a payment was not so that i have allow that i have no added to pay that what i was sent a loan and i was statement with the interest rate i am a life in . i have been additional student loans an
----- diversity: 1.0
----- Generating with seed: "time. while my balances are decreasing w"
time. while my balances are decreasing with about my soce interest.  .    those qualifying to failed a cospigat of   over that i called, plowing   awated to schoolsed. happening by because we paided rine so i logh and to the debt, ititiution, o   where it since that had 

Inspiration for next steps
Here are just a few ideas for how to take what you learned here and expand it:

Experiment with different (hyper)-parameters like the amount of training data, number of epochs or batch sizes, temperature, etc.
Try out the same code with different data; fork this notebook, go to the “Data” tab and remove the freeCodeCamp data source, then add a different dataset (good examples here).
Try out more complicated network architectures like adding dropout layers.
Learn more about deep learning on Kaggle Learn, a series of videos and hands-on notebook tutorials in Kernels.
Use weights.hdf5 in the "Output" to predict based on different data in a new kernel what it would be like if the user in this tutorial completed someone else's sentences.
Compare the speed-up effect of using a CPU versus a GPU on a minimal example.
 