# LSTM Tuning for Kaggle Deep NLP Datasets

In this notebook, an LSTM is tuned for the Deep NLP datasets on Kaggle. There is a more detailed notebook for the LSTM tuning for the Spooky Author Identification. In this notebook, the previous code will be reused with less explanation on two new datasets.

### Deep NLP Dataset

The Deep NLP Dataset contains two datasets from two different cases. One is responses to a chatbot, and one is resumes. Both of these have a label as `flagged` or `not flagged` (binary label).


### Chatbot Dataset

For the Chatbot dataset, the scenario is a therapy chatbot where the user is asked 'Describe a time when you have acted as a resource for someone else'. The dataset consists of user responses to this question, as well as a 'flagged' or 'not flagged' label. If it is 'flagged', the user is referred to help.

The dataset is contained in Sheet_1.csv, and has 80 user responses in the response_text column.


### Resume Dataset

For the Resume dataset, resumes were queried from Indeed.com with keyword 'data scientist', location 'Vermont'. The dataset consists of the text from these resumes, as well as a 'flagged' or 'not flagged' label. In this case, 'flagged' means the applicant is invited to an interview, and 'not flagged' for when the user can submit a modified resume at a later date.

The dataset is contained in Sheet_2.csv, and has 125 text resumes in the resume_text column.


### Small Datasets

Even though this dataset is called Deep NLP, these datasets have so few examples, that deep learning may not work so well. The validation sets we can build will also have a few examples, so it will also be difficult to assess the capailbity of the model to generalise and its expected accuracy. Perhaps in this case, you would want to use some less data hungry method, and engineer very specific features for these tasks to get a higher accuracy. However, such models will of course not generalise well to new cases.

One goal of these datasets is to build one model that works for both cases, so the model choice/engineering should not be so specific to one case. Deep Learning models generalise much better, as they learn the relevant features from the datset. Anyway, let's try to use the LSTM model tuned with the Spooky Author Indentification dataset.

## Data Adapters for Deep NLP Dataset

Firstly, make a simple CSV reader and taken a look at the format of the data. The Chatbot dataset reads fine, but the Resume dataset has some encoding problems. Most of the characters can be recognised by utf-8, while a few characters have strange encoding. These tend to be bullet points. So for the Resume case, `replace` these characters and do a little pre-processing afterwards.

In [2]:
import os.path
import csv

datadir = 'data/deepnlp/'
csv_chatbot = os.path.join(datadir, 'sheet_1.csv')
csv_resume = os.path.join(datadir, 'sheet_2.csv')
filesdir = 'files/'

# Read in CSV data
def read_csv(filepath, file_encoding='utf-8', for_errors='strict', discard_header=False):
    data_raw = []
    with open(filepath, 'r', encoding=file_encoding, errors=for_errors) as f:
        reader = csv.reader(f)
        for i, row in enumerate(reader):
            if not row:
                continue
            if discard_header and i == 0:
                continue
            data_raw.append(row)
    return data_raw

data_chatbot_raw = read_csv(csv_chatbot)
data_resume_raw  = read_csv(csv_resume, for_errors='replace')


### Quick Look at Chatbot Data

Taking a look at the first 10 entries of the therapy chatbot user response dataset (sheet_1.csv):

In [3]:
for row in data_chatbot_raw[:10]:
    print('{}\n'.format(row))

['response_id', 'class', 'response_text', '', '', '', '', '']

['response_1', 'not_flagged', 'I try and avoid this sort of conflict', '', '', '', '', '']

['response_2', 'flagged', 'Had a friend open up to me about his mental addiction to weed and how it was taking over his life and making him depressed', '', '', '', '', '']

['response_3', 'flagged', 'I saved a girl from suicide once. She was going to swallow a bunch of pills and I talked her out of it in a very calm, loving way.', '', '', '', '', '']

['response_4', 'not_flagged', 'i cant think of one really...i think i may have indirectly', '', '', '', '', '']

['response_5', 'not_flagged', 'Only really one friend who doesn\'t fit into the any of the above categories. Her therapist calls it spiraling." Anyway she pretty much calls me any time she is frustrated by something with  her boyfriend to ask me if it\'s logical or not. Before they would just fight and he would call her crazy. Now she asks me if it\'s ok he didn\'t say "pleas

### Remarks

From the above, it looks like 'flagged' responses are ones which the user helped someone in a more severe case. The first row is a header row, and each row contains a response ID, the flagged/not_flagged class, and response text. There appears to be 5 empty columns after that. Also, apostrophes appear to have been escaped, but this looks like how Python shows strings with both double and single quotes. The escaping doesn't show when printing, and even if you use `"""` or `'''`, the value will still show the single quotes escaped.

In [4]:
print(repr(data_chatbot_raw[5][2]))

'Only really one friend who doesn\'t fit into the any of the above categories. Her therapist calls it spiraling." Anyway she pretty much calls me any time she is frustrated by something with  her boyfriend to ask me if it\'s logical or not. Before they would just fight and he would call her crazy. Now she asks me if it\'s ok he didn\'t say "please" when he said  "hand me the remote."'


In [5]:
print(data_chatbot_raw[5][2])

Only really one friend who doesn't fit into the any of the above categories. Her therapist calls it spiraling." Anyway she pretty much calls me any time she is frustrated by something with  her boyfriend to ask me if it's logical or not. Before they would just fight and he would call her crazy. Now she asks me if it's ok he didn't say "please" when he said  "hand me the remote."


### Quick Look at Resume Data

In [6]:
for row in data_resume_raw[:10]:
    print('{}\n'.format(row))

['resume_id', 'class', 'resume_text']

['resume_1', 'not_flagged', '\nCustomer Service Supervisor/Tier - Isabella Catalog Company\nSouth Burlington VT - Email me on Indeed: indeed.com/r//49f8c9aecf490d26\nWORK EXPERIENCE\nCustomer Service Supervisor/Tier\nIsabella Catalog Company - Shelburne VT - August 2015 to Present\n2 Customer Service/Visual Set Up & Display/Website Maintenance\n��� Supervise customer service team of a popular catalog company\n��� Manage day to day issues and resolution of customer upset to ensure customer satisfaction\n��� Troubleshoot order and shipping issues: lost in transit order errors damages\n��� Manage and resolve escalated customer calls to ensure customer satisfaction\n��� Assist customers with order placing cross-selling/upselling of catalog merchandise\n��� Set up and display of sample merchandise in catalog library as well as customer pick-up area of the facility ��� Website clean-up: adding images type up product information proofreading\nAdministrat

### Text Corrections

With the Resume data, I had some trouble with the encoding. It did not seem to be a common encoding, and it looks like basically one character was causing problems, and that character was probably some special bullet character. I don't know if there is a correct encoding that renders it, or that it wasn't properly encoded during data collection.

There are common unrecognised characters, "���" that appears to be a normal bullet, and "��" for another bullet. These replacements by python are not perfect, but we can replace these again with `*` and `**` to have slightly more semantic text. First an example is shown below:

In [7]:
print(data_resume_raw[1][2])


Customer Service Supervisor/Tier - Isabella Catalog Company
South Burlington VT - Email me on Indeed: indeed.com/r//49f8c9aecf490d26
WORK EXPERIENCE
Customer Service Supervisor/Tier
Isabella Catalog Company - Shelburne VT - August 2015 to Present
2 Customer Service/Visual Set Up & Display/Website Maintenance
��� Supervise customer service team of a popular catalog company
��� Manage day to day issues and resolution of customer upset to ensure customer satisfaction
��� Troubleshoot order and shipping issues: lost in transit order errors damages
��� Manage and resolve escalated customer calls to ensure customer satisfaction
��� Assist customers with order placing cross-selling/upselling of catalog merchandise
��� Set up and display of sample merchandise in catalog library as well as customer pick-up area of the facility ��� Website clean-up: adding images type up product information proofreading
Administrative Assistant /Events Coordinator/Office Services Assistant
Eileen Fisher Inc - I

### Make Replacements

Make a fixed dataset by replacing these common missing characters.

In [8]:
data_resume_fix = []
data_resume_fix.append(data_resume_raw[0])  # header

for i, (r_id, r_class, resume_text) in enumerate(data_resume_raw):
    if i == 0:
        continue  # skip header
    resume_text = resume_text.replace("���", "*")
    resume_text = resume_text.replace("��", "**")
    data_resume_fix.append([r_id, r_class, resume_text])


Below is the same output as above after the fix:

In [9]:
print(data_resume_fix[1][2])


Customer Service Supervisor/Tier - Isabella Catalog Company
South Burlington VT - Email me on Indeed: indeed.com/r//49f8c9aecf490d26
WORK EXPERIENCE
Customer Service Supervisor/Tier
Isabella Catalog Company - Shelburne VT - August 2015 to Present
2 Customer Service/Visual Set Up & Display/Website Maintenance
* Supervise customer service team of a popular catalog company
* Manage day to day issues and resolution of customer upset to ensure customer satisfaction
* Troubleshoot order and shipping issues: lost in transit order errors damages
* Manage and resolve escalated customer calls to ensure customer satisfaction
* Assist customers with order placing cross-selling/upselling of catalog merchandise
* Set up and display of sample merchandise in catalog library as well as customer pick-up area of the facility * Website clean-up: adding images type up product information proofreading
Administrative Assistant /Events Coordinator/Office Services Assistant
Eileen Fisher Inc - Irvington NY - 

## Dataset Adapters

Both Datasets appear to be three strings, with the format:

ID, class label, example text

In the Chatbot case, there are 5 trailing empty columns, and the example text is a user chat response. In the Resume case, the example text is a long text body representing a resume. There doesn't appear to be a common format of the resumes. The class labels are the same for both cases, `flagged` or `not_flagged`. Just the meaning of these labels changes for each dataset. The ID is also a text string, not a number.

### DataManager

DataManager is a class I wrote before for the very common task of dividing data into train/valid/test sets in a ratio specified by the `split` tuple. This is not the cleanest implementation, but by sub-classing `DataManager` an adapter for a custom dataset can be built. In the NLP case, we don't want to one-hot encode at this stage, but retain the text labels. The text will go through vocabulary ID tokenisation later on.

To customise `DataManager` a few modifications need to be made. In `__init__`, the dataset_path should be set where the files will be output, and whether to `discard_header` or not should be set. Then two methods need to be overridden. `_process_row_raw(self, row)` processes the raw CSV data. So it needs to be modified to match the format of the CSV data we have. It will write a file from this for train/valid/test sets, so the output format should be specified, including any pre-processing, like making an ID from the labels. `_process_row_split(self, row)` reads in the files written above, and prepares the data into a format that will actually be used for training/validating/testing. This division allows us to store more information about the data than we finally use in our model.

### Chatbot

Firstly reading in and splitting the Chatbot data.


In [10]:
import data_tools as dt
from data_tools import DataManager

class DeepNLPChatbotData(DataManager):
    def __init__(self, filepath, split, one_hot_encode=True, output_numpy=True):
        super().__init__(filepath, split, one_hot_encode, output_numpy)
        self.filepath = filepath
        self.split = split       # train/valid/test fractions, should sum to 1
        self.dataset_path = 'data/deepnlp/chatbot/'
        self.discard_header = True

    def _process_row_raw(self, row):
        """
        Import lines from raw data file.
        Imports line of "Text ID, label, text"
        Returns list of [Text ID, text, label, label_index]
        """
        read_line = []
        text_id   = row[0]
        label     = row[1]
        text      = row[2]
        label_idx = self._get_idx(label)
        read_line.extend([text_id, text, label, label_idx])
        return read_line

    def _process_row_split(self, row):
        """
        Import lines from train/valid/test split files.
        Imports line of "Text ID, text, label, label_index"
        Returns list of [text, label]
        """
        read_line = []

        text      = row[1]
        label     = row[2]
        label_idx = int(float(row[-1]))
        fetch_idx = self._get_idx(label)  # rebuilds num_classes
        read_line.extend([text, label])
        return read_line

In [84]:
filepath = 'data/deepnlp/sheet_1.csv'
data_manager = DeepNLPChatbotData(filepath, (0.8, 0.19, 0.01), one_hot_encode=False, output_numpy=False)
data_manager.init_dataset()
chatbot_train_x, chatbot_train_y = data_manager.prepare_train()
chatbot_valid_x, chatbot_valid_y = data_manager.prepare_valid()

Preparing Train/Valid/Test data from data/deepnlp/sheet_1.csv
Split train has ( 44, 20, ) examples of each class
Split test has ( 1, 1, ) examples of each class
Split valid has ( 10, 4, ) examples of each class
Dataset prepared


### Remarks

Since this dataset is very small, I've decided to take most of the `test` split into the `validation` split. So we get a train/valid/test split of 80%/19%/1%. With a dataset this small, I think all that can be done is model development, and we're not really at a stage of formally 'testing' the model.

With these divisions, we still only get 64 training examples and 14 validation examples. There are about double the amount of `not_flagged` examples compared to the number of `flagged` examples. So we can expect the model may learn some bias of not flagging examples. 14 examples is not much to validate against, so the validation accuracy reported may not be so reliable. Rather it just gives us some indication of how the model will generalise.

In [100]:
chatbot_train_x[:4]

("Sometimes I'll calm my friends down after bad stuff happens.",
 'Ex girlfriend had depression and anxiety. I used to hold her and listen as she told me what was going on',
 'GF and I help her through a lot of shit because I myself have been through a lot of shit.',
 "I used to tutor homeless men at a shelter to help them obtain their GED's. They were all age 50+ and some of them were even reading at a first grade level.")

In [101]:
chatbot_train_y[:4]

('not_flagged', 'flagged', 'not_flagged', 'not_flagged')

### Resume

Now build the same for the resume dataset. This will require more edits due to the encoding problems. Here we do the odd character replacement as we process each row in the original CSV file. Also, since in this case our original file has some encoding issues, the `_data_import()` method was also overridden to `replace` the unrecognised characters, instead of stopping with an error.

In [87]:
class DeepNLPResumeData(DataManager):
    def __init__(self, filepath, split, one_hot_encode=True, output_numpy=True):
        super().__init__(filepath, split, one_hot_encode, output_numpy)
        self.filepath = filepath
        self.split = split       # train/valid/test fractions, should sum to 1
        self.dataset_path = 'data/deepnlp/resume/'
        self.discard_header = True

    def _process_row_raw(self, row):
        """
        Import lines from raw data file.
        Imports line of "Text ID, label, text"
        Returns list of [Text ID, text, label, label_index]
        """
        read_line = []
        text_id   = row[0]
        label     = row[1]
        text      = row[2]
        label_idx = self._get_idx(label)
        
        # make unknown character replacements (tends to be bullets)
        text = text.replace("���", "*")
        text = text.replace("��", "**")
        
        read_line.extend([text_id, text, label, label_idx])
        return read_line

    def _process_row_split(self, row):
        """
        Import lines from train/valid/test split files.
        Imports line of "Text ID, text, label, label_index"
        Returns list of [text, label]
        """
        read_line = []

        text      = row[1]
        label     = row[2]
        label_idx = int(float(row[-1]))
        fetch_idx = self._get_idx(label)  # rebuilds num_classes
        read_line.extend([text, label])
        return read_line
    
    # Override _data_import() method to account for encoding issues
    def _data_import(self, filepath):
        """
        Import data from data file.
        Override this _process_row_raw() method in child class for dataset structure.
        Return a list of [property_1, property_2, ..., target_index]
        """
        data_raw = []
        with open(filepath, errors='replace') as f:
            reader = csv.reader(f)
            for i, row in enumerate(reader):
                if not row:
                    continue
                if self.discard_header and i == 0:
                    continue
                read_line = self._process_row_raw(row)
                data_raw.append(read_line)
        return data_raw

In [88]:
filepath = 'data/deepnlp/sheet_2.csv'
data_manager = DeepNLPResumeData(filepath, (0.8, 0.19, 0.01), one_hot_encode=False, output_numpy=False)
data_manager.init_dataset()
resume_train_x, resume_train_y = data_manager.prepare_train()
resume_valid_x, resume_valid_y = data_manager.prepare_valid()

Preparing Train/Valid/Test data from data/deepnlp/sheet_2.csv
Split train has ( 73, 26, ) examples of each class
Split test has ( 2, 1, ) examples of each class
Split valid has ( 17, 6, ) examples of each class
Dataset prepared


### Remarks

As with the Chatbot case this dataset is also small, so we've also taken a train/valid/test split of 80%/19%/1% to increase the reliability of validation while maintaining some number of examples for training.
We end up with 99 training examples and 23 validation examples. This time there are about 3 times the amount of `not_flagged` examples compared to the number of `flagged` examples. So again a bias towards `not_flagged` is expected. We also do not have the luxury of reducing the number of examples to balance between the classes. In any case, this bias towards not flagging is perhaps the real form of this data distribution.

In [89]:
resume_train_x[:2]

('Elizabeth Conway\nSeeking a Part time Admin/receptionist postition\nEast Hardwick VT - Email me on Indeed: indeed.com/r/Elizabeth-Conway/f0620353200ab872\nWORK EXPERIENCE\nAssistant Calf Manager\nFairvue Farm - January 2014 to Present\nDaily calf care on a 2000 cow dairy farm including but not limited to feeding vaccines record keeping to maintain and raise healthy replacement calves assisting dairy manager in creating needed reports.\nOwner/Operator of a fiber mill\nFibers - 2007 to 2013\nResponsibilities included bookkeeping record keeping\nshipping/receiving customer service maintenance sales daily operations and advertising\nBookkeeper/Administration\nE.F. Jones LLC - 2007 to 2013\nResponsibilities included AP/AR reconciliation and daily administration duties\nSenior Associate Scientist\nPfizer Inc - 2004 to 2007\nResponsibilities included cell culture conducting In Vivo studies\ntissue collection western blots data analysis and reporting and presenting data to the team leaders.\

In [90]:
resume_train_y[:2]

('not_flagged', 'flagged')

## Prepare Vocabulary

The `Vocabulary` class from `data_tools` is used to build the vocabulary for these two datasets. This class is described in more detail in the Spooky Author Identification notebook (`notebook_spooky`). Basically it will build up a vocabulary from corpus for the sentences, and also for the labels. The vocabulary is a list of words ordered by frequency, and truncated to a `vocabulary_size` (20000 here). The word to ID and ID to word conversions are built with the `get_sentence_vocabulary()` and `get_label_vocabulary()` methods. Then the sentences and labels of the dataset can be converted to their tokenised ID form with `data_to_token_ids()` and `labels_to_token_ids()`.
The sentences and labels are then combined into one list, `chat_train_set` for train and `chat_valid_set` for validation.

### Vocabulary and ID Tokenisation of the Chatbot Dataset

Firstly processing the Chatbot dataset. The output files will go to `data/deepnlp/chatbot`.

In [91]:
# Vocabulary
vocab = dt.Vocabulary('data/deepnlp/chatbot', 20000)
vocab.build_vocabulary(chatbot_train_x, chatbot_train_y)

chat_sents_vocab, chat_rev_sents_vocab = vocab.get_sentence_vocabulary()
chat_label_vocab, chat_rev_label_vocab = vocab.get_label_vocabulary()

chat_train_x_tok = vocab.data_to_token_ids(chatbot_train_x, 'train')
chat_train_y_tok = vocab.labels_to_token_ids(chatbot_train_y, 'train')

chat_valid_x_tok = vocab.data_to_token_ids(chatbot_valid_x, 'valid')
chat_valid_y_tok = vocab.labels_to_token_ids(chatbot_valid_y, 'valid')

chat_train_set = list(zip(chat_train_x_tok, chat_train_y_tok))
chat_valid_set = list(zip(chat_valid_x_tok, chat_valid_y_tok))

Building vocabulary
Writing data/deepnlp/chatbot/vocab_sentences.txt ...
Writing data/deepnlp/chatbot/vocab_labels.txt ...
Writing data/deepnlp/chatbot/train/sentences.txt ...
Writing data/deepnlp/chatbot/train/ids_sentences.txt ...
Writing data/deepnlp/chatbot/train/labels.txt ...
Writing data/deepnlp/chatbot/train/ids_labels.txt ...
Writing data/deepnlp/chatbot/valid/sentences.txt ...
Writing data/deepnlp/chatbot/valid/ids_sentences.txt ...
Writing data/deepnlp/chatbot/valid/labels.txt ...
Writing data/deepnlp/chatbot/valid/ids_labels.txt ...


Below are some examples from the tokenised ID set converted back into their word forms, firstly from the training examples:

In [92]:
print('Translating token IDs back into words:\n')
vocab.translate_examples(chat_train_set[:5])
print('\nActual form of training data:\n', chat_train_set[0])

Translating token IDs back into words:

Sometimes I ' ll calm my friends down after bad stuff happens .
not_flagged


Ex girlfriend had depression and anxiety . I used to hold her and listen as she told me what was going on
flagged


GF and I help her through a lot of shit because I myself have been through a lot of shit .
not_flagged


I used to tutor homeless men at a shelter to help them obtain their GED ' s . They were all age 00+ and some of them were even reading at a first grade level .
not_flagged


Don ' t have a specific example but just letting people know you ' re there if they want to talk .
not_flagged



Actual form of training data:
 ([190, 4, 7, 82, 221, 14, 20, 83, 101, 253, 125, 166, 2], 1)


Quick look at the validation examples:

In [93]:
vocab.translate_examples(chat_valid_set[:3])

_UNK camp , _UNK kids have the same _UNK . I _UNK them i how it is when you cant listen or are _UNK .
not_flagged


when my best friends _UNK _UNK away from _UNK ' _UNK when he was in grade 0
flagged


I once _UNK as a resource for someone who was struggling in school , and I helped them with their _UNK .
not_flagged




### Remarks

Seems that the vocabulary is a bit small and a lot of words are being missed in the validation set. Perhaps calculating the vocabulary over all the data would help, but then the validation set has less checking capability for generalisation. Generally, the dataset is a bit too small. One other possibility for improvement is to use some pre-trained word vectors from word2vec or Glove, since we're dealing with English here.

### Vocabulary and ID Tokenisation of the Resume Dataset

Carry out the same process for the Resume Dataset

In [96]:
# Vocabulary
vocab = dt.Vocabulary('data/deepnlp/resume', 20000)
vocab.build_vocabulary(resume_train_x, resume_train_y)

resume_sents_vocab, resume_rev_sents_vocab = vocab.get_sentence_vocabulary()
resume_label_vocab, resume_rev_label_vocab = vocab.get_label_vocabulary()

resume_train_x_tok = vocab.data_to_token_ids(resume_train_x, 'train')
resume_train_y_tok = vocab.labels_to_token_ids(resume_train_y, 'train')

resume_valid_x_tok = vocab.data_to_token_ids(resume_valid_x, 'valid')
resume_valid_y_tok = vocab.labels_to_token_ids(resume_valid_y, 'valid')

resume_train_set = list(zip(resume_train_x_tok, resume_train_y_tok))
resume_valid_set = list(zip(resume_valid_x_tok, resume_valid_y_tok))

Creating sentence corpus in data/deepnlp/resume/sentences_raw.txt
Writing data/deepnlp/resume/sentences_raw.txt ...
Building vocabulary
Writing data/deepnlp/resume/vocab_sentences.txt ...
Writing data/deepnlp/resume/vocab_labels.txt ...
Writing data/deepnlp/resume/train/sentences.txt ...
Writing data/deepnlp/resume/train/ids_sentences.txt ...
Writing data/deepnlp/resume/train/labels.txt ...
Writing data/deepnlp/resume/train/ids_labels.txt ...
Writing data/deepnlp/resume/valid/sentences.txt ...
Writing data/deepnlp/resume/valid/ids_sentences.txt ...
Writing data/deepnlp/resume/valid/labels.txt ...
Writing data/deepnlp/resume/valid/ids_labels.txt ...


Taking a look at the Resume training set, converted back into words.
For the resumes, the example text is quite long.

In [97]:
print('Translating token IDs back into words:\n')
vocab.translate_examples(resume_train_set[:5])
print('\nActual form of training data:\n', resume_train_set[0])

Translating token IDs back into words:

Elizabeth Conway Seeking a Part time Admin/receptionist postition East Hardwick VT - Email me on Indeed : indeed . com/r/Elizabeth-Conway/f0000000000ab000 WORK EXPERIENCE Assistant Calf Manager Fairvue Farm - January 0000 to Present Daily calf care on a 0000 cow dairy farm including but not limited to feeding vaccines record keeping to maintain and raise healthy replacement calves assisting dairy manager in creating needed reports . Owner/Operator of a fiber mill Fibers - 0000 to 0000 Responsibilities included bookkeeping record keeping shipping/receiving customer service maintenance sales daily operations and advertising Bookkeeper/Administration E . F . Jones LLC - 0000 to 0000 Responsibilities included AP/AR reconciliation and daily administration duties Senior Associate Scientist Pfizer Inc - 0000 to 0000 Responsibilities included cell culture conducting In Vivo studies tissue collection western blots data analysis and reporting and presentin

And from the validation set:

In [98]:
vocab.translate_examples(resume_valid_set[:3])

Job _UNK Staff _UNK Field Geologist - _UNK Environmental / _UNK Technical Services - Email me on Indeed : indeed . _UNK WORK EXPERIENCE Staff _UNK Field Geologist _UNK Environmental / _UNK Technical Services - Montpelier VT - 0000 to Present Conducted site investigations using the _UNK _UNK and _UNK _UNK Interface Probe . Trained new _UNK employees on field services . Seasonal employee _UNK Environmental - Montpelier VT - 0000 to 0000 Worked for the _UNK and _UNK departments as a lead field sampler . Also worked with many ArcGIS projects for the _UNK _UNK and Water Resources departments . Undergraduate Teaching Assistant University of Vermont - Burlington VT - 0000 to 0000 Helped teach Geology 0 labs . EDUCATION B . S . in Geology University of Vermont 0000 CERTIFICATIONS/LICENSES CPR & First Aid ADDITIONAL INFORMATION Technical Skills _UNK core logging . _UNK core logging . _UNK _UNK logging . _UNK sampling . _UNK and rock sampling . _UNK Mapping . _UNK . -Data management and _UNK . _

## Building the LSTM Model

Below the class that builds a custom LSTM model in TensorFlow has been included. More details of this model have been explained in the `notebook_spooky` notebook.

In [58]:
import tensorflow as tf
import numpy as np

In [76]:
class RNNModel(object):
    def __init__(self, h_size, num_layers, vocab_size, n_classes, batch_size, rnn_type='lstm'):

        # Input Placeholders
        self.x = x = tf.placeholder(tf.int32, [batch_size, None], name="inputs") # [batch_size, num_steps]
        self.seqlen = seqlen = tf.placeholder(tf.int32, [batch_size], name="sequence_lengths")
        self.y = y = tf.placeholder(tf.int32, [batch_size], name="classes_gt")
        self.keep_prob = keep_prob = tf.placeholder("float")
        self.global_step = tf.Variable(0, trainable=False)
        
        def cell_gen():
            return tf.contrib.rnn.BasicLSTMCell(h_size, state_is_tuple=True)
        if rnn_type == 'gru':
            def cell_gen():
                return tf.contrib.rnn.GRUCell(h_size)
        
        if num_layers > 1:
            cells = []
            for _ in range(num_layers):
                cell = tf.contrib.rnn.DropoutWrapper(cell_gen(), output_keep_prob=keep_prob)
                cells.append(cell)
        
            cell = tf.contrib.rnn.MultiRNNCell(cells)
        else:
            cell = cell_gen()
        
        self.cell = cell
        
        # TODO: Prepare init state
#         # Initialise one hidden state
#         init_state = tf.get_variable('init_state', [1, h_size],
#                                  initializer=tf.constant_initializer(0.0))
#         # Tile to match batch_size
#         init_state = tf.tile(init_state, [batch_size, 1])
#         print(init_state)
        
        # Embedding layer
        embeddings = tf.get_variable('embedding_matrix', [vocab_size, h_size])
        rnn_inputs = tf.nn.embedding_lookup(embeddings, x)
        
#         rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, sequence_length=seqlen,
#                                                      initial_state=init_state)
        rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, sequence_length=seqlen, dtype=tf.float32)

        #idx = tf.range(batch_size)*tf.shape(rnn_outputs)[1] + (seqlen - 1)
        #last_rnn_output = tf.gather(tf.reshape(rnn_outputs, [-1, state_size]), idx)        
        last_rnn_output = tf.gather_nd(rnn_outputs, tf.stack([tf.range(batch_size), seqlen-1], axis=1))

        # Softmax layer
        with tf.variable_scope('softmax'):
            W = tf.get_variable('W', [h_size, n_classes])
            b = tf.get_variable('b', [n_classes], initializer=tf.constant_initializer(0.0))
        logits = tf.matmul(last_rnn_output, W) + b
        preds = tf.nn.softmax(logits)
        correct = tf.equal(tf.cast(tf.argmax(preds,1),tf.int32), y)

        self.accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

        self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))
        self.train_step = tf.train.AdamOptimizer(1e-4).minimize(self.loss, global_step=self.global_step)
        
        self.saver = tf.train.Saver(tf.global_variables())
        
        self._prepare_logs()
        
    def _prepare_logs(self):
        tf.summary.scalar('Loss', self.loss)
        tf.summary.scalar('Accuracy', self.accuracy)
        
        self.logs = tf.summary.merge_all()

def create_model(session, logdir, **parameters):
    with tf.variable_scope("model", reuse=None):
        print('\nCreating model with parameters:')
        for k,v in parameters.items():
            print('{:16s}: {}'.format(k, v))
        model_train = RNNModel(parameters['h_size'], parameters['rnn_layers'], FLAGS_in_vocab_size,
                               FLAGS_n_classes, parameters['batch_size'])
        
    ckpt = tf.train.get_checkpoint_state(logdir)
    #print(ckpt.model_checkpoint_path)
    if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path + '.index'):
        print("Loading model from parameters in {}.".format(ckpt.model_checkpoint_path))
        model_train.saver.restore(session, ckpt.model_checkpoint_path)
    else:
        print("Creating model with fresh parameters.")
        session.run(tf.global_variables_initializer())
    return model_train

### Training Code

Below is the training and validation code as used before in the `notebook_spooky` case, where it has been explained.

In [82]:
from timeit import default_timer as timer

FLAGS_in_vocab_size = 20000
FLAGS_n_classes = 2

tf.reset_default_graph()

def valid_eval(sess, model, valid_set, batches):
    total_steps = 0
    val_accuracy = 0
    val_loss = 0
    for epoch in batches.gen_padded_batch_epochs(valid_set, 1):
        for step, (batch_x, batch_y, lengths) in enumerate(epoch):
            total_steps += 1
            feed = {model.x: batch_x, model.y: batch_y, model.seqlen: lengths, model.keep_prob: 1.0}
            fetch = [model.accuracy, model.loss]
            
            val_accuracy_, val_loss_ = sess.run(fetch, feed_dict=feed)
            val_accuracy += val_accuracy_
            val_loss += val_loss_
    avg_val_accuracy = val_accuracy / total_steps
    avg_val_loss = val_loss / total_steps
    
    return avg_val_accuracy, avg_val_loss



def train_net(train_set, valid_set, n_epochs, run_name, **params):
    tf.reset_default_graph()
    with tf.Session() as sess:
        log_dir = os.path.join(FLAGS_log_dir, run_name)
        log_txt_dir = os.path.join(FLAGS_log_dir, 'txtlogs/')
        if not os.path.exists(log_txt_dir):
            os.makedirs(log_txt_dir)
        
        model = create_model(sess, log_dir, **params)
        
        # Dropout keep neuron output probability
        keep_prob = params['dropout_keep']
        
        batches = dt.Batches(params['batch_size'])
        
        train_writer = tf.summary.FileWriter(log_dir + '/train', sess.graph)
        valid_writer = tf.summary.FileWriter(log_dir + '/valid', sess.graph)
        
        quantities = ['Gstep', 'Accuracy', 'Loss', 'Time']
        train_logs = dt.Logger(*quantities)
        valid_logs = dt.Logger(*quantities)

        start_time = timer()

        for i, epoch in enumerate(batches.gen_padded_batch_epochs(train_set, n_epochs)):
            print('\nEpoch', i+1)
            accuracy = 0
            loss = 0
            for step, (batch_x, batch_y, lengths) in enumerate(epoch):

                feed = {model.x: batch_x, model.y: batch_y, model.seqlen: lengths, model.keep_prob: keep_prob}
                fetch = [model.accuracy, model.loss, model.logs, model.train_step]

                accuracy_, loss_, logs, _ = sess.run(fetch, feed_dict=feed)
                accuracy += accuracy_
                loss += loss_

                gstep = model.global_step.eval()
                elapsed = timer() - start_time

                train_writer.add_summary(logs, gstep)
                train_logs.log(Gstep=gstep, Accuracy=accuracy_, Loss=loss_, Time=elapsed)

                if step % n_steps_avg == 0 and step > 0:
                    avg_accuracy = accuracy/n_steps_avg
                    avg_loss = loss/n_steps_avg
                    print('Step {:6d}, accuracy: {:7.3f}, loss: {:7.3f}, {:6.1f}s elapsed ({} steps avg.)'.format(
                        gstep, avg_accuracy, avg_loss, elapsed, n_steps_avg))
                    accuracy = 0
                    loss = 0                          

            valid_accuracy, valid_loss = valid_eval(sess, model, valid_set, batches)
            print('Global Step {}, valid accuracy: {:7.3}'.format(gstep, valid_accuracy))
            
            elapsed = timer() - start_time
            valid_logs.log(Gstep=gstep, Accuracy=valid_accuracy, Loss=valid_loss, Time=elapsed)

            summary = tf.Summary()
            summary.value.add(tag="model/Accuracy", simple_value=valid_accuracy)
            summary.value.add(tag="model/Loss", simple_value=valid_loss)
            valid_writer.add_summary(summary, gstep)
            valid_writer.flush()

            tf.logging.info('Step {} validation accuracy: {:7.3}'.format(gstep, valid_accuracy))

            checkpoint_path = os.path.join(log_dir, 'crm_lstm.ckpt')
            model.saver.save(sess, checkpoint_path, global_step=model.global_step)

        print('Done Training')

        train_logs.write_csv(os.path.join(log_txt_dir, run_name + '_train.csv'))
        valid_logs.write_csv(os.path.join(log_txt_dir, run_name + '_valid.csv'))
                        


### Hyper-Parameter Tuning

Code for tuning multiple hyper-parameters. Since our dataset is so small, we will just try one very simple model. With 

- 256 hidden units
- 1 LSTM layer
- batches of 2 examples
- 70% neuron output, with 30% dropped.

In [79]:
# Parameter sets
import itertools

class ParameterTuner(object):
    def __init__(self):
        self.h_sizes = None
        self.rnn_layers = None
        self.batch_sizes = None
        self.dropout_keep = None
    
    def n_sets(self):
        return len(self.h_sizes)*len(self.rnn_layers)*len(self.batch_sizes)*len(self.dropout_keep)
    
    def sets(self):
        parameters = [self.h_sizes, self.rnn_layers, self.batch_sizes, self.dropout_keep]
        for h, layers, batches, dropouts in itertools.product(*parameters):
            par_set = {}
            par_set['h_size'] = h
            par_set['rnn_layers'] = layers
            par_set['batch_size'] = batches
            par_set['dropout_keep'] = dropouts
            par_string = 'h{}_l{}_b{}_d{}'.format(h, layers, batches, dropouts)
            yield par_set, par_string

# h_sizes = [128, 256, 512, 1024]
# rnn_layers = [1, 2, 3, 4]
# batch_sizes = [16, 32, 64, 128]

# h_sizes = [128, 256]
# rnn_layers = [1, 2]
# batch_sizes = [16, 32]

h_sizes = [256]
rnn_layers = [1]
batch_sizes = [2]
dropout_keep = [0.7]


tuner = ParameterTuner()
tuner.h_sizes = h_sizes
tuner.rnn_layers = rnn_layers
tuner.batch_sizes = batch_sizes
tuner.dropout_keep = dropout_keep

## Chatbot Training

Firstly, lets try to train this model on the Chatbot dataset:

In [95]:
import os

epochs = 10
n_steps_avg = 10

FLAGS_log_dir = 'logs_chatbot/'

# chat_train_set
# chat_valid_set

tune_start = timer()
n_psets = tuner.n_sets()
for iset, (pset, pstring) in enumerate(tuner.sets()):
    tune_elapsed = timer() - tune_start
    print('\n\nRun {}/{}, {}s elapsed'.format(iset+1, n_psets, tune_elapsed))
    train_net(chat_train_set, chat_valid_set, epochs, pstring, **pset)



Run 1/1, 0.0001356780412606895s elapsed

Creating model with parameters:
h_size          : 256
batch_size      : 2
rnn_layers      : 1
dropout_keep    : 0.7
Loading model from parameters in logs_chatbot/h256_l1_b2_d0.7/crm_lstm.ckpt-320.
INFO:tensorflow:Restoring parameters from logs_chatbot/h256_l1_b2_d0.7/crm_lstm.ckpt-320

Epoch 1
Step    331, accuracy:   1.050, loss:   0.203,    1.8s elapsed (10 steps avg.)
Step    341, accuracy:   1.000, loss:   0.115,    3.3s elapsed (10 steps avg.)
Step    351, accuracy:   1.000, loss:   0.071,    4.4s elapsed (10 steps avg.)
Global Step 352, valid accuracy:     0.5
INFO:tensorflow:Step 352 validation accuracy:     0.5

Epoch 2
Step    363, accuracy:   1.100, loss:   0.099,    7.2s elapsed (10 steps avg.)
Step    373, accuracy:   1.000, loss:   0.069,    8.3s elapsed (10 steps avg.)
Step    383, accuracy:   0.950, loss:   0.146,    9.7s elapsed (10 steps avg.)
Global Step 384, valid accuracy:   0.429
INFO:tensorflow:Step 384 validation accurac

In [102]:
print(len(chat_train_set), len(chat_valid_set))

64 14


### Remarks

The dataset is traversed very quickly, and we trained for 10 epochs. On the training data, we can get maximum accuracy, but in validation this is only around 57% accuracy. Although this is based on only 14 validation examples, it's not so encouraging either. In any case, there is a lot of over-fitting here, even with dropout keep probability at 70%.

To improve this result, the most obvious thing to do is prepare a larger dataset. With a dataset this size, you cannot expect too much predictive capability from any model. Otherwise, we also noticed that a lot of the words in the validation set had `_UNK` tokens, so either increasing the vocabulary size, or using pre-trained word embeddings may help. Another possibility is to work with a character-level model, although our expectaions/options are limited without first increasing the dataset size.

nb: I also noticed a bug with the training accuracy going over 1.0. This is most likely due to one more example used in the accuracy sum than the number is normalised by. I will fix this later.

### Resume Training

Now re-training the same model architecture on the Resume dataset.

In [99]:
epochs = 10
n_steps_avg = 10

FLAGS_log_dir = 'logs_resume/'

# chat_train_set
# chat_valid_set

tune_start = timer()
n_psets = tuner.n_sets()
for iset, (pset, pstring) in enumerate(tuner.sets()):
    tune_elapsed = timer() - tune_start
    print('\n\nRun {}/{}, {}s elapsed'.format(iset+1, n_psets, tune_elapsed))
    train_net(resume_train_set, resume_valid_set, epochs, pstring, **pset)



Run 1/1, 0.0001233249786309898s elapsed

Creating model with parameters:
h_size          : 256
batch_size      : 2
rnn_layers      : 1
dropout_keep    : 0.7
Creating model with fresh parameters.

Epoch 1
Step     11, accuracy:   0.800, loss:   0.757,   29.9s elapsed (10 steps avg.)
Step     21, accuracy:   0.700, loss:   0.684,   52.6s elapsed (10 steps avg.)
Step     31, accuracy:   0.750, loss:   0.674,   77.6s elapsed (10 steps avg.)
Step     41, accuracy:   0.700, loss:   0.670,  111.7s elapsed (10 steps avg.)
Global Step 49, valid accuracy:   0.727
INFO:tensorflow:Step 49 validation accuracy:   0.727

Epoch 2
Step     60, accuracy:   0.900, loss:   0.701,  190.8s elapsed (10 steps avg.)
Step     70, accuracy:   0.600, loss:   0.694,  225.8s elapsed (10 steps avg.)
Step     80, accuracy:   0.650, loss:   0.644,  257.6s elapsed (10 steps avg.)
Step     90, accuracy:   0.750, loss:   0.617,  299.6s elapsed (10 steps avg.)
Global Step 98, valid accuracy:   0.727
INFO:tensorflow:Step

In [103]:
print(len(resume_train_set), len(resume_valid_set))

99 23


### Remarks

For the resume dataset, we get a slightly better validation accuracy of around 82%, but again this is based on only 23 validation examples. Again we see overfitting after about 5 epochs.

To improve this accuracy of this model, the same remarks for the Chatbot case apply. Very generally, the main thing that needs to be done is to increase the dataset size. If this is not an option, some features could be engineered that we know would be informative, this will help give more predictive power to the model, at the cost of making it very specific to our current case(s).