####                                         Testing BERT on a sentence classification task(Conduct risk sentences) . 


The object of this test was to see how few hard coded sentences could be offered up to the model during training and thus check how powerful BERT might be in predicting and classifying sentences that it had not seen before. There were semantic similarities offered to the model during training but the model needed to make semantic inference to produce a prediction. 

Much of the code below was taken from one of the Google research workbooks and adapted for the purpose of predicting several sentences at the bottom of this notebook. I also switched out the cross-entropy softmax with logits to just a softmax in the model below.

I am only showing the classifier layer sitting above BERT 70 sentences. 27 of those sentences are contextually of interest and relate to conduct risk. The rest of the sentences are not and are innocent communications. 

The idea was to train the classifier quickly and in this case on just a tiny amount of information to see how much it is possible to leverage of BERT's pre trained memory. 

In practice accuracy and stability in the models' predictions would grow the more information is shown to the classifier.

The results are at the bottom of this notebook. Having previously tested multiple different pre-trained vectors on conduct risk related content the results below are encouraging. Having watched the available pre-trained word embedding vectors grow in size since 2013 I was somewhat sceptical around working with another large pre-trained model that was not trained on finance domain specific data. Clearly this is a new era for NLP and the models are now larger, more complex but also consist of features that look to be richer than just a matrix with word embeddings. Given the task below I also feel the full capacity of the model is not being tested. Depending on where the model might be deployed I could see the benefit in an organisation training their own bi-direction transformer model on domain specific text.

Its early days in terms of where NLP goes from BERT and if the last twelve months are anything to go by it will be a fascinating journey.

April 2019

In [23]:
from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
from datetime import datetime
from sklearn.utils import resample

import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization

In [3]:
test = pd.read_csv('/Users/brianfarrell/bert/bert_test.csv',encoding='utf8')
train = pd.read_csv('/Users/brianfarrell/bert/bert_train.csv',encoding='utf8') 

#Tidy up DF 
test.drop(columns='Misconduct_type',inplace=True)
values = {'Label_bin': 0}
train.fillna(value=values,inplace=True) 


#training Classifier on just seventy sentences, 27 of which are of interest and the rest are not.
train_1 = train.iloc[0:70,:]
#Print label distribution in training set
train_1["Label_bin"].value_counts()

0.0    43
1.0    27
Name: Label_bin, dtype: int64

In [4]:
DATA_COLUMN = 'Sentences'
LABEL_COLUMN = 'Label_bin'
label_list = [0, 1] 

In [5]:
# Use the InputExample class from BERT's run_classifier code to create examples from the data
train_InputExamples = train_1.apply(lambda x: bert.run_classifier.InputExample(guid=None,  
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)
     

test_InputExamples = test.apply(lambda x: bert.run_classifier.InputExample(guid=None, 
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

In [6]:
#Using large uncased model from TF Hub
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-24_H-1024_A-16/1"

def create_tokenizer_from_hub_module():
  """Get the vocab file and casing info from the Hub module."""
  with tf.Graph().as_default():
    bert_module = hub.Module(BERT_MODEL_HUB)
    tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
    with tf.Session() as sess:
      vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                            tokenization_info["do_lower_case"]])
      
  return bert.tokenization.FullTokenizer(
      vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer_from_hub_module()

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0409 11:14:25.363950 4611933632 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


In [7]:
#Run example sentence through tokenizer
tokenizer.tokenize("I am not sure you catch my drift this will definietly go higher  ")

['i',
 'am',
 'not',
 'sure',
 'you',
 'catch',
 'my',
 'drift',
 'this',
 'will',
 'def',
 '##ini',
 '##et',
 '##ly',
 'go',
 'higher']

In [8]:
#Set sequences to be at most 15 tokens long.
MAX_SEQ_LENGTH = 15
# Convert train and test features to InputFeatures that BERT understands.
train_features = bert.run_classifier.convert_examples_to_features(train_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)
test_features = bert.run_classifier.convert_examples_to_features(test_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)

INFO:tensorflow:Writing example 0 of 70


I0409 11:14:26.480076 4611933632 tf_logging.py:115] Writing example 0 of 70


INFO:tensorflow:*** Example ***


I0409 11:14:26.484364 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.486499 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] this is a no - brain ##er [SEP]


I0409 11:14:26.488110 4611933632 tf_logging.py:115] tokens: [CLS] this is a no - brain ##er [SEP]


INFO:tensorflow:input_ids: 101 2023 2003 1037 2053 1011 4167 2121 102 0 0 0 0 0 0


I0409 11:14:26.489505 4611933632 tf_logging.py:115] input_ids: 101 2023 2003 1037 2053 1011 4167 2121 102 0 0 0 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0


I0409 11:14:26.490568 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.492081 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 1.0 (id = 1)


I0409 11:14:26.494300 4611933632 tf_logging.py:115] label: 1.0 (id = 1)


INFO:tensorflow:*** Example ***


I0409 11:14:26.495421 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.496556 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] i was at the fish markets on the weekend [SEP]


I0409 11:14:26.497925 4611933632 tf_logging.py:115] tokens: [CLS] i was at the fish markets on the weekend [SEP]


INFO:tensorflow:input_ids: 101 1045 2001 2012 1996 3869 6089 2006 1996 5353 102 0 0 0 0


I0409 11:14:26.502228 4611933632 tf_logging.py:115] input_ids: 101 1045 2001 2012 1996 3869 6089 2006 1996 5353 102 0 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0


I0409 11:14:26.503911 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.505593 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0.0 (id = 0)


I0409 11:14:26.507071 4611933632 tf_logging.py:115] label: 0.0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:14:26.508860 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.510393 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] wool ##ies has a great deal on fruit at the moment [SEP]


I0409 11:14:26.511663 4611933632 tf_logging.py:115] tokens: [CLS] wool ##ies has a great deal on fruit at the moment [SEP]


INFO:tensorflow:input_ids: 101 12121 3111 2038 1037 2307 3066 2006 5909 2012 1996 2617 102 0 0


I0409 11:14:26.512835 4611933632 tf_logging.py:115] input_ids: 101 12121 3111 2038 1037 2307 3066 2006 5909 2012 1996 2617 102 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0


I0409 11:14:26.514279 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.515906 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0.0 (id = 0)


I0409 11:14:26.517482 4611933632 tf_logging.py:115] label: 0.0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:14:26.520468 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.522042 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] there is a car dealers ##hip in west sydney offering no - brain [SEP]


I0409 11:14:26.523937 4611933632 tf_logging.py:115] tokens: [CLS] there is a car dealers ##hip in west sydney offering no - brain [SEP]


INFO:tensorflow:input_ids: 101 2045 2003 1037 2482 16743 5605 1999 2225 3994 5378 2053 1011 4167 102


I0409 11:14:26.530337 4611933632 tf_logging.py:115] input_ids: 101 2045 2003 1037 2482 16743 5605 1999 2225 3994 5378 2053 1011 4167 102


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


I0409 11:14:26.539482 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.542020 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0.0 (id = 0)


I0409 11:14:26.543987 4611933632 tf_logging.py:115] label: 0.0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:14:26.547743 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.562651 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] he does not care about our investment objectives [SEP]


I0409 11:14:26.568362 4611933632 tf_logging.py:115] tokens: [CLS] he does not care about our investment objectives [SEP]


INFO:tensorflow:input_ids: 101 2002 2515 2025 2729 2055 2256 5211 11100 102 0 0 0 0 0


I0409 11:14:26.573902 4611933632 tf_logging.py:115] input_ids: 101 2002 2515 2025 2729 2055 2256 5211 11100 102 0 0 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0


I0409 11:14:26.578629 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.581360 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 1.0 (id = 1)


I0409 11:14:26.584310 4611933632 tf_logging.py:115] label: 1.0 (id = 1)


INFO:tensorflow:Writing example 0 of 68


I0409 11:14:26.605654 4611933632 tf_logging.py:115] Writing example 0 of 68


INFO:tensorflow:*** Example ***


I0409 11:14:26.608643 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.612019 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] client is not happy can ' t talk right now [SEP]


I0409 11:14:26.614871 4611933632 tf_logging.py:115] tokens: [CLS] client is not happy can ' t talk right now [SEP]


INFO:tensorflow:input_ids: 101 7396 2003 2025 3407 2064 1005 1056 2831 2157 2085 102 0 0 0


I0409 11:14:26.618086 4611933632 tf_logging.py:115] input_ids: 101 7396 2003 2025 3407 2064 1005 1056 2831 2157 2085 102 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0


I0409 11:14:26.620911 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.623702 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 1 (id = 1)


I0409 11:14:26.627012 4611933632 tf_logging.py:115] label: 1 (id = 1)


INFO:tensorflow:*** Example ***


I0409 11:14:26.630335 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.633456 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] i just bought some apples at the market on the weekend . [SEP]


I0409 11:14:26.637636 4611933632 tf_logging.py:115] tokens: [CLS] i just bought some apples at the market on the weekend . [SEP]


INFO:tensorflow:input_ids: 101 1045 2074 4149 2070 18108 2012 1996 3006 2006 1996 5353 1012 102 0


I0409 11:14:26.641303 4611933632 tf_logging.py:115] input_ids: 101 1045 2074 4149 2070 18108 2012 1996 3006 2006 1996 5353 1012 102 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0


I0409 11:14:26.645221 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.649471 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0 (id = 0)


I0409 11:14:26.651906 4611933632 tf_logging.py:115] label: 0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:14:26.654603 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.657557 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] but that would nt work . [SEP]


I0409 11:14:26.663815 4611933632 tf_logging.py:115] tokens: [CLS] but that would nt work . [SEP]


INFO:tensorflow:input_ids: 101 2021 2008 2052 23961 2147 1012 102 0 0 0 0 0 0 0


I0409 11:14:26.667436 4611933632 tf_logging.py:115] input_ids: 101 2021 2008 2052 23961 2147 1012 102 0 0 0 0 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0


I0409 11:14:26.668984 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.671878 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0 (id = 0)


I0409 11:14:26.676074 4611933632 tf_logging.py:115] label: 0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:14:26.679919 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.683771 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] well there are times when you are on and times that you are [SEP]


I0409 11:14:26.686047 4611933632 tf_logging.py:115] tokens: [CLS] well there are times when you are on and times that you are [SEP]


INFO:tensorflow:input_ids: 101 2092 2045 2024 2335 2043 2017 2024 2006 1998 2335 2008 2017 2024 102


I0409 11:14:26.690593 4611933632 tf_logging.py:115] input_ids: 101 2092 2045 2024 2335 2043 2017 2024 2006 1998 2335 2008 2017 2024 102


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


I0409 11:14:26.693265 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.695792 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0 (id = 0)


I0409 11:14:26.698701 4611933632 tf_logging.py:115] label: 0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:14:26.701837 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: None


I0409 11:14:26.704094 4611933632 tf_logging.py:115] guid: None


INFO:tensorflow:tokens: [CLS] i got a great deal on my mobile plan . [SEP]


I0409 11:14:26.705961 4611933632 tf_logging.py:115] tokens: [CLS] i got a great deal on my mobile plan . [SEP]


INFO:tensorflow:input_ids: 101 1045 2288 1037 2307 3066 2006 2026 4684 2933 1012 102 0 0 0


I0409 11:14:26.708046 4611933632 tf_logging.py:115] input_ids: 101 1045 2288 1037 2307 3066 2006 2026 4684 2933 1012 102 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0


I0409 11:14:26.709973 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:14:26.711422 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0 (id = 0)


I0409 11:14:26.713359 4611933632 tf_logging.py:115] label: 0 (id = 0)


In [9]:
def create_model(is_predicting, input_ids, input_mask, segment_ids, labels,
                 num_labels):
  """Creates a classification model."""

  bert_module = hub.Module(
      BERT_MODEL_HUB,
      trainable=True)
  bert_inputs = dict(
      input_ids=input_ids,
      input_mask=input_mask,
      segment_ids=segment_ids)
  bert_outputs = bert_module(
      inputs=bert_inputs,
      signature="tokens",
      as_dict=True)

  output_layer = bert_outputs["pooled_output"]

  hidden_size = output_layer.shape[-1].value

  output_weights = tf.get_variable(
      "output_weights", [num_labels, hidden_size],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(
      "output_bias", [num_labels], initializer=tf.zeros_initializer())

  with tf.variable_scope("loss"):

    output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
    logits = tf.matmul(output_layer, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    log_probs = tf.nn.softmax(logits)  
    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 
    predicted_labels = tf.squeeze(tf.argmax(log_probs, axis=-1, output_type=tf.int32))     
    if is_predicting:
      return (predicted_labels, log_probs)
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)
    return (loss, predicted_labels, log_probs)

In [10]:
#Model_fn_builder  
def model_fn_builder(num_labels, learning_rate, num_train_steps,
                     num_warmup_steps):
  """Returns `model_fn` closure for TPUEstimator."""
  def model_fn(features, labels, mode, params):    
    """The `model_fn` for TPUEstimator."""

    input_ids = features["input_ids"]
    input_mask = features["input_mask"]
    segment_ids = features["segment_ids"]
    label_ids = features["label_ids"]

    is_predicting = (mode == tf.estimator.ModeKeys.PREDICT)
    
    # TRAIN and EVAL
    if not is_predicting:

      (loss, predicted_labels, log_probs) = create_model(
        is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

      train_op = bert.optimization.create_optimizer(
          loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu=False)

      # Calculate evaluation metrics. 
      def metric_fn(label_ids, predicted_labels):
        accuracy = tf.metrics.accuracy(label_ids, predicted_labels)
        f1_score = tf.contrib.metrics.f1_score(
            label_ids,
            predicted_labels)
        auc = tf.metrics.auc(
            label_ids,
            predicted_labels)
        recall = tf.metrics.recall(
            label_ids,
            predicted_labels)
        precision = tf.metrics.precision(
            label_ids,
            predicted_labels) 
        true_pos = tf.metrics.true_positives(
            label_ids,
            predicted_labels)
        true_neg = tf.metrics.true_negatives(
            label_ids,
            predicted_labels)   
        false_pos = tf.metrics.false_positives(
            label_ids,
            predicted_labels)  
        false_neg = tf.metrics.false_negatives(
            label_ids,
            predicted_labels) 

      eval_metrics = metric_fn(label_ids, predicted_labels)

      if mode == tf.estimator.ModeKeys.TRAIN:
        return tf.estimator.EstimatorSpec(mode=mode,
          loss=loss,
          train_op=train_op)
      else:
          return tf.estimator.EstimatorSpec(mode=mode,
            loss=loss,
            eval_metric_ops=eval_metrics)
    else:
      (predicted_labels, log_probs) = create_model(
        is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

      predictions = {
          'probabilities': log_probs,
          'labels': predicted_labels
      }
      return tf.estimator.EstimatorSpec(mode, predictions=predictions)

  # Return the actual model function in the closure
  return model_fn

In [11]:
# Compute train and warmup steps from batch size
# These hyperparameters are copied from this colab notebook (https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)
BATCH_SIZE = 32
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 3.0
WARMUP_PROPORTION = 0.1

In [12]:
# Compute # train and warmup steps from batch size
num_train_steps = int(len(train_features) / BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

In [13]:
run_config = tf.estimator.RunConfig(
    model_dir=None)

In [14]:
model_fn = model_fn_builder(
  num_labels=len(label_list),
  learning_rate=LEARNING_RATE,
  num_train_steps=num_train_steps,
  num_warmup_steps=num_warmup_steps)

estimator = tf.estimator.Estimator(
  model_fn=model_fn,
  config=run_config,
  params={"batch_size": BATCH_SIZE})



W0409 11:14:26.816224 4611933632 tf_logging.py:125] Using temporary folder as model directory: /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w


INFO:tensorflow:Using config: {'_model_dir': '/var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1a3190ee48>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


I0409 11:14:26.819579 4611933632 tf_logging.py:115] Using config: {'_model_dir': '/var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1a3190ee48>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [15]:
train_input_fn = bert.run_classifier.input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=False)

In [16]:
print(f'Beginning Training!')
current_time = datetime.now()
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print("Training took time ", datetime.now() - current_time)

Beginning Training!
INFO:tensorflow:Calling model_fn.


I0409 11:14:26.902292 4611933632 tf_logging.py:115] Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0409 11:14:35.360075 4611933632 tf_logging.py:115] Saver not created because there are no variables in the graph to restore
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


INFO:tensorflow:Done calling model_fn.


I0409 11:15:04.106714 4611933632 tf_logging.py:115] Done calling model_fn.


INFO:tensorflow:Create CheckpointSaverHook.


I0409 11:15:04.111134 4611933632 tf_logging.py:115] Create CheckpointSaverHook.


INFO:tensorflow:Graph was finalized.


I0409 11:15:20.971144 4611933632 tf_logging.py:115] Graph was finalized.


INFO:tensorflow:Running local_init_op.


I0409 11:15:35.772709 4611933632 tf_logging.py:115] Running local_init_op.


INFO:tensorflow:Done running local_init_op.


I0409 11:15:36.189880 4611933632 tf_logging.py:115] Done running local_init_op.


INFO:tensorflow:Saving checkpoints for 0 into /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt.


I0409 11:16:06.863450 4611933632 tf_logging.py:115] Saving checkpoints for 0 into /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt.


INFO:tensorflow:loss = -0.4381824, step = 1


I0409 11:16:51.920901 4611933632 tf_logging.py:115] loss = -0.4381824, step = 1


INFO:tensorflow:Saving checkpoints for 6 into /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt.


I0409 11:18:06.191381 4611933632 tf_logging.py:115] Saving checkpoints for 6 into /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt.


INFO:tensorflow:Loss for final step: -0.7097838.


I0409 11:18:26.237435 4611933632 tf_logging.py:115] Loss for final step: -0.7097838.


Training took time  0:03:59.401701


In [17]:
test_input_fn = run_classifier.input_fn_builder(
    features=test_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=False)

In [18]:
estimator.evaluate(input_fn=test_input_fn, steps=None)

INFO:tensorflow:Calling model_fn.


I0409 11:18:26.296015 4611933632 tf_logging.py:115] Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0409 11:18:34.960335 4611933632 tf_logging.py:115] Saver not created because there are no variables in the graph to restore
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


INFO:tensorflow:Done calling model_fn.


I0409 11:18:56.612127 4611933632 tf_logging.py:115] Done calling model_fn.


INFO:tensorflow:Starting evaluation at 2019-04-09-01:18:56


I0409 11:18:56.639560 4611933632 tf_logging.py:115] Starting evaluation at 2019-04-09-01:18:56


INFO:tensorflow:Graph was finalized.


I0409 11:19:00.290427 4611933632 tf_logging.py:115] Graph was finalized.


INFO:tensorflow:Restoring parameters from /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt-6


I0409 11:19:00.293190 4611933632 tf_logging.py:115] Restoring parameters from /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt-6


INFO:tensorflow:Running local_init_op.


I0409 11:19:07.677698 4611933632 tf_logging.py:115] Running local_init_op.


INFO:tensorflow:Done running local_init_op.


I0409 11:19:08.032736 4611933632 tf_logging.py:115] Done running local_init_op.


INFO:tensorflow:Finished evaluation at 2019-04-09-01:19:18


I0409 11:19:18.821744 4611933632 tf_logging.py:115] Finished evaluation at 2019-04-09-01:19:18


INFO:tensorflow:Saving dict for global step 6: global_step = 6, loss = -0.593893


I0409 11:19:18.822780 4611933632 tf_logging.py:115] Saving dict for global step 6: global_step = 6, loss = -0.593893


INFO:tensorflow:Saving 'checkpoint_path' summary for global step 6: /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt-6


I0409 11:19:33.125077 4611933632 tf_logging.py:115] Saving 'checkpoint_path' summary for global step 6: /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt-6


{'loss': -0.593893, 'global_step': 6}

In [19]:
def getPrediction(in_sentences):
  labels = ["Looks OK", "Please Review"]
  input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
  predictions = estimator.predict(predict_input_fn)
  return [(sentence, prediction['probabilities'], labels[prediction['labels']]) for sentence, prediction in zip(in_sentences, predictions)]

In [20]:
#Get model's prediction on the folowing sentences
pred_sentences = [   
  "She just didnt care about our investment needs", #Client complaining about service 
  "You need to call me once the market shuts there is an issue", #Possible concealment of trade issue
  "Absolutely fantastic level!",  #Innocent
  "I need to speak to your head of compliance now", #Likely issue 
  "My friend just started working in your compliance function" #Innocent 
]

In [21]:
predictions = getPrediction(pred_sentences)

INFO:tensorflow:Writing example 0 of 5


I0409 11:19:33.148476 4611933632 tf_logging.py:115] Writing example 0 of 5


INFO:tensorflow:*** Example ***


I0409 11:19:33.150757 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: 


I0409 11:19:33.152515 4611933632 tf_logging.py:115] guid: 


INFO:tensorflow:tokens: [CLS] she just didn ##t care about our investment needs [SEP]


I0409 11:19:33.154062 4611933632 tf_logging.py:115] tokens: [CLS] she just didn ##t care about our investment needs [SEP]


INFO:tensorflow:input_ids: 101 2016 2074 2134 2102 2729 2055 2256 5211 3791 102 0 0 0 0


I0409 11:19:33.158153 4611933632 tf_logging.py:115] input_ids: 101 2016 2074 2134 2102 2729 2055 2256 5211 3791 102 0 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0


I0409 11:19:33.163858 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:19:33.167519 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0 (id = 0)


I0409 11:19:33.169585 4611933632 tf_logging.py:115] label: 0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:19:33.171766 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: 


I0409 11:19:33.174360 4611933632 tf_logging.py:115] guid: 


INFO:tensorflow:tokens: [CLS] you need to call me once the market shut ##s there is an [SEP]


I0409 11:19:33.176594 4611933632 tf_logging.py:115] tokens: [CLS] you need to call me once the market shut ##s there is an [SEP]


INFO:tensorflow:input_ids: 101 2017 2342 2000 2655 2033 2320 1996 3006 3844 2015 2045 2003 2019 102


I0409 11:19:33.178308 4611933632 tf_logging.py:115] input_ids: 101 2017 2342 2000 2655 2033 2320 1996 3006 3844 2015 2045 2003 2019 102


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


I0409 11:19:33.180732 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:19:33.182197 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0 (id = 0)


I0409 11:19:33.184462 4611933632 tf_logging.py:115] label: 0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:19:33.186446 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: 


I0409 11:19:33.187897 4611933632 tf_logging.py:115] guid: 


INFO:tensorflow:tokens: [CLS] absolutely fantastic level ! [SEP]


I0409 11:19:33.190086 4611933632 tf_logging.py:115] tokens: [CLS] absolutely fantastic level ! [SEP]


INFO:tensorflow:input_ids: 101 7078 10392 2504 999 102 0 0 0 0 0 0 0 0 0


I0409 11:19:33.191891 4611933632 tf_logging.py:115] input_ids: 101 7078 10392 2504 999 102 0 0 0 0 0 0 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0


I0409 11:19:33.193605 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:19:33.195080 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0 (id = 0)


I0409 11:19:33.196581 4611933632 tf_logging.py:115] label: 0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:19:33.198417 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: 


I0409 11:19:33.205040 4611933632 tf_logging.py:115] guid: 


INFO:tensorflow:tokens: [CLS] i need to speak to your head of compliance now [SEP]


I0409 11:19:33.207061 4611933632 tf_logging.py:115] tokens: [CLS] i need to speak to your head of compliance now [SEP]


INFO:tensorflow:input_ids: 101 1045 2342 2000 3713 2000 2115 2132 1997 12646 2085 102 0 0 0


I0409 11:19:33.209407 4611933632 tf_logging.py:115] input_ids: 101 1045 2342 2000 3713 2000 2115 2132 1997 12646 2085 102 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0


I0409 11:19:33.212705 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:19:33.215974 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0 (id = 0)


I0409 11:19:33.218469 4611933632 tf_logging.py:115] label: 0 (id = 0)


INFO:tensorflow:*** Example ***


I0409 11:19:33.221433 4611933632 tf_logging.py:115] *** Example ***


INFO:tensorflow:guid: 


I0409 11:19:33.227838 4611933632 tf_logging.py:115] guid: 


INFO:tensorflow:tokens: [CLS] my friend just started working in your compliance function [SEP]


I0409 11:19:33.236067 4611933632 tf_logging.py:115] tokens: [CLS] my friend just started working in your compliance function [SEP]


INFO:tensorflow:input_ids: 101 2026 2767 2074 2318 2551 1999 2115 12646 3853 102 0 0 0 0


I0409 11:19:33.238691 4611933632 tf_logging.py:115] input_ids: 101 2026 2767 2074 2318 2551 1999 2115 12646 3853 102 0 0 0 0


INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0


I0409 11:19:33.242287 4611933632 tf_logging.py:115] input_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0


INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


I0409 11:19:33.244468 4611933632 tf_logging.py:115] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


INFO:tensorflow:label: 0 (id = 0)


I0409 11:19:33.251318 4611933632 tf_logging.py:115] label: 0 (id = 0)


INFO:tensorflow:Calling model_fn.


I0409 11:19:33.326897 4611933632 tf_logging.py:115] Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0409 11:19:41.964319 4611933632 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Done calling model_fn.


I0409 11:19:42.493695 4611933632 tf_logging.py:115] Done calling model_fn.


INFO:tensorflow:Graph was finalized.


I0409 11:19:43.462757 4611933632 tf_logging.py:115] Graph was finalized.


INFO:tensorflow:Restoring parameters from /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt-6


I0409 11:19:43.465118 4611933632 tf_logging.py:115] Restoring parameters from /var/folders/mk/_jsfr0655xx4hymnm7d5td900000gn/T/tmp7jadvo0w/model.ckpt-6


INFO:tensorflow:Running local_init_op.


I0409 11:19:45.816371 4611933632 tf_logging.py:115] Running local_init_op.


INFO:tensorflow:Done running local_init_op.


I0409 11:19:45.974565 4611933632 tf_logging.py:115] Done running local_init_op.


In [22]:
#Final cell and predictions for the sentences listed above in cell 20.
#The output below consists of the following:
#Prints the sentence first
#Prints two class probabilities via Softmax 
#The first number is the probability of the sentence being OK,and second is the 
#probability of the sentence needing somebody to review it.
#Given the probability treshold is 50% the model will produce the suggested action based on the two probabilities versus the treshhold.
#Would be interesting to override trashold given typical class imbalance but it is beyond scope of this test.

#Definitely not a prefect outcome on the first sentence but given the amount of traininng data that was shown to the model
#the results are encouraging. I also tested typos in sentences of interest outside 
#this notebook and again the results were encouraging.  
predictions

[('She just didnt care about our investment needs',
  array([0.5122035 , 0.48779655], dtype=float32),
  'Looks OK'),
 ('You need to call me once the market shuts there is an issue',
  array([0.35026708, 0.6497329 ], dtype=float32),
  'Please Review'),
 ('Absolutely fantastic level!',
  array([0.5733275 , 0.42667255], dtype=float32),
  'Looks OK'),
 ('I need to speak to your head of compliance now',
  array([0.30353436, 0.6964656 ], dtype=float32),
  'Please Review'),
 ('My friend just started working in your compliance function',
  array([0.64840436, 0.3515957 ], dtype=float32),
  'Looks OK')]