#### Workshop Description
Understanding the questions posed by instructors and students alike plays an important role in the development of educational technology applications. In this intermediate level workshop, you will learn to apply NLP to one piece of this real-world problem by building a model to predict the type of answer (e.g. entity, description, number, etc.) a question elicits. Specifically, you will learn to:
1. Perform preprocessing, normalization, and exploratory analysis on a question dataset,
2. Identify salient linguistic features of natural language questions, and
3. Experiment with different feature sets and models to predict the answer type.
4. Use powerful pretrained language models to create dense sentence representations and apply deep learning models to text classification.

The concepts will be taught using popular NLP and ML packages like SpaCy, Scikit Learn, and Tensorflow.

This workshop assumes familiarity with Jupyter notebooks and the basics of scientific packages like numPy and sciPy. We also assume some basic knowledge of machine learning and deep learning techniques like CNNs, LSTMs, etc. Reference materials will be provided to gain a better understanding of these techniques for interested attendees.
***


# Deep Representation Learning to classify TREC question-classification text

**Overview:** 
In this session we'll try to solve the TREC question-classification problem by using a few popular Deep Learning Algorithms.
Concretely, we will use pre-trained Language Models to generate **representations(Embeddings)** for our input data and then classify these representations using a shallow neural network.
We will examine the network architectures of **Universal Sentence Encoder(USE)** and **Bidirectional Encoder Representation from Transformers(BERT)** and touch upon the pros and cons of these architectures in classifying TREC question-classification data.

### What you'll learn:
- How to use **Keras** for Text classification
- How to generate representations using pre-trained **Universal Sentence Encoder: USE**
- How to tune and evaluate Deep Learning models 
- How to use **Tensorflow** for Text classification
- How to use pre-trained Language Model **Bidirectional Encoder Representation from Transformers: BERT** for Text classification

**Note:** We will be using the same dataset as the previous 2 sessions. Notebook links to the previous session are available **INSERT LINK**

### Word Represenations:
![dep_nobj-1](images/Word2Vec.png)

### Sentence Representation
![dep_nobj-1](images/SentenceEmbedding.png)

### Utility Functions
The following two utility functions provide functionality that can be used across different models to inspect training metrics and performance. These will be used at a later point in time.

In [None]:
#Matplotlib Plotting Import
import matplotlib.pyplot as plt

def plot_training_history(history):
    """
    Function to plot training accuracy/loss, validation accuracy/loss.
    
    Parameters
    ----------
    history: 
        Keras training history object. See: https://keras.io/callbacks/#history
    
    Returns
    -------
    
    """
    
    # Plot training & validation accuracy values
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.title('Model accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Val'], loc='upper left')
    plt.show()

    # Plot training & validation loss values
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Val'], loc='upper left')
    plt.show()

In [None]:
import numpy as np

#Keras Imports: USE Embedding Classification
from keras.layers import Dense, Input, Dropout
from keras.models import Model, load_model
from keras.utils.np_utils import to_categorical
from keras.callbacks import ModelCheckpoint, EarlyStopping

#Sklearn Utility Imports
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold

def generate_classification_report(model_path, label_encoder, test_features, test_labels, class_names=None):
    """
    Function to generate SKLearn based multi class classification report
    
    Parameters
    ----------
    model_path: Path to trained model.
    label_encoder: Encoder used during label transformation
    test_features: Features for test.
    test_labels: Ground truth labels
    class_names: Class names for the true and pred integer values
    
    Returns
    -------
    dict: sklearn.metrics.classification_report
        MultiClass Classification Report.
    """
    
    # Load pre-trained model
    model = load_model(model_path)
    
    # Predict labels for test features
    preds = model.predict(test_features)
    
    # Since the model is trained to return a set of probabilities across the label set, 
    # we'll have to find the index of label set with the highest probability score.
    preds_index = np.argmax(preds, axis=1)
    
    # Converting the predicted index into the original TREC based label
    preds_labels = label_encoder.inverse_transform(preds_index)
    
    return classification_report(test_labels, preds_labels, target_names=label_encoder.classes_)

### Download Data

First let's download the train and test data from Xin Li, Dan Roth, Learning Question Classifiers. COLING'02, Aug., 2002.
    <https://cogcomp.seas.upenn.edu/Data/QA/QC/">https://cogcomp.seas.upenn.edu/Data/QA/QC/>
    
We will store these data in Pandas DataFrames (and write them as .csv files) containing the following columns:
- *question*: The question text
- *processed_question*: The question as a SpaCy Doc object
- *coarse_label*: The coarse-grained label (6 classes)
- *label*: The fine-grained label

Recall that in Module 1, we found that some questions were duplicated. Let's remove those now.

In [1]:
import os
import pandas as pd
from download_data import main as download_trec_data

if not os.path.exists("data"):
    os.makedirs("data")

download_trec_data()

path_to_train = os.path.join("data", "train.csv")
path_to_test = os.path.join("data", "test.csv")

train_df = pd.read_csv(os.path.join("data", "train.csv"))
test_df = pd.read_csv(os.path.join("data", "test.csv"))

#
# Dedupe from python module.
#
train_df = train_df.drop_duplicates("question")
test_df = test_df.drop_duplicates('question')

Directory 'data' already exists


## Universal Sentence Encoder
<u>Reference Paper</u>: https://arxiv.org/abs/1803.11175<br>
<u>Announcement</u>: https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html<br><br>
**Universal Sentence Encoder (USE)** is a versatile sentence embedding model that convert sentences into vector representations. These vectors capture rich semantic information that can be used to train classifiers for a broad range of downstream tasks.

![dep_nobj-1](images/USE.png)

Note: USE can work on small multi sentence paragraphs.


### High level steps for classifying text using pre-trained USE model: 
- Download Pre-trained USE Model from Tensorflow HUB<br>
- Extract USE Repesentations for both train and test sets<br>
- Define a the classification network architecture<br>
- Start Training


### Data Prep for USE Q&A classification

In [None]:
import re
def pre_process_text(input_text):
    """
    Function to normalize text by applying NLP tranformations.
    
    Parameters
    ----------
    input_text: String 
        Question text from the input sample
        
    Returns
    -------
    String
        pre-processed version of input string
    """
    #Exercise: build multiple models based on diferrent pre-processing techniques.
    #Un-Comment the below line to see if the model performance improves by introducing additional 
    #input_text = re.sub('[^A-Za-z0-9 ,\?\'\"-._\+\!/\`@=;:]+', '', input_text)
    return input_text.lower()

##### Extract the raw question text and labels from the training and test dataframes:

In [None]:
features_train = train_df['question'].to_list()
features_test  = test_df['question'].to_list()
labels_train   = train_df['coarse_label'].to_list()
labels_test    = test_df['coarse_label'].to_list()

##### Pre-Process the text used for training and test

In [None]:
# Pre-Process the text used for training and test
features_train_processed = list(map(lambda x:pre_process_text(x), features_train))
features_test_processed = list(map(lambda x:pre_process_text(x), features_test))

##### The labels for the training and test set are in a string format (eg: ABBR, DESC etc). These labels need to be converted into a numerical set using [Scikit Learn's Label Encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [None]:
# Pre-Process labels for training
label_encoder = preprocessing.LabelEncoder()
labels_train_tranformed = label_encoder.fit_transform(labels_train)
labels_train_categorical = to_categorical(np.asarray(labels_train_tranformed))
# Note: We do not have to "fit" the label encoder for the test set since they already have been fit on the trainset
labels_test_transformed = label_encoder.transform(labels_test)
labels_test_categorical = to_categorical(np.asarray(labels_test_transformed))

##### Download and load the pre-trained Universal Sentence Encoder from Tensorflow Hub

In [None]:
#Tensorflow Imports
import tensorflow as tf
import tensorflow_hub as hub

pre_trained_use_embed_model = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")

##### Generate sentence/phrase representations of the training and test text data using the above downloaded USE model

In [None]:
embeddings_features_train = []
embeddings_features_test = []
with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    embeddings_features_train.append(session.run(pre_trained_use_embed_model(features_train_processed)))
    embeddings_features_test.append(session.run(pre_trained_use_embed_model(features_test_processed)))

In [None]:
# Inspect the shape of the input embeddings
question_embeddings_train = embeddings_features_train[0]
question_embeddings_test = embeddings_features_test[0]

In [None]:
# Inspect the shape of the input embeddings
question_embeddings_train.shape[1]

### Model Definition for USE Q&A classification

We will use the [Keras Functional API Guide](https://keras.io/getting-started/functional-api-guide/#first-example-a-densely-connected-network) to build and train the USE Q&A classifier network.

In [None]:
class QNAClassifier():
    """
    Q&A classifier class using Keras framework
    """
    
    def __init__(self, experiment_name):
        """
        Init function
        
        Parameters
        ----------
        experiment_name: String
            Name of the experiment. This will be used to name the model checkpoints.
            
        """
        
        #Exercise: Modify the below hyper parameters to create variations of the USE Q&A classifier model.
        self.patience = 10
        self.epochs = 100
        self.batch_size = 64
        
        self.experiment_name = experiment_name
        self.output_dir = 'models'
        self.class_count = 6
        self.model = None
        
        # Creating an output directory for the generated models.
        if not os.path.exists(self.output_dir):
            os.makedirs(self.output_dir)

    
    def train_vanilla_nn(self, embeddings_train, labels_train, embeddings_test, labels_test):
        """
        Simple Feed forward neural network with 1 Dense layer to classify Q&A embeddings.
        
        Parameters
        ----------
        embeddings_train: Numpy Array
            USE embedding repesentation of the training set.
            
        labels_train: Numpy Array
            Categorical encoded labels for the training set.
            
        embeddings_test: Numpy Array
            USE embedding repesentation of the test set.
            
        labels_test: Numpy Array
            Categorical encoded labels for the test set.
            
        Returns
        -------
        Keras history object
            See: https://keras.io/callbacks/#history
        
        """
        
        # Network Architecture: Input Layer(Embeddings)-> Dense Layer -> Softmax layer
        # Exercise: Change the size of the hidden layer and the activation unit.
        embedding_inputs = Input(shape=(embeddings_train.shape[1],))
        x = Dense(256, activation='relu')(embedding_inputs)
        predictions = Dense(self.class_count, activation='softmax')(x)
        
        self.model = Model(inputs=embedding_inputs, outputs=predictions)
        self.model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
        
        # Keras Callbacks
        early_stopping = EarlyStopping(monitor='val_acc', patience=self.patience)
        model_filename = self.output_dir + "/" + self.experiment_name
        checkpoint = ModelCheckpoint(model_filename + '.{epoch:03d}-{val_acc:.4f}.hdf5',
                                     monitor='val_acc', verbose=1,
                                     save_best_only=True, mode='auto')
        
        # Start Training
        training_history = self.model.fit(embeddings_train, labels_train, 
                                          validation_data = (embeddings_test, labels_test),
                                          epochs= self.epochs,
                                          batch_size=self.batch_size,
                                          callbacks=[checkpoint, early_stopping])
        
        return training_history
    
    def train_vanilla_nn_cross_validated(self, embeddings_train, labels_train, embeddings_test, labels_test):
        """
        K-Fold Cross validated simple Feed forward neural network with 1 Dense layer to classify Q&A embeddings.
        
        Parameters
        ----------
        embeddings_train: Numpy Array
            USE embedding repesentation of the training set.
            
        labels_train: Numpy Array
            Categorical encoded labels for the training set.
            
        embeddings_test: Numpy Array
            USE embedding repesentation of the test set.
            
        labels_test: Numpy Array
            Categorical encoded labels for the test set.
            
        Returns
        -------
        list of training history objects
            Keras training history object. See: https://keras.io/callbacks/#history
            
        """

        # Network Architecture: Input Layer(Embeddings)-> Dense Layer -> Softmax layer
        # Exercise: Change the size of the hidden layer and the activation unit.
        embedding_inputs = Input(shape=(embeddings_train.shape[1],))
        x = Dense(64, activation='relu')(embedding_inputs)
        predictions = Dense(self.class_count, activation='softmax')(x)
        
        self.model = Model(inputs=embedding_inputs, outputs=predictions)
        self.model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
        
        early_stopping = EarlyStopping(monitor='val_acc', patience=self.patience)
        
        training_histories = []
        counter = 0
        
        #Exercise: Experiment with different number of splits.
        kf = KFold(n_splits=3, random_state=42, shuffle=False)
        for train_index, test_index in kf.split(embeddings_train):
            
            X_train, X_test = embeddings_train[train_index], embeddings_train[test_index]
            y_train, y_test = labels_train[train_index], labels_train[test_index]
            
            model_filename = self.output_dir + "/" + self.experiment_name + "_fold{}".format(counter)
            checkpoint = ModelCheckpoint(model_filename + '.{epoch:03d}-{val_acc:.4f}.hdf5',
                                         monitor='val_acc', verbose=1,
                                         save_best_only=True, mode='auto')
        
            # Start Training
            training_history = self.model.fit(X_train, y_train, 
                                              validation_data = (X_test, y_test),
                                              epochs= self.epochs,
                                              batch_size=self.batch_size,
                                              #Exercise: Add Tensorboard here
                                              callbacks=[checkpoint, early_stopping])
            
            print("-----------------------------\n")
            print("KSplit {} training complete\n".format(counter))
            print("-----------------------------\n")
            
            counter += 1
            
            training_histories.append(training_history)
        
        return training_histories
    
    def train_tuned_nn(self, embeddings_train, labels_train, embeddings_test, labels_test):
        
        """
        Tuned Feed forward neural network with 1 Dense layer to classify Q&A embeddings.
        
        Parameters
        ----------
        embeddings_train: Numpy Array
            USE embedding repesentation of the training set.
            
        labels_train: Numpy Array
            Categorical encoded labels for the training set.
            
        embeddings_test: Numpy Array
            USE embedding repesentation of the test set.
            
        labels_test: Numpy Array
            Categorical encoded labels for the test set.
            
        Returns
        -------
        Keras history object
            See: https://keras.io/callbacks/#history
        
        """
        embedding_inputs = Input(shape=(embeddings_train.shape[1],))
        x = Dense(128, activation='relu')(embedding_inputs)
        # Added dropouts for regularization
        # Exercise: Change the value of dropouts.
        x = Dropout(0.5)(x)
        x = Dense(128, activation='relu')(x)
        predictions = Dense(self.class_count, activation='softmax')(x)
        
        model = Model(inputs=embedding_inputs, outputs=predictions)
        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
        
        # Keras Callbacks
        early_stopping = EarlyStopping(monitor='val_acc', patience=self.patience)
        model_filename = self.output_dir + "/" + self.experiment_name
        checkpoint = ModelCheckpoint(model_filename + '.{epoch:03d}-{val_acc:.4f}.hdf5',
                                     monitor='val_acc', verbose=1,
                                     save_best_only=True, mode='auto')
        
        
        # Start Training
        training_history = model.fit(embeddings_train, labels_train, 
                                     validation_data = (embeddings_test, labels_test),
                                     epochs= self.epochs,
                                     batch_size=self.batch_size,
                                     #Exercise: Add Tensorboard here
                                     callbacks=[checkpoint, early_stopping])
        
        return training_history

##### Train Vanilla Neural Network with Pre-trained USE Embeddings

In [None]:
# Train Vanilla Neural Network with Pre-trained USE Embeddings
use_embedding_classifier = QNAClassifier("USE_Embedding_Model")
use_embedding_training_history = use_embedding_classifier.train_vanilla_nn(question_embeddings_train, labels_train_categorical,
                                                                          question_embeddings_test, labels_test_categorical)

##### Plot training history of above Vanilla Neural Network with Pre-trained USE Embeddings

In [None]:
plot_training_history(use_embedding_training_history)

##### Train Cross Validated Vanilla Neural Network with Pre-trained USE Embeddings

In [None]:
# Train Cross Validated Vanilla Neural Network with Pre-trained USE Embeddings
use_embedding_classifier = QNAClassifier("USE_Embedding_CV_Model")
use_embedding_training_history = use_embedding_classifier.train_vanilla_nn_cross_validated(question_embeddings_train, labels_train_categorical,
                                                                                           question_embeddings_test, labels_test_categorical)

##### Train Tuned Neural Network with Pre-trained USE Embeddings

In [None]:
# Train Tuned Neural Network with Pre-trained USE Embeddings
use_embedding_tuned_classifier = QNAClassifier("USE_Embedding_Tuned_Model")
use_embedding_tuned_training_history = use_embedding_tuned_classifier.train_tuned_nn(question_embeddings_train, labels_train_categorical,
                                                                          question_embeddings_test, labels_test_categorical)

##### Plot training history of above Tuned Neural Network with Pre-trained USE Embeddings

In [None]:
plot_training_history(use_embedding_tuned_training_history)

##### Plot the test classification metrics for the above Tuned Neural Network with Pre-trained USE Embeddings

In [None]:
# Note: Please use the appropriate model path corresponding to your training step.
print(generate_classification_report(model_path = 'models/USE_Embedding_Tuned_Model.011-0.9140.hdf5', 
                                     label_encoder = label_encoder,
                                     test_features = question_embeddings_test,
                                     test_labels = labels_test))

***

## BERT: Bidirectional Encoder Representation from Transformers 
<u>Refrence Paper</u>: https://arxiv.org/abs/1810.04805<br>
<u>Announcement</u>: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

**BERT** is the current state of the art Language Model and is designed by pre-training deep bidirectional representations from unlabeled(Wikipedia)text by jointly conditioning on both left and right context in all layers.
BERT’s model architecture is a multi-layer/stacked set of bidirectional Transformers with the following 2 variants:
**BERTBASE** (L=12, H=768, A=12, Total Parameters=110M) and **BERTLARGE** (L=24, H=1024, A=16, Total Parameters=340M).

![dep_nobj-1](images/BERT.png)


<h3>Data Prep for BERT Q&A classification</h3>

Since BERT is a pre-trained Langauage Model, fine-tuning tasks using BERT is expected to have the same input format of data as that of BERT's training. In a nutshell, we'll have to apply the following transformations to our input text to conform to BERT's fine tuning input expectation.
    
- Lowercase our text (if we're using a BERT lowercase model)<br>
- Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])<br>
- Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])<br>
- Map our words to indexes using a vocab file that BERT provides<br>
- Add special "CLS" and "SEP" tokens for NextSentenceIdentication (see the Section 3 https://arxiv.org/pdf/1810.04805.pdf)<br>
- Append "index" and "segment" tokens to each input (see the Section 3 https://arxiv.org/pdf/1810.04805.pdf)<br>


In [None]:
#BERT Imports: BERT Classification
import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization

##### Fortunately, there are multiple libraries that'll trannsform our raw Question text to a format that BERT understands
**bert.run_classifier.InputExample** is a data structure that will store the tranformed Quesstion text into BERT Input format. The below lambda section is only initializing these BERT Input format data structures.

In [None]:
train_InputSamples = list(map(lambda x,y: bert.run_classifier.InputExample(guid=None, text_a=x, text_b=None, label=y),
                              features_train, labels_train))
test_InputSamples = list(map(lambda x,y: bert.run_classifier.InputExample(guid=None, text_a=x, text_b=None, label=y),
                              features_test, labels_test))

##### Download the pre-trained BERT base model and load up the BERT tokenizers to operate on our transformed Question text

In [None]:
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
tf.logging.set_verbosity(tf.logging.INFO)

def create_tokenizer_from_hub_module():
    """
    Load the pre-trained BERT model and extract the vocab file and tokenizer from TF HUB

    Returns
    -------
    BERT tokenizer object: bert.tokenization.FullTokenizer
        See: https://github.com/google-research/bert/blob/master/tokenization.py
        
    """

    with tf.Graph().as_default():
        bert_module = hub.Module(BERT_MODEL_HUB)
        tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
        with tf.Session() as sess:
            vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                                  tokenization_info["do_lower_case"]])

    return bert.tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer_from_hub_module()

##### Time to run the pre-trained BERT tokenizer on our input Question text

In [None]:
# This is the max length of tokens in our Question text dataset
# Exercise: Modify this MAX_SEQ_LENGTH value to see how it affects the training process
MAX_SEQ_LENGTH = 20
label_list = list(set(labels_train))

train_features = bert.run_classifier.convert_examples_to_features(train_InputSamples, label_list, MAX_SEQ_LENGTH, tokenizer)
test_features = bert.run_classifier.convert_examples_to_features(test_InputSamples, label_list, MAX_SEQ_LENGTH, tokenizer)


### Model Definition for BERT Q&A classification

We'll use Tensorflow's Estimator API/Framework to train our fine-tuned BERT Q&A classification network. See https://www.tensorflow.org/guide/estimator

In [None]:
def bert_model(is_predicting, input_ids, input_mask, segment_ids, labels, num_labels):
    """
    Our Custom fine-tuning Q&A classifier definition using BERT output layers.

    Parameters
    ----------
    is_predicting: boolean
        Boolean variable to indicate Training or Prediction mode.

    input_ids: Numpy Array
        BERT vocab token index for the input sample.

    input_mask: Numpy Array
        Flag to indicate if the input token is masked (1: Yes, 0:No).

    segment_ids: Numpy Array
        Flag to indicate which sentence the token belongs to. (0: 1st sentence, 1:2nd sentence).
        
    labels: Numpy Array
        Classification label for the input.
        
    num_labels: integer
        Total number of labels

    Returns
    -------
    In Training Mode return (Training Loss, Evaluation Labels, Evaluation probs per sample) tuple
    In Prediction Mode return (Evaluation Labels, Evaluation probs per sample) tuple

    """
    bert_module = hub.Module( BERT_MODEL_HUB,trainable=True)
    bert_inputs = dict( input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids)
    bert_outputs = bert_module(inputs=bert_inputs, signature="tokens", as_dict=True)

    # Use "pooled_output" for classification tasks on an entire sentence.
    # Use "sequence_outputs" for token-level output.
    output_layer = bert_outputs["pooled_output"]

    hidden_size = output_layer.shape[-1].value

    # Tunable layer.
    output_weights = tf.get_variable("output_weights", [num_labels, hidden_size],
                                     initializer=tf.truncated_normal_initializer(stddev=0.02))

    output_bias = tf.get_variable("output_bias", [num_labels], initializer=tf.zeros_initializer())

    with tf.variable_scope("loss"):

        # Dropout helps prevent overfitting
        output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

        logits = tf.matmul(output_layer, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        log_probs = tf.nn.log_softmax(logits, axis=-1)

        # Convert labels into one-hot encoding
        one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

        predicted_labels = tf.squeeze(tf.argmax(log_probs, axis=-1, output_type=tf.int32))
        # If we're predicting, we want predicted labels and the probabiltiies.
        if is_predicting:
            return (predicted_labels, log_probs)

        # If we're train/eval, compute loss between predicted and actual label
        per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
        loss = tf.reduce_mean(per_example_loss)
        
        return (loss, predicted_labels, log_probs)

##### Estimator driver logic for Training, Evaluation and Predict modes

In [None]:
def model_fn_builder(num_labels, learning_rate, num_train_steps, num_warmup_steps):
    """
    Estimator driver logic for Training, Evaluation and Predict modes
    
    Parameters
    ----------
    num_labels : integer
        Total number of labels
        
    learning_rate : float
        Learning rate for underlying neural network
        
    num_train_steps: integer
        Number of steps to train (Sample Size/(Batch Size*Number of Epochs))
        
    num_warmup_steps: float
        Dynamic learning rate adjustment proportion
        
    Returns
    -------
    model_fn closure: Python Object
        Returns a closure of the driver logic
    
    """

    def model_fn(features, labels, mode, params):
        """
        Definition for Training, Evaluation and Predict modes
        
        Parameters
        ----------
        features: Dictionary
            Training/Test features
            
        labels: Numpy Array
            Train/Test labels
            
        mode: Numpy Array
            Train/Eval/Predict
            
        params: Dictionary
            Dict with training hyperparams
            
        Returns
        -------
        EstimatorSpec: tf.estimator.EstimatorSpec
            https://www.tensorflow.org/api_docs/python/tf/estimator/EstimatorSpec
        
        """

        input_ids = features["input_ids"]
        input_mask = features["input_mask"]
        segment_ids = features["segment_ids"]
        label_ids = features["label_ids"]

        is_predicting = (mode == tf.estimator.ModeKeys.PREDICT)

        # TRAIN and EVAL
        if not is_predicting:

            # Get BERT model definition
            (loss, predicted_labels, log_probs) = bert_model(is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

            train_op = bert.optimization.create_optimizer(loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu=False)

            # Calculate evaluation metrics.
            def metric_fn(label_ids, predicted_labels):
                """
                Function to calculate training/evaluation metrics
                """
                
                recall = tf.metrics.recall(label_ids, predicted_labels)
                precision = tf.metrics.precision(label_ids, predicted_labels)
                true_pos = tf.metrics.true_positives(label_ids, predicted_labels)
                true_neg = tf.metrics.true_negatives(label_ids, predicted_labels)
                false_pos = tf.metrics.false_positives(label_ids, predicted_labels)
                false_neg = tf.metrics.false_negatives(label_ids, predicted_labels)
                
                return {
                    "precision": precision,
                    "recall": recall,
                    "true_positives": true_pos,
                    "true_negatives": true_neg,
                    "false_positives": false_pos,
                    "false_negatives": false_neg
                }

            eval_metrics = metric_fn(label_ids, predicted_labels)

            if mode == tf.estimator.ModeKeys.TRAIN:
                return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
            else:
                return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metrics)
        else:
            (predicted_labels, log_probs) = bert_model(is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

            predictions = {
                'probabilities': log_probs,
                'labels': predicted_labels
            }
            return tf.estimator.EstimatorSpec(mode, predictions=predictions)

    # Return the actual model function in the closure
    return model_fn


##### Define hyperparameters for training

In [None]:
# Exercise: Modify the below values and observe the change in the training process
# Compute train and warmup steps from batch size
BATCH_SIZE = 64
LEARNING_RATE = 1e-5
NUM_TRAIN_EPOCHS = 5.0
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 10
SAVE_SUMMARY_STEPS = 10

In [None]:
# Compute # train and warmup steps from batch size
num_train_steps = int(len(train_features) / BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

In [None]:
# Specify outpit directory and number of checkpoint steps to save
run_config = tf.estimator.RunConfig(model_dir='models',
                                    save_summary_steps=SAVE_SUMMARY_STEPS,
                                    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS)

In [None]:
model_fn = model_fn_builder(num_labels=len(label_list), learning_rate=LEARNING_RATE,
                            num_train_steps=num_train_steps, num_warmup_steps=num_warmup_steps)

estimator = tf.estimator.Estimator(model_fn=model_fn, config=run_config, params={"batch_size": BATCH_SIZE})

In [None]:
# Create an input function for training. drop_remainder = True for using TPUs.
train_input_fn = bert.run_classifier.input_fn_builder( features=train_features, seq_length=MAX_SEQ_LENGTH,
                                                      is_training=True, drop_remainder=False)

##### Start Training

In [None]:
print('Start Training')
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print("End Training")

##### Visualize the training metrics on the Tensorboard

In [None]:
import tensorboard
!tensorboard --logdir .

##### Evaluate the trained model

In [None]:
test_input_fn = run_classifier.input_fn_builder(features=test_features, seq_length=MAX_SEQ_LENGTH,
                                                is_training=False, drop_remainder=False)

In [None]:
metrics = estimator.evaluate(input_fn=test_input_fn, steps=None)
metrics["accuracy"] = (metrics["true_positives"] + metrics["true_negatives"])/(metrics["true_positives"] + metrics["true_negatives"]+metrics["false_positives"] + metrics["false_negatives"])
metrics["f1_score"] = (2*metrics["precision"]*metrics["recall"])/(metrics["precision"]+metrics["recall"])

In [None]:
metrics

### Conclusions:

As seen from the chart below, although BERT outperforms most previous state of the art techniques, it does not justify using such a model as a standalone solution to solve business problems. An ensembled or hybrid approach would be much more desirable to achieve business needs. Figuring out the set of metrics that are most applicable to measure the success of the chosen modeling techniques should ulimately drive the modeling decisions. 

Since BERT was trained on a general Wikipedia corpus, it was able to perform extremely well on the TREC question classification datasets. Real-world business problems are rarely generic in nature and the datasets are highly domain specific which require us to retrain a language model like BERT(for a couple of days with lots of GPU/TPU power) which can very well produce un-favorable results. Hence Deep Learning should never be the first approach to solve a business problem rather a systematic investigation of the data and step-by-step exploration of algorithms should guide your problem solving process. 

![dep_nobj-1](images/Conclusion.png)

 

### References:
- Universal Sentence Encoder: https://arxiv.org/abs/1803.11175
- Tensorflow Estimator: https://www.tensorflow.org/guide/estimator
- BERT: https://arxiv.org/abs/1810.04805