# Milestone Project 2 - Skim Lit

The purpose of this project is to take medical abstracts from medical research papers, and break them down into easily readable and shortened summaries of each abstract. This is based off the PubMed paper that performs the same experiment (see link below).

* https://arxiv.org/abs/1710.06071

## Environment Setup

Need to figure out if I'm on google colab or on local. This will determine which commands need to be ran and how to setup the CPU/GPU being used.

In [None]:
# Determining if on google colab
try:
  from google import colab
  on_colab = True
except Exception:
  on_colab = False

on_colab

In [None]:
if on_colab:
  # Setting up the notebook with a GPU
  !nvidia-smi -L
  !pip install py-learning-toolbox@git+https://github.com/bkubick/py-learning-toolbox.git
  !pltb_setup_project .
  !rm -rf ./notebooks

## Imports

In [None]:
from dataclasses import dataclass
import random
import requests
import string
import typing

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from py_learning_toolbox import dl_toolbox
from py_learning_toolbox import data_toolbox
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import tensorflow as tf
import tensorflow_hub as hub

## Helpers

In [None]:
def preprocess_pubmed_txt_data(url: str) -> typing.List[typing.Dict[str, typing.Any]]:
    """ Preprocessing function that grabs the data from the corresponding url, then
        prepocesses it to clean it up into a list of dictionaries with the following keys:

        - abstract_id
        - target
        - text
        - line_number
        - total_lines

        Args:
            url (str): the corresponding url the data is grabbed from.

        Raises:
            AssertionError: if there is an issue with the data that there are not
                two blank lines between abstracts.

        Returns:
            (List[Dict]): the properly structured data.
    """
    raw_abstract_data = data_toolbox.read_txt_file_from_url(url, delimiter='\n\n')

    processed_data = []
    for abstract in raw_abstract_data:
        if len(abstract) == 0:
            continue

        # Verify the raw abstract item represents the start of a new abstract
        assert abstract.startswith('###')

        abstract_lines = abstract.split('\n')
        abstract_id = abstract_lines[0][3:]  # Abstract id is the first item in the split list, and do not include `###`
        total_lines = len(abstract_lines) - 2  # Doesn't include the abstract id line and starts from 0

        for line_number, line in enumerate(abstract_lines[1:]):
            [target, text] = line.split('\t')
            processed_data.append({
                'abstract_id': abstract_id,
                'target': target,
                'text': text.lower(),
                'line_number': line_number,
                'total_lines': total_lines,
            })

    return processed_data

In [None]:
def split_chars(text: str) -> str:
    return ' '.join(list(text))

In [None]:
# Combining token and character dataset for this specific model
def concatenate_datasets(datasets, labels) -> tf.data.Dataset:
    concatenated_data = tf.data.Dataset.from_tensor_slices(tuple(datasets))
    labels_data = tf.data.Dataset.from_tensor_slices(labels)
    concatenated_dataset = tf.data.Dataset.zip((concatenated_data, labels_data))

    return concatenated_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
def generate_results_df(results: typing.List[dl_toolbox.analysis.classification.PredictionMetrics]) -> pd.DataFrame:
    all_results = {}
    for model_number, prediction_results in enumerate(results):
        all_results[f'model_{model_number}'] = dict(prediction_results)

    return pd.DataFrame(all_results).transpose()


## Download & Analyze Data

The data used in the paper is publicly available at the github link listed below.

* https://github.com/Franck-Dernoncourt/pubmed-rct

In [None]:

pubmed_data_urls = {
    'test_url_20k': 'https://raw.githubusercontent.com/Franck-Dernoncourt/pubmed-rct/master/PubMed_20k_RCT_numbers_replaced_with_at_sign/test.txt',
    'dev_url_20k': 'https://raw.githubusercontent.com/Franck-Dernoncourt/pubmed-rct/master/PubMed_20k_RCT_numbers_replaced_with_at_sign/dev.txt',
    'train_url_20k': 'https://raw.githubusercontent.com/Franck-Dernoncourt/pubmed-rct/master/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt',
}

In [None]:
raw_train_abstracts = data_toolbox.read_txt_file_from_url(pubmed_data_urls['train_url_20k'], '\n\n')
len(raw_train_abstracts)

In [None]:
raw_train_abstracts[:2]

### Preprocessing Notes

After looking at the data, each abstract includes an abstract id, the target section it talks about, and the actual text. To make this usable, I am going to structure it as a list of dictionaries that contain the following keys:

* abstract_id
* line_number
* target
* text
* total_lines

This will be done using the function, `preprocess_pubmed_txt_data`, created in the **Helpers** section of this notebook.

In [None]:
preprocessed_train_data = preprocess_pubmed_txt_data(pubmed_data_urls['train_url_20k'])
preprocessed_val_data = preprocess_pubmed_txt_data(pubmed_data_urls['dev_url_20k'])
preprocessed_test_data = preprocess_pubmed_txt_data(pubmed_data_urls['test_url_20k'])

len(preprocessed_train_data), len(preprocessed_val_data), len(preprocessed_test_data)

In [None]:
preprocessed_train_data[:12]

In [None]:
train_df = pd.DataFrame(preprocessed_train_data)
test_df = pd.DataFrame(preprocessed_test_data)
val_df = pd.DataFrame(preprocessed_val_data)

train_df.head(12)

In [None]:
train_df.target.value_counts()

In [None]:
train_df.total_lines.plot.hist()

In [None]:
train_sentences = train_df.text.tolist()
test_sentences = test_df.text.tolist()
val_sentences = val_df.text.tolist()

len(train_sentences), len(test_sentences), len(val_sentences)

### Text to Numeric Preprocessing

In [None]:
# One hot encode labels
one_hot_encoder = OneHotEncoder(sparse=False)

train_labels_one_hot = one_hot_encoder.fit_transform(train_df.target.to_numpy().reshape(-1, 1))
val_labels_one_hot = one_hot_encoder.transform(val_df.target.to_numpy().reshape(-1, 1))
test_labels_one_hot = one_hot_encoder.transform(test_df.target.to_numpy().reshape(-1, 1))

# Check what training labels look like
train_labels_one_hot

In [None]:
# Label Encode Labels
label_encoder = LabelEncoder()

train_labels_encoded = label_encoder.fit_transform(train_df.target.to_numpy())
val_labels_encoded = label_encoder.fit_transform(val_df.target.to_numpy())
test_labels_encoded = label_encoder.fit_transform(test_df.target.to_numpy())

train_labels_encoded

In [None]:
num_classes = len(label_encoder.classes_)
class_names = label_encoder.classes_

num_classes, class_names

#### Creating Datasets

Going to setup the data to run as fast as possible using the TensorFlor tf.data API. The purpose of this is that TensorFlow has setup datasets that are used to optimize performance when training, validating, and testing by utilizing both the CPU and GPU as efficiently as they can be used.

To utilize this functionality, we must create datasets that can be used when experimenting with models.

#### Sentence Datasets

In [None]:
# Creating the final dataset to be used
train_slice_dataset = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels_one_hot))
val_slice_dataset = tf.data.Dataset.from_tensor_slices((val_sentences, val_labels_one_hot))
test_slice_dataset = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels_one_hot))

train_slice_dataset

In [None]:
# Take TensorSliceDataset and turn into prefetch models
train_dataset = train_slice_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
val_dataset = val_slice_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
test_dataset = test_slice_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

train_dataset

#### Character Datasets

In [None]:
train_chars = [split_chars(sentence) for sentence in train_sentences]
val_chars = [split_chars(sentence) for sentence in val_sentences]
test_chars = [split_chars(sentence) for sentence in test_sentences]

train_chars[:5]

In [None]:
# Creating the final dataset to be used
train_chars_slice_dataset = tf.data.Dataset.from_tensor_slices((train_chars, train_labels_one_hot))
val_chars_slice_dataset = tf.data.Dataset.from_tensor_slices((val_chars, val_labels_one_hot))
test_chars_slice_dataset = tf.data.Dataset.from_tensor_slices((test_chars, test_labels_one_hot))

train_chars_slice_dataset

In [None]:
# Take TensorSliceDataset and turn into prefetch models
train_char_dataset = train_chars_slice_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
val_char_dataset = val_chars_slice_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
test_char_dataset = test_chars_slice_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

train_char_dataset

## Experiments

The models I am building will all be compared against Model-0 (baseline) which does not use Deep Learning, rather it uses a Naive Bayes ML model.

0. Model-0 (Baseline): Naive Bayes w/ TF-IDF Encoder
1. Model-1: Conv1D w/ Token Embeddings
2. Model-2: TensorFlow Hub Pretrained Feature Extractor
3. Model-3: Conv1D w/ Character Embeddings
4. Model-4: Pretrained Token Embeddings (same as Model-2) + Character Embeddings (same as Model-3)
5. Model-5: Pretrained Token Embeddings + Character Embeddings + Positional Embeddings

### Preprocessing Layer Setup

Many embedding layers will be reused for more than one of the experiments mentioned above. These steps will setup the layers to be used in experiments such that they can be reused for multiple models. The various layers to be made are:

* `text_vectorizer` (TextVectorizer)
* `token_embedding` (Embedding)
* `character_vectorizer` (TextVectorizer)
* `character_embeddings` (Embedding)
* `positional_embeddings`

#### Token Embeddings Layers

These layers consist of the `text_vectorizer` and the `embedding` layers that will be reused.

In [None]:
# Find average number of tokens
sent_lens = [len(i.split()) for i in train_sentences]
round(sum(sent_lens) / len(train_sentences))

In [None]:
# How long of a sentence covers 95% of the examples?
int(np.percentile(sent_lens, 95))

In [None]:
# Setup text vectorization params
max_vocab_length = 68000  # Max words to have in our vocab
max_length = 55  # Max length our sequence will be (95% of examples are within length of 55)

In [None]:
# Setting up a text vectorization layer (tokenization)
text_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=max_vocab_length,  # How many words in the vocabulary
    output_sequence_length=max_length)  # Padds (adds 0's to end of number) to make all the same length

# Adapt the vectorizer to the training data
text_vectorizer.adapt(train_sentences)

In [None]:
# Getting the words in the vocab from the training data
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5]
least_common_5_words = words_in_vocab[-5:]
len(words_in_vocab), top_5_words, least_common_5_words

In [None]:
# Config of text vectorizer layer
text_vectorizer.get_config()

In [None]:
# Setting up the Embedding layer
token_embedding = tf.keras.layers.Embedding(input_dim=len(words_in_vocab),
                                            output_dim=128,  # GPU's work well when number is divisible by 8
                                            mask_zero=True,
                                            name='token_embedding')
token_embedding

In [None]:
# Testing out an example sentence
target_sentence = random.choice(train_sentences)

# Looking at the steps of tokenization
print(f'Sentence before vectorization: \n {target_sentence}')
vectorized_sentence = text_vectorizer([target_sentence])
print(f'Sentence after vectorization: \n {vectorized_sentence}')
embedded_sentence = token_embedding(vectorized_sentence)
print(f'Sentence after embedding: \n {embedded_sentence}')
print(f'Embedded sentence shape: {embedded_sentence.shape}')

#### Chracacter Embedding Layer

A character level embedding layer will require tokenizing characters before creating the embedding layer.

In [None]:
# Whats the avg character length?
chars_lens = [len(sentence) for sentence in train_sentences]
mean_chars_lens = np.mean(chars_lens)

mean_chars_lens

In [None]:
plt.hist(chars_lens, bins=7);

In [None]:
# Find character length for 95% of sentences
output_sequence_len = int(np.percentile(chars_lens, 95))
output_sequence_len

In [None]:
# Figuring out the total alpha-numeric characters
alphabet = string.ascii_lowercase + string.digits + string.punctuation

alphabet

In [None]:
NUM_CHAR_TOKENS = len(alphabet) + 2  # Add 2 for space and OOV token ([UNK])
NUM_CHAR_TOKENS

In [None]:
character_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=NUM_CHAR_TOKENS,
    output_sequence_length=output_sequence_len,
    name='char_vectorizer')

In [None]:
character_vectorizer.adapt(train_chars)

In [None]:
char_vocab = character_vectorizer.get_vocabulary()
print(f'Number of Different Characters: {len(char_vocab)}')
print(f'5 Most Common Characters: {char_vocab[:5]}')
print(f'5 Least Common Characters: {char_vocab[-5:]}')

In [None]:
random_train_chars = random.choice(train_chars)

print(f'Text:\n{random_train_chars}')
print(f'Length: {len(random_train_chars.split())}')

vectorized_chars = character_vectorizer([random_train_chars])
print(f'Vectorized Chars:\n {vectorized_chars}')
print(f'Length of Vectorized Chars: {len(vectorized_chars[0])}')

In [None]:
character_embedding = tf.keras.layers.Embedding(input_dim=len(char_vocab),
                                                output_dim=25,
                                                mask_zero=True,
                                                name='character_embedding')
character_embedding

In [None]:
# Testing out an example sentence
target_chars = random.choice(train_chars)

# Looking at the steps of tokenization
print(f'Chars before vectorization: \n {target_chars}')

vectorized_chars = character_vectorizer([target_chars])
print(f'Chars after vectorization: \n {vectorized_chars}')
embedded_chars = character_embedding(vectorized_chars)
print(f'Chars after embedding: \n {embedded_chars}')
print(f'Embedded chars shape: {embedded_chars.shape}')

#### Positional Embedding Layer

A positional embedding layer will look at the position of each sentence within the abstract.

In [None]:
train_df['line_number'].value_counts()

In [None]:
train_df['line_number'].plot.hist()

In [None]:
# See what number of lines covers 98% of the samples
max_lines = int(np.percentile(train_df['line_number'].to_numpy(), 98))
max_lines

In [None]:
# Encoding line_number tensors
train_line_numbers_one_hot = tf.one_hot(train_df['line_number'].to_numpy(), depth=max_lines)
val_line_numbers_one_hot = tf.one_hot(val_df['line_number'].to_numpy(), depth=max_lines)
test_line_numbers_one_hot = tf.one_hot(test_df['line_number'].to_numpy(), depth=max_lines)

train_line_numbers_one_hot[:15], train_line_numbers_one_hot.shape

In [None]:
train_line_number_sliced_dataset = tf.data.Dataset.from_tensor_slices((train_line_numbers_one_hot, train_labels_one_hot))
train_line_number_dataset = train_line_number_sliced_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

val_line_number_sliced_dataset = tf.data.Dataset.from_tensor_slices((val_line_numbers_one_hot, val_labels_one_hot))
val_line_number_dataset = val_line_number_sliced_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

test_line_number_sliced_dataset = tf.data.Dataset.from_tensor_slices((test_line_numbers_one_hot, test_labels_one_hot))
test_line_number_dataset = test_line_number_sliced_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

train_line_number_dataset, val_line_number_dataset, test_line_number_dataset

In [None]:
train_df['total_lines'].value_counts()

In [None]:
train_df['total_lines'].plot.hist()

In [None]:
max_total_lines = int(np.percentile(train_df['total_lines'], 98))
max_total_lines

In [None]:
# Encoding line_number tensors
train_total_lines_one_hot = tf.one_hot(train_df['total_lines'].to_numpy(), depth=max_total_lines)
val_total_lines_one_hot = tf.one_hot(val_df['total_lines'].to_numpy(), depth=max_total_lines)
test_total_lines_one_hot = tf.one_hot(test_df['total_lines'].to_numpy(), depth=max_total_lines)

train_total_lines_one_hot[:15], train_total_lines_one_hot.shape

In [None]:
# Creating total lines datasets
train_total_lines_sliced_dataset = tf.data.Dataset.from_tensor_slices((train_total_lines_one_hot, train_labels_one_hot))
train_total_lines_dataset = train_total_lines_sliced_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

val_total_lines_sliced_dataset = tf.data.Dataset.from_tensor_slices((val_total_lines_one_hot, val_labels_one_hot))
val_total_lines_dataset = val_total_lines_sliced_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

test_total_lines_sliced_dataset = tf.data.Dataset.from_tensor_slices((test_total_lines_one_hot, test_labels_one_hot))
test_total_lines_dataset = test_total_lines_sliced_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

train_total_lines_dataset, val_total_lines_dataset, test_total_lines_dataset

### Model-0 (Baseline):

In [None]:
model_0 = Pipeline([
    ('tf-idf', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])

In [None]:
model_0.fit(X=train_sentences, y=train_labels_encoded)

In [None]:
model_0.score(X=val_sentences, y=val_labels_encoded)

In [None]:
model_0_preds = model_0.predict(val_sentences)
model_0_preds

In [None]:
model_0_results = dl_toolbox.analysis.classification.generate_prediction_metrics(val_labels_encoded, model_0_preds)
model_0_results

### Model-1: Conv1D w/ Token Embeddings

In [None]:
# Build Model
inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string)

text_vectors = text_vectorizer(inputs)
token_embeddings = token_embedding(text_vectors)
x = tf.keras.layers.Conv1D(filters=64, kernel_size=5, activation='relu', padding='same')(token_embeddings)
x = tf.keras.layers.GlobalMaxPooling1D()(x)

outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)

model_1 = tf.keras.models.Model(inputs, outputs)
model_1.summary()

In [None]:
# Compile Model
model_1.compile(loss='categorical_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [None]:
# Fit Model
model_1_history = model_1.fit(train_dataset,
                              steps_per_epoch=int(0.1 * len(train_dataset)),  # Only going to look at 10% of data to speed up experimentation
                              epochs=3,
                              validation_data=val_dataset,
                              validation_steps=int(0.1 * len(val_dataset)),  # Only going to look at 10% of data to speed up experimentation
                              callbacks=[])

In [None]:
dl_toolbox.analysis.history.plot_history(model_1_history, 'accuracy')

In [None]:
model_1.evaluate(val_dataset)

In [None]:
model_1_pred_probs = model_1.predict(val_dataset)
model_1_pred_probs[:10], model_1_pred_probs.shape

In [None]:
# Get the max index
model_1_pred = tf.argmax(model_1_pred_probs, axis=1)
model_1_pred

In [None]:
model_1_results = dl_toolbox.analysis.classification.generate_prediction_metrics(val_labels_encoded, model_1_pred)
model_1_results

### Model-2: TensorFlow Hub Pretrained Feature Extractor

This model will use Transfer Learning with the `Universal Sentence Encoder` pretrained model on TensorFlow Hub (see link below). This model will not allow fine-tuning of the pretrained model.

* https://tfhub.dev/google/collections/universal-sentence-encoder/1

In [None]:
use_url = 'https://tfhub.dev/google/universal-sentence-encoder/4'

In [None]:
# Build Model
inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
pretrained_embedding = hub.KerasLayer(use_url, trainable=False, name='USE')(inputs)
x = tf.keras.layers.Dense(128, activation='relu')(pretrained_embedding)
outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)

model_2 = tf.keras.models.Model(inputs, outputs, name='Model2USE')
model_2.summary()

In [None]:
# Compile Model
model_2.compile(loss='categorical_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [None]:
# Fit Model
model_2_history = model_2.fit(train_dataset,
                              steps_per_epoch=int(0.1 * len(train_dataset)),  # Only going to look at 10% of data to speed up experimentation
                              epochs=3,
                              validation_data=val_dataset,
                              validation_steps=int(0.1 * len(val_dataset)),  # Only going to look at 10% of data to speed up experimentation
                              callbacks=[])

In [None]:
dl_toolbox.analysis.history.plot_history(model_2_history, 'accuracy')

In [None]:
model_2_pred_probs = model_2.predict(val_dataset)
model_2_pred_probs

In [None]:
model_2_preds = tf.argmax(model_2_pred_probs, axis=1)
model_2_preds

In [None]:
model_2_results = dl_toolbox.analysis.classification.generate_prediction_metrics(val_labels_encoded, model_2_preds)
model_2_results

#### Findings

Looks like the predictions are significantly worse than both models.

### Model-3: Conv1D w/ Character Embeddings

Character embeddings creates an embedding for each character.

In [None]:
# Build Model
inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string)

vectorized_chars = character_vectorizer(inputs)
embedded_chars = character_embedding(vectorized_chars)
x = tf.keras.layers.Conv1D(filters=64, kernel_size=5, activation='relu', padding='same')(embedded_chars)
x = tf.keras.layers.GlobalMaxPooling1D()(x)

outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)

model_3 = tf.keras.models.Model(inputs, outputs, name='Model3CharEmbeddingConv1D')
model_3.summary()

In [None]:
model_3.compile(loss='categorical_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [None]:
model_3_history = model_3.fit(train_char_dataset,
                              steps_per_epoch=int(0.1 * len(train_char_dataset)),  # Only going to look at 10% of data to speed up experimentation
                              epochs=3,
                              validation_data=val_char_dataset,
                              validation_steps=int(0.1 * len(val_char_dataset)),  # Only going to look at 10% of data to speed up experimentation
                              callbacks=[])

In [None]:
dl_toolbox.analysis.history.plot_history(model_3_history, 'accuracy')

In [None]:
model_3_pred_probs = model_3.predict(val_char_dataset)
model_3_pred_probs

In [None]:
model_3_preds = tf.argmax(model_3_pred_probs, axis=1)
model_3_preds

In [None]:
model_3_results = dl_toolbox.analysis.classification.generate_prediction_metrics(val_labels_encoded, model_3_preds)
model_3_results

### Model-4: Pretrained Token Embeddings (same as Model-2) + Character Embeddings (same as Model-3)

Combining both a pretrained token embedding model with the character embedding model will concatenate the outcomes of the two models.

1. Create a token level embedding model (similar to Model-2)
2. Create a character level embedding model (similar to Model-3 with a slight modification)
3. Combine the two models using a concatenate layer
4. Build output layers on top of step 3 similar to the model built in the paper.

In [None]:
use_url = 'https://tfhub.dev/google/universal-sentence-encoder/4'

In [None]:
# Creating the specific datasets for the concatenated data
train_token_character_dataset = concatenate_datasets([train_sentences, train_chars], train_labels_one_hot)
val_token_character_dataset = concatenate_datasets([val_sentences, val_chars], val_labels_one_hot)
test_token_character_dataset = concatenate_datasets([test_sentences, test_chars], test_labels_one_hot)

train_token_character_dataset, val_token_character_dataset, test_token_character_dataset

In [None]:
# Build Model

# 1. Build Pretrained Token Embeddings Model
token_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string, name='token_input')

pretrained_token_embedding = hub.KerasLayer(use_url, trainable=False, name='universal_sentence_encoder')(token_inputs)
token_outputs = tf.keras.layers.Dense(128, activation='relu')(pretrained_token_embedding)

token_embedding_model = tf.keras.models.Model(token_inputs, token_outputs)

# 2. Build Character Embeddings Model (This is similar to the model in the paper, it uses Bi-LSTM as the Output Layer)
character_inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string, name='char_input')

char_vectors = character_vectorizer(character_inputs)
char_embeddings = character_embedding(char_vectors)
char_bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(24))(char_embeddings)

character_embedding_model = tf.keras.models.Model(character_inputs, char_bi_lstm)

# 3. Concatenating the two models
token_char_concat = tf.keras.layers.Concatenate(name='token_char_hybrid')([token_embedding_model.output, character_embedding_model.output])

# 4. Creating output layers (with dropout as done in the paper)
combined_dropout = tf.keras.layers.Dropout(0.5)(token_char_concat)
combined_dense = tf.keras.layers.Dense(128, activation='relu')(combined_dropout)
final_dropout = tf.keras.layers.Dropout(0.5)(combined_dense)
outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(final_dropout)

# 5. Creating the Model
model_4 = tf.keras.models.Model(inputs=[token_embedding_model.input, character_embedding_model.input],
                                outputs=outputs,
                                name='model_4_token_character_hybrid')
model_4.summary()

In [None]:
dl_toolbox.analysis.model.plot_model(model_4)

In [None]:
# Compile model
model_4.compile(loss='categorical_crossentropy',
                optimizer=tf.keras.optimizers.legacy.Adam(),
                metrics=['accuracy'])

In [None]:
# Fit Model
model_4_history = model_4.fit(train_token_character_dataset,
                              epochs=3,
                              steps_per_epoch=int(0.1 * len(train_token_character_dataset)),
                              validation_data=val_token_character_dataset,
                              validation_steps=int(0.1 * len(val_token_character_dataset)))

In [None]:
dl_toolbox.analysis.history.plot_history(model_4_history, 'accuracy')

In [None]:
model_4_pred_probs = model_4.predict(val_token_character_dataset)
model_4_pred_probs

In [None]:
model_4_preds = tf.argmax(model_4_pred_probs, axis=1)

In [None]:
model_4_results = dl_toolbox.analysis.classification.generate_prediction_metrics(val_labels_encoded, model_4_preds)
model_4_results

### Model-5: Pretrained Token Embeddings + Character Embeddings + Positional Embeddings

This model will take into account the various embedding types used thus far, and add an additional embedding dataset that looks at the positional embeddings (location of each line within the abstract).

To build this model, we will be performing the following steps:

1. Create a token level model
2. Create a character level model
3. Create a model for the `line_number` feature
4. Create a model for the `total_lines` feature
5. Concatenate the outputs of token level model and character level model.
6. Concatenate the outputs of `line_number` feature model, `total_lines` feature model, and the concatenated model from step 5.
7. Create an output layer that accepts the output of the concatenated tribrid embedding model in step 6, and outputs the label probabilities.
8. Create the fully combined model from all the steps.

**NOTE**: Any engineered features used to train the model need to be available at test time. In our case, line numbers and total line numbers in an abstract are available.



In [None]:
# Creating the dataset for working in model 5
# NOTE: Order of the datasets matters! It must match the order of the concatenation in the model
train_token_character_position_dataset = concatenate_datasets(datasets=[train_line_numbers_one_hot,
                                                                        train_total_lines_one_hot,
                                                                        train_sentences,
                                                                        train_chars],
                                                              labels=train_labels_one_hot)
val_token_character_position_dataset = concatenate_datasets(datasets=[val_line_numbers_one_hot,
                                                                      val_total_lines_one_hot,
                                                                      val_sentences,
                                                                      val_chars],
                                                            labels=val_labels_one_hot)
test_token_character_position_dataset = concatenate_datasets(datasets=[test_line_numbers_one_hot,
                                                                       test_total_lines_one_hot,
                                                                       test_sentences,
                                                                       test_chars],
                                                             labels=test_labels_one_hot)

train_token_character_position_dataset, val_token_character_position_dataset, test_token_character_position_dataset

In [None]:
use_url = 'https://tfhub.dev/google/universal-sentence-encoder/4'

In [None]:
# 1. Token Embedding Model
token_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string, name='token_input')
pretrained_token_embedding = hub.KerasLayer(use_url, trainable=False, name='universal_sentence_encoder')(token_inputs)
token_outputs = tf.keras.layers.Dense(128, activation='relu')(pretrained_token_embedding)

token_embedding_model = tf.keras.models.Model(token_inputs, token_outputs, name='token_embedding')

# 2. Character Embedding Model
char_inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string, name='char_input')
char_vectors = character_vectorizer(char_inputs)
char_embeddings = character_embedding(char_vectors)
char_bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(24))(char_embeddings)

character_embedding_model = tf.keras.models.Model(char_inputs, char_bi_lstm, name='character_embedding')

# 3. Line Number Feature Model
line_number_inputs = tf.keras.layers.Input(shape=(15,), dtype=tf.float32, name='line_number_input')
line_number_outputs = tf.keras.layers.Dense(32, activation='relu')(line_number_inputs)

line_number_feature_model = tf.keras.models.Model(line_number_inputs, line_number_outputs, name='line_number_feature')

# 4. Total Lines Feature Model
total_lines_inputs = tf.keras.layers.Input(shape=(20,), dtype=tf.float32, name='total_lines_input')
total_lines_outputs = tf.keras.layers.Dense(32, activation='relu')(total_lines_inputs)

total_lines_feature_model = tf.keras.models.Model(total_lines_inputs, total_lines_outputs, name='total_lines_feature')

# 5. Concatenate Token and Character Level Models (including dropout layer)
token_char_concat = tf.keras.layers.Concatenate(name='token_char_hybrid_embedding')([token_embedding_model.output, character_embedding_model.output])
x = tf.keras.layers.Dense(256, activation='relu')(token_char_concat)
combined_token_char_embeddings_dropout = tf.keras.layers.Dropout(0.5)(x)

# 6. Concatenate Line Number Feature Model, Total Lines Feature Model, and Token Char Concatenated Hybrid
tribrid_concat = tf.keras.layers.Concatenate(name='token_char_positional_embedding')([line_number_feature_model.output,
                                                                                      total_lines_feature_model.output,
                                                                                      combined_token_char_embeddings_dropout])

# 7. Output Layers
output_layer = tf.keras.layers.Dense(num_classes, activation='softmax', name='output_layer')(tribrid_concat)

# 8. Create Model
model_5 = tf.keras.models.Model(inputs=[line_number_feature_model.input,
                                        total_lines_feature_model.input,
                                        token_embedding_model.input,
                                        character_embedding_model.input],
                                outputs=output_layer,
                                name='model_5_token_char_positional')
model_5.summary()

In [None]:
dl_toolbox.analysis.model.plot_model(model_5)

In [None]:
# Compile Model
model_5.compile(loss=tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.2),  # label_smoothing helps to prevent overfitting
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [None]:
model_5_history = model_5.fit(train_token_character_position_dataset,
                              epochs=3,
                              steps_per_epoch=int(0.1 * len(train_token_character_position_dataset)),
                              validation_data=val_token_character_position_dataset,
                              validation_steps=int(0.1 * len(val_token_character_position_dataset)))

In [None]:
dl_toolbox.analysis.history.plot_history(model_5_history, 'accuracy')

In [None]:
model_5_pred_probs = model_5.predict(val_token_character_position_dataset, verbose=1)
model_5_pred_probs

In [None]:
model_5_preds = tf.argmax(model_5_pred_probs, axis=1)
model_5_preds

In [None]:
model_5_results = dl_toolbox.analysis.classification.generate_prediction_metrics(val_labels_encoded, model_5_preds)
model_5_results

## Analysis

Now that the 5 different experiments have been run, it is time to compare and analyze them.

#### Prediction Metric Analysis

In [None]:
all_results = [model_1_results, model_2_results, model_3_results, model_4_results, model_5_results]
all_results

In [None]:
all_results_df = generate_results_df(all_results)
all_results_df

In [None]:
all_results_df.plot(kind='bar', figsize=(10, 7)).legend(bbox_to_anchor=(1.0, 1.0))

In [None]:
all_results_df.sort_values('f1', ascending=True)['f1'].plot(kind='bar', figsize=(10, 7))

#### Findings
After evaluating the model performance metrics, it looks like model_4 outperformed all other models in terms of accuracy, however the baseline model ended up being a very close second.

## Exporting Model

Now that I have identified the top performing model, I want to export it from google colab for reuse.

In [None]:
model_5_filepath = './models/skimlit_tribrid_model'

In [None]:
model_5.save(model_5_filepath)

In [None]:
# Verifying export works as expected
loaded_model = tf.keras.models.load_model(model_5_filepath)

In [None]:
# Verifying model_5 and loaded_model are equivalent
loaded_pred_probs = loaded_model.predict(val_token_character_position_dataset)
loaded_preds = tf.argmax(loaded_pred_probs, axis=1)
loaded_model_results = dl_toolbox.analysis.classification.generate_prediction_metrics(val_labels_encoded, loaded_preds)

In [None]:
loaded_model_results, model_5_results

#### Findings

After saving the model, then loading it back in and comparing against the original model, the performance is exactly the same. This confirms the export worked as expected.