# Purpose
The purpose of this notebook is to execute a grid search like evaluation of different Entity Extraction model parameters. This is the second iteration, and will implement parameters from the first iteration.

**Parameters Selected Based on Iteration 1**:

* **Embedding**: Glove
* **Stopwords**: False
* **Lemmatization**: False
* **LSTM Stack**: True
* **Hidden Dimension - Layer 1**: 32
* **Hidden Dimension - Layer 2**: 128
* **Dropout** : True
* **Dropout Rate**: 0.5
* **Sample Weights**: False
* **Trainable**: True 

**Parameters to Evaluate:**
* Optimizer
* Time Distribution

## Import

### Packages

In [2]:
# General
import codecs, io, os, re, sys, time
from collections import OrderedDict 
from scipy.stats import uniform
from tqdm import tqdm

# Analysis
import numpy as np
import pandas as pd
from sklearn.metrics import \
    accuracy_score, classification_report, confusion_matrix, \
    precision_recall_fscore_support
from sklearn.model_selection import \
    ParameterGrid, RandomizedSearchCV, RepeatedStratifiedKFold
from sklearn.utils.class_weight import compute_class_weight

# Visual
# import matplotlib.pyplot as plt
# import seaborn as sn

# Deep Learning
import tensorflow as tf
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping
from keras.layers.experimental.preprocessing import TextVectorization

### Custom Functions

In [3]:
sys.path.append('*')
from source_entity_extraction import *

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\canfi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\canfi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data
The training data is imported and the necessary columns are converted to lists.

In [4]:
#import data
path_dir_data ="./../data"
subfolder_a = "input"
file_training_data = "training_data_dir_multiclass.xlsx"
path_training_data = os.path.join(path_dir_data, subfolder_a, file_training_data)
dataset = pd.read_excel(path_training_data,  engine='openpyxl')

#convert into lists
df = pd.DataFrame({
    'text': dataset.sentence, 
    'node1': dataset.node_1, 
    'node2': dataset.node_2
    })

df.dropna(inplace = True)

## Randomness
To better control and compare results of the Entity Extraction model between the environments where the model is trainined (Python) and where it will be implemented (R/Shiny), we will attempt to control any random actions by the process to maintain consistent results.

In [5]:
random_state = 5590
np.random.seed(random_state)
tf.random.set_seed(random_state)

# Global Actions
The following section defines global settings or performs actions that are consistent across the entirety of this notebook.

## Variables
Varaibles that are used across multiple calls should be defined here.

In [6]:
MAX_LENGTH = 50
MAX_FEATURES = 1000
BATCH_SIZE = 32
LEMMATIZE = False
STOP_WORDS = False
SAMPLE_WEIGHTS = False
TRAINABLE = True
EMBEDDING_LABELS = [
    "text_vectorization",
    "glove",
    "fasttext"
]

## Pre-Processing


### Text Processing

In [7]:
df = process_text(
    df, 
    stopwords = LEMMATIZE,
    lemmatize = STOP_WORDS
    )

### Target Generation
With the text processing complete we will now create two versions of the target set. The first will have the feature tokens converted into numerical representations for each class: 0, 1, 2. Then we will also create a target set that is a one-hot-encoded representation of the numerical classes.

In [8]:
df = target_gen_wrapper(
    df, 
    max_length=MAX_LENGTH
    )

## Split Data into Training / Test Sets

In [9]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(
    df,
    test_size=0.25, 
    random_state = random_state
    )

Due to the different nature of random actions between R and Python, it is easier to export the test set than to duplciate the train/test split.

In [10]:
subfolder_b = "output"
subfolder_entity = "entity_extraction"
file_name = 'entity_extraction_test_set.csv'
path = os.path.join(path_dir_data, subfolder_b, 
                    subfolder_entity, file_name)
df_test.to_csv(path, index_label=False)

# Feature/Target Definition
We need to define a target variable and perform preprocessing steps on the features before inputting into the model

In [11]:
# Features
X_train = df_train['text'].tolist()
X_test = df_test['text'].tolist()

# Targets
y_train_val = df_train['target_labels'].tolist()
y_test_val = df_test['target_labels'].tolist()

### Export Target Classes
To simplify eventual work in R, we will generate the our target classes the test dataset and export it.

In [12]:
# Export Test Target Values
# Set File Location
file_name = 'entity_extraction_test_target_classes_cause_effect.csv'
path = os.path.join(path_dir_data, subfolder_b, subfolder_entity, file_name)

# Convert Lists to Dataframe
df_targets_test = pd.DataFrame(y_test_val)

# Export Target Values
df_targets_test.to_csv(path, index=False, header=False)

# Build Model

## Vectorization Layer

In [13]:
vectorization_layer = TextVectorization(
    max_tokens=MAX_FEATURES,
    output_mode='int',
    output_sequence_length=MAX_LENGTH
    )

vectorization_layer.adapt(X_train)

vocab = vectorization_layer.get_vocabulary()
vocab_len = len(vocab)
print(f"Vocabulary Size: {vocab_len}")

Vocabulary Size: 1000


In [14]:
# Inspect vocabulary
word_index = dict(zip(vocab, range(len(vocab))))
# word_index

In [15]:
# # Inspect vectorization output
# m = 5
# test_string = X_train[m]
# print(f"Test String - Raw:\n{test_string}")
# print()
# test_string_vec = vectorization_layer([test_string])
# print(f"Test String - Vectorized:\n{test_string_vec[0]}")

## Embedding Layer

### Initialize

In [16]:
dct_embedding_index = {}

# Initialize None for text vectorizaiton
dct_embedding_index["text_vectorization"] = {
    "index": None,
    "dimension": None
}

dct_embedding_matrix = {
    "text_vectorization": None
}

### Embedding Matrix

### Glove

##### Import/Load Embeddings

In [17]:
embed_label = EMBEDDING_LABELS[1]
embedding_dim = 100

# Define file path
subfolder_embed = "pre_trained"
subfolder_embed_glove = "glove.6B"
file_name = "glove.6B.100d.txt"
path = os.path.join(path_dir_data, subfolder_embed, subfolder_embed_glove, file_name)

print("Preparing embedding index...")
embeddings_index = {}
with open(path, encoding="utf8") as f:
    for line in tqdm(f):
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

dct_embedding_index[embed_label] = {
    "index": embeddings_index,
    "dimension": embedding_dim
}
print(f"Found {len(embeddings_index)} word vectors.")

628it [00:00, 6217.73it/s]

Preparing embedding index...


400000it [00:34, 11575.42it/s]

Found 400000 word vectors.





##### Create Matrix

In [18]:
embedding_matrix = gen_embedding_matrix(
    dct_embedding_index = dct_embedding_index,
    embed_label = embed_label, 
    vocabulary_length = vocab_len, 
    word_index = word_index
)

dct_embedding_matrix[embed_label] = embedding_matrix

#### FastText

##### Import/Load Embeddings

In [19]:
# embed_label = EMBEDDING_LABELS[2]
# embedding_dim = 300

# # Define file path
# subfolder_embed = "pre_trained"
# subfolder_embed_fasttext = "wiki-news-300d-1M.vec"
# file_name = "wiki-news-300d-1M.vec"
# path = os.path.join(path_dir_data, subfolder_embed, subfolder_embed_fasttext, file_name)
# f = codecs.open(path, encoding='utf-8')

# print("Preparing embedding index...")
# embeddings_index = {}
# for line in tqdm(f):
#     values = line.rstrip().rsplit(' ')
#     # print(values)
#     word = values[0]
#     # print(word)
#     coefs = np.asarray(values[1:], dtype='float32')
#     # print(coefs)
#     embeddings_index[word] = coefs
# f.close()

# dct_embedding_index[embed_label] = {
#     "index": embeddings_index,
#     "dimension": embedding_dim,
# }
# print(f'Found {len(embeddings_index)} word vectors.')

##### Create Matrix

In [20]:
# embedding_matrix = gen_embedding_matrix(
#     dct_embedding_index = dct_embedding_index,
#     embed_label = embed_label, 
#     vocabulary_length = vocab_len, 
#     word_index = word_index
# )

# dct_embedding_matrix[embed_label] = embedding_matrix

## Imbalance Class Management
There is an imbalance in the classes predicted by the model. Roughly 90% of tokens are class 0 (not Cause or Entity node, filler words). The remaining classes (1: Cause, 2: Entity) account for about 5% of the remaining tokens, each.

In order to better predict all classes in the dataset, and not create a model which simply predicts the default class, we need to weight each of these classes differently. Unfortunately, we cannot simply use the **class weights** input for training a Keras model. that is because we are predicting a 3D array as an output, and Keras will not allow the use of **class weights** in such a case. 

There is a workaround, as discussed in Keras Github issue 3653. We can use **sample weights**, with the sample weights mode set to *temporal*. 

https://github.com/keras-team/keras/issues/3653

To apply sample weights to our model, we needa matrix of sample weights to account for all input values. This matrix will be the same size as the y_train output array (n samples X sample length).  

### Sample Weights

In [21]:
# # Initialize Sample weight matrix
# sample_weight_matrix = np.array(y_train_val).copy()

# # Flatten matrix
# sample_weight_matrix_fl = flatten_list(sample_weight_matrix)

# # Determine number of classes
# n_classes = np.unique(sample_weight_matrix_fl).shape[0]

# # Determine class weights
# class_weights = compute_class_weight(
#     "balanced", 
#     np.unique(sample_weight_matrix_fl), 
#     np.array(sample_weight_matrix_fl)
#     )

# # Replace class label with class weight
# for i in range(0, len(class_weights)):
#     sample_weight_matrix = np.where(
#             sample_weight_matrix==i, 
#             class_weights[i], 
#             sample_weight_matrix
#             ) 


# Grid

## Define Parameter Grid

In [23]:
params = {
    'embedding': ['glove'],
    'stop_words': [False],
    'lemmatization': [False],
    'hidden_dim_1': [32],
    'hidden_dim_2': [128],
    'lstm_stack': [True],
    'dropout': [True],
    'dropout_rate': [0.5],
    'sample_weights': [False],
    'trainable': [True],
    'optimizer': ['adam', 'rmsprop'],
    'time_distributed': [True, False],
    }
param_grid = list(ParameterGrid(params))
print(f"Number of Parameter Configurations: {len(param_grid)}")

Number of Parameter Configurations: 4


## Initialize Output Dictionary

In [24]:
# Initialize
dct_summary = OrderedDict()

dct_summary["embedding"] = []
dct_summary["stop_words"] = []
dct_summary["lemmatization"] = []
dct_summary["hidden_dim_1"] = []
dct_summary["hidden_dim_2"] = []
dct_summary["lstm_stack"] = []
dct_summary["dropout"] = []
dct_summary["dropout_rate"] = []
dct_summary["sample_weights"] = []
dct_summary["trainable"] = []
dct_summary["optimizer"] = []
dct_summary["time_distributed"] = []
dct_summary["accuracy"] = []
dct_summary["precision_0"] = []
dct_summary["precision_1"] = []
dct_summary["precision_2"] = []
dct_summary["precision_macro"] = []
dct_summary["recall_0"] = []
dct_summary["recall_1"] = []
dct_summary["recall_2"] = []
dct_summary["recall_macro"] = []
dct_summary["f1_0"] = []
dct_summary["f1_1"] = []
dct_summary["f1_2"] = []
dct_summary["f1_macro"] = []

## Execute Model Training/Evaluation

In [26]:
for i, param in enumerate(param_grid):

    # Define Parameters
    EMBEDDING = param['embedding']
    STOP_WORDS = param['stop_words']
    LEMMATIZATION = param['lemmatization']
    HIDDEN_DIM_1 = param['hidden_dim_1']
    HIDDEN_DIM_2 = param['hidden_dim_2']
    LSTM_STACK = param['lstm_stack']
    DROP_OUT = param['dropout']
    DROP_OUT_RATE = param['dropout_rate']
    SAMPLE_WEIGHTS = param['sample_weights']
    TRAINABLE = param['trainable']
    OPTIMIZER = param['optimizer']
    TIME_DISTRIBUTED = param['time_distributed']
    

    print(f"Iteration:\t\t{i + 1}")
    print(f"EMBEDDING:\t\t{EMBEDDING}")
    print(f"STOP_WORDS:\t\t{STOP_WORDS}")
    print(f"LEMMATIZATION:\t\t{LEMMATIZATION}")
    print(f"HIDDEN_DIM:\t\t{HIDDEN_DIM_1}")
    print(f"HIDDEN_DIM:\t\t{HIDDEN_DIM_2}")
    print(f"LSTM_STACK:\t\t{LSTM_STACK}")
    print(f"DROP_OUT:\t\t{DROP_OUT}")
    print(f"DROP_OUT:\t\t{DROP_OUT_RATE}")
    print(f"SAMPLE_WEIGHTS:\t\t{SAMPLE_WEIGHTS}")
    print(f"TRAINABLE:\t\t{TRAINABLE}")
    print(f"OPTIMIZER:\t\t{OPTIMIZER}")
    print(f"TIME_DISTRIBUTED:\t{TIME_DISTRIBUTED}")
    print()

    # Initialize input dataframe
    df_iter = df.copy()

    # Process ------------------------------------------------------------------
    # Process Text
    df_iter = process_text(
        df_iter, 
        stopwords = STOP_WORDS,
        lemmatize = LEMMATIZATION
    )

    # Generate Target Labels
    df_iter = target_gen_wrapper(
        df_iter, 
        max_length=MAX_LENGTH
        )
    
    # Split Training/Test Set
    df_train, df_test = train_test_split(
        df_iter,
        test_size=0.25, 
        random_state = random_state
        )
    
    # Define Target/Features
    # Features
    X_train = df_train['text'].tolist()
    X_test = df_test['text'].tolist()

    # Target
    y_train_val = df_train['target_labels'].tolist()
    y_test_val = df_test['target_labels'].tolist()

    # Build Model --------------------------------------------------------------
    # Vectorization Layer
    vectorization_layer = TextVectorization(
        max_tokens=MAX_FEATURES,
        output_mode='int',
        output_sequence_length=MAX_LENGTH
        )

    # Adapt to training data
    vectorization_layer.adapt(X_train)

    # Define vocabulary
    vocab = vectorization_layer.get_vocabulary()
    vocab_len = len(vocab)
    word_index = dict(zip(vocab, range(vocab_len)))

    # Create embedding matrix
    embedding_matrix = gen_embedding_matrix(
        dct_embedding_index = dct_embedding_index,
        embed_label = EMBEDDING, 
        vocabulary_length = vocab_len, 
        word_index = word_index
        )

    # Embedding Layer
    embedding_layer = gen_embedding_layer(
        label = EMBEDDING,
        input_dimension = vocab_len,
        output_dimension_wo_init = 64,
        max_length = MAX_LENGTH,
        embedding_matrix = embedding_matrix,
        trainable = TRAINABLE
        )

    # Compile Model
    model = compile_model(
        vectorization_layer = vectorization_layer,
        embedding_layer = embedding_layer,
        dropout = DROP_OUT,
        dropout_rate = DROP_OUT_RATE,
        lstm_stack = LSTM_STACK,
        hidden_dimension_1 = HIDDEN_DIM_1,
        hidden_dimension_2 = HIDDEN_DIM_2,
        sample_weights = SAMPLE_WEIGHTS, 
        optimizer = OPTIMIZER,
        time_distributed = TIME_DISTRIBUTED
    )

    # Convert and encode target/features
    # # Convert features to numpy array from model input
    X_train = np.array(X_train)

    # Encode targets
    y_train = encode_target(y_train_val)

    # Convert target to numpy array for model input
    y_train = np.array(y_train)

    # Train Model --------------------------------------------------------------
    if SAMPLE_WEIGHTS: 
        history = model.fit(
                            X_train, y_train, 
                            batch_size=BATCH_SIZE, 
                            epochs=75, 
                            validation_split=0.2, 
                            sample_weight=sample_weight_matrix,
                            verbose=2
                            )
    else: 
        history = model.fit(
                            X_train, y_train, 
                            batch_size=BATCH_SIZE, 
                            epochs=75, 
                            validation_split=0.2, 
                            verbose=2
                            )
        
    # Evaluation ---------------------------------------------------------------
    # Generate Predictions
    y_pred = []
    for i in range(len(X_test)):
        y_pred_prob = model.predict(np.array([X_test[i]]))
        y_pred_class = np.argmax(y_pred_prob, axis=-1)[0].tolist()
        y_pred.append(y_pred_class)

    # Flatten lists
    y_pred = flatten_list(y_pred)
    y_test = flatten_list(y_test_val)

    # Score Model
    dct_summary = gen_eval_metrics(
        dct_summary,
        y_test, y_pred,
        embedding = EMBEDDING,
        stop_words = STOP_WORDS,
        lemmatization = LEMMATIZATION,
        hidden_dim_1 = HIDDEN_DIM_1,
        hidden_dim_2 = HIDDEN_DIM_2,
        lstm_stack = LSTM_STACK,
        dropout = DROP_OUT,
        dropout_rate = DROP_OUT_RATE,
        sample_weights = SAMPLE_WEIGHTS,
        trainable = TRAINABLE,
        optimizer = OPTIMIZER,
        time_distributed = TIME_DISTRIBUTED
        )

Iteration:		1
EMBEDDING:		glove
STOP_WORDS:		False
LEMMATIZATION:		False
HIDDEN_DIM:		32
HIDDEN_DIM:		128
LSTM_STACK:		True
DROP_OUT:		True
DROP_OUT:		0.5
SAMPLE_WEIGHTS:		False
TRAINABLE:		True
OPTIMIZER:		adam
TIME_DISTRIBUTED:	True

Epoch 1/75
12/12 - 14s - loss: 0.3317 - accuracy: 0.5240 - val_loss: 0.2731 - val_accuracy: 0.5873
Epoch 2/75
12/12 - 2s - loss: 0.2949 - accuracy: 0.5900 - val_loss: 0.2611 - val_accuracy: 0.6139
Epoch 3/75
12/12 - 2s - loss: 0.2810 - accuracy: 0.6068 - val_loss: 0.2400 - val_accuracy: 0.6489
Epoch 4/75
12/12 - 2s - loss: 0.2594 - accuracy: 0.6629 - val_loss: 0.2136 - val_accuracy: 0.7236
Epoch 5/75
12/12 - 2s - loss: 0.2284 - accuracy: 0.7241 - val_loss: 0.1848 - val_accuracy: 0.7612
Epoch 6/75
12/12 - 2s - loss: 0.2091 - accuracy: 0.7536 - val_loss: 0.1748 - val_accuracy: 0.7865
Epoch 7/75
12/12 - 2s - loss: 0.1940 - accuracy: 0.7730 - val_loss: 0.1590 - val_accuracy: 0.8170
Epoch 8/75
12/12 - 2s - loss: 0.1785 - accuracy: 0.7943 - val_loss: 0.1529 - 

## Output Model Summary

In [27]:
# Output Results
# Convert dictionary to dataframe
df_summary = pd.DataFrame(dct_summary)

# Create file path
file_name_text = 'entity_extraction_evaluation_iter_3_'
time_stamp =  time.strftime("%Y%m%d-%H%M%S")
file_extension = ".csv"
file_name = file_name_text + time_stamp + file_extension
path = os.path.join(path_dir_data, subfolder_b, 
                    subfolder_entity, file_name)

# Save table
df_summary.to_csv(path, index=False)

In [28]:
df_summary

Unnamed: 0,embedding,stop_words,lemmatization,hidden_dim_1,hidden_dim_2,lstm_stack,dropout,dropout_rate,sample_weights,trainable,...,precision_2,precision_macro,recall_0,recall_1,recall_2,recall_macro,f1_0,f1_1,f1_2,f1_macro
0,glove,False,False,32,128,True,True,0.5,False,True,...,0.875666,0.92665,0.986767,0.859425,0.90128,0.915824,0.984906,0.889256,0.888288,0.920817
1,glove,False,False,32,128,True,True,0.5,False,True,...,0.905482,0.933514,0.990112,0.840256,0.875686,0.902018,0.98517,0.875937,0.890335,0.917147
2,glove,False,False,32,128,True,True,0.5,False,True,...,0.898305,0.924028,0.986913,0.899361,0.872029,0.919434,0.986339,0.893651,0.884972,0.921654
3,glove,False,False,32,128,True,True,0.5,False,True,...,0.908382,0.927143,0.989094,0.884984,0.85192,0.908666,0.98644,0.88711,0.879245,0.917598


## Visuals

### Accuracy Charts

In [None]:
# hist = pd.DataFrame(history.history)

# #plot training and validation accuracy
# f = plt.figure(figsize=(5,5))
# plt.plot(hist["accuracy"], label =' Training Accuracy')
# plt.plot(hist["val_accuracy"], label = 'Validation Accuracy')
# plt.legend(loc="lower right")
# plt.show()

# # Save and output
# file_name = 'enitity_extraction_training_validation_accuracy.png'
# path = os.path.join(path_dir_data, file_name)
# f.savefig(path)
# # f.download(path)

### Confusion Matrix

In [None]:
# # Create confusion matrix
# cm = confusion_matrix(y_test_fl, y_pred_fl)
# df_cm = pd.DataFrame(cm, columns=np.unique(y_test_fl), index = np.unique(y_pred_fl))
# df_cm.index.name = 'Actual'
# df_cm.columns.name = 'Predicted'

# # Visualize
# plt.figure(figsize = (10,7))
# sn.set(font_scale=1.4)#for label size
# sn.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})# font size

# Save and Restore Model
The following section tests saving and restoring the model.

## Save

In [None]:
# # Export Model
# folder_keras_model = 'entity_extraction_w_processing_keras'
# path_keras_model = os.path.join(path_dir_data, folder_keras_model)

# # SavedModel
# model.save(path_keras_model)


# Restore

In [None]:
# from tensorflow import keras
# model = keras.models.load_model(path_keras_model)