## Convolutional NN to classify govuk content to level2 taxons

Based on:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

## To do:
- ~~Consider grouping very small classes (especially if too small for evaluation)~~
- ~~Split data into training, validation and test to avoid overfitting validation data during hyperparamter searches & model architecture changes~~
- ~~Try learning embeddings~~--
- ~~Try changing pos_ratio~~
- Try implementing class_weights during model fit (does this do the same as the weighted binary corss entropy?)
- Work on tensorboard callbacks
- ~~Create dictionary of class indices to taxon names for viewing results~~
- ~~Check model architecture~~
- ~~consider relationship of training error to validation error - overfitting/bias?~~
- ~~train longer~~
- Try differnet max_sequence_length
- Check batch size is appropriate
- Also think about:
  - ~~regularization (e.g. dropout)~~ 
  - fine-tuning the Embedding layer

### Load requirements and data

TODO: edit requirement.txt to include only these packages and do not include tensorflow because this conflicts with tf on AWS when using on GPU.

In [None]:
import pandas as pd
import numpy as np
import os
from datetime import datetime
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.utils import to_categorical, layer_utils, plot_model

from keras.layers import (Embedding, Input, Dense, Dropout, 
                          Activation, Conv1D, MaxPooling1D, Flatten, concatenate, Reshape)
from keras.models import Model, Sequential
from keras.optimizers import rmsprop
from keras.callbacks import TensorBoard, Callback, ModelCheckpoint
import keras.backend as K
from keras.losses import binary_crossentropy

from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score 
from sklearn.metrics import precision_recall_fscore_support, classification_report
from sklearn.utils import class_weight

import tensorflow as tf

import matplotlib.pyplot as plt
%matplotlib inline

import functools

import h5py


### Environmental vars

In [None]:
DATADIR=os.getenv('DATADIR')
#DATADIR='/data' #this was put in for AWS run but doesn't work locally...

## Hyperparameters

Intuition for POS_RATIO is that it penalises the prediction of zero for everything, which is attractive to the model because the multilabel y matrix is super sparse. 

Increasing POS_RATIO should penalise predicting zeros more.

In [None]:
#MAX_NB_WORDS
MAX_SEQUENCE_LENGTH =1000
EMBEDDING_DIM = 100 # keras embedding layer output_dim = Dimension of the dense embedding
P_THRESHOLD = 0.5 #Threshold for probability of being assigned to class
POS_RATIO = 0.5 #ratio of positive to negative for each class in weighted binary cross entropy loss function
NUM_WORDS=20000 #keras tokenizer num_words: None or int. Maximum number of words to work with 
#(if set, tokenization will be restricted to the top num_words most common words in the dataset).

### Read in data
Content items tagged to level 2 taxons or lower in the topic taxonomy

In [None]:
labelled_level2 = pd.read_csv(os.path.join(DATADIR, 'labelled_level2.csv.gz'), dtype=object, compression='gzip')

In [None]:
labelled_level2.shape

In [None]:
labelled_level2['content_id'].nunique()

### DON'T Collapse taxons with insufficient support for predictions (will need to be manually tagged)

In [None]:
#count the number of taxons per content item into new column
#labelled_level2['num_content_per_taxon'] = labelled_level2.groupby(["level2taxon"])['level2taxon'].transform("count")

In [None]:
#COLLAPSE level2taxons with too few content items into "toosmall" category
#labelled_level2.loc[labelled_level2['num_content_per_taxon'] < 10, 'level2taxon'] = 'TOO_SMALL'

#### clean up any World taxons leftover despite dropping relevant doctypes

In [None]:
#COLLAPSE World level2taxons
labelled_level2.loc[labelled_level2['level1taxon'] == 'World', 'level2taxon'] = 'world_level1'

#creating categorical variable for level2taxons from values
labelled_level2['level2taxon'] = labelled_level2['level2taxon'].astype('category')

In [None]:
#count the number of content items per taxon into new column
labelled_level2['num_content_per_taxon'] = labelled_level2.groupby(["level2taxon"])['level2taxon'].transform("count")

In [None]:
labelled_level2['num_content_per_taxon'].describe()

In [None]:
#number of rows in biggest level2 taxon -this is the target size for all other level2 taxons in resampling
max_content_freq = max(labelled_level2['num_content_per_taxon'])
max_content_freq

### drop news

In [None]:
labelled_level2.shape

In [None]:
labelled_level2[(labelled_level2['document_type'] == 'world_news_story')].shape

In [None]:
labelled_level2[(labelled_level2['document_type'] == 'news_story')].shape

In [None]:
nonews = labelled_level2[(labelled_level2['document_type'] != 'news_story') &
                         (labelled_level2['document_type'] != 'world_news_story')]

In [None]:
nonews.shape

### Create dictionary mapping taxon codes to string labels

In [None]:
#Get the category numeric values (codes) and avoid zero-indexing
labels = nonews['level2taxon'].cat.codes + 1

#create dictionary of taxon category code to string label for use in model evaluation
labels_index = dict(zip((labels), nonews['level2taxon']))

In [None]:
labels_index

In [None]:
print(len(labels_index))

### Create target/Y 

Note: when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample).

In multilabel learning, the joint set of binary classification tasks is expressed with label binary indicator array: each sample is one row of a 2d array of shape (n_samples, n_classes) with binary values:  
the one, i.e. the non zero elements, corresponds to the subset of labels.  
An array such as np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]]) represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample.  
Producing multilabel data as a list of sets of labels may be more intuitive.

####  First reshape wide to get columns for each level2taxon and row number = number unique urls

In [None]:
#get a smaller copy of data for pivoting ease (think you can work from full data actually and other cols get droopedauto)

level2_reduced = nonews[['content_id', 
                         'level2taxon', 
                         'combined_text', 
                         'title', 
                         'description',
                         'document_type', 
                            'first_published_at', 
                            'publishing_app', 
                            'primary_publishing_organisation']].copy()

#how many level2taxons are there?
print('Number of unique level2taxons: {}'.format(level2_reduced.level2taxon.nunique()))

#count the number of taxons per content item into new column
level2_reduced['num_taxon_per_content'] = level2_reduced.groupby(["content_id"])['content_id'].transform("count")

#Add 1 because of zero-indexing to get 1-number of level2taxons as numerical targets
level2_reduced['level2taxon_code'] = level2_reduced.level2taxon.astype('category').cat.codes + 1

In [None]:
#how many level2taxons are there?
print('Number of unique level2taxons: {}'.format(labelled_level2.level2taxon.nunique()))

#count the number of taxons per content item into new column
labelled_level2['num_taxon_per_content'] = labelled_level2.groupby(["content_id"])['content_id'].transform("count")

#Add 1 because of zero-indexing to get 1-number of level2taxons as numerical targets
labelled_level2['level2taxon_code'] = labelled_level2.level2taxon.astype('category').cat.codes + 1

In [None]:
#reshape to wide per taxon and keep the combined text so indexing is consistent when splitting X from Y

multilabel = (level2_reduced.pivot_table(index=['content_id', 
                                                'combined_text', 
                                                'title', 
                                                'description' 
                                                ] , columns='level2taxon_code', values='num_taxon_per_content'))
print('level2reduced shape: {}'.format(level2_reduced.shape))
print('pivot table shape (no duplicates): {} '.format(multilabel.shape))


In [None]:
multilabel.columns

In [None]:
multilabel.head()

In [None]:
multilabel.columns.astype('str')

In [None]:
#THIS IS WHY INDEXING IS NOT ZERO-BASED
#convert the number_of_taxons_per_content values to 1, meaning there was an entry for this taxon and this content_id, 0 otherwise
binary_multilabel = multilabel.notnull().astype('int')

### Upsample minority classes to address imbalance leading to ~2, 465, 570 rows of data!

Access taxon columns with indexing. 

In [None]:
print("[ENCODING] Taxon min indx:",binary_multilabel.columns[0],"Taxon max indx:",
      binary_multilabel.columns[len(binary_multilabel.columns)-1])

In [None]:
binary_multilabel[1].shape

In [None]:
type(binary_multilabel.columns[0])

In [None]:
binary_multilabel.columns

In [None]:
binary_multilabel.columns.name 

In [None]:
# Why are we deleting this?
del binary_multilabel.columns.name

In [None]:
binary_multilabel.index.names

In [None]:
#TAKES FOREVER TO RUN!
from sklearn.utils import resample

In [None]:
balanced_df = binary_multilabel
upper = len(binary_multilabel.columns)+1

for taxon in range(1, upper):
    num_samples = binary_multilabel[binary_multilabel[taxon]==1].shape[0] 
    if num_samples<500:
        print("Taxon code:",taxon,"Taxon name:",labels_index[taxon])
        print("SMALL SUPPORT:",num_samples)
        df_minority = binary_multilabel[binary_multilabel[taxon]==1]

        # Upsample minority class
        df_minority_upsampled = resample(df_minority, 
                                             replace=True,     # sample with replacement
                                             n_samples=(500),    # to match majority class, switch to max_content_freq if works
                                             random_state=123) # reproducible results

        # Combine majority class with upsampled minority class
        balanced_df = pd.concat([balanced_df, df_minority_upsampled])

        # Display new shape
        print(balanced_df.shape)


Do not remove index because the text data lives there.
**TODO** Consider reworking how datasets are set up at some point

In [None]:
balanced_df.to_csv(os.path.join(DATADIR, 'balanced_level2_500.csv.gz'), compression='gzip')

### LOAD OVERSAMPLED DATASET

In [None]:
balanced_df = pd.read_csv(os.path.join(DATADIR, 'balanced_level2_500.csv.gz'), dtype=object, compression='gzip')

In [None]:
balanced_df.shape

In [None]:
#will convert columns to an array of shape
print('Shape of Y multilabel array before train/val/test split:{}'.format(balanced_df[list(balanced_df.columns)].values.shape))

In [None]:
#after saving out and reading in the indexes have been converted to columns
balanced_df.columns

In [None]:
balanced_df.head()

In [None]:
#dont' overwirte blanced_df as it take sages to read in
balanced_df_taxons = balanced_df.iloc[:,4:215]

In [None]:
balanced_df_taxons.columns = balanced_df_taxons.columns.astype(int)

In [None]:
balanced_df_taxons = balanced_df_taxons.astype(int)

In [None]:
#convert columns to an array. Each row represents a content item, each column an individual taxon
binary_multilabel = balanced_df_taxons[list(balanced_df_taxons.columns)].values
print('Example row of multilabel array {}'.format(binary_multilabel[2]))

In [None]:
type(binary_multilabel)

# Format metadata/X

In [None]:
#extract content_id index to df
meta1 = pd.DataFrame(balanced_df['content_id'])

In [None]:
print(meta1.shape)
meta1.head()

In [None]:
metas = ['document_type','first_published_at','publishing_app','primary_publishing_organisation']

In [None]:
def build_index(x):
    index_dict = {}
    index_dict['index'] = 0
    for i,elem in enumerate(x):
        index_dict[elem] = i+1
    return index_dict

In [None]:
import time

In [None]:
#IF THIS FUNCTION TURNS OUT FASTER KEEP
#apply meta data to content
print("STARTED:",time.strftime("%H:%M:%S"))
for meta in metas:
    print("WORKON:",meta)
    meta1[meta] = meta1['content_id'].map(dict(zip(labelled_level2['content_id'], labelled_level2[meta])))
print("FINISHED:",time.strftime("%H:%M:%S"))

In [None]:
meta1 = meta1.replace(np.nan, '', regex=True) #conver nans to empty strings for labelencoder types
meta1.head()

In [None]:
def to_cat_to_hot(column):
    doctype_encoder = LabelEncoder()
    new_col = column+"_cat"
    meta1[new_col] = doctype_encoder.fit_transform(meta1[column])
    return to_categorical(meta1[new_col])

dict_of_encodings = {}
for meta in metas:
    if meta != "first_published_at":
        print(meta)
        dict_of_encodings[meta] = to_cat_to_hot(meta)   

In [None]:
meta1.head()

In [None]:
meta1['first_published_at'] = pd.to_datetime(meta1['first_published_at'])
print(meta1['first_published_at'].shape)

In [None]:
first_published = np.array(meta1['first_published_at']).reshape(meta1['first_published_at'].shape[0], 1)

In [194]:
print(first_published.dtype,first_published.shape,type(first_published))

datetime64[ns] (169271, 1) <class 'numpy.ndarray'>


In [None]:
dict_of_encodings.keys()

In [None]:
meta = np.concatenate((dict_of_encodings['document_type'], 
                               dict_of_encodings['primary_publishing_organisation'], 
                               dict_of_encodings['publishing_app']), 
                              axis=1)

In [None]:
nb_metavars = meta.shape[1]
print(nb_metavars)
print(meta.shape)

### Tokenize text fields

Tokenizer = Class for vectorizing texts, or/and turning texts into sequences (=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i)

In [222]:
def tokenize(local_tokenizer,input_data,option):
# apply tokenizer to our text data
    data = []
    local_tokenizer.fit_on_texts(input_data)
# list of word indexes, where the word of rank i in the dataset (starting at 1) has index i
    sequences = local_tokenizer.texts_to_sequences(input_data) #yield one sequence per input text
    word_index = local_tokenizer.word_index  # Only set after fit_on_texts was called.
    print('Found %s unique tokens.' % len(word_index))
    if option:
        data = pad_sequences(sequences, maxlen= MAX_SEQUENCE_LENGTH) #MAX_SEQUENCE_LENGTH
    else:
        data = tokenizer.sequences_to_matrix(sequences)
    return data

In [223]:
# True for sequences to matrix, False otherwise.
texts = balanced_df['combined_text']
tokenizer = Tokenizer(num_words=NUM_WORDS)
data = tokenize(tokenizer,texts,False)

titles = balanced_df['title']
tokenizer_tit = Tokenizer(num_words=10000)
onehot_tit = tokenize(tokenizer_tit,titles,True)

descs = balanced_df['description']
tokenizer_desc = Tokenizer(num_words=10000)
onehot_desc = tokenize(tokenizer_desc,descs,True)

Found 193851 unique tokens.
Found 31442 unique tokens.
Found 36795 unique tokens.


In [224]:
h5f = h5py.File("tokenized_combined_text.hdf5", "w")
h5f.create_dataset('tokenized_combined_text', data=data)
h5f.close()
# data = pd.read_pickle(os.path.join(DATADIR, 'tokenized_combined_text'))

In [225]:
print('Shape of label tensor:', binary_multilabel.shape)
print('Shape of data tensor:', data.shape)

Shape of label tensor: (169271, 210)
Shape of data tensor: (169271, 20000)


### Data split
- Training data = 80%
- Development data = 10%
- Test data = 10%

In [None]:
# shuffle data and standardise indices
indices = np.arange(data.shape[0])
print(indices)
np.random.seed(0)
np.random.shuffle(indices)
print(indices)

In [None]:
data = data[indices]
metadata = meta[indices]
title_data = onehot_tit[indices]
desc_data = onehot_desc[indices]
timedata = first_published[indices]
labels = binary_multilabel[indices]

In [None]:
nb_test_samples = int(0.1 * data.shape[0]) #validation split
print('nb_test samples:', nb_test_samples)

nb_dev_samples = int(0.2 * data.shape[0]) #validation split
print('nb_dev samples:', nb_dev_samples)

nb_training_samples = int(0.8 * data.shape[0]) #validation split
print('nb_training samples:', nb_training_samples)

In [218]:
def split(data,splits):
    for (start,end) in splits:
        yield data[start:end]

In [None]:
splits = [(0,-nb_dev_samples),(-nb_dev_samples,-nb_test_samples),(-nb_test_samples,len(data))]

x_train, x_dev, x_test = split(data,splits)
meta_train, meta_dev, meta_test = split(metadata,splits)
title_train, title_dev, title_test = split(title_data,splits)
desc_train, desc_dev, desc_test = split(desc_data,splits)
timedata_train, timedata_dev, timedata_test = split(timedata,splits)
y_train, y_dev, y_test = split(labels,splits)

In [None]:
print('Shape of x_train:', x_train.shape)
print('Shape of metax_train:', metax_train.shape)
print('Shape of titlex_train:', titlex_train.shape)
print('Shape of descx_train:', descx_train.shape)
print('Shape of datex_train:', datex_train.shape)
print('Shape of y_train:', y_train.shape)

In [None]:
print('Shape of x_dev:', x_dev.shape)
print('Shape of meta_dev:', meta_dev.shape)
print('Shape of titlex_dev:', title_dev.shape)
print('Shape of descx_dev:', desc_dev.shape)
print('Shape of metax_dev:', date_dev.shape)
print('Shape of y_dev:', y_dev.shape)

In [None]:
print('Shape of x_test:', x_test.shape)
print('Shape of metax_test:', metax_test.shape)
print('Shape of titlex_test:', titlex_test.shape)
print('Shape of descx_test:', descx_test.shape)
print('Shape of datex_test:', datex_test.shape)
print('Shape of y_test:', y_test.shape)

### preparing the Embedding layer

NB stopwords haven't been removed yet...

In [None]:
embedding_layer = Embedding(len(word_index) + 1, 
                            EMBEDDING_DIM, 
                            input_length=MAX_SEQUENCE_LENGTH)

An Embedding layer should be fed sequences of integers, i.e. a 2D input of shape (samples, indices). These input sequences should be padded so that they all have the same length in a batch of input data (although an Embedding layer is capable of processing sequence of heterogenous length, if you don't pass an explicit input_length argument to the layer).

All that the Embedding layer does is to map the integer inputs to the vectors found at the corresponding index in the embedding matrix, i.e. the sequence [1, 2] would be converted to [embeddings[1], embeddings[2]]. This means that the output of the Embedding layer will be a 3D tensor of shape (samples, sequence_length, embedding_dim).

### Estimate class weights for unbalanced datasets.
paramter to model.fit = __class_weight__: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.

Implement class_weight from sklearn:

- Import the module 

`from sklearn.utils import class_weight`
- calculate the class weight, If ‘balanced’, class weights will be given by n_samples / (n_classes * np.bincount(y)):

`class_weight = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)`

- change it to a dict in order to work with Keras.

`class_weight_dict = dict(enumerate(class_weight))`

- Add to model fitting

`model.fit(X_train, y_train, class_weight=class_weight)`

In [None]:
# class_weight = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
# class_weight_dict = dict(enumerate(class_weight))

### Custom loss function

In [None]:
class WeightedBinaryCrossEntropy(object):

    def __init__(self, pos_ratio):
        neg_ratio = 1. - pos_ratio
        #self.pos_ratio = tf.constant(pos_ratio, tf.float32)
        self.pos_ratio = pos_ratio
        #self.weights = tf.constant(neg_ratio / pos_ratio, tf.float32)
        self.weights = neg_ratio / pos_ratio
        self.__name__ = "weighted_binary_crossentropy({0})".format(pos_ratio)

    def __call__(self, y_true, y_pred):
        return self.weighted_binary_crossentropy(y_true, y_pred)

    def weighted_binary_crossentropy(self, y_true, y_pred):
            # Transform to logits
            epsilon = tf.convert_to_tensor(K.common._EPSILON, y_pred.dtype.base_dtype)
            y_pred = tf.clip_by_value(y_pred, epsilon, 1 - epsilon)
            y_pred = tf.log(y_pred / (1 - y_pred))

            cost = tf.nn.weighted_cross_entropy_with_logits(y_true, y_pred, self.weights)
            return K.mean(cost * self.pos_ratio, axis=-1)
    
y_true_arr = np.array([0,1,0,1], dtype="float32")
y_pred_arr = np.array([0,0,1,1], dtype="float32")
y_true = tf.constant(y_true_arr)
y_pred = tf.constant(y_pred_arr)

with tf.Session().as_default(): 
    print(WeightedBinaryCrossEntropy(0.5)(y_true, y_pred).eval())
    print(binary_crossentropy(y_true, y_pred).eval())


### difficulty getting global precision/recall metrics . CAUTION interpreting monitoring metrics
fcholltet: "Basically these are all global metrics that were approximated
batch-wise, which is more misleading than helpful. This was mentioned in
the docs but it's much cleaner to remove them altogether. It was a mistake
to merge them in the first place."

In [None]:
def f1(y_true, y_pred):
    """Use Recall  and precision metrics to calculate harmonic mean (F1 score).

        Only computes a batch-wise average of recall.

        Computes the recall, a metric for multi-label classification of
        how many relevant items are selected.
        """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1 = 2*((precision*recall)/(precision+recall))
    
    return f1

## Training a 1D convnet

### 1. Create model

In [None]:
NB_CLASSES = y_train.shape[1]
NB_METAVARS = metax_train.shape[1]



sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='wordindex') #MAX_SEQUENCE_LENGTH
embedded_sequences = embedding_layer(sequence_input)
x = Dropout(0.2, name = 'dropout_embedded')(embedded_sequences)

x = Conv1D(128, 5, activation='relu', name = 'conv0')(x)

x = MaxPooling1D(5, name = 'max_pool0')(x)

x = Dropout(0.5, name = 'dropout0')(x)

x = Conv1D(128, 5, activation='relu', name = 'conv1')(x)

x = MaxPooling1D(5 , name = 'max_pool1')(x)

x = Conv1D(128, 5, activation='relu', name = 'conv2')(x)

x = MaxPooling1D(35, name = 'global_max_pool')(x)  # global max pooling

x = Flatten()(x) #reduce dimensions from 3 to 2; convert to vector + FULLYCONNECTED

meta_input = Input(shape=(NB_METAVARS,), name='meta')
meta_hidden = Dense(128, activation='relu', name = 'hidden_meta')(meta_input)
meta_hidden = Dropout(0.2, name = 'dropout_meta')(meta_hidden)


title_input = Input(shape=(titlex_train.shape[1],), name='titles')
title_hidden = Dense(128, activation='relu', name = 'hidden_title')(title_input)
title_hidden = Dropout(0.2, name = 'dropout_title')(title_hidden)

desc_input = Input(shape=(descx_train.shape[1],), name='descs')
desc_hidden = Dense(128, activation='relu', name = 'hidden_desc')(desc_input)
desc_hidden = Dropout(0.2, name = 'dropout_desc')(desc_hidden)

concatenated = concatenate([meta_hidden, title_hidden, desc_hidden, x])

x = Dense(400, activation='relu', name = 'fully_connected0')(concatenated)

x = Dropout(0.2, name = 'dropout1')(x)

x = Dense(NB_CLASSES, activation='sigmoid', name = 'fully_connected1')(x)

# # The Model class turns an input tensor and output tensor into a model
# This creates Keras model instance, will use this instance to train/test the model.
model = Model(inputs=[meta_input, title_input, desc_input, sequence_input], outputs=x)

### 2. Compile model

In [None]:
# model.compile(loss=WeightedBinaryCrossEntropy(POS_RATIO),
#               optimizer='rmsprop',
#               metrics=['binary_accuracy', f1])

Metric values are recorded at the end of each epoch on the training dataset. If a validation dataset is also provided, then the metric recorded is also calculated for the validation dataset.

All metrics are reported in verbose output and in the history object returned from calling the fit() function. In both cases, the name of the metric function is used as the key for the metric values. In the case of metrics for the validation dataset, the “val_” prefix is added to the key.

You have now built a function to describe your model. To train and test this model, there are four steps in Keras:
1. Create the model by calling the function above
2. Compile the model by calling `model.compile(optimizer = "...", loss = "...", metrics = ["accuracy"])`
3. Train the model on train data by calling `model.fit(x = ..., y = ..., epochs = ..., batch_size = ...)`
4. Test the model on test data by calling `model.evaluate(x = ..., y = ...)`

If you want to know more about `model.compile()`, `model.fit()`, `model.evaluate()` and their arguments, refer to the official [Keras documentation](https://keras.io/models/model/).


In [None]:
model.summary()

### Tensorboard callbacks /metrics /monitor training

<span style="color:red"> **Size of these files is killing storage during training. Is it histograms?**</span>

In [None]:
tb = TensorBoard(log_dir='./learn_embedding_logs', histogram_freq=1, write_graph=True, write_images=False)

In [None]:
CHECKPOINT_PATH = os.path.join(DATADIR, 'model_checkpoint.hdf5')

cp = ModelCheckpoint(
                     filepath = CHECKPOINT_PATH, 
                     monitor='val_loss', 
                     verbose=0, 
                     save_best_only=False, 
                     save_weights_only=False, 
                     mode='auto', 
                     period=1
                    )

In [None]:
# class Metrics(Callback):
#     def on_train_begin(self, logs={}):
#         self.val_f1s = []
#         self.val_recalls = []
#         self.val_precisions = []
 
#     def on_epoch_end(self, epoch, logs={}):
#         val_predict = (np.asarray(self.model.predict(self.model.validation_data[0]))).round()
#         val_targ = self.model.validation_data[1]
        
#         self.val_f1s.append(f1_score(val_targ, val_predict, average='micro'))
#         self.val_recalls.append(recall_score(val_targ, val_predict))
#         self.val_precisions.append(precision_score(val_targ, val_predict))
#         print("- val_f1: %f — val_precision: %f — val_recall %f" 
#                 %(f1_score(val_targ, val_predict, average='micro'), 
#                   precision_score(val_targ, val_predict),
#                    recall_score(val_targ, val_predict)))
#         return
 
# metrics = Metrics()

In [None]:
from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=2)
#model.fit(x, y, validation_split=0.2, callbacks=[early_stopping])

### 3. Train model

In [None]:
# metrics callback causes: CCCCCCR55555555511155
# So disable for now
from keras.utils import multi_gpu_model

# Replicates `model` on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss=WeightedBinaryCrossEntropy(POS_RATIO),
              optimizer='rmsprop',
              metrics=['binary_accuracy', f1])

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
history = parallel_model.fit(
    {'meta': metax_train, 'titles': titlex_train, 'descs': descx_train, 'wordindex': x_train},
    y_train, 
    validation_data=([metax_dev, titlex_dev, descx_dev, x_dev], y_dev), 
    epochs=10, batch_size=128, callbacks=[early_stopping]
)


# history = model.fit(
#     {'meta': metax_train, 'titles': titlex_train, 'descs': descx_train, 'wordindex': x_train},
#     y_train, 
#     validation_data=([metax_dev, titlex_dev, descx_dev, x_dev], y_dev), 
#     epochs=10, batch_size=128, callbacks=[early_stopping]
# )

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, 10)

plt.plot(epochs, loss_values, 'bo', label='Training loss')           
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')      
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.clf()    

f1_values = history_dict['f1']
val_f1_values = history_dict['val_f1']

plt.plot(epochs, f1_values, 'bo', label='Training f1')
plt.plot(epochs, val_f1_values, 'b', label='Validation f1')
plt.title('Training and validation batch-level f1-micro')
plt.xlabel('Epochs')
plt.ylabel('F1-micro')
plt.legend()

plt.show()

### Evaluate model

#### Training metrics

In [None]:
y_prob = parallel_model.predict([metax_train, titlex_train, descx_train, x_train])

In [None]:
y_prob.shape

In [None]:
y_pred = y_prob.copy()
y_pred[y_pred>=P_THRESHOLD] = 1
y_pred[y_pred<P_THRESHOLD] = 0

In [None]:
f1_score(y_train, y_pred, average='micro')

In [None]:
#average= None, the scores for each class are returned.
#precision_recall_fscore_support(y_train, y_pred, average=None, sample_weight=None)

In [None]:
#a = precision_recall_fscore_support(y_train, y_pred, average=None, sample_weight=None)
# pd.DataFrame(list(a))
# f1_byclass = pd.DataFrame((a)[2], columns=['f1'])

# support_byclass = pd.DataFrame((a)[3], columns=['support'])

# f1_byclass = pd.merge(
#     left=f1_byclass, 
#     right=support_byclass, 
#     left_index=True,
#     right_index=True,
#     how='outer', 
#     validate='one_to_one'
# )

# f1_byclass['index_col'] = f1_byclass.index

# f1_byclass['level2taxon'] = f1_byclass['index_col'].map(labels_index).copy()

# print("At p_threshold of {}, there were {} out of {} ({})% taxons with auto-tagged content in the training data"
#       .format(P_THRESHOLD, 
#               f1_byclass.loc[f1_byclass['f1'] > 0].shape[0], 
#               y_pred.shape[1], 
#               (f1_byclass.loc[f1_byclass['f1'] > 0].shape[0]/y_pred.shape[1])*100 ))

In [None]:
# no_auto_content = f1_byclass.loc[f1_byclass['f1'] == 0]
# no_auto_content = no_auto_content.set_index('level2taxon')

In [None]:
# no_auto_content['support'].sort_values().plot( kind = 'barh', figsize=(20, 20))

In [None]:
# classes_predictedto = f1_byclass.loc[f1_byclass['f1'] > 0]
# classes_predictedto = classes_predictedto.set_index('level2taxon') 

In [None]:
# classes_predictedto.plot.scatter(x='support', y='f1', figsize=(20, 10), xticks=np.arange(0, 9700, 100))

In [None]:
# classes_predictedto['f1'].sort_values().plot( kind = 'barh', figsize=(20, 20))

In [None]:
#Calculate globally by counting the total true positives, false negatives and false positives.
precision_recall_fscore_support(y_train, y_pred, average='micro', sample_weight=None) 

In [None]:
#Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
precision_recall_fscore_support(y_train, y_pred, average='macro', sample_weight=None)

In [None]:
#Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
precision_recall_fscore_support(y_train, y_pred, average='weighted', sample_weight=None)

#### Development set metrics

In [None]:
y_pred_dev = parallel_model.predict([metax_dev, titlex_dev, descx_dev, x_dev])

In [None]:
y_pred_dev[y_pred_dev>=P_THRESHOLD] = 1
y_pred_dev[y_pred_dev<P_THRESHOLD] = 0

In [None]:
#average= None, the scores for each class are returned.
precision_recall_fscore_support(y_dev, y_pred_dev, average=None, sample_weight=None)

In [None]:
#Calculate globally by counting the total true positives, false negatives and false positives.
precision_recall_fscore_support(y_dev, y_pred_dev, average='micro', sample_weight=None) 

In [None]:
#Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
precision_recall_fscore_support(y_dev, y_pred_dev, average='macro', sample_weight=None)

In [None]:
#Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
precision_recall_fscore_support(y_dev, y_pred_dev, average='weighted', sample_weight=None)