## 04 Neural Networks with keras - 01 Base approach

In this notebook we implement an approach based on neural networks, using the library **keras** from **tensorflow** to predict whether the tweets refer to a real disaster or not. We establish a fixed architecture with two convolutional layers followed by a dense layer, eyeballing the hyperparameters.

#### Loading data

We start by importing the packages we are going to use and loading the datasets:

In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow.keras as keras

from keras.layers import TextVectorization
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import Conv1D
from keras.layers import GlobalMaxPooling1D
from keras.layers import Dense

from sklearn.model_selection import KFold

train_data = pd.read_csv("../../data/train.csv")
test_data = pd.read_csv("../../data/test.csv")

train_data['text'].replace('http:\/\/\S*', 'urltoken', regex=True, inplace=True)
test_data['text'].replace('http:\/\/\S*', 'urltoken', regex=True, inplace=True)

train_text, train_label = np.array(train_data['text']), np.array(train_data['target'])
test_text = test_data['text']

print(train_text.shape)
print(train_label.shape)
print(test_text.shape)

(7613,)
(7613,)
(3263,)


We explore the training data. There average tweet has 15 words, and the longest one has 31:

In [2]:
# Word counts
pd.Series(np.array([len(text.split()) for text in train_text])).describe()

count    7613.000000
mean       14.903586
std         5.732604
min         1.000000
25%        11.000000
50%        15.000000
75%        19.000000
max        31.000000
dtype: float64

There are 27736 unique words in all the tweets:

In [3]:
# Number of unique tokens among all tweets
len(np.unique(np.array(' '.join(train_text).split())))

27736

#### Model building

The following function will create and return a model with a fixed layer architecture whose hyperparameters are defined above.

We start with a **TextVectorization** layer with usual standardization, followed by an **Embedding** layer. We then compose with two **Conv1D** layers and perform **GlobalMaxPooling1D**, and finish with a **Dense** layer. We include some **Dropout** layers in order to avoid overfitting.

In [6]:
# Base model
max_features = 20000
sequence_length = 500

embedding_dim = 128

dropout_rate = 0.5

conv_filters = 128

conv_kernel_size = 7
conv_strides = 3

dense_layer_size = 128

def build_model():
    # Inputs are text strings, then we vectorize them
    inputs = keras.Input(shape=(1,), dtype=tf.string, name='text')

    vectorizer = TextVectorization(
        standardize='lower_and_strip_punctuation',
        max_tokens=max_features,
        output_mode="int",
        output_sequence_length=sequence_length,
    )
    vectorizer.adapt(train_text)
    x = vectorizer(inputs)

    # We use Embedding to map the vectorized text onto a space of dimension embedding_dim
    x = Embedding(max_features + 1, embedding_dim)(x)
    x = Dropout(dropout_rate)(x)

    # Conv1D + GlobalMaxPooling
    x = Conv1D(conv_filters, conv_kernel_size, strides=conv_strides, activation='relu')(x)
    x = Conv1D(conv_filters, conv_kernel_size, strides=conv_strides, activation='relu')(x)
    x = GlobalMaxPooling1D()(x)

    # Dense hidden layer
    x = Dense(dense_layer_size, activation="relu")(x)
    x = Dropout(dropout_rate)(x)

    # Output layer
    outputs = Dense(1, activation="sigmoid", name="predictions")(x)

    model = keras.Model(inputs, outputs)

    # Compile the model with binary crossentropy loss and an adam optimizer.
    model.compile(optimizer="adam", loss='binary_crossentropy', metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()])

    return model

#### Model training

We are now ready to train the model. We start by creating an instance and printing a summary:

In [8]:
model = build_model()
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text (InputLayer)            [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 500)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 500, 128)          2560128   
_________________________________________________________________
dropout_2 (Dropout)          (None, 500, 128)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 165, 128)          114816    
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 53, 128)           114816    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0   

We use 10-fold cross-validation and train for 3 epochs:

In [9]:
epochs = 3

kfold = KFold(n_splits=10, shuffle=True)

scores = []
models = []

i = 1
for fold_train_indices, fold_val_indices in kfold.split(train_text, train_label):
    print('------------------------------------------------------------------------')
    print(f'> Fold {i}')
    print('------------------------------------------------------------------------')

    fold_train_text = train_text[fold_train_indices]
    fold_train_label = train_label[fold_train_indices]
    fold_val_text = train_text[fold_val_indices]
    fold_val_label = train_label[fold_val_indices]

    model = build_model()
    model.fit(fold_train_text, fold_train_label, epochs=epochs, verbose=2)
    models.append(model)

    fold_train_score = model.evaluate(fold_train_text, fold_train_label, verbose=2)
    fold_val_score = model.evaluate(fold_val_text, fold_val_label, verbose=2)
    scores.append({'train': fold_train_score, 'val': fold_val_score})

    i += 1

------------------------------------------------------------------------
> Fold 1
------------------------------------------------------------------------
Epoch 1/3
215/215 - 39s - loss: 0.6361 - accuracy: 0.6260 - precision_2: 0.7154 - recall_2: 0.2130
Epoch 2/3
215/215 - 38s - loss: 0.3834 - accuracy: 0.8396 - precision_2: 0.8716 - recall_2: 0.7343
Epoch 3/3
215/215 - 34s - loss: 0.1776 - accuracy: 0.9365 - precision_2: 0.9494 - recall_2: 0.9000
215/215 - 7s - loss: 0.0712 - accuracy: 0.9768 - precision_2: 0.9764 - recall_2: 0.9694
24/24 - 1s - loss: 0.5756 - accuracy: 0.7717 - precision_2: 0.7257 - recall_2: 0.7651
------------------------------------------------------------------------
> Fold 2
------------------------------------------------------------------------
Epoch 1/3
215/215 - 42s - loss: 0.6279 - accuracy: 0.6417 - precision_3: 0.6974 - recall_3: 0.2974
Epoch 2/3
215/215 - 37s - loss: 0.3630 - accuracy: 0.8542 - precision_3: 0.8766 - recall_3: 0.7700
Epoch 3/3
215/215 - 3

We compute the F1-score:

In [10]:
for fold_scores in scores:
    for subset in ['train', 'val']:
        precision = fold_scores[subset][2]
        recall = fold_scores[subset][3]
        f1_score = 2/(1/precision + 1/recall)
        fold_scores[subset].append(f1_score)

And print a detailed summary of the scores:

In [11]:
print('------------------------------------------------------------------------')
print('Average scores for all folds - Train')
print(f'> Loss: {round(np.mean([fold_score["train"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["train"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["train"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["train"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["train"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')
print('Average scores for all folds - Validation')
print(f'> Loss: {round(np.mean([fold_score["val"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["val"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["val"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["val"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["val"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')


i = 1
print('------------------------------------------------------------------------')
print('Score per fold')
for fold_scores in scores:
    print('------------------------------------------------------------------------')
    print(f'> Fold {i} - Train')
    print(f'>>> Loss: {round(fold_scores["train"][0], 4)} - Accuracy: {round(fold_scores["train"][1], 4)} - Precision: {round(fold_scores["train"][2], 4)} - Recall: {round(fold_scores["train"][3], 4)} - F1-score: {round(fold_scores["train"][4], 4)}')
    print(f'> Fold {i} - Validation')
    print(f'>>> Loss: {round(fold_scores["val"][0], 4)} - Accuracy: {round(fold_scores["val"][1], 4)} - Precision: {round(fold_scores["val"][2], 4)} - Recall: {round(fold_scores["val"][3], 4)} - F1-score: {round(fold_scores["val"][4], 4)}')
    i += 1
print('------------------------------------------------------------------------')

------------------------------------------------------------------------
Average scores for all folds - Train
> Loss: 0.0715 -  Accuracy: 0.978 - Precision: 0.9804 - Recall: 0.9682 - F1-score: 0.9742
------------------------------------------------------------------------
Average scores for all folds - Validation
> Loss: 0.5732 -  Accuracy: 0.773 - Precision: 0.7467 - Recall: 0.7269 - F1-score: 0.7331
------------------------------------------------------------------------
------------------------------------------------------------------------
Score per fold
------------------------------------------------------------------------
> Fold 1 - Train
>>> Loss: 0.0712 - Accuracy: 0.9768 - Precision: 0.9764 - Recall: 0.9694 - F1-score: 0.9729
> Fold 1 - Validation
>>> Loss: 0.5756 - Accuracy: 0.7717 - Precision: 0.7257 - Recall: 0.7651 - F1-score: 0.7449
------------------------------------------------------------------------
> Fold 2 - Train
>>> Loss: 0.0682 - Accuracy: 0.979 - Precision: 

#### Submission

We take the model and train it with all the available data:

In [12]:
model = build_model()
model.fit(train_text, train_label, epochs=epochs, verbose=2)

model.evaluate(train_text, train_label, verbose=2)

Epoch 1/3
238/238 - 31s - loss: 0.6142 - accuracy: 0.6559 - precision_12: 0.7479 - recall_12: 0.3002
Epoch 2/3
238/238 - 28s - loss: 0.3579 - accuracy: 0.8528 - precision_12: 0.8788 - recall_12: 0.7625
Epoch 3/3
238/238 - 30s - loss: 0.1727 - accuracy: 0.9384 - precision_12: 0.9482 - recall_12: 0.9061
238/238 - 5s - loss: 0.0657 - accuracy: 0.9786 - precision_12: 0.9796 - recall_12: 0.9703


[0.06569287925958633,
 0.9785892367362976,
 0.979629635810852,
 0.9703454375267029]

We generate the predictions for the test set:

In [13]:
test_pred = model.predict(test_text)
test_pred = np.round(test_pred).flatten().astype('int')

test_pred

array([1, 1, 1, ..., 1, 1, 1])

And we save the predictions into a csv file ready for submission:

In [14]:
output = pd.DataFrame({'id': test_data['id'], 'target': test_pred})
output.to_csv('predictions/nnets.csv', index=False)
print("Submission successfully saved!")

Submission successfully saved!
