## 04 - Neural Networks with keras

In this notebook we implement an approach based on neural networks, using the library **keras** from **tensorflow** to predict whether the tweets refer to a real disaster or not.

#### Loading data

We start by importing the packages we are going to use and loading the datasets:

In [270]:
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow.keras as keras

from keras.layers import TextVectorization
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import Conv1D
from keras.layers import GlobalMaxPooling1D
from keras.layers import Dense

from sklearn.model_selection import KFold

import string
import re

train_data = pd.read_csv("../data/train.csv")
test_data = pd.read_csv("../data/test.csv")

train_text, train_label = np.array(train_data['text']), np.array(train_data['target'])
test_text = test_data['text']

print(train_text.shape)
print(train_label.shape)
print(test_text.shape)

(7613,)
(7613,)
(3263,)


In [271]:
pd.Series(np.array([len(text.split()) for text in train_text])).describe()

count    7613.000000
mean       14.903586
std         5.732604
min         1.000000
25%        11.000000
50%        15.000000
75%        19.000000
max        31.000000
dtype: float64

In [272]:
len(np.unique(np.array(' '.join(train_text).split())))

31924

In [273]:
# max_features = 20000
max_features = 10000

sequence_length = 500
# sequence_length = 32

embedding_dim = 128
# embedding_dim = 64

dropout_rate = 0.5

conv_filters = 128
# conv_filters = 64
conv_kernel_size = 7
conv_strides = 3

We clean the text by removing punctuation characters and stopwords:

In [274]:
def custom_standardization(raw):
    lowercase = tf.strings.lower(raw)

    no_punct = tf.strings.regex_replace(
        lowercase, "[%s]" % re.escape(string.punctuation), ""
    )

    return no_punct

vectorizer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

vectorizer.adapt(train_text)

len(vectorizer.get_vocabulary())

10000

In [275]:
def build_model():
    # Inputs are text strings, then we vectorize them
    inputs = keras.Input(shape=(1,), dtype=tf.string, name='text')
    x = vectorizer(inputs)

    # We use Embedding to map the vectorized text onto a space of dimension embedding_dim
    x = Embedding(max_features + 1, embedding_dim)(x)
    x = Dropout(dropout_rate)(x)

    # Conv1D + GlobalMaxPooling
    x = Conv1D(conv_filters, conv_kernel_size, strides=conv_strides, activation='relu')(x)
    x = Conv1D(conv_filters, conv_kernel_size, strides=conv_strides, activation='relu')(x)
    x = GlobalMaxPooling1D()(x)

    # Dense hidden layer
    x = Dense(128, activation="relu")(x)
    x = Dropout(0.5)(x)

    # Output layer
    outputs = Dense(1, activation="sigmoid", name="predictions")(x)

    model = keras.Model(inputs, outputs)

    # Compile the model with binary crossentropy loss and an adam optimizer.
    model.compile(optimizer="adam", loss='binary_crossentropy', metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()])

    return model

In [None]:
epochs = 3

model = build_model()
model.fit(train_text, train_label, epochs=epochs, verbose=2)

model.evaluate(train_text, train_label, verbose=2)

In [276]:
epochs = 3

kfold = KFold(n_splits=10, shuffle=True)

scores = []
models = []

i = 1
for fold_train_indices, fold_val_indices in kfold.split(train_text, train_label):
    print('------------------------------------------------------------------------')
    print(f'> Fold {i}')
    print('------------------------------------------------------------------------')

    fold_train_text = train_text[fold_train_indices]
    fold_train_label = train_label[fold_train_indices]
    fold_val_text = train_text[fold_val_indices]
    fold_val_label = train_label[fold_val_indices]

    model = build_model()
    model.fit(fold_train_text, fold_train_label, epochs=epochs, verbose=2)
    models.append(model)

    fold_train_score = model.evaluate(fold_train_text, fold_train_label, verbose=2)
    fold_val_score = model.evaluate(fold_val_text, fold_val_label, verbose=2)
    scores.append({'train': fold_train_score, 'val': fold_val_score})

    i += 1

------------------------------------------------------------------------
> Fold 1
------------------------------------------------------------------------
Epoch 1/3
215/215 - 50s - loss: 0.6403 - accuracy: 0.6257 - precision_84: 0.6928 - recall_84: 0.2414
Epoch 2/3
215/215 - 33s - loss: 0.3908 - accuracy: 0.8358 - precision_84: 0.8470 - recall_84: 0.7569
Epoch 3/3
215/215 - 43s - loss: 0.2141 - accuracy: 0.9223 - precision_84: 0.9336 - recall_84: 0.8832
215/215 - 7s - loss: 0.0934 - accuracy: 0.9680 - precision_84: 0.9814 - recall_84: 0.9440
24/24 - 1s - loss: 0.5739 - accuracy: 0.7861 - precision_84: 0.7466 - recall_84: 0.7152
------------------------------------------------------------------------
> Fold 2
------------------------------------------------------------------------
Epoch 1/3
215/215 - 44s - loss: 0.6291 - accuracy: 0.6341 - precision_85: 0.7172 - recall_85: 0.2484
Epoch 2/3
215/215 - 40s - loss: 0.3924 - accuracy: 0.8351 - precision_85: 0.8697 - recall_85: 0.7259
Epoch 3

In [277]:
for fold_scores in scores:
    for subset in ['train', 'val']:
        precision = fold_scores[subset][2]
        recall = fold_scores[subset][3]
        f1_score = 2/(1/precision + 1/recall)
        fold_scores[subset].append(f1_score)

In [278]:
i = 1
print('------------------------------------------------------------------------')
print('Score per fold')
for fold_scores in scores:
    print('------------------------------------------------------------------------')
    print(f'> Fold {i} - Train')
    print(f'>>> Loss: {round(fold_scores["train"][0], 4)} - Accuracy: {round(fold_scores["train"][1], 4)} - Precision: {round(fold_scores["train"][2], 4)} - Recall: {round(fold_scores["train"][3], 4)} - F1-score: {round(fold_scores["train"][4], 4)}')
    print(f'> Fold {i} - Validation')
    print(f'>>> Loss: {round(fold_scores["val"][0], 4)} - Accuracy: {round(fold_scores["val"][1], 4)} - Precision: {round(fold_scores["val"][2], 4)} - Recall: {round(fold_scores["val"][3], 4)} - F1-score: {round(fold_scores["val"][4], 4)}')
    i += 1
print('------------------------------------------------------------------------')
print('Average scores for all folds - Train')
print(f'> Loss: {round(np.mean([fold_score["train"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["train"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["train"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["train"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["train"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')
print('Average scores for all folds - Validation')
print(f'> Loss: {round(np.mean([fold_score["val"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["val"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["val"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["val"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["val"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')

------------------------------------------------------------------------
Score per fold
------------------------------------------------------------------------
> Fold 1 - Train
>>> Loss: 0.0934 - Accuracy: 0.968 - Precision: 0.9814 - Recall: 0.944 - F1-score: 0.9623
> Fold 1 - Validation
>>> Loss: 0.5739 - Accuracy: 0.7861 - Precision: 0.7466 - Recall: 0.7152 - F1-score: 0.7306
------------------------------------------------------------------------
> Fold 2 - Train
>>> Loss: 0.0993 - Accuracy: 0.9658 - Precision: 0.9888 - Recall: 0.9312 - F1-score: 0.9592
> Fold 2 - Validation
>>> Loss: 0.4985 - Accuracy: 0.7992 - Precision: 0.8059 - Recall: 0.6875 - F1-score: 0.742
------------------------------------------------------------------------
> Fold 3 - Train
>>> Loss: 0.0883 - Accuracy: 0.968 - Precision: 0.9789 - Recall: 0.9459 - F1-score: 0.9621
> Fold 3 - Validation
>>> Loss: 0.5747 - Accuracy: 0.7861 - Precision: 0.8 - Recall: 0.6767 - F1-score: 0.7332
-------------------------------

238/238 - 7s - loss: 0.0064 - accuracy: 0.9959


[0.006378879304975271, 0.9959279894828796]

In [80]:
test_pred = model.predict(test_text)
test_pred = np.round(test_pred).flatten().astype('int')

test_pred

array([1, 1, 1, ..., 0, 1, 1])

We generate vector counts for both train and test data using scikit's **CountVectorizer**. In particular, notice that we fit the vectorizer only with the train tokens, and use it to transform both train and test data. If there are N unique tokens in the train dataset, for each tweet we obtain a vector of length N whose values are the word counts:

In [81]:
output = pd.DataFrame({'id': test_data['id'], 'target': test_pred})
output.to_csv('predictions/nnets.csv', index=False)
print("Submission successfully saved!")

Submission successfully saved!
