## 04 - Neural Networks with keras

In this notebook we implement an approach based on neural networks, using the library **keras** from **tensorflow** to predict whether the tweets refer to a real disaster or not.

#### Loading data

We start by importing the packages we are going to use and loading the datasets:

In [128]:
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow.keras as keras

from keras.layers import TextVectorization
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import Conv1D
from keras.layers import GlobalMaxPooling1D
from keras.layers import Dense

from sklearn.model_selection import KFold

import string
import re

train_data = pd.read_csv("../data/train.csv")
test_data = pd.read_csv("../data/test.csv")

train_data['text'].replace('http:\/\/\S*', 'urltoken', regex=True, inplace=True)
test_data['text'].replace('http:\/\/\S*', 'urltoken', regex=True, inplace=True)

train_text, train_label = np.array(train_data['text']), np.array(train_data['target'])
test_text = test_data['text']

print(train_text.shape)
print(train_label.shape)
print(test_text.shape)

(7613,)
(7613,)
(3263,)


In [129]:
# Word counts
pd.Series(np.array([len(text.split()) for text in train_text])).describe()

count    7613.000000
mean       14.903586
std         5.732604
min         1.000000
25%        11.000000
50%        15.000000
75%        19.000000
max        31.000000
dtype: float64

In [130]:
# Number of unique tokens among all tweets
len(np.unique(np.array(' '.join(train_text).split())))

27736

We clean the text by removing punctuation characters and stopwords:

In [131]:
max_features = 30000
sequence_length = 32

embedding_dim = 4

In [133]:
vectorizer = TextVectorization(
    standardize='lower_and_strip_punctuation',
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

vectorizer.adapt(train_text)

len(vectorizer.get_vocabulary())

18510

In [134]:
c = tf.constant(["i loved the movie yesterday, especially the part when the hero beat the monster after going through the maze filled of traps", "very concerned about the recent wildfires in California"])
c = vectorizer(c)
c = Embedding(max_features + 1, embedding_dim)(c)
c = Dense(embedding_dim, activation='relu')(c)
c = GlobalMaxPooling1D()(c)
c = Dense(1, activation='sigmoid', name='predictions')(c)

In [135]:
def build_model():
    # Inputs are text strings, then we vectorize them
    inputs = keras.Input(shape=(1,), dtype=tf.string, name='text')
    x = vectorizer(inputs)

    # We use Embedding to map the vectorized text onto a space of dimension embedding_dim
    x = Embedding(max_features + 1, embedding_dim)(x)

    # Dense layer
    x = Dense(embedding_dim, activation='relu')(x)

    # GlobalMaxPooling
    x = GlobalMaxPooling1D()(x)

    # Output layer
    outputs = Dense(1, activation='sigmoid', name='predictions')(x)

    model = keras.Model(inputs, outputs)

    # Compile the model with binary crossentropy loss and an adam optimizer.
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()])

    return model

In [136]:
model = build_model()
model.summary()

Model: "model_43"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text (InputLayer)            [(None, 1)]               0         
_________________________________________________________________
text_vectorization_26 (TextV (None, 32)                0         
_________________________________________________________________
embedding_65 (Embedding)     (None, 32, 4)             120004    
_________________________________________________________________
dense_57 (Dense)             (None, 32, 4)             20        
_________________________________________________________________
global_max_pooling1d_52 (Glo (None, 4)                 0         
_________________________________________________________________
predictions (Dense)          (None, 1)                 5         
Total params: 120,029
Trainable params: 120,029
Non-trainable params: 0
____________________________________________________

In [None]:
epochs = 3

model = build_model()
model.fit(train_text, train_label, epochs=epochs, verbose=2)

model.evaluate(train_text, train_label, verbose=2)

In [137]:
epochs = 10

kfold = KFold(n_splits=10, shuffle=True)

scores = []
models = []

i = 1
for fold_train_indices, fold_val_indices in kfold.split(train_text, train_label):
    print('------------------------------------------------------------------------')
    print(f'> Fold {i}')
    print('------------------------------------------------------------------------')

    fold_train_text = train_text[fold_train_indices]
    fold_train_label = train_label[fold_train_indices]
    fold_val_text = train_text[fold_val_indices]
    fold_val_label = train_label[fold_val_indices]

    model = build_model()
    model.fit(fold_train_text, fold_train_label, epochs=epochs, verbose=2)
    models.append(model)

    fold_train_score = model.evaluate(fold_train_text, fold_train_label, verbose=2)
    fold_val_score = model.evaluate(fold_val_text, fold_val_label, verbose=2)
    scores.append({'train': fold_train_score, 'val': fold_val_score})

    i += 1

------------------------------------------------------------------------
> Fold 1
------------------------------------------------------------------------
Epoch 1/10
215/215 - 3s - loss: 0.6783 - accuracy: 0.6085 - precision_44: 0.5817 - recall_44: 0.2979
Epoch 2/10
215/215 - 1s - loss: 0.6093 - accuracy: 0.7135 - precision_44: 0.8964 - recall_44: 0.3724
Epoch 3/10
215/215 - 1s - loss: 0.5006 - accuracy: 0.8021 - precision_44: 0.8611 - recall_44: 0.6399
Epoch 4/10
215/215 - 1s - loss: 0.4104 - accuracy: 0.8424 - precision_44: 0.8649 - recall_44: 0.7479
Epoch 5/10
215/215 - 1s - loss: 0.3458 - accuracy: 0.8692 - precision_44: 0.8813 - recall_44: 0.8018
Epoch 6/10
215/215 - 1s - loss: 0.2942 - accuracy: 0.8937 - precision_44: 0.8991 - recall_44: 0.8463
Epoch 7/10
215/215 - 1s - loss: 0.2516 - accuracy: 0.9130 - precision_44: 0.9152 - recall_44: 0.8777
Epoch 8/10
215/215 - 1s - loss: 0.2173 - accuracy: 0.9269 - precision_44: 0.9304 - recall_44: 0.8958
Epoch 9/10
215/215 - 1s - loss: 0.188

In [138]:
for fold_scores in scores:
    for subset in ['train', 'val']:
        precision = fold_scores[subset][2]
        recall = fold_scores[subset][3]
        f1_score = 2/(1/precision + 1/recall)
        fold_scores[subset].append(f1_score)

In [141]:
print('------------------------------------------------------------------------')
print('Average scores for all folds - Train')
print(f'> Loss: {round(np.mean([fold_score["train"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["train"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["train"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["train"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["train"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')
print('Average scores for all folds - Validation')
print(f'> Loss: {round(np.mean([fold_score["val"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["val"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["val"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["val"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["val"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')


i = 1
print('------------------------------------------------------------------------')
print('Score per fold')
for fold_scores in scores:
    print('------------------------------------------------------------------------')
    print(f'> Fold {i} - Train')
    print(f'>>> Loss: {round(fold_scores["train"][0], 4)} - Accuracy: {round(fold_scores["train"][1], 4)} - Precision: {round(fold_scores["train"][2], 4)} - Recall: {round(fold_scores["train"][3], 4)} - F1-score: {round(fold_scores["train"][4], 4)}')
    print(f'> Fold {i} - Validation')
    print(f'>>> Loss: {round(fold_scores["val"][0], 4)} - Accuracy: {round(fold_scores["val"][1], 4)} - Precision: {round(fold_scores["val"][2], 4)} - Recall: {round(fold_scores["val"][3], 4)} - F1-score: {round(fold_scores["val"][4], 4)}')
    i += 1
print('------------------------------------------------------------------------')

------------------------------------------------------------------------
Average scores for all folds - Train
> Loss: 0.2039 -  Accuracy: 0.9295 - Precision: 0.9384 - Recall: 0.8952 - F1-score: 0.9157
------------------------------------------------------------------------
Average scores for all folds - Validation
> Loss: 0.552 -  Accuracy: 0.7592 - Precision: 0.7381 - Recall: 0.6835 - F1-score: 0.7089
------------------------------------------------------------------------
------------------------------------------------------------------------
Score per fold
------------------------------------------------------------------------
> Fold 1 - Train
>>> Loss: 0.1411 - Accuracy: 0.9584 - Precision: 0.9577 - Recall: 0.9443 - F1-score: 0.951
> Fold 1 - Validation
>>> Loss: 0.5594 - Accuracy: 0.7808 - Precision: 0.7827 - Recall: 0.7122 - F1-score: 0.7458
------------------------------------------------------------------------
> Fold 2 - Train
>>> Loss: 0.1826 - Accuracy: 0.9432 - Precision:

In [136]:
model = models[-1]

In [137]:
test_pred = model.predict(test_text)
test_pred = np.round(test_pred).flatten().astype('int')

test_pred

array([1, 1, 1, ..., 1, 1, 1])

We generate vector counts for both train and test data using scikit's **CountVectorizer**. In particular, notice that we fit the vectorizer only with the train tokens, and use it to transform both train and test data. If there are N unique tokens in the train dataset, for each tweet we obtain a vector of length N whose values are the word counts:

In [138]:
output = pd.DataFrame({'id': test_data['id'], 'target': test_pred})
output.to_csv('predictions/nnets.csv', index=False)
print("Submission successfully saved!")

Submission successfully saved!
