## 04 - Neural Networks with keras

In this notebook we implement an approach based on neural networks, using the library **keras** from **tensorflow** to predict whether the tweets refer to a real disaster or not.

#### Loading data

We start by importing the packages we are going to use and loading the datasets:

In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow.keras as keras

from keras.layers import TextVectorization
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import Conv1D
from keras.layers import GlobalMaxPooling1D
from keras.layers import Dense

from sklearn.model_selection import KFold

train_data = pd.read_csv("../data/train.csv")
test_data = pd.read_csv("../data/test.csv")

train_data['text'].replace('http:\/\/\S*', 'urltoken', regex=True, inplace=True)
test_data['text'].replace('http:\/\/\S*', 'urltoken', regex=True, inplace=True)

train_text, train_label = np.array(train_data['text']), np.array(train_data['target'])
test_text = test_data['text']

print(train_text.shape)
print(train_label.shape)
print(test_text.shape)

(7613,)
(7613,)
(3263,)


In [2]:
# Word counts
pd.Series(np.array([len(text.split()) for text in train_text])).describe()

count    7613.000000
mean       14.903586
std         5.732604
min         1.000000
25%        11.000000
50%        15.000000
75%        19.000000
max        31.000000
dtype: float64

In [3]:
# Number of unique tokens among all tweets
len(np.unique(np.array(' '.join(train_text).split())))

27736

In [173]:
# Base model (overfitted)
max_features = 20000
sequence_length = 500

embedding_dim = 128

dropout_rate = 0.5

conv_filters = 128

conv_kernel_size = 7
conv_strides = 3

dense_layer_size = 128

In [174]:
# New base model after some tests (still overfitted)
max_features = 10000
sequence_length = 32

embedding_dim = 64

conv_filters = 64

conv_kernel_size = 5
conv_strides = 2

dense_layer_size = 64

In [175]:
# New new base model after more tests
max_features = 5000

embedding_dim = 32

conv_filters = 32

conv_kernel_size = 3
conv_strides = 1

dense_layer_size = 32

In [181]:
# Modifications
# max_features = 2000

embedding_dim = 16

# conv_filters = 16

# dense_layer_size = 16

We clean the text by removing punctuation characters and stopwords:

In [182]:
vectorizer = TextVectorization(
    standardize='lower_and_strip_punctuation',
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

vectorizer.adapt(train_text)

len(vectorizer.get_vocabulary())

5000

In [183]:
def build_model():
    # Inputs are text strings, then we vectorize them
    inputs = keras.Input(shape=(1,), dtype=tf.string, name='text')
    x = vectorizer(inputs)

    # We use Embedding to map the vectorized text onto a space of dimension embedding_dim
    x = Embedding(max_features + 1, embedding_dim)(x)
    x = Dropout(dropout_rate)(x)

    # Conv1D + GlobalMaxPooling
    x = Conv1D(conv_filters, conv_kernel_size, strides=conv_strides, activation='relu')(x)
    x = Conv1D(conv_filters, conv_kernel_size, strides=conv_strides, activation='relu')(x)
    x = GlobalMaxPooling1D()(x)

    # Dense hidden layer
    x = Dense(dense_layer_size, activation="relu")(x)
    x = Dropout(dropout_rate)(x)

    # Output layer
    outputs = Dense(1, activation="sigmoid", name="predictions")(x)

    model = keras.Model(inputs, outputs)

    # Compile the model with binary crossentropy loss and an adam optimizer.
    model.compile(optimizer="adam", loss='binary_crossentropy', metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()])

    return model

In [184]:
model = build_model()
model.summary()

Model: "model_165"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text (InputLayer)            [(None, 1)]               0         
_________________________________________________________________
text_vectorization_23 (TextV (None, 32)                0         
_________________________________________________________________
embedding_165 (Embedding)    (None, 32, 16)            80016     
_________________________________________________________________
dropout_363 (Dropout)        (None, 32, 16)            0         
_________________________________________________________________
conv1d_330 (Conv1D)          (None, 30, 32)            1568      
_________________________________________________________________
conv1d_331 (Conv1D)          (None, 28, 32)            3104      
_________________________________________________________________
global_max_pooling1d_165 (Gl (None, 32)                0 

In [185]:
epochs = 3

kfold = KFold(n_splits=10, shuffle=True)

scores = []
models = []

i = 1
for fold_train_indices, fold_val_indices in kfold.split(train_text, train_label):
    print('------------------------------------------------------------------------')
    print(f'> Fold {i}')
    print('------------------------------------------------------------------------')

    fold_train_text = train_text[fold_train_indices]
    fold_train_label = train_label[fold_train_indices]
    fold_val_text = train_text[fold_val_indices]
    fold_val_label = train_label[fold_val_indices]

    model = build_model()
    model.fit(fold_train_text, fold_train_label, epochs=epochs, verbose=2)
    models.append(model)

    fold_train_score = model.evaluate(fold_train_text, fold_train_label, verbose=2)
    fold_val_score = model.evaluate(fold_val_text, fold_val_label, verbose=2)
    scores.append({'train': fold_train_score, 'val': fold_val_score})

    i += 1

------------------------------------------------------------------------
> Fold 1
------------------------------------------------------------------------
Epoch 1/3
215/215 - 3s - loss: 0.6770 - accuracy: 0.5712 - precision_166: 0.4775 - recall_166: 0.0181
Epoch 2/3
215/215 - 1s - loss: 0.6083 - accuracy: 0.6878 - precision_166: 0.7734 - recall_166: 0.3829
Epoch 3/3
215/215 - 1s - loss: 0.5107 - accuracy: 0.7764 - precision_166: 0.7706 - recall_166: 0.6802
215/215 - 1s - loss: 0.3735 - accuracy: 0.8489 - precision_166: 0.8193 - recall_166: 0.8302
24/24 - 0s - loss: 0.4835 - accuracy: 0.7822 - precision_166: 0.7402 - recall_166: 0.7840
------------------------------------------------------------------------
> Fold 2
------------------------------------------------------------------------
Epoch 1/3
215/215 - 3s - loss: 0.6814 - accuracy: 0.5671 - precision_167: 0.4674 - recall_167: 0.0145
Epoch 2/3
215/215 - 1s - loss: 0.5759 - accuracy: 0.7106 - precision_167: 0.7215 - recall_167: 0.537

In [186]:
for fold_scores in scores:
    for subset in ['train', 'val']:
        precision = fold_scores[subset][2]
        recall = fold_scores[subset][3]
        f1_score = 2/(1/precision + 1/recall)
        fold_scores[subset].append(f1_score)

In [187]:
i = 1
print('------------------------------------------------------------------------')
print('Score per fold')
for fold_scores in scores:
    print('------------------------------------------------------------------------')
    print(f'> Fold {i} - Train')
    print(f'>>> Loss: {round(fold_scores["train"][0], 4)} - Accuracy: {round(fold_scores["train"][1], 4)} - Precision: {round(fold_scores["train"][2], 4)} - Recall: {round(fold_scores["train"][3], 4)} - F1-score: {round(fold_scores["train"][4], 4)}')
    print(f'> Fold {i} - Validation')
    print(f'>>> Loss: {round(fold_scores["val"][0], 4)} - Accuracy: {round(fold_scores["val"][1], 4)} - Precision: {round(fold_scores["val"][2], 4)} - Recall: {round(fold_scores["val"][3], 4)} - F1-score: {round(fold_scores["val"][4], 4)}')
    i += 1
print('------------------------------------------------------------------------')
print('Average scores for all folds - Train')
print(f'> Loss: {round(np.mean([fold_score["train"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["train"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["train"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["train"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["train"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')
print('Average scores for all folds - Validation')
print(f'> Loss: {round(np.mean([fold_score["val"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["val"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["val"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["val"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["val"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')

------------------------------------------------------------------------
Score per fold
------------------------------------------------------------------------
> Fold 1 - Train
>>> Loss: 0.3735 - Accuracy: 0.8489 - Precision: 0.8193 - Recall: 0.8302 - F1-score: 0.8247
> Fold 1 - Validation
>>> Loss: 0.4835 - Accuracy: 0.7822 - Precision: 0.7402 - Recall: 0.784 - F1-score: 0.7615
------------------------------------------------------------------------
> Fold 2 - Train
>>> Loss: 0.3416 - Accuracy: 0.8676 - Precision: 0.8773 - Recall: 0.8064 - F1-score: 0.8403
> Fold 2 - Validation
>>> Loss: 0.4598 - Accuracy: 0.7966 - Precision: 0.7635 - Recall: 0.7267 - F1-score: 0.7446
------------------------------------------------------------------------
> Fold 3 - Train
>>> Loss: 0.3213 - Accuracy: 0.8759 - Precision: 0.9159 - Recall: 0.7834 - F1-score: 0.8445
> Fold 3 - Validation
>>> Loss: 0.4329 - Accuracy: 0.7992 - Precision: 0.8071 - Recall: 0.6954 - F1-score: 0.7471
-------------------------

In [188]:
model = build_model()
model.fit(train_text, train_label, epochs=epochs, verbose=2)

model.evaluate(train_text, train_label, verbose=2)

Epoch 1/3
238/238 - 3s - loss: 0.6748 - accuracy: 0.5698 - precision_176: 0.4861 - recall_176: 0.0214
Epoch 2/3
238/238 - 1s - loss: 0.5487 - accuracy: 0.7490 - precision_176: 0.7507 - recall_176: 0.6224
Epoch 3/3
238/238 - 1s - loss: 0.4255 - accuracy: 0.8236 - precision_176: 0.8423 - recall_176: 0.7252
238/238 - 1s - loss: 0.3380 - accuracy: 0.8732 - precision_176: 0.8800 - recall_176: 0.8163


[0.33802932500839233,
 0.8732431530952454,
 0.8800263404846191,
 0.8162641525268555]

In [189]:
test_pred = model.predict(test_text)
test_pred = np.round(test_pred).flatten().astype('int')

test_pred

array([1, 1, 1, ..., 1, 1, 1])

We generate vector counts for both train and test data using scikit's **CountVectorizer**. In particular, notice that we fit the vectorizer only with the train tokens, and use it to transform both train and test data. If there are N unique tokens in the train dataset, for each tweet we obtain a vector of length N whose values are the word counts:

In [190]:
output = pd.DataFrame({'id': test_data['id'], 'target': test_pred})
output.to_csv('predictions/nnets.csv', index=False)
print("Submission successfully saved!")

Submission successfully saved!
