## 04 Neural Networks with keras - 02 Simple approach

In this notebook we implement an approach based on neural networks, using the library **keras** from **tensorflow** to predict whether the tweets refer to a real disaster or not. We establish a simple, fixed architecture with just one dense layer.

#### Loading data

We start by importing the packages we are going to use and loading the datasets:

In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow.keras as keras

from keras.layers import TextVectorization
from keras.layers import Embedding
from keras.layers import GlobalMaxPooling1D
from keras.layers import Dense

from sklearn.model_selection import KFold

train_data = pd.read_csv("../../data/train.csv")
test_data = pd.read_csv("../../data/test.csv")

train_data['text'].replace('http:\/\/\S*', 'urltoken', regex=True, inplace=True)
test_data['text'].replace('http:\/\/\S*', 'urltoken', regex=True, inplace=True)

train_text, train_label = np.array(train_data['text']), np.array(train_data['target'])
test_text = test_data['text']

print(train_text.shape)
print(train_label.shape)
print(test_text.shape)

(7613,)
(7613,)
(3263,)


We explore the training data. There average tweet has 15 words, and the longest one has 31:

In [2]:
# Word counts
pd.Series(np.array([len(text.split()) for text in train_text])).describe()

count    7613.000000
mean       14.903586
std         5.732604
min         1.000000
25%        11.000000
50%        15.000000
75%        19.000000
max        31.000000
dtype: float64

There are 27736 unique words in all the tweets:

In [3]:
# Number of unique tokens among all tweets
len(np.unique(np.array(' '.join(train_text).split())))

27736

#### Model building

The following function will create and return a model with a fixed layer architecture whose hyperparameters are defined above.

We start with a **TextVectorization** layer with usual standardization, followed by an **Embedding** layer. We then compose with a **Dense** layer and perform **GlobalMaxPooling1D**.

In [4]:
max_features = 30000
sequence_length = 32

embedding_dim = 4

def build_model():
    # Inputs are text strings, then we vectorize them
    inputs = keras.Input(shape=(1,), dtype=tf.string, name='text')

    vectorizer = TextVectorization(
        standardize='lower_and_strip_punctuation',
        max_tokens=max_features,
        output_mode="int",
        output_sequence_length=sequence_length,
    )
    vectorizer.adapt(train_text)
    x = vectorizer(inputs)

    # We use Embedding to map the vectorized text onto a space of dimension embedding_dim
    x = Embedding(max_features + 1, embedding_dim)(x)

    # Dense layer
    x = Dense(embedding_dim, activation='relu')(x)

    # GlobalMaxPooling
    x = GlobalMaxPooling1D()(x)

    # Output layer
    outputs = Dense(1, activation='sigmoid', name='predictions')(x)

    model = keras.Model(inputs, outputs)

    # Compile the model with binary crossentropy loss and an adam optimizer.
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()])

    return model

#### Model training

We are now ready to train the model. We start by creating an instance and printing a summary:

In [5]:
model = build_model()
model.summary()

2021-09-23 17:25:48.131454: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-23 17:25:48.324508: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text (InputLayer)            [(None, 1)]               0         
_________________________________________________________________
text_vectorization (TextVect (None, 32)                0         
_________________________________________________________________
embedding (Embedding)        (None, 32, 4)             120004    
_________________________________________________________________
dense (Dense)                (None, 32, 4)             20        
_________________________________________________________________
global_max_pooling1d (Global (None, 4)                 0         
_________________________________________________________________
predictions (Dense)          (None, 1)                 5         
Total params: 120,029
Trainable params: 120,029
Non-trainable params: 0
_______________________________________________________

We use 10-fold cross-validation and train for 10 epochs:

In [6]:
epochs = 10

kfold = KFold(n_splits=10, shuffle=True)

scores = []
models = []

i = 1
for fold_train_indices, fold_val_indices in kfold.split(train_text, train_label):
    print('------------------------------------------------------------------------')
    print(f'> Fold {i}')
    print('------------------------------------------------------------------------')

    fold_train_text = train_text[fold_train_indices]
    fold_train_label = train_label[fold_train_indices]
    fold_val_text = train_text[fold_val_indices]
    fold_val_label = train_label[fold_val_indices]

    model = build_model()
    model.fit(fold_train_text, fold_train_label, epochs=epochs, verbose=2)
    models.append(model)

    fold_train_score = model.evaluate(fold_train_text, fold_train_label, verbose=2)
    fold_val_score = model.evaluate(fold_val_text, fold_val_label, verbose=2)
    scores.append({'train': fold_train_score, 'val': fold_val_score})

    i += 1

------------------------------------------------------------------------
> Fold 1
------------------------------------------------------------------------
Epoch 1/10
215/215 - 2s - loss: 0.6723 - accuracy: 0.5736 - precision_1: 0.0000e+00 - recall_1: 0.0000e+00
Epoch 2/10
215/215 - 1s - loss: 0.6198 - accuracy: 0.6872 - precision_1: 0.7763 - recall_1: 0.3742
Epoch 3/10
215/215 - 1s - loss: 0.5575 - accuracy: 0.7425 - precision_1: 0.7436 - recall_1: 0.6046
Epoch 4/10
215/215 - 1s - loss: 0.5010 - accuracy: 0.7786 - precision_1: 0.7835 - recall_1: 0.6642
Epoch 5/10
215/215 - 1s - loss: 0.4490 - accuracy: 0.8161 - precision_1: 0.8377 - recall_1: 0.7052
Epoch 6/10
215/215 - 1s - loss: 0.3992 - accuracy: 0.8483 - precision_1: 0.8810 - recall_1: 0.7450
Epoch 7/10
215/215 - 1s - loss: 0.3541 - accuracy: 0.8683 - precision_1: 0.9011 - recall_1: 0.7764
Epoch 8/10
215/215 - 1s - loss: 0.3152 - accuracy: 0.8857 - precision_1: 0.9186 - recall_1: 0.8031
Epoch 9/10
215/215 - 1s - loss: 0.2817 - accu

We compute the F1-score:

In [7]:
for fold_scores in scores:
    for subset in ['train', 'val']:
        precision = fold_scores[subset][2]
        recall = fold_scores[subset][3]
        f1_score = 2/(1/precision + 1/recall)
        fold_scores[subset].append(f1_score)

And print a detailed summary of the scores:

In [8]:
print('------------------------------------------------------------------------')
print('Average scores for all folds - Train')
print(f'> Loss: {round(np.mean([fold_score["train"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["train"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["train"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["train"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["train"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')
print('Average scores for all folds - Validation')
print(f'> Loss: {round(np.mean([fold_score["val"][0] for fold_score in scores]), 4)} -  Accuracy: {round(np.mean([fold_score["val"][1] for fold_score in scores]), 4)} - Precision: {round(np.mean([fold_score["val"][2] for fold_score in scores]), 4)} - Recall: {round(np.mean([fold_score["val"][3] for fold_score in scores]), 4)} - F1-score: {round(np.mean([fold_score["val"][4] for fold_score in scores]), 4)}')
print('------------------------------------------------------------------------')


i = 1
print('------------------------------------------------------------------------')
print('Score per fold')
for fold_scores in scores:
    print('------------------------------------------------------------------------')
    print(f'> Fold {i} - Train')
    print(f'>>> Loss: {round(fold_scores["train"][0], 4)} - Accuracy: {round(fold_scores["train"][1], 4)} - Precision: {round(fold_scores["train"][2], 4)} - Recall: {round(fold_scores["train"][3], 4)} - F1-score: {round(fold_scores["train"][4], 4)}')
    print(f'> Fold {i} - Validation')
    print(f'>>> Loss: {round(fold_scores["val"][0], 4)} - Accuracy: {round(fold_scores["val"][1], 4)} - Precision: {round(fold_scores["val"][2], 4)} - Recall: {round(fold_scores["val"][3], 4)} - F1-score: {round(fold_scores["val"][4], 4)}')
    i += 1
print('------------------------------------------------------------------------')

------------------------------------------------------------------------
Average scores for all folds - Train
> Loss: 0.1934 -  Accuracy: 0.9319 - Precision: 0.9336 - Recall: 0.9065 - F1-score: 0.9197
------------------------------------------------------------------------
Average scores for all folds - Validation
> Loss: 0.561 -  Accuracy: 0.7566 - Precision: 0.7318 - Recall: 0.6856 - F1-score: 0.7075
------------------------------------------------------------------------
------------------------------------------------------------------------
Score per fold
------------------------------------------------------------------------
> Fold 1 - Train
>>> Loss: 0.228 - Accuracy: 0.9197 - Precision: 0.9325 - Recall: 0.875 - F1-score: 0.9029
> Fold 1 - Validation
>>> Loss: 0.5972 - Accuracy: 0.7178 - Precision: 0.6991 - Recall: 0.6771 - F1-score: 0.688
------------------------------------------------------------------------
> Fold 2 - Train
>>> Loss: 0.2663 - Accuracy: 0.9146 - Precision: 0

#### Submission

We take the model and train it with all the available data:

In [9]:
model = build_model()
model.fit(train_text, train_label, epochs=epochs, verbose=2)

model.evaluate(train_text, train_label, verbose=2)

Epoch 1/10
238/238 - 2s - loss: 0.6718 - accuracy: 0.5801 - precision_11: 0.7937 - recall_11: 0.0306
Epoch 2/10
238/238 - 1s - loss: 0.5869 - accuracy: 0.7315 - precision_11: 0.8476 - recall_11: 0.4574
Epoch 3/10
238/238 - 1s - loss: 0.4838 - accuracy: 0.8065 - precision_11: 0.8426 - recall_11: 0.6759
Epoch 4/10
238/238 - 1s - loss: 0.4141 - accuracy: 0.8332 - precision_11: 0.8529 - recall_11: 0.7392
Epoch 5/10
238/238 - 1s - loss: 0.3611 - accuracy: 0.8508 - precision_11: 0.8545 - recall_11: 0.7866
Epoch 6/10
238/238 - 1s - loss: 0.3149 - accuracy: 0.8684 - precision_11: 0.8728 - recall_11: 0.8120
Epoch 7/10
238/238 - 1s - loss: 0.2764 - accuracy: 0.8865 - precision_11: 0.8886 - recall_11: 0.8413
Epoch 8/10
238/238 - 1s - loss: 0.2448 - accuracy: 0.9049 - precision_11: 0.9036 - recall_11: 0.8716
Epoch 9/10
238/238 - 1s - loss: 0.2186 - accuracy: 0.9141 - precision_11: 0.9134 - recall_11: 0.8838
Epoch 10/10
238/238 - 1s - loss: 0.1975 - accuracy: 0.9205 - precision_11: 0.9210 - recall_

[0.17462129890918732,
 0.9323525428771973,
 0.9459546804428101,
 0.8936105370521545]

We generate the predictions for the test set:

In [10]:
test_pred = model.predict(test_text)
test_pred = np.round(test_pred).flatten().astype('int')

test_pred

array([0, 0, 0, ..., 0, 1, 1])

And we save the predictions into a csv file ready for submission:

In [11]:
output = pd.DataFrame({'id': test_data['id'], 'target': test_pred})
output.to_csv('predictions/nnets.csv', index=False)
print("Submission successfully saved!")

Submission successfully saved!
