## 04 - Neural Networks with keras

In this notebook we implement an approach based on neural networks, using the library **keras** from **tensorflow** to predict whether the tweets refer to a real disaster or not.

#### Loading data

We start by importing the packages we are going to use and loading the datasets:

In [107]:
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow.keras as keras

from keras.layers import TextVectorization
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import Conv1D
from keras.layers import GlobalMaxPooling1D
from keras.layers import Dense

from sklearn.model_selection import KFold

import string
import re

from sklearn.model_selection import train_test_split

train_data = pd.read_csv("../data/train.csv")
test_data = pd.read_csv("../data/test.csv")

train_text, train_label = np.array(train_data['text']), np.array(train_data['target'])
test_text = test_data['text']

print(train_text.shape)
print(train_label.shape)
print(test_text.shape)

(7613,)
(7613,)
(3263,)


In [108]:
max_features = 20000
sequence_length = 500

embedding_dim = 128

dropout_rate = 0.5

conv_filters = 128
conv_kernel_size = 7
conv_strides = 3

We clean the text by removing punctuation characters and stopwords:

In [109]:
def custom_standardization(raw):
    lowercase = tf.strings.lower(raw)

    no_punct = tf.strings.regex_replace(
        lowercase, "[%s]" % re.escape(string.punctuation), ""
    )

    return no_punct

vectorizer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

vectorizer.adapt(train_text)

In [110]:
def build_model():
    # Inputs are text strings, then we vectorize them
    inputs = keras.Input(shape=(1,), dtype=tf.string, name='text')
    x = vectorizer(inputs)

    # We use Embedding to map the vectorized text onto a space of dimension embedding_dim
    x = Embedding(max_features + 1, embedding_dim)(x)
    x = Dropout(dropout_rate)(x)

    # Conv1D + GlobalMaxPooling
    x = Conv1D(conv_filters, conv_kernel_size, strides=conv_strides, activation='relu')(x)
    x = Conv1D(conv_filters, conv_kernel_size, strides=conv_strides, activation='relu')(x)
    x = GlobalMaxPooling1D()(x)

    # Dense hidden layer
    x = Dense(128, activation="relu")(x)
    x = Dropout(0.5)(x)

    # Output layer
    outputs = Dense(1, activation="sigmoid", name="predictions")(x)

    model = keras.Model(inputs, outputs)

    # Compile the model with binary crossentropy loss and an adam optimizer.
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

    return model

In [112]:
epochs = 3

kfold = KFold(n_splits=10, shuffle=True)

kfold_losses = []
kfold_accuracies = []

for fold_train_indices, fold_val_indices in kfold.split(train_text, train_label):
    print('*'*10)
    fold_train_text = train_text[fold_train_indices]
    fold_train_label = train_label[fold_train_indices]
    fold_val_text = train_text[fold_val_indices]
    fold_val_label = train_label[fold_val_indices]

    model = build_model()

    model.fit(fold_train_text, fold_train_label, epochs=epochs, verbose=2)

    scores = model.evaluate(fold_val_text, fold_val_label, verbose=2)

    kfold_losses.append(scores[0])
    kfold_accuracies.append(scores[1])

**********
Epoch 1/3
215/215 - 31s - loss: 0.6124 - accuracy: 0.6564
Epoch 2/3
215/215 - 28s - loss: 0.3559 - accuracy: 0.8568
Epoch 3/3
215/215 - 28s - loss: 0.1605 - accuracy: 0.9435
24/24 - 1s - loss: 0.5471 - accuracy: 0.7690
**********
Epoch 1/3
215/215 - 29s - loss: 0.6397 - accuracy: 0.6259
Epoch 2/3
215/215 - 28s - loss: 0.3721 - accuracy: 0.8448
Epoch 3/3
215/215 - 28s - loss: 0.1640 - accuracy: 0.9431
24/24 - 1s - loss: 0.6261 - accuracy: 0.7769
**********
Epoch 1/3
215/215 - 29s - loss: 0.6221 - accuracy: 0.6455
Epoch 2/3
215/215 - 28s - loss: 0.3574 - accuracy: 0.8501
Epoch 3/3
215/215 - 28s - loss: 0.1610 - accuracy: 0.9423
24/24 - 1s - loss: 0.6448 - accuracy: 0.7638
**********
Epoch 1/3
215/215 - 31s - loss: 0.6329 - accuracy: 0.6293
Epoch 2/3
215/215 - 28s - loss: 0.3775 - accuracy: 0.8473
Epoch 3/3
215/215 - 28s - loss: 0.1628 - accuracy: 0.9431
24/24 - 1s - loss: 0.5941 - accuracy: 0.7595
**********
Epoch 1/3
215/215 - 29s - loss: 0.6380 - accuracy: 0.6267
Epoch 2/3
2

In [113]:
print('------------------------------------------------------------------------')
print('Score per fold')
for i in range(0, len(kfold_losses)):
    print('------------------------------------------------------------------------')
    print(f'> Fold {i+1} - Loss: {kfold_losses[i]} - Accuracy: {kfold_accuracies[i]}%')
print('------------------------------------------------------------------------')
print('Average scores for all folds:')
print(f'> Accuracy: {np.mean(kfold_accuracies)} (+- {np.std(kfold_accuracies)})')
print(f'> Loss: {np.mean(kfold_losses)}')
print('------------------------------------------------------------------------')

------------------------------------------------------------------------
Score per fold
------------------------------------------------------------------------
> Fold 1 - Loss: 0.547137975692749 - Accuracy: 0.7690288424491882%
------------------------------------------------------------------------
> Fold 2 - Loss: 0.6261386275291443 - Accuracy: 0.7769029140472412%
------------------------------------------------------------------------
> Fold 3 - Loss: 0.6447669267654419 - Accuracy: 0.7637795209884644%
------------------------------------------------------------------------
> Fold 4 - Loss: 0.5940704941749573 - Accuracy: 0.7595269680023193%
------------------------------------------------------------------------
> Fold 5 - Loss: 0.5130724906921387 - Accuracy: 0.8147174715995789%
------------------------------------------------------------------------
> Fold 6 - Loss: 0.6829480528831482 - Accuracy: 0.7279894948005676%
-------------------------------------------------------------------

In [79]:
model.evaluate(train_text, train_label, verbose=2)

238/238 - 7s - loss: 0.0064 - accuracy: 0.9959


[0.006378879304975271, 0.9959279894828796]

In [80]:
test_pred = model.predict(test_text)
test_pred = np.round(test_pred).flatten().astype('int')

test_pred

array([1, 1, 1, ..., 0, 1, 1])

We generate vector counts for both train and test data using scikit's **CountVectorizer**. In particular, notice that we fit the vectorizer only with the train tokens, and use it to transform both train and test data. If there are N unique tokens in the train dataset, for each tweet we obtain a vector of length N whose values are the word counts:

In [81]:
output = pd.DataFrame({'id': test_data['id'], 'target': test_pred})
output.to_csv('predictions/nnets.csv', index=False)
print("Submission successfully saved!")

Submission successfully saved!
