# Kaggle competition: Natural Language Processing with Disaster Tweets.




### Brief description of the problem and data

This project is to participate in the Kaggle learning competition to build an NLP model which does binary classification
over a data set of about 10,000 English language tweets which may or may not be describing disasters. This is a supervised learning problem. 

The data consist of a training data file train.csv (7613 rows) and a test data file test.csv (3263 rows)

The train.csv has the following columns:

* id -- an integer id for each sample tweet

* keyword -- a category for the tweet with 222 unique values, 61 missing

* location -- location name with 3341 unique values, 2533 missing

* text -- the text of the tweet

* target -- the 0/1 label for binary classification where 1 = disaster

The test.csv has the same first 4 columns but not the target value which we need to predict. The id is not significant for training, but is needed for the test data because we need to include it with the predictions submitted to Kaggle.

In the training set, the  two classes are not quite evenly balanced, but are close with 42.9% in the positive (disaster) class.




In [48]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import Dense, LSTM, GRU, Embedding, Input
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
import os


df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')
df_train

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [7]:
df_train.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


In [8]:
df_test.describe()

Unnamed: 0,id
count,3263.0
mean,5427.152927
std,3146.427221
min,0.0
25%,2683.0
50%,5500.0
75%,8176.0
max,10875.0


### Exploratory Data Analysis (EDA) Data Cleaning and Preparation

The general plan is to build some type of binary classification recurrent neural network model on the sequence of words in the tweets. The assignment suggests using word embeddings. I chose to use GloVe pretrained embeddings from https://nlp.stanford.edu/projects/glove/ . In particular I picked the glove.6B.zip data set which has a 400K vocabulary extracted and trained from Wikipedia 2014 and Gigaword 5. This data set has a choice of 4 sets of word vectors of dimension 50, 100, 200, or 300.

This dataset provides a dictionary to look up word vectors from lower cased word tokens. So my strategy for cleaning the data is to lower case all the letters in the tweets, while discarding all punctuation, numbers, and anything else not alpha characters, then split the resulting string on blanks, to get work tokens. I will then look up each token in the GloVe dictionary, and skip any tokens that don't match a known word embedding.

In this process I found that some of the tweets had no useable words.  I could have skipped these samples for the training set, but they need to be preserved in the test set because we must provide a prediction for each of the test data in submitting to Kaggle. To treat the training and test data sets the same, I chose to substitute the single word "neutral" for any tweet that had no useable tokents.

In [10]:
## Function to Load the GloVe dictionary of word embeddings.
## I experimented with word vectors of 50, 100, and 200 dimension from GloVe, but 100 seems to work best for this problem.

def get_glove_vectors(filename="data/glove.6B.100d.txt"):
    ## function from https://campus.datacamp.com/courses/recurrent-neural-networks-for-language-modeling-in-python/rnn-architecture?ex=7
    # Get all word vectors from pre-trained model
    glove_vector_dict = {}
    with open(filename, encoding="UTF-8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = values[1:]
            glove_vector_dict[word] = np.asarray(coefs, dtype='float32')
    return glove_vector_dict

import time
start = time.time()

glove_vector_dict = get_glove_vectors()

end = time.time()
print(f'elapsed seconds = {end - start}')
type(glove_vector_dict)


elapsed seconds = 9.88432240486145


dict

I chose to ignore the Location column in the data and I added the keyword column as additional words at the end of each tweet.

At this point I have the train and test datasets in pandas data frames df_train and df_test, exactly as read from the CSV files. Initially I tried just discarding the keyword column. But I changed this to the simple approach of appending the keywords as additional words at the end of each tweet. The following code does this:

In [33]:
## Previous models ignored the keyword column in the training and test data.

## To incorporate the keyword, we will try just appending the keywords (when present) as
## an additional one or two tokens at the end of each tweet text.

## This function modifies the pandas dataframe df by
## appending the keyword (if present) to the end of the tweet.
## Keywords like 'airplane%20accident' are split into two words.
## It also writes the modified dataframe to a csv file (for debugging).
def add_keyword(df, filename=None):
    df.loc[df['keyword'].isna()==False,'text'] = df['text'] + ' ' + df['keyword'].str.replace('%20',' ')
    if filename:
        df.to_csv(f'data/{filename}',index=False)

add_keyword(df_train, filename='df_train.csv')
add_keyword(df_test, filename='df_test.csv')

Here is the raw text from a few example tweets:

In [34]:
list(df_train.loc[29:34,'text'])

['Do you like pasta?',
 'The end!',
 '@bbcmtd Wholesale Markets ablaze http://t.co/lHYXEOHY6C ablaze',
 'We always try to bring the heavy. #metal #RT http://t.co/YAo1e0xngw ablaze',
 '#AFRICANBAZE: Breaking news:Nigeria flag set ablaze in Aba. http://t.co/2nndBGwyEi ablaze',
 'Crying out for more! Set me ablaze ablaze']

In [35]:
s1 = df_train[df_train['id']==232]['text']
s1 = list(s1)[0]
s1

"+ Nicole Fletcher one of a victim of crashed airplane few times ago. \n\nThe accident left a little bit trauma for her. Although she's \n\n+ airplane accident"

Here we define a function to clean up the tweets by removing all non alpha characters, lower case the text, and split the string into word tokens on space.

In [36]:
import re

def clean_up_tweet(tweet):
    """
    Clean up the content of one tweet, removing punctuation and numbers. 
    
    Parameters:
    tweet(str):The text of the tweet
    
    Returns:
    word_list: A list of pure alphabetic words in lower case
    
    """
    ## Remove all characters execept alphabetic chars and space,
    ## convert to lower case and split on space.
    word_list = re.sub('[^A-Za-z ]+','',tweet).lower().split(' ')
    return word_list

In [37]:
## Same as the last example tweet after cleaning
clean_up_tweet(s1)

['',
 'nicole',
 'fletcher',
 'one',
 'of',
 'a',
 'victim',
 'of',
 'crashed',
 'airplane',
 'few',
 'times',
 'ago',
 'the',
 'accident',
 'left',
 'a',
 'little',
 'bit',
 'trauma',
 'for',
 'her',
 'although',
 'shes',
 '',
 'airplane',
 'accident']

In preparing the data for training, I want to reserve 20% for validation during training.

In [38]:
train, valid = train_test_split(df_train, train_size=0.8, shuffle=True, random_state=42)
print(train.shape)
print(valid.shape)

(6090, 5)
(1523, 5)


Here we are applying the clean_up_tweet function to produce clean lower case word tokens for each of 3 data sets for training, validation, and testing.

In [39]:
train_x = train['text'].map(clean_up_tweet)
valid_x = valid['text'].map(clean_up_tweet)
test_x = df_test['text'].map(clean_up_tweet)

print(train_x.shape)
print(valid_x.shape)
print(test_x.shape)


(6090,)
(1523,)
(3263,)


Here we extract the target values for training and validation.

In [40]:
train_y = np.array(train['target'], dtype=np.float32)
valid_y = np.array(valid['target'], dtype=np.float32)
print(train_y[:5])
print(train_y[-5:])
print(np.sum(train_y)/len(train_y))
print(np.sum(valid_y)/len(valid_y))

[1. 0. 1. 1. 0.]
[0. 0. 0. 1. 1.]
0.43054187192118226
0.4261326329612607


Above we confirmed that the train and validation sets have about the same class proportions for the target.



Once we have a sequence of word vectors for each tweet, 
we will also need to pad each sequence to a uniform length before training with a recurrent NN model. 
The following code counts the words in each tweet and finds the max lengths. Based on these counts,
I chose to pad all the sequences to length 56, which covers all the tweets.

In [72]:
train['tweet_word_counts'] = [len(x) for x in train_x]
valid['tweet_word_counts'] = [len(x) for x in valid_x]
df_test['tweet_word_counts'] = [len(x) for x in test_x]
print(np.max(train['tweet_word_counts']) )
print(np.max(valid['tweet_word_counts']) )
print(np.max(df_test['tweet_word_counts']) )

55
31
35


Now we are ready to replace the cleaned word tokens in the tweets with word embeddings from the GloVe data set. I am using the 100d data file that has vectors of 100 floats for each word.

In [43]:
def glove_word_embeddings(word_lists, pad_to=56):
    ## We plan to replace all the words in the tweets
    ## with embeddings from the GloVe dictionary, skipping
    ## any words not found, and also padding the sequence 
    ## of embeddings to a fixed length.
    
    ## If none of the words match for a given tweet we will substitute
    ## a with place holder vector of one word, "neutral".
    d = glove_vector_dict
    neutral = d["neutral"]
    placeHolder = np.array([neutral])
    padNeutral = pad_sequences(placeHolder.T, pad_to, dtype='float32')
    outer = []
    for word_list in word_lists:
        enc_list = []
        for word in word_list:
            if(type(d.get(word)) is np.ndarray):
                enc_list.append(d.get(word))
        if(len(enc_list) > 0):
            enc_array = np.array(enc_list)
            pad = pad_sequences(enc_array.T, pad_to, dtype='float32')
            outer.append(pad.T)
        else:
            outer.append(padNeutral.T)
    return np.array(outer)

In [44]:

start = time.time()
X_train = glove_word_embeddings(train_x)
X_valid = glove_word_embeddings(valid_x)
X_test = glove_word_embeddings(test_x)
end = time.time()
print(f'elapsed seconds = {end - start}')
print(X_train.shape)
print(X_valid.shape)
print(X_test.shape)

elapsed seconds = 2.3428685665130615
(6090, 56, 100)
(1523, 56, 100)
(3263, 56, 100)


In [None]:
### Model Architecture

Model 1 -- Single Layer LSTM, 128 units per layer

Model 2 -- Two Layer LSTM, 128 units per layer

Model 3 -- Three Layer LSTM, 128 units per layer



## Model 1 -- Single Layer LSTM, 128 units per layer

In [49]:
DROPOUT = 0.2
UNITS_PER_LAYER = 128

# Build model
model = Sequential()
model.add(LSTM(units=UNITS_PER_LAYER, input_shape=(None, 100), return_sequences=False, dropout=DROPOUT))
model.add(Dense(1, activation='sigmoid'))

opt = tf.keras.optimizers.Adam(learning_rate=0.001)

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

file_name = 'weights_{epoch:03d}_{val_accuracy:.4f}.hdf5'

checkpoint_filepath = os.path.join('.', 'SAVE_MODELS', file_name)

modelCheckpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=6, restore_best_weights=True)

model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_5 (LSTM)               (None, 128)               117248    
                                                                 
 dense_3 (Dense)             (None, 1)                 129       
                                                                 
Total params: 117,377
Trainable params: 117,377
Non-trainable params: 0
_________________________________________________________________


In [50]:
history = model.fit(X_train, train_y, 
                    batch_size=20, 
                    epochs=100, 
                    validation_data=(X_valid,valid_y),
                    callbacks=[earlyStopping,modelCheckpoint]
                   )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 9: early stopping


In [53]:
sub_num = 18

predict_proba = model.predict(X_test,batch_size=20)
predict = (predict_proba > 0.5).astype(int)

submission = pd.DataFrame(df_test['id'])
submission['target']=predict

submission.to_csv(f'data/submission{sub_num}.csv',index=False)
submission

yp_proba = model.predict(X_valid, batch_size=20)
yp = (yp_proba > 0.402).astype(int)
print('acc', accuracy_score(valid_y, yp))
print('recall', recall_score(valid_y, yp))
print('precision', precision_score(valid_y, yp))
print('F1', f1_score(valid_y, yp))
pd.DataFrame(confusion_matrix(valid_y,yp))

acc 0.8240315167432699
recall 0.7580893682588598
precision 0.8159203980099502
F1 0.7859424920127794


Unnamed: 0,0,1
0,763,111
1,157,492


## Model 2 -- Two Layer LSTM, 128 units per layer

In [57]:
DROPOUT = 0.2
UNITS_PER_LAYER = 128

# Build model
model = Sequential()
model.add(LSTM(units=UNITS_PER_LAYER, input_shape=(None, 100), return_sequences=True, dropout=DROPOUT))
model.add(LSTM(units=UNITS_PER_LAYER, return_sequences=False, dropout=DROPOUT))
model.add(Dense(1, activation='sigmoid'))

opt = tf.keras.optimizers.Adam(learning_rate=0.001)

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

file_name = 'weights_{epoch:03d}_{val_accuracy:.4f}.hdf5'

checkpoint_filepath = os.path.join('.', 'SAVE_MODELS', file_name)

modelCheckpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=6, restore_best_weights=True)

model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_9 (LSTM)               (None, None, 128)         117248    
                                                                 
 lstm_10 (LSTM)              (None, 128)               131584    
                                                                 
 dense_5 (Dense)             (None, 1)                 129       
                                                                 
Total params: 248,961
Trainable params: 248,961
Non-trainable params: 0
_________________________________________________________________


In [58]:
history = model.fit(X_train, train_y, 
                    batch_size=20, 
                    epochs=100, 
                    validation_data=(X_valid,valid_y),
                    callbacks=[earlyStopping,modelCheckpoint]
                   )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 11: early stopping


In [59]:
sub_num = 18

predict_proba = model.predict(X_test,batch_size=20)
predict = (predict_proba > 0.5).astype(int)

submission = pd.DataFrame(df_test['id'])
submission['target']=predict

submission.to_csv(f'data/submission{sub_num}.csv',index=False)
submission

yp_proba = model.predict(X_valid, batch_size=20)
yp = (yp_proba > 0.402).astype(int)
print('acc', accuracy_score(valid_y, yp))
print('recall', recall_score(valid_y, yp))
print('precision', precision_score(valid_y, yp))
print('F1', f1_score(valid_y, yp))
pd.DataFrame(confusion_matrix(valid_y,yp))

acc 0.8345370978332239
recall 0.7303543913713405
precision 0.8602540834845736
F1 0.79


Unnamed: 0,0,1
0,797,77
1,175,474


## Model 3 -- Three Layer LSTM, 128 units per layer

In [54]:
DROPOUT = 0.2
UNITS_PER_LAYER = 128

# Build model
model = Sequential()
model.add(LSTM(units=UNITS_PER_LAYER, input_shape=(None, 100), return_sequences=True, dropout=DROPOUT))
model.add(LSTM(units=UNITS_PER_LAYER, return_sequences=True, dropout=DROPOUT))
model.add(LSTM(units=UNITS_PER_LAYER, return_sequences=False, dropout=DROPOUT))
model.add(Dense(1, activation='sigmoid'))

opt = tf.keras.optimizers.Adam(learning_rate=0.001)

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

file_name = 'weights_{epoch:03d}_{val_accuracy:.4f}.hdf5'

checkpoint_filepath = os.path.join('.', 'SAVE_MODELS', file_name)

modelCheckpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=6, restore_best_weights=True)

model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_6 (LSTM)               (None, None, 128)         117248    
                                                                 
 lstm_7 (LSTM)               (None, None, 128)         131584    
                                                                 
 lstm_8 (LSTM)               (None, 128)               131584    
                                                                 
 dense_4 (Dense)             (None, 1)                 129       
                                                                 
Total params: 380,545
Trainable params: 380,545
Non-trainable params: 0
_________________________________________________________________


In [55]:
history = model.fit(X_train, train_y, 
                    batch_size=20, 
                    epochs=100, 
                    validation_data=(X_valid,valid_y),
                    callbacks=[earlyStopping,modelCheckpoint]
                   )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 12: early stopping


In [56]:
sub_num = 18

predict_proba = model.predict(X_test,batch_size=20)
predict = (predict_proba > 0.5).astype(int)

submission = pd.DataFrame(df_test['id'])
submission['target']=predict

submission.to_csv(f'data/submission{sub_num}.csv',index=False)
submission

yp_proba = model.predict(X_valid, batch_size=20)
yp = (yp_proba > 0.402).astype(int)
print('acc', accuracy_score(valid_y, yp))
print('recall', recall_score(valid_y, yp))
print('precision', precision_score(valid_y, yp))
print('F1', f1_score(valid_y, yp))
pd.DataFrame(confusion_matrix(valid_y,yp))

acc 0.8325673013788575
recall 0.7411402157164869
precision 0.846830985915493
F1 0.7904683648315529


Unnamed: 0,0,1
0,787,87
1,168,481


## Model 4 - Bi-Directional LSTM with 2 layers, 64 units per layer

In [66]:
DROPOUT = 0.2
UNITS_PER_LAYER = 64

## Try switching to a Bidirectional LSTM model, as in this example
## https://keras.io/examples/nlp/bidirectional_lstm_imdb/

from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(None, 100) )
x = layers.Bidirectional(LSTM(units=UNITS_PER_LAYER, return_sequences=True, dropout=DROPOUT))(inputs)
x = layers.Bidirectional(LSTM(units=UNITS_PER_LAYER, return_sequences=False, dropout=DROPOUT))(x)
# Add a classifier
outputs = layers.Dense(1,  activation='sigmoid')(x)
model = keras.Model(inputs, outputs)

opt = tf.keras.optimizers.Adam(learning_rate=0.0005)

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

file_name = 'weights_{epoch:03d}_{val_accuracy:.4f}.hdf5'

checkpoint_filepath = os.path.join('.', 'SAVE_MODELS', file_name)

modelCheckpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=6, restore_best_weights=True)

model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, None, 100)]       0         
                                                                 
 bidirectional_5 (Bidirectio  (None, None, 128)        84480     
 nal)                                                            
                                                                 
 bidirectional_6 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense_8 (Dense)             (None, 1)                 129       
                                                                 
Total params: 183,425
Trainable params: 183,425
Non-trainable params: 0
_________________________________________________________________


In [67]:
history = model.fit(X_train, train_y, 
                    batch_size=20, 
                    epochs=100, 
                    validation_data=(X_valid,valid_y),
                    callbacks=[earlyStopping,modelCheckpoint]
                   )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 8: early stopping


In [73]:
sub_num = 18

predict_proba = model.predict(X_test,batch_size=20)
predict = (predict_proba > 0.5).astype(int)

submission = pd.DataFrame(df_test['id'])
submission['target']=predict

submission.to_csv(f'data/submission{sub_num}.csv',index=False)
submission

yp_proba = model.predict(X_valid, batch_size=20)
yp = (yp_proba > 0.5).astype(int)
print('acc', accuracy_score(valid_y, yp))
print('recall', recall_score(valid_y, yp))
print('precision', precision_score(valid_y, yp))
print('F1', f1_score(valid_y, yp))
pd.DataFrame(confusion_matrix(valid_y,yp))

acc 0.8325673013788575
recall 0.7303543913713405
precision 0.855595667870036
F1 0.7880299251870323


Unnamed: 0,0,1
0,794,80
1,175,474


## Model 5  - Bi-Directional LSTM with 3 layers, 64 units per layer

In [69]:
DROPOUT = 0.2
UNITS_PER_LAYER = 64

## Try switching to a Bidirectional LSTM model, as in this example
## https://keras.io/examples/nlp/bidirectional_lstm_imdb/

from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(None, 100) )
x = layers.Bidirectional(LSTM(units=UNITS_PER_LAYER, return_sequences=True, dropout=DROPOUT))(inputs)
x = layers.Bidirectional(LSTM(units=UNITS_PER_LAYER, return_sequences=True, dropout=DROPOUT))(x)
x = layers.Bidirectional(LSTM(units=UNITS_PER_LAYER, return_sequences=False, dropout=DROPOUT))(x)
# Add a classifier
outputs = layers.Dense(1,  activation='sigmoid')(x)
model = keras.Model(inputs, outputs)

opt = tf.keras.optimizers.Adam(learning_rate=0.0005)

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

file_name = 'weights_{epoch:03d}_{val_accuracy:.4f}.hdf5'

checkpoint_filepath = os.path.join('.', 'SAVE_MODELS', file_name)

modelCheckpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=6, restore_best_weights=True)

model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, None, 100)]       0         
                                                                 
 bidirectional_7 (Bidirectio  (None, None, 128)        84480     
 nal)                                                            
                                                                 
 bidirectional_8 (Bidirectio  (None, None, 128)        98816     
 nal)                                                            
                                                                 
 bidirectional_9 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense_9 (Dense)             (None, 1)                 129       
                                                           

In [70]:
history = model.fit(X_train, train_y, 
                    batch_size=20, 
                    epochs=100, 
                    validation_data=(X_valid,valid_y),
                    callbacks=[earlyStopping,modelCheckpoint]
                   )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 10: early stopping


In [76]:
sub_num = 19

predict_proba = model.predict(X_test,batch_size=20)
predict = (predict_proba > 0.5).astype(int)

submission = pd.DataFrame(df_test['id'])
submission['target']=predict

submission.to_csv(f'data/submission{sub_num}.csv',index=False)
submission

yp_proba = model.predict(X_valid, batch_size=20)
yp = (yp_proba > 0.5).astype(int)
print('acc', accuracy_score(valid_y, yp))
print('recall', recall_score(valid_y, yp))
print('precision', precision_score(valid_y, yp))
print('F1', f1_score(valid_y, yp))
pd.DataFrame(confusion_matrix(valid_y,yp))

acc 0.8325673013788575
recall 0.7303543913713405
precision 0.855595667870036
F1 0.7880299251870323


Unnamed: 0,0,1
0,794,80
1,175,474


### References

1) Kaggle Natural Language Processing with Disaster Tweets https://www.kaggle.com/competitions/nlp-getting-started/overview
    
2) GloVe: Global Vectors for Word Representation https://nlp.stanford.edu/projects/glove/

3) DataCamp Course: Recurrent Neural Networks for Language Modeling in Pythone: https://campus.datacamp.com/courses/recurrent-neural-networks-for-language-modeling-in-python

4) Keras.io example code for Bi-Directional LSTM model: https://keras.io/examples/nlp/bidirectional_lstm_imdb/