# Text Classification Model for Disaster Tweets Classification Using TensorFlow Take 2
### David Lowe
### February 19, 2021

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]

SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The Disaster Tweets Classification dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: Twitter has become an important communication channel in times of emergency. The ubiquitous nature of smartphones enables people to announce an emergency they are observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter. In this practice Kaggle competition, we want to build a machine learning model that predicts which Tweets are about real disasters and which ones are not. This dataset was created by Figure-Eight and shared initially on their 'Data for Everyone' website.

From iteration Take1, we deployed a bag-of-words model to classify the Tweets. We also made predictions on Kaggle's test dataset and submitted the results for evaluation.

In this Take2 iteration, we will deploy a word-embedding model to classify the Tweets. We will also submit the test predictions to Kaggle and obtain the performance score for the model.

ANALYSIS: From iteration Take1, the bag-of-words model's performance achieved an average accuracy score of 76.25% after 20 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 75.20%.

In this Take2 iteration, the word-embedding model's performance achieved an average accuracy score of 72.55% after 20 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 73.79%.

CONCLUSION: In this modeling iteration, the word-embedding TensorFlow model did not do as well as the bag-of-words model. However, we should continue to experiment with both natural language processing techniques for further modeling.

Dataset Used: Sentiment Labelled Sentences

Dataset ML Model: Binary class text classification with text-oriented features

Dataset Reference: https://www.kaggle.com/c/nlp-getting-started/

A deep-learning text classification project generally can be broken down into five major tasks:

1. Prepare Environment
2. Load and Prepare Text Data
3. Define and Train Models
4. Evaluate and Optimize Models
5. Finalize Model and Make Predictions

# Task 1 - Prepare Environment

In [1]:
# # Install the packages to support accessing environment variable and SQL databases
# !pip install python-dotenv PyMySQL boto3

In [2]:
# # Retrieve GPU configuration information from Colab
# gpu_info = !nvidia-smi
# gpu_info = '\n'.join(gpu_info)
# if gpu_info.find('failed') >= 0:
#     print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
#     print('and then re-execute this cell.')
# else:
#     print(gpu_info)

In [3]:
# # Retrieve memory configuration information from Colab
# from psutil import virtual_memory
# ram_gb = virtual_memory().total / 1e9
# print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

# if ram_gb < 20:
#     print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
#     print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
#     print('re-execute this cell.')
# else:
#     print('You are using a high-RAM runtime!')

In [4]:
# Retrieve CPU information from the system
ncpu = !nproc
print("The number of available CPUs is:", ncpu[0])

The number of available CPUs is: 4


## 1.a) Load libraries and modules

In [5]:
# Set the random seed number for reproducible results
seedNum = 1

In [6]:
# Load libraries and packages
import random
random.seed(seedNum)
import numpy as np
np.random.seed(seedNum)
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import os
import sys
import boto3
import shutil
import string
import nltk
from nltk.corpus import stopwords
from collections import Counter
from datetime import datetime
from dotenv import load_dotenv
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import tensorflow as tf
tf.random.set_seed(seedNum)
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import ReduceLROnPlateau

In [7]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/pythonml/nltk_data...
[nltk_data]    |   Package movie_reviews is a

True

## 1.b) Set up the controlling parameters and functions

In [8]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the number of CPU cores available for multi-thread processing
n_jobs = 1

# Set up the flag to stop sending progress emails (setting to True will send status emails!)
notifyStatus = False

# Set the verbose level for program execution output
verbose = False

# Set Pandas options
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 140)

# Set the percentage sizes for splitting the dataset
TEST_SET_SIZE = 0.2
VAL_SET_SIZE = 0.25

# Set the number of folds for cross validation
N_FOLDS = 5
N_ITERATIONS = 1

# Set various default modeling parameters
DEFAULT_LOSS = 'binary_crossentropy'
DEFAULT_METRICS = ['accuracy']
DEFAULT_OPTIMIZER = tf.keras.optimizers.Adam(learning_rate=0.0001)
DEFAULT_INITIALIZER = tf.keras.initializers.GlorotUniform(seed=seedNum)
EPOCH_NUMBER = 20
BATCH_SIZE = 1
DEFAULT_VECTOR_SPACE = 100
DEFAULT_FILTERS = 32
DEFAULT_KERNEL_SIZE = 8
DEFAULT_POOL_SIZE = 2

# Define the labels to use for graphing the data
train_metric = "accuracy"
validation_metric = "val_accuracy"
train_loss = "loss"
validation_loss = "val_loss"

# Check the number of GPUs accessible through TensorFlow
print('Num GPUs Available:', len(tf.config.list_physical_devices('GPU')))

# Print out the TensorFlow version for confirmation
print('TensorFlow version:', tf.__version__)

Num GPUs Available: 0
TensorFlow version: 2.3.1


In [9]:
# Set up the parent directory location for loading the dotenv files
# from google.colab import drive
# drive.mount('/content/gdrive')
# gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
# env_path = '/content/gdrive/My Drive/Colab Notebooks/'
# dotenv_path = env_path + "python_script.env"
# load_dotenv(dotenv_path=dotenv_path)

# Set up the dotenv file for retrieving environment variables
# env_path = "/Users/david/PycharmProjects/"
# dotenv_path = env_path + "python_script.env"
# load_dotenv(dotenv_path=dotenv_path)

In [10]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])

In [11]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 1 - Prepare Environment has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

In [12]:
# Reset the random number generators
def reset_random(x):
    random.seed(x)
    np.random.seed(x)
    tf.random.set_seed(x)

In [13]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 1 - Prepare Environment completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

# Task 2 - Load and Prepare Text Data

In [14]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 2 - Load and Prepare Text Data has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

## 2.a) Download Text Data Archive

In [15]:
dataset_path = 'https://dainesanalytics.com/datasets/kaggle-nlp-disaster-tweets/train.csv'
Xy_train = pd.read_csv(dataset_path)

# Take a peek at the dataframe after import
print(Xy_train.head(10))

   id keyword location                                               text  target
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...       1
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada       1
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...       1
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...       1
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...       1
5   8     NaN      NaN  #RockyFire Update => California Hwy. 20 closed...       1
6  10     NaN      NaN  #flood #disaster Heavy rain causes flash flood...       1
7  13     NaN      NaN  I'm on top of the hill and I can see a fire in...       1
8  14     NaN      NaN  There's an emergency evacuation happening now ...       1
9  15     NaN      NaN  I'm afraid that the tornado is coming to our a...       1


In [16]:
# Dropping redundant features
Xy_train.drop(columns=['id','keyword','location'], inplace=True)

# Take a peek at the dataframe after cleaning
print(Xy_train.head(10))

                                                text  target
0  Our Deeds are the Reason of this #earthquake M...       1
1             Forest fire near La Ronge Sask. Canada       1
2  All residents asked to 'shelter in place' are ...       1
3  13,000 people receive #wildfires evacuation or...       1
4  Just got sent this photo from Ruby #Alaska as ...       1
5  #RockyFire Update => California Hwy. 20 closed...       1
6  #flood #disaster Heavy rain causes flash flood...       1
7  I'm on top of the hill and I can see a fire in...       1
8  There's an emergency evacuation happening now ...       1
9  I'm afraid that the tornado is coming to our a...       1


## 2.b) Splitting Data for Training and Validation

In [17]:
X_train_df = Xy_train.iloc[:,0]
y_train_df = Xy_train.iloc[:,1]
print("Xy_train_df.shape: {} X_train_df.shape: {} y_train_df.shape: {}".format(Xy_train.shape, X_train_df.shape, y_train_df.shape))

Xy_train_df.shape: (7613, 2) X_train_df.shape: (7613,) y_train_df.shape: (7613,)


## 2.c) Load Document and Build Vocabulary

In [18]:
# turn a doc into clean tokens
def clean_sentence(sentence):
    # split into tokens by white space
    tokens = sentence.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

In [19]:
# load sentences and add to vocab
def add_sentence_to_vocab(sentence, vocab):
    # clean doc
    tokens = clean_sentence(sentence)
    # update counts
    vocab.update(tokens)

In [20]:
# load all docs in a directory
def build_vocabulary(comments, vocab):
    # walk through all comments in the dataframe
    for i in range(len(comments)):
        # add comments to vocab
        sentence = comments.iloc[i]
        add_sentence_to_vocab(sentence, vocab)
        if verbose : print('Processing comment:', sentence)
        if ((i+1) % 100) == 0 : print(i+1, 'comments processed so far.')
    print('Total number of comments loaded into the vocabulary:', i+1, '\n')

In [21]:
# define vocab
vocab = Counter()
# add all docs to vocab
build_vocabulary(X_train_df, vocab)
# print the size of the vocab
print('The total number of words in the vocabulary:', len(vocab))
# print the top words in the vocab
top_words = 50
print('The top', top_words, 'words in the vocabulary:\n', vocab.most_common(top_words))

100 comments processed so far.
200 comments processed so far.
300 comments processed so far.
400 comments processed so far.
500 comments processed so far.
600 comments processed so far.
700 comments processed so far.
800 comments processed so far.
900 comments processed so far.
1000 comments processed so far.
1100 comments processed so far.
1200 comments processed so far.
1300 comments processed so far.
1400 comments processed so far.
1500 comments processed so far.
1600 comments processed so far.
1700 comments processed so far.
1800 comments processed so far.
1900 comments processed so far.
2000 comments processed so far.
2100 comments processed so far.
2200 comments processed so far.
2300 comments processed so far.
2400 comments processed so far.
2500 comments processed so far.
2600 comments processed so far.
2700 comments processed so far.
2800 comments processed so far.
2900 comments processed so far.
3000 comments processed so far.
3100 comments processed so far.
3200 comments pro

In [22]:
# keep tokens with a min occurrence
min_occurane = 2
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print('The number of words with the minimum appearance:', len(tokens))

The number of words with the minimum appearance: 7535


In [23]:
# save list to file
def save_list(lines, filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    # open file
    file = open(filename, 'w')
    # write text
    file.write(data)
    # close file
    file.close()

# save tokens to a vocabulary file
save_list(tokens, 'vocabulary.txt')

In [24]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [25]:
# load the vocabulary
vocab_filename = 'vocabulary.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
print('Number of tokens in the vocabulary:', len(vocab))

Number of tokens in the vocabulary: 7535


## 2.d) Create Tokenizer and Encode the Input Text

In [26]:
# load doc, clean and return line of tokens
def comment_to_line(sentence, vocab):
    # clean sentence
    tokens = clean_sentence(sentence)
    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    line = ' '.join(tokens)
    return line

In [27]:
# load all docs in a directory
def process_comments_to_lines(comments, vocab):
    lines = list()
    # walk through all comments in the dataframe
    for i in range(len(comments)):
        # load and clean the comments
        sentence = comments.iloc[i]
        line = comment_to_line(sentence, vocab)
        # add to list
        lines.append(line)
    return lines

In [28]:
# # Load all training cases
training_docs = process_comments_to_lines(X_train_df, vocab)

In [29]:
# prepare bag of words encoding of docs
def encode_training_data(train_docs, maxlen):
    # create the tokenizer
    tokenizer = Tokenizer()
    # fit the tokenizer on the documents
    tokenizer.fit_on_texts(train_docs)
    # encode training data set
    encoded_docs = tokenizer.texts_to_sequences(train_docs)
    # pad sequences
    train_encoded = pad_sequences(encoded_docs, maxlen=maxlen, padding='post')
    # define vocabulary size (largest integer value)
    vocab_size = len(tokenizer.word_index) + 1
    return train_encoded, vocab_size

In [30]:
# prepare bag of words encoding of docs
def encode_test_data(train_docs, val_docs, maxlen):
    # create the tokenizer
    tokenizer = Tokenizer()
    # fit the tokenizer on the documents
    tokenizer.fit_on_texts(train_docs)
    # encode validation data set
    encoded_docs = tokenizer.texts_to_sequences(val_docs)
    # pad sequences
    val_encoded = pad_sequences(encoded_docs, maxlen=maxlen, padding='post')
    return val_encoded

In [31]:
# Get maximum doc length for padding sequences
max_length = max([len(s.split()) for s in training_docs])
print('The maximum document length:', max_length)

The maximum document length: 22


In [32]:
# encode training and validation datasets
X_train, vocabulary_size = encode_training_data(training_docs, max_length)
print('The shape of the encoded training dataset:', X_train.shape)

The shape of the encoded training dataset: (7613, 22)


In [33]:
y_train = y_train_df.values.ravel()
print('The shape of the encoded test classes:', y_train.shape)

The shape of the encoded test classes: (7613,)


In [34]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 2 - Load and Prepare Text Data completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

# Task 3 - Define and Train Models

In [35]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 3 - Define and Train Models has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

In [36]:
# Define the default numbers of input/output for modeling
num_outputs = 1

In [37]:
# Define the baseline model for benchmarking
def create_nn_model(n_inputs=max_length, n_outputs=num_outputs, n_vocab=vocabulary_size, n_vectors=DEFAULT_VECTOR_SPACE, n_pools=DEFAULT_POOL_SIZE, 
                    conv1_filters=DEFAULT_FILTERS, conv1_kernels=DEFAULT_KERNEL_SIZE, dense_nodes=50, opt_param=DEFAULT_OPTIMIZER, init_param=DEFAULT_INITIALIZER,
                    loss_param=DEFAULT_LOSS, metrics_param=DEFAULT_METRICS):
    nn_model = keras.Sequential([
        layers.Embedding(n_vocab, n_vectors, input_length=n_inputs),
        layers.Conv1D(filters=conv1_filters, kernel_size=conv1_kernels, activation='relu'),
        layers.MaxPooling1D(pool_size=n_pools),
        layers.Flatten(),
        layers.Dense(dense_nodes, activation='relu', kernel_initializer=init_param),
        layers.Dense(n_outputs, activation='sigmoid', kernel_initializer=init_param)
    ])
    nn_model.compile(loss=loss_param, optimizer=opt_param, metrics=metrics_param)
    return nn_model

In [38]:
# Initialize the default model and get a baseline result
startTimeModule = datetime.now()
results = list()
iteration = 0
cv = RepeatedKFold(n_splits=N_FOLDS, n_repeats=N_ITERATIONS, random_state=seedNum)
for train_ix, val_ix in cv.split(X_train):
    feature_train, feature_validation = X_train[train_ix], X_train[val_ix]
    target_train, target_validation = y_train[train_ix], y_train[val_ix]
    reset_random(seedNum)
    baseline_model = create_nn_model()
    baseline_model.fit(feature_train, target_train, epochs=EPOCH_NUMBER, batch_size=BATCH_SIZE, verbose=0)
    model_metric = baseline_model.evaluate(feature_validation, target_validation, verbose=0)[1]
    iteration = iteration + 1
    print('Accuracy measurement from iteration %d >>> %.2f%%' % (iteration, model_metric*100))
    results.append(model_metric)
validation_score = np.mean(results)
validation_variance = np.std(results)
print('Average model accuracy from all validation iterations: %.2f%% (%.2f%%)' % (validation_score*100, validation_variance*100))
print('Total time for model fitting and cross validating:', (datetime.now() - startTimeModule))

Accuracy measurement from iteration 1 >>> 71.70%
Accuracy measurement from iteration 2 >>> 72.36%
Accuracy measurement from iteration 3 >>> 72.03%
Accuracy measurement from iteration 4 >>> 72.86%
Accuracy measurement from iteration 5 >>> 73.78%
Average model accuracy from all validation iterations: 72.55% (0.73%)
Total time for model fitting and cross validating: 0:50:50.081950


In [39]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 3 - Define and Train Models completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

# Task 4 - Evaluate and Optimize Models

In [40]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 4 - Evaluate and Optimize Models has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

In [41]:
# Not applicable for this iteration of modeling

In [42]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 4 - Evaluate and Optimize Models completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

# Task 5 - Finalize Model and Make Predictions

In [43]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 5 - Finalize Model and Make Predictions has begun on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

## 5.a) Set up and Train the Final Model

In [44]:
# Create the final model for evaluating the test dataset
reset_random(seedNum)
final_model = create_nn_model()
final_model.fit(X_train, y_train, epochs=EPOCH_NUMBER, batch_size=BATCH_SIZE, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fc7d8433eb0>

In [45]:
# Summarize the final model
final_model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 22, 100)           576000    
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 15, 32)            25632     
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 7, 32)             0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 224)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 50)                11250     
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 51        
Total params: 612,933
Trainable params: 612,933
Non-trainable params: 0
________________________________________________

## 5.b) Create Submission Files for Kaggle Evaluation

In [46]:
dataset_path = 'https://dainesanalytics.com/datasets/kaggle-nlp-disaster-tweets/test.csv'
X_kaggle_data = pd.read_csv(dataset_path)

# Take a peek at the dataframe after import
print(X_kaggle_data.head(10))

   id keyword location                                               text
0   0     NaN      NaN                 Just happened a terrible car crash
1   2     NaN      NaN  Heard about #earthquake is different cities, s...
2   3     NaN      NaN  there is a forest fire at spot pond, geese are...
3   9     NaN      NaN           Apocalypse lighting. #Spokane #wildfires
4  11     NaN      NaN      Typhoon Soudelor kills 28 in China and Taiwan
5  12     NaN      NaN                 We're shaking...It's an earthquake
6  21     NaN      NaN  They'd probably still show more life than Arse...
7  22     NaN      NaN                                  Hey! How are you?
8  27     NaN      NaN                                   What a nice hat?
9  29     NaN      NaN                                          Fuck off!


In [47]:
# Set up the dataframe to capture predictions for Kaggle submission
y_submission_kaggle = pd.DataFrame(columns=['id', 'target'])
y_submission_kaggle['id'] = X_kaggle_data['id']
# X_kaggle_data.drop(columns=['ID'], inplace=True)
print(y_submission_kaggle.head())

   id target
0   0    NaN
1   2    NaN
2   3    NaN
3   9    NaN
4  11    NaN


In [48]:
# Dropping redundant features
X_kaggle_data.drop(columns=['id','keyword','location'], inplace=True)

# Take a peek at the dataframe after cleaning
X_test_df = X_kaggle_data.iloc[:,0]
print("X_test_df.shape: {}".format(X_test_df.shape))
print(X_test_df.head(10))

X_test_df.shape: (3263,)
0                   Just happened a terrible car crash
1    Heard about #earthquake is different cities, s...
2    there is a forest fire at spot pond, geese are...
3             Apocalypse lighting. #Spokane #wildfires
4        Typhoon Soudelor kills 28 in China and Taiwan
5                   We're shaking...It's an earthquake
6    They'd probably still show more life than Arse...
7                                    Hey! How are you?
8                                     What a nice hat?
9                                            Fuck off!
Name: text, dtype: object


In [49]:
# load all test/validation cases
test_docs = process_comments_to_lines(X_test_df, vocab)

In [50]:
# encode training and validation datasets
X_test = encode_test_data(training_docs, test_docs, max_length)
print('The shape of the encoded test dataset:', X_test.shape)

The shape of the encoded test dataset: (3263, 22)


In [51]:
test_predictions = final_model.predict(X_test, batch_size=BATCH_SIZE, verbose=1)
probabilities_kaggle = (test_predictions > 0.5).astype("int32").ravel()
y_submission_kaggle['target'] = probabilities_kaggle
print("y_submission_kaggle.shape: {}".format(y_submission_kaggle.shape))
print(y_submission_kaggle.head())

y_submission_kaggle.shape: (3263, 2)
   id  target
0   0       1
1   2       1
2   3       1
3   9       0
4  11       1


In [52]:
submission_file = y_submission_kaggle.to_csv(header=True, index=False)
filename = 'submission_' + datetime.now().strftime('%Y%m%d-%H%M') + '.csv'
with open(filename, 'w') as f:
    f.write(submission_file)
    print('Completed writing output file: ' + filename)

Completed writing output file: submission_20210209-2056.csv


In [53]:
if notifyStatus: status_notify('(TensorFlow Text Classification) Task 5 - Finalize Model and Make Predictions completed on ' + datetime.now().strftime('%A %B %d, %Y %I:%M:%S %p'))

In [54]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 1:06:19.767394
