<a class="anchor" id="0"></a>
# **Predicting Disaster Tweets** 


<img src=https://digitalfireflymarketing.com/wp-content/uploads/2013/10/pablo-17-2-768x384.png> 






<a class="anchor" id="0.1"></a>
# **Table of Contents** 

1. [Background](#1)
2. [The Data](#2)
3. [Model 1 - Traditional NLP (Bag of Words + Linear Model)](#3)
    - [Data Cleaning](#3a)
    - [Bag of Words Vectorizer](#3b)
    - [TF-IDF Vectorizer](#3c)
    - [Models](#3d)

4. [Model 2 - BERT Model](#4)
    - [Helper Functions](#4a)
    - [Data Preprocessing](#4b)
    - [Modelling](#4c)
5. [Conclusion](#5)

<a class="anchor" id="1"></a>
# 1. **Background**
[Table of Contents](#0.1)

Things tend to happen quickly at disaster scenes. Sometimes, there is barely enough time to react and the situation can quickly gets out of hand. Getting word out is, therefore important, and Twitter is one of the ways word can spread quickly. 

Thanks to the ubiquity of smartphones all over the world, it is easy to reach a large audience in the event of a disaster through platforms like Twitter. This helps draw attention to the ensuing reality and makes getting help a bit much easier. However, it could also be the case that a tweet report fake disasters, and that can be rather unpleasant.

In this notebook, we will analyse tweets (about 10,000 of them) to predict the ones that are about real disasters and those that aren't.

Let's dive right in!

Oh, I'd appreciate feedback and comments. Thank you.

<a class="anchor" id="2"></a>
# 2. **The Data**
[Table of Contents](#0.1)

### Importing relevant libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub

from bert import tokenization

# text processing libraries
import re
import string
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords

# sklearn libraries
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV, StratifiedKFold, RandomizedSearchCV

# matplotlib and seaborn for plots
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')

import warnings
# warnings.filterwarnings(action='ignore', category=FutureWarning)

  import pandas.util.testing as tm


### Loading and Preparing Data

In [2]:
# Training Data
train = pd.read_csv('train.csv')
print('Training data shape: ', train.shape)
train.head()

Training data shape:  (7613, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
# Testing Data
test = pd.read_csv('test.csv')
print('Testing data shape: ', test.shape)
test.head()

Testing data shape:  (3263, 4)


Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


### Exploring the Data

In [4]:
# checking for missing values in the training set
train.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [5]:
# checking for missing values in the test set
test.isnull().sum()

id             0
keyword       26
location    1105
text           0
dtype: int64

In [6]:
# checking our target column
train['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

We have a relatively balanced dataset. There are a good number of missing Location values but relatively smaller number of missing Keyword values

<a class="anchor" id="3"></a>
# 3. **Model 1 - Traditional NLP (Bag of Words + Linear Model)**
[Table of Contents](#0.1)

In [7]:
# taking copies of the train and test data so that we won't  have to re-read the data for the
# implementation of the BERT model
train_1 = train.copy()
test_1 = test.copy()

<a class=anchor id=3a></a>
#### 3a. Data Cleaning

Here, we tokenize our text and clean it up by turning all characters to lower case, removing brackets, URLs, html tags, punctuation, numbers, etc. We also dispose of emojis and common stop words. It is important we do this for the Bag of Words + Linear model

In [8]:
# Applying a first round of text cleaning techniques

def clean_text(text):
    '''Make text lowercase, remove text in square brackets, URLS, punctuation, and words containing
    numbers'''
    text = text.lower() #make text lower case
    text = re.sub('\[.*?\]', '', text) #remove square brackets
    text = re.sub('https?://\S+|www\.\S+', '', text) #remove URLs
    text = re.sub('<.*?>+', '', text) #remove html tags
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text ) #remove punctuation
    text = re.sub('\n', '', text) #remove words containing numbers
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[‘’“”…]', '', text)
    
    return text
    

In [9]:
# Defining emoji removal function
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [10]:
# Applying emoji removal function to the training and test datasets
train_1['text'] = train_1['text'].apply(lambda x: remove_emoji(x))
test_1['text'] = test_1['text'].apply(lambda x: remove_emoji(x))

In [11]:
# defining a text preprocessing function
def text_preprocessing(text):
    """ To clean and parse the text"""
    
    tokenizer_reg = nltk.tokenize.RegexpTokenizer(r'\w+')
    
    nopunc = clean_text(text)
    tokenized_text = tokenizer_reg.tokenize(nopunc)
    
#     remove stopwords
    remove_stopwords = [w for w in tokenized_text if w not in stopwords.words('english')]
    
    combined_text = ' '.join(remove_stopwords)
    return combined_text

In [12]:
# Applying the cleaning function to both test and training datasets
train_1['text'] = train_1['text'].apply(lambda x: text_preprocessing(x))
test_1['text'] = test_1['text'].apply(lambda x: text_preprocessing(x))

In [13]:
# Examining the updated train data
train_1['text'].head()

0         deeds reason earthquake may allah forgive us
1                forest fire near la ronge sask canada
2    residents asked shelter place notified officer...
3    people receive wildfires evacuation orders cal...
4    got sent photo ruby alaska smoke wildfires pou...
Name: text, dtype: object

It looks good so far. We have managed to rid our text of punctuations, emojis, stopwords, etc.

<a class=anchor id=3b></a>
#### 3b. Bag of Words Vectorizer - BoW

Here we use uni-grams and add any word that appears to the vocabulary

In [14]:
count_vectorizer = CountVectorizer(ngram_range = (1,1), min_df = 1)
train_vectors = count_vectorizer.fit_transform(train_1['text'])
test_vectors = count_vectorizer.transform(test_1['text'])

In [15]:
train_vectors.shape

(7613, 16412)

<a class=anchor id=3c></a>
#### 3c. TF-IDF Vectorizer

Here we use uni-grams and bigrams. We also ignore terms appearing in over 50% of text examples

In [16]:
tfidf = TfidfVectorizer(ngram_range = (1,2), min_df = 2, max_df = 0.5)
train_tfidf = tfidf.fit_transform(train_1['text'])
test_tfidf = tfidf.transform(test_1['text'])

In [17]:
train_tfidf.shape

(7613, 11077)

<a class=anchor id=3d></a>
#### 3d. Models

Next, we fit Logistic Regression and Multinomial Naive Bayes models with BoW and TF-IDF, so we have a total of four models

In [18]:
# Fitting a simple Logistic Regression on BoW
logreg_bow = LogisticRegression(C=1.0)
scores = model_selection.cross_val_score(logreg_bow, train_vectors, train['target'], cv=5, scoring='f1')
scores.mean()

0.5834476966398702

In [19]:
# Fitting a simple Logistic Regression on BoW
logreg_tfidf = LogisticRegression(C=1.0)
scores = model_selection.cross_val_score(logreg_tfidf, train_tfidf, train['target'], cv=5, scoring='f1')
scores.mean()

0.5451346913743611

In [20]:
# Fitting a Naive Bayes on BoW
NB_bow = MultinomialNB()
scores = model_selection.cross_val_score(NB_bow, train_vectors, train['target'], cv=5, scoring='f1')
scores.mean()

0.6584930948850116

In [21]:
# Fitting a Naive Bayes on TFIDF
NB_tfidf = MultinomialNB()
scores = model_selection.cross_val_score(NB_tfidf, train_tfidf, train['target'], cv=5, scoring='f1')
scores.mean()

0.6187711183101462

The best performance occurs when we use MNB on the Bag of Words vectors - 0.6585.

In [22]:
NB_bow.fit(train_vectors, train['target'])

MultinomialNB()

<a class="anchor" id="4"></a>
# 4. **Model 2 - BERT Model**
[Table of Contents](#0.1)

<a class=anchor id=4a></a>
#### 4a. Helper Functions

First, we define some functions we'll need down the road

In [23]:
# First, we define an encoding function which takes the text column, the tokenizer, and the
# maximum length of text string as input

# The outputs are Tokens, pad masks, and segment ids

def bert_encode(texts, tokenizer, max_len = 512):
#     note: BERT models support max lengths/tokens of up to 512
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
        
#         the max_len is reduced by 2 to accomodate the CLS and SEP tokens to be added at the 
#         start and end of the inmpt sequence
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
#         the pad_len is for cases where the input sequence is shorter than the max_len 
        pad_len = max_len - len(input_sequence)
        
#         convert the tokens to ids
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
#         use 0s to fill empty tokens
        tokens += [0] * pad_len
#         create pad masks with 1s representing the length of input, and 0s for empty tokens
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
#         create segment ids corresponding to the max_len
        segment_ids = [0] * max_len
        
#         append the tokens, pad_masks, and segment_ids to their respective arrays
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
        
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)
        

In [24]:
# Next, we define a function to build and compile the model

def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name='input_word_ids')
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name='input_mask')
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name='segment_ids')
   
    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation = 'sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [41]:
# Build and compile the model

def build_model(bert_layer, max_len = 512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

<a class=anchor id=4b></a>
#### 4a. Modelling


In [25]:
# First we download the BERT architecture
# We use BERT-Large uncased: 24-layer, 1024-hidden_nodes, 16-attention-heads, 340M parameters

module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/4"

bert_layer = hub.KerasLayer(module_url, trainable=True)

In [38]:
# First we download the BERT architecture
# We use BERT-Large uncased: 24-layer, 1024-hidden_nodes, 16-attention-heads, 340M parameters

module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"

bert_layer_2 = hub.KerasLayer(module_url, trainable=True)

KeyboardInterrupt: 

In [34]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [26]:
# obtain a table of support tokens
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
# Specify lower case
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
# instantiate tokenizer
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [27]:
train_2 = train.copy()
test_2 = test.copy()

In [35]:
train_input = bert_encode(train_2.text.values, tokenizer, max_len=160)
test_input = bert_encode(test_2.text.values, tokenizer, max_len=160)
train_labels = train_2.target.values

In [None]:
model = build_model(bert_layer, max_len=160)
model.summary()