The following notebook shows a simple Natural Language Processing (NLP) classification for identifying Tweets about natural disasters as real versus non-real or unrelated.  For details about the classification as a machine learning example, for the source datasets, or for other similar exampmle see:

https://www.kaggle.com/c/nlp-getting-started

Data exploration (EDA) is largely done prior to starting this notebook.  This focuses mostly on tokenization and data prep prior to model setup, generation and predictions.  The Keras Sequential ML model utilized here obviously is set with the most basic parameters, and no attempt has been made to optimize parameters or to compare the implementation of different alternative classification approaches.  However, validation accuracy for this first attempt (as shown) was over 92%.

Note that other than source data, a Bing Maps API is required (and is sourced here from a local .env file). 

In [1]:
import pandas as pd
import numpy as np
from glob import glob
from collections import Counter
import os
import re
import string
import nltk
import base64
from dotenv import load_dotenv
import geocoder

from tensorflow.keras.utils import to_categorical
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import losses, metrics


In [2]:
load_dotenv()
BING_MAP_KEY = base64.b64decode(os.getenv('BING_MAP_KEY')).decode('ascii')

In [43]:
#Show local tables to load into memory
print(glob('*.csv'))

['sample_submission.csv', 'test.csv', 'test_data_with_locations.csv', 'train.csv', 'train_data_with_locations.csv']


In [4]:
#Load and check training data
df_train = pd.read_csv('train.csv')
df_train.iloc[0]

id                                                          1
keyword                                                   NaN
location                                                  NaN
text        Our Deeds are the Reason of this #earthquake M...
target                                                      1
Name: 0, dtype: object

In [5]:
#Show most common locations for Tweets
df_train.location.value_counts()[:20]

USA                104
New York            71
United States       50
London              45
Canada              29
Nigeria             28
UK                  27
Los Angeles, CA     26
India               24
Mumbai              22
Washington, DC      21
Kenya               20
Worldwide           19
Chicago, IL         18
Australia           18
California          17
California, USA     15
New York, NY        15
Everywhere          15
Florida             14
Name: location, dtype: int64

In [8]:
#Clean locations by geocoding and replacing with just country code
#Bing Maps API is great geocoder tool, also Google works well
count = 1
def relocate(x):
    global count
    print(f'Processed line {count}...', end='\r')
    count += 1
    if pd.isnull(x['location']):
        return 'EMPTY'
    g = geocoder.bing(x['location'], key=BING_MAP_KEY)
    if not g.json is None:
        try:
            return g.json['country']
        except:
            return 'EMPTY'
    else:
        return 'EMPTY'

df_train['country'] = df_train.apply(relocate, axis=1)

Processed line 6772...

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=++&o=json&inclnb=1&key=AqQxUDgDIW2P-Q9Yxna6W-XUMfo6GXMLdS3rHpzEsNAuMfFjTg_wWMI91p99Qnqc&maxResults=1


Processed line 7613...

In [6]:
#Clean text that is very messy - return bases of words from Treebank tokens and WordNet Lemmatizer
def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('[^a-zA-Z0-9 \n\.]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens = tokenizer.tokenize(text)
    lemmatizer=nltk.stem.WordNetLemmatizer()
    text = " ".join(lemmatizer.lemmatize(token) for token in tokens)
    return text

In [11]:
df_train.country.unique()

array(['EMPTY', 'United Kingdom', 'United States', 'South Africa',
       'Hong Kong-China', 'Philippines', 'Canada', 'India', 'Barbados',
       'Nigeria', 'Brazil', 'Australia', 'Germany', 'Kenya', 'Russia',
       'Maldives', 'Switzerland', 'New Caledonia', 'Belgium', 'Indonesia',
       'Belarus', 'Sri Lanka', 'France', 'Israel', 'Slovenia', 'Italy',
       'Netherlands', 'Pakistan', 'Malaysia', 'Turkey', 'Spain',
       'Argentina', 'Japan', 'Poland', 'Finland', 'Tuvalu', 'Cyprus',
       'Mexico', 'Singapore', 'South Sudan', 'Burundi', 'Ireland',
       'United Arab Emirates', 'West Bank', 'Cameroon', 'Mauritius',
       'Norway', 'Latvia', 'Hungary', 'Peru', 'Belize', 'Austria',
       'Trinidad and Tobago', 'Egypt', 'Ukraine', 'New Zealand', 'Greece',
       'Sierra Leone', 'North Macedonia', 'Denmark', 'South Korea',
       'Sweden', 'Iraq', 'Puerto Rico', 'Afghanistan', 'Saudi Arabia',
       'Isle of Man', 'Golan Heights', 'Venezuela', 'Georgia', 'Colombia',
       'Jamaica'

In [None]:
#OPTIONAL - write cleaned data to local file. Geocoding takes some time
df_train.to_csv('train_data_with_locations.csv', index=False, header=True)

In [9]:
df_train['location'] = df_train['country']
df_train.drop(labels='country', axis=1, inplace=True)
df_train['text'] = df_train.apply(lambda x: clean_text(x['text']), axis=1)
df_train.keyword.fillna('', inplace=True)
df_train['keyword'] = df_train.apply(lambda x: clean_text(x['keyword']), axis=1)

In [10]:
#Check cleaned text data
for i in df_train.iloc[:10]['text'].values:
    print(i)

our deed are the reason of this earthquake may allah forgive u all
forest fire near la ronge sask canada
all resident asked to shelter in place are being notified by officer no other evacuation or shelter in place order are expected
people receive wildfire evacuation order in california
just got sent this photo from ruby alaska a smoke from wildfire pours into a school
rockyfire update california hwy closed in both direction due to lake county fire cafire wildfire
flood disaster heavy rain cause flash flooding of street in manitou colorado spring area
im on top of the hill and i can see a fire in the wood
there an emergency evacuation happening now in the building across the street
im afraid that the tornado is coming to our area


In [11]:
#Get word index for training data - ALL WORDS USED as returned by tokenizer
vals = list(Counter([i for i in ' '.join(df_train.text.values.tolist()).split(' ')]).keys())

In [12]:
#As an option, take top n most used words 
counter = Counter([i for i in ' '.join(df_train.text.values.tolist()).split(' ')])
top_words = [i for _, i in sorted(zip(counter.values(), counter.keys()), 
                                  key=lambda x: x[0], reverse=True)][:20000]

In [13]:
#vals = top_words
indices = [i for i, _ in enumerate(vals)]
word_index = {}
values = {}
for i in range(len(indices)):
    word_index[indices[i]] = vals[i]
    values[vals[i]] = indices[i]

In [34]:
def vectorize_sequences(sequences, dimension=50000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

In [15]:
df_train.columns

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')

In [16]:
#Get lines of text and convert to list of word keys
texts = df_train.text.values
sequences = []
for line in texts:
    test_line = [values[i.replace('#', '')] for i in line.split(' ') if i.replace('#', '') in vals]
    sequences.append(test_line)
print(len(sequences), sequences[0])

7613 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


In [35]:
#Convert word keys for main text of tweet to categorical for training purposes
first_part_train = vectorize_sequences(sequences)

In [36]:
#Perform tokenization for locations, and convert to binary/categorical arrays
keyvals = list(Counter(df_train.location.values.tolist()).keys())
indices = [i for i, _ in enumerate(keyvals)]
loc_dict = {}
for i in range(len(keyvals)):
    loc_dict[keyvals[i]] = indices[i] + 1 #nan is 0
locations = []
for val in df_train.location.values:
    if pd.isnull(val):
        locations.append(0)
    else:
        locations.append(loc_dict[val])
locations = to_categorical(np.array(locations), dtype='int32')
print(locations.shape)

(7613, 143)


In [37]:
#Perform same categorization for keywords as for location values
locvals = list(Counter(df_train.keyword.values.tolist()).keys())
indices = [i for i, _ in enumerate(locvals)]
key_dict = {}
for i in range(len(locvals)):
    key_dict[locvals[i]] = indices[i] + 1 #nan is 0
keywords = []
for val in df_train.keyword.values:
    if pd.isnull(val):
        keywords.append(0)
    else:
        keywords.append(key_dict[val])
keywords = to_categorical(np.array(keywords), dtype='int32')
print(keywords.shape)

(7613, 179)


In [38]:
x_train = np.hstack((first_part_train, locations, keywords))

In [39]:
y_train = df_train.target.values

In [40]:
#Develop model to train based on prepared input data

model = models.Sequential()
model.add(layers.Dense(16, activation='tanh', input_shape=(x_train.shape[1],)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [41]:

#model.compile(optimizer=optimizers.RMSprop(lr=0.001), loss='binary_crossentropy', metrics=['accuracy'])


model.compile(optimizer=optimizers.Nadam(),
            loss=losses.binary_crossentropy, metrics=[metrics.binary_accuracy])

In [42]:
x_val = x_train[-100:]
partial_x_train = x_train[:-100]
y_val = y_train[-100:]
partial_y_train = y_train[:-100]
model.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['acc'])

history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=50, validation_data=[x_val, y_val])

Train on 7513 samples, validate on 100 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [174]:
#As an optional step, save trained model for reuse in the future
model.save('model3.h5')

In [175]:
test_df = pd.read_csv('test.csv')

In [176]:
"""
Go through and apply same text tokens, as well as keyword and location tokens, from the training dataframe.
Tokens won't be remade, to satisfy specific input shape of learning model.
"""

count = 1
test_df['country'] = test_df.apply(relocate, axis=1)
test_df['location'] = test_df['country']
test_df.drop(labels='country', axis=1, inplace=True)
test_df['text'] = test_df.apply(lambda x: clean_text(x['text']), axis=1)
test_df.keyword.fillna('', inplace=True)
test_df['keyword'] = test_df.apply(lambda x: clean_text(x['keyword']), axis=1)


test_df['keyword'].fillna('', inplace=True)

texts = test_df.text.values
sequences = []
for line in texts:
    test_line = [values[i] for i in str(line).split(' ') if i in vals]
    sequences.append(test_line)
#Convert word keys for main text of tweet to categorical (this time for test purposes)
first_part_test = vectorize_sequences(sequences)
#Now to get location values (if matching ones exist in training dataset)
locations = []
for val in test_df.location.values:
    if pd.isnull(val):
        locations.append(0)
    else:
        try:
            locations.append(loc_dict[val])
        except KeyError:
            locations.append(len(loc_dict))
locations = to_categorical(np.array(locations), dtype='int32')
print(locations.shape)
#And finally, get keywords (again, only if matching keywords were present in training dataset)
keywords = []
for val in test_df.keyword.values:
    if pd.isnull(val):
        keywords.append(0)
    else:
        try:
            keywords.append(key_dict[val])
        except KeyError:
            keywords.append(len(key_dict))  #value must be absent, but make sure tensor has same shape 
keywords = to_categorical(np.array(keywords), dtype='int32')
print(keywords.shape)

#As before, merge text tokens, location and keyword category values 
x_test = np.hstack((first_part_test, locations, keywords))

(3263, 143)
(3263, 179)


In [177]:
#With the tensors carefully prepared, run predictions. Perform additional check to make sure shape matches.
if x_test.shape[1] != x_train.shape[1]:
    print('Error in processing - input matrix needs same number of columns as training data')
else:
    output_targets = model.predict(x_test, verbose=1)



In [178]:
samples = pd.read_csv('sample_submission.csv')
samples.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [179]:
output_aslist = [int(i) for i in output_targets]

In [180]:
output_table = pd.DataFrame({'id': test_df.id.values, 'target': output_aslist})
output_table.to_csv('submissions.csv', index=False, header=True)