# Natural Language Processing of Disaster Tweets

Twitter has become an important source of real-time information, given its ease of use and quick access. Users can quickly send out short 'tweets' to update on disasters that they are witnessing, which has led to many companies and monitoring agencies using it as a source of real-time information. However, since the vocabulary used in describing disasters is also used in unrelated contexts or as literary devices, these agencies need a way of checking if a tweet containing these words is describing an actual disaster.


# Exploring the Data

We have a training and testing dataset consisting of tweets, some of which are related to an actual disaster. The training set has 7503 datapoints, and the testing set has 3243. The features present in the training set are:

1. keywords
 + a particular keyword from the tweet (may be blank)
2. location
 + the location the tweet was sent from (may be blank)
3. text
 + the text of the tweet
4. target
 + this denotes whether a tweet is about a real disaster (1) or not (0)
 
We will begin by exploring our data and performing the eye-test on the contents of the tweets. We will also be performing n-gram analysis on our tweets after doing some preprocessing.

In [17]:
# import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import OrdinalEncoder
from sklearn.naive_bayes import *
from sklearn.decomposition import *
from sklearn.svm import *

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader, TensorDataset

from nltk.tokenize import *
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.probability import FreqDist

import re
import string

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 100000)
pd.set_option('display.max_colwidth', None)

In [18]:
# nltk data used for preprocessing

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

In [19]:
train = pd.read_csv('../input/nlp-getting-started/train.csv')

In [20]:
train.info()

In [21]:
NO_TRAIN_NON_DISASTER = train['target'].value_counts()[0]
NO_TRAIN_DISASTER = train['target'].value_counts()[1]


We note that about 25% of the data contains missing location data, while only about 0.6% is missing keyword data.

Since location data is not something that can be immediately derived without the use of external datasets, we decide to set the missing values to a placeholder value of 'no_location'.

For the keyword columns, one possible way of filling in the missing values is by parsing the grammatical structure of the tweet and searching for the grammatical 'object' and using it as the keyword. However, there is no guarantee on the correct spelling of the words in the tweet, which would add unnecessary outliers to our keyword data. (e.g. 'goal' is spelt as goooooooaaaaaalll' in tweet 28). Similarly, the keyword in some tweets do not necessarily correspond to its object. For example, tweet 37 ('No way...I can't eat that shit') has 'shit' as an object, though one would likely agree that 'eat' would likely be the more useful/descriptive keyword here.

We will hence fill in the remaining missing values with a placeholder of 'no_keyword'.

In [22]:
# fill in nas with no_location, no_keyword

train_processed = train.fillna(value={'keyword': 'no_keyword', 'location': 'no_location'}, )

In [23]:
train_processed.info()

We want to look through the top keywords and locations from our dataset.

In [24]:
train_processed[train_processed['target']==1]['keyword'].value_counts()[:20].plot.barh()
plt.title('Top 20 keyword values (Disasters)')

In [25]:
train_processed[train_processed['target']==0]['keyword'].value_counts()[:20].plot.barh()
plt.title('Top 20 keyword values (Not Disasters)')

In [26]:
train_processed[(train_processed['target'] == 0) & (train_processed['keyword'] == 'blizzard')].head()

Surprising, words like 'blizzard' and 'body bags' are part of the top 20 keyword values for non-disaster related tweets. To get some context on these types of tweets, we looked at several examples. We note that tweets which had 'blizzard', for example, was used to refer to the company Blizzard. This informs us that we cannot just use a single word as an indicator for whether a tweet is about a disaster or not. 

In [27]:
train_processed[train_processed['target']==1]['location'].value_counts()[1:11].plot.barh()
plt.title('Top 10 location values (Disasters)\n Excludes no_location')

In [28]:
train_processed[train_processed['target']==0]['location'].value_counts()[1:11].plot.barh()
plt.title('Top 10 location values (Not Disasters)\nExcludes no_location')

In [29]:
temp1 = train_processed[train_processed['target']==1]['location'].value_counts()[1:11]
temp2 = train_processed[train_processed['target']==0]['location'].value_counts().filter(items=list(temp1.index), axis=0)

temp = pd.DataFrame({'disaster': temp1, 'non_disaster': temp2})
temp = temp.div(temp.sum(axis=1), axis=0)

temp.plot.barh()

plt.title('Location vs. whether a tweet is related to disaster or non-disaster')
plt.xlabel('Percentage of total tweets')

We see that tweets from certain locations have a higher proportions of tweets that are related to disasters than non-disasters. In particular, Mumbai, India, and Nigeria exhibit such behaviour. Locations like New York, however, are more likely to have non-disaster tweets than disaster tweets.

# Data Preprocessing

Next, we will begin to preprocess our data and clean it up in preparation for feature learning. We will use the basic steps shared by in this [intro notebook by Parul Pandey](https://www.kaggle.com/parulpandey) to guide our preprocessing.

The techniques include:
- Lower casing
- Tokenisation
- Stop words removal
- Stemming
- Lemmatization

[API reference](https://www.nltk.org/api/nltk.html) and [guide to using with Pandas](https://www.kirenz.com/post/2021-12-11-text-mining-and-sentiment-analysis-with-nltk-and-pandas-in-python/text-mining-and-sentiment-analysis-with-nltk-and-pandas-in-python/)

## Lower-casing

In [30]:
# lower casing all texts

train_processed['cleaned_text'] = train_processed['text'].str.lower()

In [31]:
train_processed.head()

## Text cleaning

**URL Removal**

We note from analysis that we did retroactively that there are URLs present in many tweets that link to external sites. As such, we will attempt to remove them so that common strings that are useless for us like 'http' will not show up in our data.

**Hashtags removal**

Similarly, hashtags are common in tweets but do not have any special meaning for our use. Hence, we will remove them.

**Remove special character's strings**

We will also remove special characters that appear in tweets. Some of these are due to encoding errors, leading to the appearance of non-alphanumeric strings. We remove them based on a list referenced from [here](https://deepnote.com/@pk/DeFi-DL-57c78dd7-2199-4952-9c0d-748f13156a1a).

**Replace special characters encoding**

Since some special characters like ampersands (&) are reserved in HTML, they are represented by certain strings in order to display them. We will replace them with the actual characters in our preprocessing.

**Remove punctuation at start and end of tokens**
Some words have a punctuation mark attached to them, likely due to 

Finally, we will also remove characters outside of the first 128 unicode characters (e.g. Û)

In [32]:
def clean_text(text):
    result = re.sub(r"http\S+", "", text)
    result = re.sub(r"https\S+", "", result)
    result = re.sub(r"#", "", result)

    # remove special characters
    result = re.sub(r"\x89Û_", "", result)
    result = re.sub(r"\x89ÛÒ", "", result)
    result = re.sub(r"\x89ÛÓ", "", result)
    result = re.sub(r"\x89ÛÏWhen", "When", result)
    result = re.sub(r"\x89ÛÏ", "", result)
    result = re.sub(r"let\x89Ûªs", "let's", result)
    result = re.sub(r"\x89Û÷", "", result)
    result = re.sub(r"\x89Ûª", "", result)
    result = re.sub(r"\x89Û\x9d", "", result)
    result = re.sub(r"å_", "", result)
    result = re.sub(r"\x89Û¢", "", result)
    result = re.sub(r"\x89Û¢åÊ", "", result)
    result = re.sub(r"åÊ", "", result)
    result = re.sub(r"åÈ", "", result)
    result = re.sub(r"Ì©", "e", result)
    result = re.sub(r"å¨", "", result)
    result = re.sub(r"åÇ", "", result)
    result = re.sub(r"åÀ", "", result)
    
    # Character entity references
    result = re.sub(r"&gt;", ">", result)
    result = re.sub(r"&lt;", "<", result)
    result = re.sub(r"&amp;", "&", result)
    
    
    # We will remove all chars not in Unicode 128 https://www.utf8-chartable.de/unicode-utf8-table.pl?number=128]
    
    result = ''.join([char for char in result if ord(char)<=128])
    
    return result

train_processed['cleaned_text'] = train_processed['cleaned_text'].apply(clean_text)

## Tokenisation

In [33]:
# Tokenisation via nltk

train_processed['tokens'] = train_processed['cleaned_text'].apply(word_tokenize)


In [34]:
train_processed.head()

## Stop-words removal

In [35]:
# remove stop words using nltk

stopwords = nltk.corpus.stopwords.words('english')

train_processed['tokens'] = train_processed['tokens'].apply(lambda x: [item for item in x if item not in stopwords])

In [36]:
train_processed.head()

## Remove punctuation tokens + remove punctuation from start/end of tokens.

In [37]:
# remove punctuation tokens

puncs = [char for char in string.punctuation]
    
train_processed['tokens'] = train_processed['tokens'].apply(lambda x: [item for item in x if item not in puncs])
train_processed['tokens'] = train_processed['tokens'].apply(lambda x: [item.strip(''.join(puncs)) for item in x])

In [38]:
train_processed.head()

## Lemmatize tokens

In [39]:
# lemmatise using wordnetlemmatise

wnl = WordNetLemmatizer()

train_processed['tokens_lem'] = train_processed['tokens'].apply(lambda x: [wnl.lemmatize(item) for item in x])


In [40]:
train_processed.head()

## Encode categorical variables

In [41]:
# encode using sklearn's OrdinalEncoder

# these encoders will be used for the training, validation, and test sets.
oe_location = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=6969)
oe_keyword = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=6969)

oe_location.fit(train_processed['location'].to_frame())
oe_keyword.fit(train_processed['keyword'].to_frame())

train_processed['location'] = oe_location.transform(train_processed['location'].to_frame())
train_processed['keyword'] = oe_keyword.transform(train_processed['keyword'].to_frame())

## Count Vectorise our dataset

In [42]:
# make the tokens into a string for our sklearn countvectoriser function

def combine_tokens(tokens):
    string = ''
    for token in tokens:
        string += ' '
        string += token
    return string

train_processed['tokens_string'] = train_processed['tokens_lem'].apply(combine_tokens)

In [43]:
train_processed.head()

## Unigrams (1-grams)

In [28]:
# Get counts of each unique token for each corpus.

v1 = CountVectorizer()

X1 = v1.fit_transform(train_processed['tokens_string'])
X1_matrix = X1.toarray()
X1_features = v1.get_feature_names_out()

In [29]:
unigrams_df = pd.DataFrame(X1_matrix, columns=list(X1_features))

In [30]:
# plot the top 10 most frequent unigrams
temp = unigrams_df.sum().sort_values(ascending=False)

plt.figure(figsize=(12,10))
temp[:50].plot.barh()
plt.title('Top 50 most frequent unigrams')

In [31]:
# Combine with training data

train_unigram = train_processed.merge(unigrams_df, left_index=True, right_index=True, suffixes=['x', None])

train_unigram.head(2)

In [31]:
# check most common unigrams of disaster and non-disaster related tweets

temp = train_unigram.groupby('targetx').sum().drop('idx', axis=1)
indexes = list(unigrams_df.sum().sort_values(ascending=False).index)[:10]

t1 = temp.iloc[0, :].loc[indexes]
t2 = temp.iloc[1, :].loc[indexes]

t3 = pd.DataFrame({'disaster': t2, 'non_disaster': t1})
t3['disaster'] = t3['disaster'].div(NO_TRAIN_DISASTER)
t3['non_disaster'] = t3['non_disaster'].div(NO_TRAIN_NON_DISASTER)

t3.plot.barh()
plt.title('Unigrams vs frequency in tweets(split by disaster vs non_disaster tweets)')

## Bigrams (2-grams)

In [None]:
# Get counts of each unique token for each corpus.

v2 = CountVectorizer(ngram_range=(2, 2))

X2 = v2.fit_transform(train_processed['tokens_string'])
X2_matrix = X2.toarray()
X2_features = v2.get_feature_names_out()

In [None]:
bigrams_df = pd.DataFrame(X2_matrix, columns=list(X2_features))

In [None]:
# plot the top 10 most frequent bigrams
temp = bigrams_df.sum().sort_values(ascending=False)

temp[:10].plot.barh()
plt.title('Top 10 most frequent bigrams')

In [None]:
# Combine with training data

train_bigram = train_processed.merge(bigrams_df, left_index=True, right_index=True, suffixes=['x', None])

train_bigram.head(2)

In [None]:
# check most common bigrams of disaster and non-disaster related tweets

temp = train_bigram.groupby('target').sum().drop('id', axis=1)
indexes = list(bigrams_df.sum().sort_values(ascending=False).index)[:10]
t1 = temp.iloc[0, :].loc[indexes]
t2 = temp.iloc[1, :].loc[indexes]

t3 = pd.DataFrame({'disaster': t2, 'non_disaster': t1})
t3['disaster'] = t3['disaster'].div(NO_TRAIN_DISASTER)
t3['non_disaster'] = t3['non_disaster'].div(NO_TRAIN_NON_DISASTER)

t3.plot.barh()
plt.title('Bigrams vs normalised frequency in tweets(split by disaster vs non_disaster tweets)')

## Trigrams (3-grams)

In [None]:
# Get counts of each unique token for each corpus.

v3 = CountVectorizer(ngram_range=(3, 3))

X3 = v3.fit(train_processed['tokens_string']).transform(train_processed['tokens_string'])
X3_matrix = X3.toarray()
X3_features = v3.get_feature_names_out()

In [None]:
trigrams_df = pd.DataFrame(X3_matrix, columns=list(X3_features))

In [None]:
# plot the top 10 most frequent trigrams
temp = trigrams_df.sum().sort_values(ascending=False)

temp[:10].plot.barh()
plt.title('Top 10 most frequent trigrams')

In [None]:
# Combine with training data

train_trigram = train_processed.merge(pd.DataFrame(X3_matrix, columns=list(X3_features)), left_index=True, right_index=True, suffixes=['x', None])


In [None]:
# check most common trigrams of disaster and non-disaster related tweets

temp = train_trigram.groupby('target').sum().drop('id', axis=1)
indexes = list(trigrams_df.sum().sort_values(ascending=False).index)[:10]
t1 = temp.iloc[0, :].loc[indexes]
t2 = temp.iloc[1, :].loc[indexes]

t3 = pd.DataFrame({'disaster': t2, 'non_disaster': t1})
t3['disaster'] = t3['disaster'].div(NO_TRAIN_DISASTER)
t3['non_disaster'] = t3['non_disaster'].div(NO_TRAIN_NON_DISASTER)

t3.plot.barh()
plt.title('Trigrams vs normalised frequency in tweets(split by disaster vs non_disaster tweets)')

# Building a Trigram model

We choose a trigram model because of the top trigrams found in the dataset, the top 10 largely are related to disaster tweets -- more so than the unigram and bigrams. We do not go higher than this in order to maintain generality of our model.


We will attempt several methods.

## Should we TFIDF weight our tokens?

Since common words that occur throughout the whole corpus do not offer much information in differentiating between different types of tweets (similar to stop words), we want to weigh it such that words which appear less frequently will have more weight than words that appear everywhere. To do so, we can apply Term Frequency-Inverse Document Frequency weighting. However, after experimentation, we note that this leads to worse results. Hence, we will not be using it.

In [44]:
# Get counts of each unique token for each corpus.

v3 = CountVectorizer(ngram_range=(3, 3), binary=True)

X3 = v3.fit(train_processed['tokens_string']).transform(train_processed['tokens_string'])
X3_matrix = X3.toarray()
X3_features = v3.get_feature_names_out()

trigrams_df = pd.DataFrame(X3_matrix, columns=list(X3_features))

# Combine with training data

train_trigram = train_processed.merge(pd.DataFrame(X3_matrix, columns=list(X3_features)), left_index=True, right_index=True, suffixes=['x', None])


In [45]:
X = train_trigram.drop(['text', 'cleaned_text', 'tokens', 'tokens_lem', 'tokens_string', 'id', 'target'], axis=1)
y = train_trigram.loc[:, 'target']

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X = 0
y = 0

In [47]:
X_train.head()

## Should we do Topic Modelling via LDA?

We note that since the number of unique tokens is not that high (<100000), we do not need to do dimensionality reduction. Experimentation also showed that LDA led to worse results.

## Logistic Regression

We will use a logistic regression classification model with cross validation (3-folds). We chose this model due to the data being exceptionally high dimensional, which would make a Tree-based model work extremely slow.

In [None]:
clf = LogisticRegressionCV(cv=3, random_state=42, max_iter=3000, verbose=1).fit(X_train, y_train)

In [64]:
# validation

clf.score(lda.transform(X_test), y_test)

## Naive Bayes

We will also try a Naive Bayes method, as it is known to work well for document classification.
https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes

In [61]:
nb_gaussian = GaussianNB().fit(X_train, y_train)
nb_categorical = CategoricalNB().fit(X_train, y_train)
nb_complement = ComplementNB().fit(X_train, y_train)
nb_multinomial = MultinomialNB().fit(X_train, y_train)


In [66]:
print('Gaussian NB:',nb_gaussian.score((X_test), y_test))
print('Complement NB:', nb_complement.score((X_test), y_test))
print('Multinomial NB:', nb_multinomial.score((X_test), y_test))

## Support Vector Machine (SVM)

Lastly, we will try SVMs, as they are ['widely regarded as one of the best text classification algorithms'](https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568).

In [59]:
svc = SVC(verbose=1, max_iter=-1, random_state=42,).fit(X_train, y_train)

In [60]:
svc.score(X_test[:], y_test[:])

## Neural Network (MLP)

In [53]:
torch.Tensor(X_train.to_numpy()).type(torch.LongTensor).dtype

In [54]:
# load dataset

dataset = TensorDataset(torch.Tensor(X_train.to_numpy()), torch.Tensor(y_train.to_numpy()).type(torch.LongTensor))
train_loader = DataLoader(dataset, batch_size=16, shuffle=True)

In [60]:
# build architecture

class TextClassifier(nn.Module):
    def __init__(self):
        
        super().__init__()
        
        self.classifier = nn.Sequential(
            nn.Linear(45075, 100),
            nn.ReLU(),
            nn.Linear(100, 2)
        )
        
    def forward(self, x):
        return self.classifier(x)
    
# model init
model = TextClassifier()

# validation using 

loss_function = nn.CrossEntropyLoss()

In [61]:
losses_epoch = []

In [62]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001, momentum=0.9)
epochs = 10

for epoch in range(epochs):
    losses = []
    for i, data in enumerate(train_loader, 0):

        inputs, labels = data

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()
    
        # print statistics
        running_loss += loss.item()
        losses.append(loss.detach())
            
    losses_epoch.append(losses)
    l = sum(losses)/len(losses)
    print('epoch [{}/{}]'.format(epoch + 1, epochs,), f'Loss: {l}')

print('Finished Training')

# Creating Test Submission

Now that we have a basic model - though admittedly the validation score of 0.66 is not great - we can prepare our testing file and use the model to predict the target values.

In [None]:
test = pd.read_csv('../input/nlp-getting-started/test.csv')

## Preparing Test Data

To make things more concise, we will create a simple preprocessing pipeline. After that, we will count vectorise the tokens and merge the count matrix with our data.

In [None]:
def preprocessingPipeline(data):
    
    processed = data.fillna(value={'keyword': 'no_keyword', 'location': 'no_location'}, )
    
    # lower casing all texts
    
    processed['cleaned_text'] = processed['text'].str.lower()
    
    # remove urls and hashtags

    processed['cleaned_text'] = processed['cleaned_text'].apply(clean_text)
    # Tokenisation via nltk
    
    processed['tokens'] = processed['cleaned_text'].apply(word_tokenize)
    
    # remove stop words using nltk

    stopwords = nltk.corpus.stopwords.words('english')
    processed['tokens'] = processed['tokens'].apply(lambda x: [item for item in x if item not in stopwords])
    
    # remove punctuation tokens

    puncs = [char for char in string.punctuation]
    processed['tokens'] = processed['tokens'].apply(lambda x: [item for item in x if item not in puncs])
    processed['tokens'] = processed['tokens'].apply(lambda x: [item.strip(''.join(puncs)) for item in x])
    
    
    # lemmatise using wordnetlemmatise

    wnl = WordNetLemmatizer()
    processed['tokens_lem'] = processed['tokens'].apply(lambda x: [wnl.lemmatize(item) for item in x])

    # make the tokens into a string for our sklearn countvectoriser function

    processed['tokens_string'] = processed['tokens_lem'].apply(combine_tokens)
    
    # encode using training set trained OrdinalEncoders
    
    processed['location'] = oe_location.transform(processed['location'].to_frame())
    processed['keyword'] = oe_keyword.transform(processed['keyword'].to_frame())
    
    return processed

In [None]:
test_processed = preprocessingPipeline(test)

In [63]:
# countvectorising

X3 = v3.transform(test_processed['tokens_string'])
X3_matrix = X3.toarray()
X3_features = v3.get_feature_names_out()

trigrams_df = pd.DataFrame(X3_matrix, columns=list(X3_features))

# Combine with training data

test_trigram = test_processed.merge(pd.DataFrame(X3_matrix, columns=list(X3_features)), left_index=True, right_index=True, suffixes=['x', None])

X = test_trigram.drop(['text', 'cleaned_text', 'tokens', 'tokens_lem', 'tokens_string', 'id',], axis=1)

## Create Test Submission

In [None]:
# test

results = nb_gaussian.predict(X)

pd.DataFrame({'id': test.loc[:, 'id'], 'target': results}).to_csv("submission.csv", index=None)