# Real or Not? NLP with Disaster Tweets (Kaggle Competition)

In [3]:
from IPython.display import YouTubeVideo
YouTubeVideo("", width=600)

[Link to the GitHub repository](https://github.com/XaviJunior/SBB)

[Link to the YouTube video]()

## Contributions

* **Xavier AEBY**: experiments with Doc2Vec, video
* **Tarik BACHA**: experiments with neural networks, model explanation, EDA
* **Tanguy BERGUERAND**: cleaning, experiments with various models
* **Frederic SPYCHER**: cleaning, notebook writing/proofreading

## Introduction

For our second group project, we were asked to join the [Real or Not? NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started) Kaggle competition. This challenge consists of training a machine learning model that can predict, using **natural language processing** (NLP), whether tweets about **disasters** are genuine.

From the Kaggle website:

_Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies). But, it’s not always clear whether a person’s words are actually announcing a disaster._

Therefore, the **incentive** for building such a model is to bolster the monitoring efforts of the aforementioned organizations and help them to identify actual threats more quickly and with more accuracy.

## Setting things up

For NLP operations like tokenization, we turned to the powerful Python library **spaCy**. It comes with a pretrained model for English (`en_core_web_sm`).

As for machine learning models, we mostly used the toolkit offered by the **scikit-learn** library. Experiments with neural networks were done with **Keras/TensorFlow**.

Most of the text cleaning was done using **regular expressions**.

If the environment does not contain spaCy, Keras and TensorFlow already, they can be installed by uncommenting and calling the following commands.

In [3]:
# !pip install spacy
# !python -m spacy download en_core_web_sm

# !pip install keras
# !pip install --upgrade tensorflow==1.14.0

In [4]:
import re
import string

from nltk.corpus import wordnet
import numpy as np
import pandas as pd
import spacy
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.svm import NuSVC

RSEED = 42

In order to get reproducible and comparable results, we arbitrarily chose a **random seed** to be used as a parameter for the train/test split and models.

## The data

The data provided by Kaggle contains more than **10,000 tweets**, for 70% of which we are given the `target` class, i.e. whether they are true (1) or false (0). Each observation from the training data is composed of an `id` and the `text` of the tweet.

In most cases, a `keyword` as well as . There are a total of 221 keywords, all of which pertain somehow to accidents and disasters. However, they are assigned to tweets of both classes (e.g. the keyword "accident" is found in 24 tweets of class 1 and 11 tweets of class 0).

Sometimes, the tweet's `location` is also given. It is however missing in a large quantity of tweets. Moreover, the values found in this feature are very messy, probably due to the fact that they are user-generated (e.g. Birmingham; Est. September 2012 - Bristol; AFRICA; Philadelphia, PA; TN; #NewcastleuponTyne #UK; etc.).

In [12]:
df = pd.read_csv("https://raw.githubusercontent.com/XaviJunior/SBB/master/project_2/Data/train.csv", encoding="utf-8")
df.head(30)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [39]:
df[~df["location"].isna()]

Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
32,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
33,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
34,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
35,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0
...,...,...,...,...,...
7575,10826,wrecked,TN,On the bright side I wrecked http://t.co/uEa0t...,0
7577,10829,wrecked,#NewcastleuponTyne #UK,@widda16 ... He's gone. You can relax. I thoug...,0
7579,10831,wrecked,"Vancouver, Canada",Three days off from work and they've pretty mu...,0
7580,10832,wrecked,London,#FX #forex #trading Cramer: Iger's 3 words tha...,0


In [38]:
df[(df["keyword"] == "accident") & (df["target"]==1)].shape[0]

24

In [5]:
#df = df.drop_duplicates(subset="text", keep="first")
print("Classified observations:", df.shape[0])

Classified observations: 7503


In [6]:
df_test = pd.read_csv("https://raw.githubusercontent.com/XaviJunior/SBB/master/project_2/Data/test.csv", encoding="utf-8")
df_test.head(3)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."


In [7]:
print("Unclassified observations:", df_test.shape[0])

Unclassified observations: 3263


## Cleaning

kept duplicates (reinforcment of the fact that some words indicate disasters (ornot))

 elongated words (function found at From https://github.com/ugis22/analysing_twitter/): thought it would be a good idea but consistently give slightly lower scores

In [8]:
def clean(tweet):
    tweet = re.sub(r"\b[0-9]+\b", "", tweet)  # remove tokens with numbers only
    tweet = re.sub(r"@[A-Za-z0-9_]+", "", tweet)  # remove Twitter usernames
    tweet = re.sub(r"\bRT\b", "", tweet)  # remove "RT"
    tweet = re.sub(r" \w{1,3}\.{3,3} http\S{0,}", " ", tweet)  # remove truncated endings
    tweet = re.sub(r"http\S{0,}", " ", tweet)  # remove other URLs
    tweet = re.sub(r".Û.", "'", tweet)  # replace strange representation of apostrophe
    
    # replace HTML codes
    tweet = re.sub(r"&amp;", "&", tweet)
    tweet = re.sub(r"&lt;", "<", tweet)
    tweet = re.sub(r"&gt;", ">", tweet)
    
    tweet = re.sub(r"[^a-zA-Z0-9']", " ", tweet)  # keep alphanumerical characters only
    tweet = re.sub(r"\bx....\b", " ", tweet)  # remove hexadecimal characters
    tweet = re.sub(r"\s+|\t|\n", " ", tweet)  # remove all white spaces, tabs and newlines
    
    #tweet = detect_elongated_words(tweet)
    
    return tweet.strip()
    
df["text"] = df["text"].apply(clean)
df_test["text"] = df_test["text"].apply(clean)

df["text"].to_csv("tweets.csv", index=False)

## Tokenization

tfidf vs countvectorizer, number of grams, length of tokens

In [17]:
nlp = spacy.load('en_core_web_sm')
punctuation = string.punctuation
stop_words = spacy.lang.en.stop_words.STOP_WORDS

def tokenizer(tweet):
    tokens = nlp(tweet)
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens]
    tokens = [word for word in tokens if word not in stop_words and word not in punctuation]
    
    return tokens

vectorizer = TfidfVectorizer(tokenizer=tokenizer, ngram_range=(1,3))
# vectorizer = CountVectorizer(tokenizer=tokenizer, ngram_range=(1,3))

## Training models

For our first attempts at building a model, we went back to models previously seen in the Data Mining & Machine Learning and Big-Scale Analytics courses, without tweaking the hyperparamaters too much, in order to compare how they each perform with this particular dataset.

Preparing the data: needs to be in the form of the list

In [18]:
X = vectorizer.fit_transform(df["text"].values.tolist())
y = df["target"].values.tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RSEED)

X_train

<6002x86952 sparse matrix of type '<class 'numpy.float64'>'
	with 122986 stored elements in Compressed Sparse Row format>

In [19]:
print(vectorizer.get_feature_names())



### Support vector machine

Several svm offered by sklearn (SVC, SGDClassifier, NuSVC), kept nuSVC because consistently giving better scores.

best for now: tfidf, 1 gram, nu=.5, gamma=.3 --> 0.7397737162750216 (with countvec gamma .1 --> 0.7376760563380281)

undoing the elongated word cleaning: 0.7406113537117904

3-gram with tfidf: 0.7252946509519492

In [25]:
clf = NuSVC(nu=0.5, kernel="rbf", gamma=0.5)
clf.fit(X_train, y_train)
f1_score(y_test, clf.predict(X_test))

0.7194112235510578

In [None]:
NB vvery fast and slightly worse results than SVM

In [26]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB

clf = MultinomialNB()
clf.fit(X_train, y_train)
f1_score(y_test, clf.predict(X_test))

0.7012487992315083

### Logistic Regression

with tfidf -> 0.7277441659464132
with countvec -> 0.7241379310344829
consistently lower than SVM

with 3-grams countvec --> 0.7223719676549865

In [27]:
clf = LogisticRegressionCV(solver="lbfgs", cv=5, max_iter=2000, random_state=RSEED)
clf.fit(X_train, y_train)
f1_score(y_test, clf.predict(X_test))

0.7062043795620438

### Random forest, XGBoost

### Neural network

## Exporting predictions

To send our submissions for the Kaggle competition, we compute predictions for a set of tweets provided by the website and send them in a CSV file containing the tweet `id` and the predicted `target` 

In [267]:
clf.fit(X, y)
to_predict = vectorizer.transform(df_test["text"].values.tolist())
df_test["target"] = clf.predict(to_predict)  # choose appropriate model

df_test[["id", "target"]].to_csv("UNIL_SBB_FSP.csv", index=False)

## Unused code

In [None]:
def replace_elongated_word(word):
    # From https://github.com/ugis22/analysing_twitter/
    regex = r'(\w*)(\w+)\2(\w*)'
    repl = r'\1\2\3'    
    if wordnet.synsets(word):
        return word
    new_word = re.sub(regex, repl, word)
    if new_word != word:
        return replace_elongated_word(new_word)
    else:
        return new_word
    
def detect_elongated_words(row):
    # From https://github.com/ugis22/analysing_twitter/
    regexrep = r'(\w*)(\w+)(\2)(\w*)'
    words = [''.join(i) for i in re.findall(regexrep, row)]
    for word in words:
        if not wordnet.synsets(word):
            row = re.sub(word, replace_elongated_word(word), row)
    return row