# Real or Not? NLP with Disaster Tweets (Kaggle Competition)

In [3]:
from IPython.display import YouTubeVideo
YouTubeVideo("", width=600)

[Link to the GitHub repository](https://github.com/XaviJunior/SBB)

[Link to the YouTube video]()

## Contributions

* **Xavier AEBY**: 
* **Tarik BACHA**: 
* **Tanguy BERGUERAND**: 
* **Frederic SPYCHER**: 

## Introduction

For our second group project, we were tasked with joining the [Real or Not? NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started) Kaggle competition, which consists of devising a machine learning model that can predict, using natural language processing, whether tweets announcing a disaster are genuine or not.

The incentive for such a model is to assist disaster relief organizations and news agencies identify actual   as they monitor Twitter (which is often used as a communication tool during such events).

## Setting things up

In [None]:
# !pip install spacy
# !pip install xgboost

In [205]:
import re
import string

import numpy as np
import pandas as pd
import spacy
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
import xgboost as xgb

RSEED = 42

## The data

The data provided by Kaggle contains more than **10,000 tweets**, for 70% of which we are given their `target` class (1 = true, 0 = false). Each observation from the training data is composed of an `id` and the `text` of the tweet. In some cases, a `keyword` as well as the tweet's `location` are also given.

In [189]:
df = pd.read_csv("https://raw.githubusercontent.com/XaviJunior/SBB/master/project_2/Data/train.csv", encoding="utf-8")
df.head(3)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1


In [190]:
print("Classified observations:", df.shape[0])

Classified observations: 7613


In [191]:
df_test = pd.read_csv("https://raw.githubusercontent.com/XaviJunior/SBB/master/project_2/Data/test.csv", encoding="utf-8")
df_test.head(3)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."


In [192]:
print("Unclassified observations:", df_test.shape[0])

Unclassified observations: 3263


## Cleaning

In [193]:
def clean(tweet):
    tweet = re.sub(r" \w{1,3}\.{3,3} http\S{0,}", " ", tweet)
    tweet = re.sub(r"http\S{0,}", " ", tweet)
    tweet = re.sub(r".Û.", "", tweet)
    tweet = re.sub(r"\s+", " ", tweet)
    tweet = re.sub(r'\t', " ", tweet)
    tweet = re.sub(r'\n', " ", tweet)
    tweet = re.sub(r"\.{3,3}", "", tweet)

    return tweet
    
df["text"] = df["text"].apply(clean)
df_test["text"] = df_test["text"].apply(clean)

# df["text"].to_csv("tweets.csv", index=False)

## Tokenization

In [194]:
nlp = English()
punctuation = string.punctuation
stop_words = spacy.lang.en.stop_words.STOP_WORDS

def tokenizer(tweet):
    tokens = nlp(tweet)
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens]
    tokens = [word for word in tokens if word not in stop_words and word not in punctuation]
    
    return tokens

# vectorizer = TfidfVectorizer(tokenizer=tokenizer, ngram_range=(1,1))
vectorizer = CountVectorizer(tokenizer=tokenizer, ngram_range=(1,1))

## Training models

For our first attempts at building a model, we went back to models previously seen in the Data Mining & Machine Learning and Big-Scale Analytics courses, without tweaking the hyperparamaters too much, in order to compare how they each perform with this particular dataset.

Preparing the data: needs to be in the form of the list

In [196]:
X = vectorizer.fit_transform(df["text"].values.tolist())
y = df["target"].values.tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RSEED)

X_train

<6090x17410 sparse matrix of type '<class 'numpy.int64'>'
	with 51831 stored elements in Compressed Sparse Row format>

In [173]:
# vectorizer.get_feature_names()

### Random forest

In [200]:
RF = RandomForestClassifier(criterion="entropy", n_estimators=30, max_depth=200, random_state=RSEED)
RF.fit(X_train, y_train)
f1_score(y_test, RF.predict(X_test))

0.6904109589041096

### XGBoost

In [None]:
xgc = xgb.XGBClassifier(n_estimators=101, max_depth=71, base_score=0.5, objective='binary:logistic', random_state=42)
xgc.fit(X_train, y_train)
f1_score(y_test, RF.predict(X_test))

### Neural network

## Predictions

To send our submissions for the Kaggle competition, we compute predictions for a set of tweets provided by the website and send them in a CSV file containing the tweet `id` and the predicted `target` 

In [204]:
X = vectorizer.transform(df_test["text"].values.tolist())
df_test["target"] = RF.predict(X)

df_test[["id", "target"]].to_csv("UNIL_SBB_FSP.csv", index=False)

## Unused code

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB

from sklearn.pipeline import Pipeline
text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf',MultinomialNB()),])

text_clf.fit(Text_train, y_train)
print(text_clf.score(Text_test,y_test))

In [0]:
# first neural network with keras tutorial
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers
from keras import layers

input_dim = bow_train.shape[1] 
model = Sequential()
model.add(layers.Dense(600, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(200))
model.add(layers.Dense(150))
model.add(layers.Dense(10))
model.add(layers.Dense(1, activation='sigmoid'))

In [0]:
model.compile(loss='binary_crossentropy', 
               optimizer='adam', 
               metrics=['accuracy'])

In [0]:
model.summary()

Model: "sequential_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_48 (Dense)             (None, 512)               9479680   
_________________________________________________________________
dense_49 (Dense)             (None, 50)                25650     
_________________________________________________________________
dense_50 (Dense)             (None, 30)                1530      
_________________________________________________________________
dense_51 (Dense)             (None, 20)                620       
_________________________________________________________________
dense_52 (Dense)             (None, 1)                 21        
Total params: 9,507,501
Trainable params: 9,507,501
Non-trainable params: 0
_________________________________________________________________


In [0]:
model.fit(bow_train, y_train, epochs=10, batch_size=50)
loss, accuracy = model.evaluate(bow_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Testing Accuracy:  0.7525


In [0]:
loss, accuracy = model.evaluate(bow_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Testing Accuracy:  0.7663
