# Real or Not? NLP with Disaster Tweets

In this tutorial we'll look at the dataset provided by the competition in kaggle and use a very simple technique to process it, build a machine learning model and evaluate it.

In [1]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction
from sklearn.metrics import classification_report, precision_score, f1_score
from sklearn.naive_bayes import BernoulliNB

## Data Preprocessing

We first load the training data and visualize it. We can see that we can use the columns keyword, location and the text as an input to a model that maps it to its target.

In [2]:
train = pd.read_csv('train.csv', sep=",")
print('Shape of the training data:', train.shape)
train.head(5)

FileNotFoundError: [Errno 2] File b'train.csv' does not exist: b'train.csv'

In [3]:
test = pd.read_csv("test.csv")
print('Shape of testing data:', test.shape)
test.head(5)

FileNotFoundError: [Errno 2] File b'test.csv' does not exist: b'test.csv'

For now we one-hot encode each word in the column: text then we extract the targets and the testing data. (We can use keyword and location as additional features ...)

In [4]:
vectorizer = feature_extraction.text.TfidfVectorizer()
vectorizer.fit_transform(sum([list(train['text']), list(test['text'])], []))
vocabulary = vectorizer.get_feature_names()

In [5]:
count_vectorizer = feature_extraction.text.CountVectorizer(vocabulary=vocabulary)
X = count_vectorizer.fit_transform(train["text"]).todense()
print('Shape of X: ', X.shape)

Shape of X:  (7613, 27922)


In [6]:
Y = np.asarray(train['target'])
print('Number of the targets: ', Y.shape[0])

Number of the targets:  7613


Similary to before we are going to one-hot encode the each word in the column: text.

In [7]:
X_test = count_vectorizer.fit_transform(test["text"]).todense()
print('Shape of X_test:', X_test.shape)

Shape of X_test: (3263, 27922)


## Model/Training/Evaluation

### Naive Bayes Classifier

We choose a Naive Bayes Classifier for multivariate Bernoulli models as a model.

In [8]:
clf_nbc = BernoulliNB(alpha=5)
clf_nbc.fit(X, Y)
y_pred = clf_nbc.predict(X)

Now we compute the precision and the f1-score.

In [9]:
precision = precision_score(Y, y_pred)
print('Precision: ', precision)
fscore = f1_score(Y, y_pred)
print('F1-score: ', fscore)

Precision:  0.995598868280415
F1-score:  0.9817110973341601


Now let's predict for the targets for the testing data.

In [10]:
sample_submission = pd.read_csv("sample_submission.csv")
sample_submission["target"] = clf_nbc.predict(X_test)

In [11]:
sample_submission.head(5)

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,1
3,9,1
4,11,1


In [12]:
sample_submission.to_csv("submission.csv", index=False)

The model achieves a f1-score of 0.76073. We can see that a Naive Bayes Classifier is a good start.