In [15]:
import pandas as pd
import numpy as np
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [6]:
sample_data = pd.read_csv("sample_submission.csv")
test_df = pd.read_csv("test.csv")
train_df = pd.read_csv("train.csv")

## Data inspect

In [8]:
train_df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


when target = 1 it means the tweet is a disaster tweet

when target = 0 it means that tweet is NOT a disaster tweet

In [9]:
train_df[train_df["target"] == 0]["text"].values[1] 

'I love fruits'

In [12]:
train_df[train_df["target"] == 1]["text"].values[1]

'Forest fire near La Ronge Sask. Canada'

### Building vectors

The words contained in each tweets are good indicators of whether they're about a real disaster or not (Not entirely correct, but good start).

Using SKLEARNS "CountVectorizer" to count the words in each tweet and turn them into data our machine learning model can process

Note: A "vector" is a set of numbers that a machine learning model can work with.

In [16]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [17]:
## we use .todense() here because these vectors are "sparse" (only non zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


The above tells us that there are 54 unique words (or tokens) in the first 5 tweets and the first tweet contains only some of those unique tokens - all of the non zero counts above are the tokens that do exist in the first tweet.

Below are vectors for all the tweets.

In [19]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])

## note that were not using .fit_transform() here. using just .transform() makes sure that the tokens in the train
## bectors are the only ones mapped to the test vectors i.e. that the train and test vectors use the same set of tokens.

test_vectors = count_vectorizer.transform(test_df["text"])

### Our model
As we mentioned above, we think the words contained in each tweet are a good indicator of whether theyre about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real. 

What we're assuming here is a linear connection. So let's build a linear model and see!

In [20]:
## Our vectors are really big so we want to push our model's weights toward 0 without completely discounting different words
## ridge regression is a good way to do this
clf = linear_model.RidgeClassifier()

Lets test our model and see how ell it does on the training data. For this we'll use cross-validation - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs. 

Using F1 as test metric since that is the metric for the competition

In [22]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.60355649, 0.57580105, 0.64485082])

The above scores aren't terrible! It looks like our assumption will score roughly 0.65 on the leaderboard. There are lots of ways to potentially improve on this (TFIDF, LSA, LSTM/RNN).

## Predictions on our training set and building submission for competition

In [24]:
clf.fit(train_vectors, train_df["target"])

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=None,
                solver='auto', tol=0.001)

In [25]:
sample_data["target"] = clf.predict(test_vectors)

In [27]:
sample_data.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


In [30]:
sample_data.to_csv("submission.csv", index=False)