## 01 - Ridge regression

In this notebook we use a linear approach to solve this challenge. We generate token count vectors from the text of the tweets and use a Ridge regression classifier to predict whether they refer to a real disaster or not.

Solution based on [this](https://www.kaggle.com/philculliton/nlp-getting-started-tutorial) notebook.

We start by importing the packages we are going to use:

In [79]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score

We load the datasets:

In [80]:
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

train_data

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


We generate vector counts for both train and test data using scikit's **CountVectorizer**. In particular, notice that we fit the vectorizer only with the train tokens, and use it to transform both train and test data. We find that there are 21637 unique tokens in the train dataset, and for each tweet, a vector with this length indicates the word counts.

In [81]:
cvec = CountVectorizer()
train_vectors = cvec.fit_transform(train_data['text'])
test_vectors = cvec.transform(test_data['text'])

train_vectors.shape, test_vectors.shape

((7613, 21637), (3263, 21637))

#### Ridge regression

It is a fair assumption to believe that some words (or sets of words) have direct correlation with the prediction variable, i.e. whether the tweet is about a real disaster or not. We hence try to use a linear model. Since the vectors are too big, we use Ridge Regression as it adds a penalty term for solutions with big 2-norm, or equivalently prefers solutions with many coefficients close to zero.

In [82]:
rc = RidgeClassifier()

We check the classifier's performance using cross-validation with 10 splits and 3 repeats.

In [83]:
cv = RepeatedKFold(n_splits=10, n_repeats=3)
score = np.mean(cross_val_score(rc, train_vectors, train_data['target'], cv=cv, scoring='f1'))
score

0.7369436688893063

We train it with the whole train dataset:

In [84]:
rc.fit(train_vectors, train_data['target'])

RidgeClassifier()

And we generate the predictions for submission:

In [85]:
test_pred = rc.predict(test_vectors)

output = pd.DataFrame({'id': test_data['id'], 'target': test_pred})
output.to_csv('predictions/ridge-regression.csv', index=False)
print("Submission successfully saved!")

Submission successfully saved!
