<h2>Load the Data</h2>

Load the cleaned data and split 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import lightgbm

train = pd.read_csv('data/train_cleaned.csv')
X_train, X_val, y_train, y_val = train_test_split(
    train['clean_text'], train['target'], test_size=0.2, shuffle=True, random_state=0)

<h2>LightGBM</h2>

First, we have to featurize our text. I'll use TfidfVectorizer for this

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=50000, ngram_range=(1, 2))
train_data_features = tfidf.fit_transform(X_train)
val_data_features = tfidf.transform(X_val)

Now, let's do the data transformation required for LightGBM

In [3]:
train_set = lightgbm.Dataset(train_data_features, label=y_train)
val_set = lightgbm.Dataset(val_data_features, label=y_val)

In [4]:
params = {
    'learning_rate': 0.03,
    'boosting_type': 'gbdt', # gradient boosting decision tree
    'objective': 'binary', # binary classifiction problem
    'metric': 'binary_logloss', # for binary classification
    'num_leaves': 20, # number of leaves in a full tree
    'min_data': 50, # minimum number of records a leaf may have. default = 20
    #'max_depth': 10 # maximum depth of tree; decrease if overfitting
}

lgb_model = lightgbm.train(params, train_set, 500)

In [5]:
from sklearn.metrics import accuracy_score
import numpy as np

lgb_probs = lgb_model.predict(val_data_features, 500)

for thresh in np.arange(0.3, 0.5, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, accuracy_score(y_val, (lgb_probs>thresh).astype(int))))

F1 score at threshold 0.3 is 0.5843729481286933
F1 score at threshold 0.31 is 0.5869993434011819
F1 score at threshold 0.32 is 0.5896257386736704
F1 score at threshold 0.33 is 0.5935653315824031
F1 score at threshold 0.34 is 0.706500328299409
F1 score at threshold 0.35 is 0.7091267235718975
F1 score at threshold 0.36 is 0.7097833223900197
F1 score at threshold 0.37 is 0.7124097176625082
F1 score at threshold 0.38 is 0.7130663164806303
F1 score at threshold 0.39 is 0.7137229152987524
F1 score at threshold 0.4 is 0.7124097176625082
F1 score at threshold 0.41 is 0.7156927117531189
F1 score at threshold 0.42 is 0.716349310571241
F1 score at threshold 0.43 is 0.7202889034799738
F1 score at threshold 0.44 is 0.7222586999343401
F1 score at threshold 0.45 is 0.7229152987524623
F1 score at threshold 0.46 is 0.721602101116218
F1 score at threshold 0.47 is 0.7222586999343401
F1 score at threshold 0.48 is 0.7222586999343401
F1 score at threshold 0.49 is 0.7242284963887065


Appears, that the threshold for classifying as non-disaster vs disaster tweet is 0.45. 

In [13]:
(y_val== (lgb_probs>0.45)).mean()

0.7229152987524623

Not as good of a performance as the logistic regression model.  Perhaps, a blend of the two will improve performance since the two models are pretty different and they may have learned differently and can combine to give a better performance

<h2>Blended Logistic Regression and LightGBM model</h2>

<h3>Logistic Regression</h3>

Best results from the baseline model were achieved using binarized logistic regression

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

bv = CountVectorizer(max_features=50000, ngram_range=(1, 2))
train_data_features = bv.fit_transform(X_train)
val_data_features = bv.transform(X_val)

In [15]:
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression(C=0.5)
logistic.fit(train_data_features.sign(), y_train)

LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [23]:
logistic_probs = logistic.predict_proba(val_data_features.sign())[:, 1]

for thresh in np.arange(0.3, 0.5, 0.01):
    thresh = np.round(thresh, 2)
    print("Accuracy score at threshold {0} is {1}".format(thresh, accuracy_score(y_val, (logistic_probs>thresh).astype(int))))

Accuracy score at threshold 0.3 is 0.7747866053841103
Accuracy score at threshold 0.31 is 0.7754432042022325
Accuracy score at threshold 0.32 is 0.7820091923834537
Accuracy score at threshold 0.33 is 0.788575180564675
Accuracy score at threshold 0.34 is 0.7957977675640184
Accuracy score at threshold 0.35 is 0.7971109652002626
Accuracy score at threshold 0.36 is 0.7997373604727511
Accuracy score at threshold 0.37 is 0.8023637557452397
Accuracy score at threshold 0.38 is 0.8003939592908733
Accuracy score at threshold 0.39 is 0.8036769533814839
Accuracy score at threshold 0.4 is 0.8056467498358503
Accuracy score at threshold 0.41 is 0.8063033486539725
Accuracy score at threshold 0.42 is 0.8128693368351937
Accuracy score at threshold 0.43 is 0.8122127380170716
Accuracy score at threshold 0.44 is 0.81483913328956
Accuracy score at threshold 0.45 is 0.8181221273801708
Accuracy score at threshold 0.46 is 0.814182534471438
Accuracy score at threshold 0.47 is 0.8128693368351937
Accuracy score a

In [17]:
logistic_probs = logistic.predict_proba(val_data_features)[:, 1]
(y_val == (logistic_probs>0.45)).mean()

0.8174655285620486

In [19]:
logistic_probs, lgb_probs

(array([0.11109575, 0.40588825, 0.4438563 , ..., 0.69375798, 0.10121786,
        0.54541374]),
 array([0.19468519, 0.48497119, 0.33974634, ..., 0.85849911, 0.33974634,
        0.33974634]))

In [33]:
for i in np.arange(0, 1, 0.1):
    y_val_probs = (logistic_probs*i) + (lgb_probs*abs(1-i))
    print('Accuracy score with weight ratio of logistic model:LGB model::{0}:{1} is {2}'.format(
        round(i, 1), round(abs(1-i),1), (y_val==(y_val_probs>0.45)).mean()))

Accuracy score with weight ratio of logistic model:LGB model::0.0:1.0 is 0.7229152987524623
Accuracy score with weight ratio of logistic model:LGB model::0.1:0.9 is 0.7275114904793172
Accuracy score with weight ratio of logistic model:LGB model::0.2:0.8 is 0.7426132632961261
Accuracy score with weight ratio of logistic model:LGB model::0.3:0.7 is 0.7688772160210111
Accuracy score with weight ratio of logistic model:LGB model::0.4:0.6 is 0.7806959947472094
Accuracy score with weight ratio of logistic model:LGB model::0.5:0.5 is 0.7951411687458962
Accuracy score with weight ratio of logistic model:LGB model::0.6:0.4 is 0.8003939592908733
Accuracy score with weight ratio of logistic model:LGB model::0.7:0.3 is 0.8036769533814839
Accuracy score with weight ratio of logistic model:LGB model::0.8:0.2 is 0.8108995403808273
Accuracy score with weight ratio of logistic model:LGB model::0.9:0.1 is 0.8154957321076822


Better off just using the logistic regression model clearly