# To_Vaccine_or_not_to_Vaccine_It's_not_a_Question
The aim of this challenge is to predict whether a tweet is about vaccination or not

## Model Pipeline
- ### Exploratory Data Analysis
  - Look for count of missing values in training data
  - Check the count for each agreement level
  - Removed samples with agreement levels less than `60%` since they make up only about `2%` of the entire data
  - Removed duplicate tweets
  - Preprocessed training and test text

 - ### Feature Engineering
   - Converted training and test set into word embeddings using the pretrained `glove` word embeddings

## Methodology
- Scaled word embeddings matrix using standard scaling
- Used grid search with a 5-fold split, with `negative root mean squared error` as the scoring metric criteria to explore different hyperparameter combinations for models

Below is table showing the best scores for each model:
<table>
    <tr>
        <th>Random Forest Classifier</th>
        <th>0.585</th>
    </tr>
    <tr>
        <td>XGBRegressor</td>
        <td>0.586</td>
    </tr>
</table>

Highest Publice Leader Mean Squared Error: ***0.612***
<br>
Rank: ***105***

In [153]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR

from xgboost import XGBClassifier, XGBRegressor

import os
import random

from gensim.models import KeyedVectors

In [114]:
## getting the data
data_path = "x__data"
df_train = pd.read_csv(f"{data_path}/train.csv")
df_test = pd.read_csv(f"{data_path}/test.csv")
sample_submission = pd.read_csv(f"{data_path}/samplesubmission.csv")

## Taking a look at training data

In [115]:
df_train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
0,CL1KWCMY,Me &amp; The Big Homie meanboy3000 #MEANBOY #M...,0.0,1.0
1,E3303EME,I'm 100% thinking of devoting my career to pro...,1.0,1.0
2,M4IVFSMS,"#whatcausesautism VACCINES, DO NOT VACCINATE Y...",-1.0,1.0
3,1DR6ROZ4,I mean if they immunize my kid with something ...,-1.0,1.0
4,J77ENIIE,Thanks to <user> Catch me performing at La Nui...,0.0,1.0


## Checking for missing values

In [116]:
## checking for missing
df_train.isna().sum()

tweet_id     0
safe_text    0
label        1
agreement    2
dtype: int64

## Checking agreement count
only a handful of the text have agreement of less than 60%

In [117]:
df_train["agreement"].value_counts(normalize=True)

agreement
1.000000    0.586659
0.666667    0.389439
0.333333    0.023902
Name: proportion, dtype: float64

## Number of samples per class
One of the samples has an invalid class hence should be removed before proceeding

In [118]:
df_train["label"].value_counts()

label
 0.000000    4908
 1.000000    4053
-1.000000    1038
 0.666667       1
Name: count, dtype: int64

#### Choosing to train only only columns that have above `60%` agreement

In [119]:
df_train = df_train[df_train["agreement"] > 0.6]

In [120]:
df_train["safe_text"].value_counts().shape

(9426,)

In [121]:
df_train.shape

(9760, 4)

## Removing the duplicates
Some tweets are duplicated hence duplicates should be removed

In [122]:
df_train = df_train.drop_duplicates(subset="safe_text")

In [123]:
## sanity check
df_train.shape

(9426, 4)

## Looking at a random text sample

In [124]:
df_train.sample(n=5)["safe_text"].values

array(['<user> Immunities, Baby... Immunities 😎',
       'In just over a week I will be working with colleagues in the #Dominicanrepublic to improve #immunization rates and #nutrition for children.',
       '#CoconutWater is amazing! #health #healing #healthy #immunity #digestion #longevity #wellness #natural… <url>',
       '<user> oh I saw a couple that said essentially " no one cares, vaccinate your kids" idk I found those funny haha',
       "If you are choosing not to vaccinate yourself or family you're a seditious baby killer. #fb"],
      dtype=object)

## Function to preprocess text
This function removes hashtags, urls, encoded symbols, placeholders etc.

In [125]:
import re

## function to preprocessing of text
def process_text(text):
    ## patterns to remove
    rem_pat_1 = "([@]|https?:)\S*"
    rem_pat_2 = "&\S+;"
    rem_pat_3 = "\[\d+:\d+.+\]" ## removing timestamp. eg. [01:04 UTC]
    rem_pat_4 = "[\-_.+#]" ## to remove symbols (make sure to bring last to avoid affecting first two patterns)
    rem_pat_5 = "<[^<>]*>"
    combined_rem_pat = f"({rem_pat_1})|({rem_pat_2})|({rem_pat_3})|({rem_pat_4})|({rem_pat_5})"

    text = re.sub(combined_rem_pat, "", text) ## removing text that match patterns
    text = text.strip() ## removing trailing white spaces
    text = text.lower() ## lowercasing

    return text

## function for tokenizing of string
def tokenize(text):
    return re.split("\s+", text)

## function to remove stop words
def remove_stopwords(token_list):
    l = []
    for word in token_list:
        if word not in stopwords:
            l.append(word)
    return l

def full_text_process(text):
    text = process_text(text)
    # text = tokenize(text)
    # text = remove_stopwords(text)
    return text

In [126]:
#     rem_pat_1 = "([@]|https?:)\S*"
#     rem_pat_2 = "&\S+;"
#     rem_pat_3 = "\[\d+:\d+.+\]" ## removing timestamp. eg. [01:04 UTC]
#     rem_pat_4 = "[\-_.+#]" ## to remove symbols (make sure to bring last to avoid affecting first two patterns)
#     rem_pat_5 = "<.*>"
#     combined_rem_pat = f"({rem_pat_1})|({rem_pat_2})|({rem_pat_3})|({rem_pat_4})|({rem_pat_5})"
# combined_rem_pat

In [127]:
process_text(df_test.iloc[11].values[1])

"study of 13 million kids reveals vaccines aren't associated with autism"

In [128]:
df_test.iloc[11].values[1]

"<user> <user> <user> Study of 1.3 Million Kids Reveals Vaccines Aren't Associated with Autism <url>"

## Class to convert text into word embedding using the pretrained glove embeddings

In [129]:
# ## creating datatransformers
# class GloveVectorizer(BaseEstimator, TransformerMixin):
#     def __init__(self, filepath=".", encoding="utf-8"):
#         self.filepath = filepath
#         self.encoding = encoding

#     def fitX(self, X):
#         self.unique_words = set()
#         self.vectors = []
#         self.word2idx = {}

#         ## getting unique words
#         for text in X:
#             self.unique_words |= set(text.split())

#         ## builing word to index dictionary and vector matrix
#         idx = 0
#         with open(self.filepath, encoding=self.encoding) as  file:
#             for line in file:
#                 line = line.split()
#                 word = line[0]
#                 if word in self.unique_words:
#                     self.word2idx[word] = idx
#                     self.vectors.append(line[1:])
#                     idx+=1

#         self.vectors = np.array(self.vectors, dtype=np.float32)

#     def fitY(self, y):
#         ## creating integer labels for each category
#         unique = {x for x in y}
#         self.label2int = dict([(v,i) for i,v in enumerate(unique)])


#     def transformX(self, X):
#         N = len(X)
#         transformed_x = np.zeros((N, self.vectors.shape[1]), dtype=np.float32)
#         for i in range(N):
#             mat = []
#             line = X[i].lower().split()
#             for word in line:
#                 if word in self.word2idx:
#                     mat.append(self.vectors[self.word2idx[word]])

#             if len(mat) > 0: transformed_x[i] = np.mean(mat, axis=0)
#             else: print(f"Sentence at index:{i} has no word in the vector dictionary")

#         return np.array(transformed_x)

#     def transformY(self, y):
#         return np.array([self.label2int[word] for word in y])
        

#     def fit(self, X, y=None):
#         self.fitX(X)

#         if y is not None: self.fitY(y)

#         return self
    

#     def transform(self, X, y=None):
#         if y is not None: return self.transformX(X), self.transformY(y)
#         else: return self.transformX(X)

#     def fit_transform(self, X, y=None):
#         self.fit(X,y)
#         return self.transform(X,y)


## creating datatransformers
class Word2vecVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, filepath=".", is_binary=True):
        self.filepath = filepath
        self.is_binary = is_binary

    def fitX(self, X):
        self.unique_words = set()
        self.keyed_vector = None
        self.word2idx = {}

        ## getting unique words
        for text in X:
            self.unique_words |= set(text.split())

        ## getting keyed vector
        self.keyed_vector = KeyedVectors.load_word2vec_format(self.filepath,
                                                              binary=self.is_binary)
        

    def fitY(self, y):
        ## creating integer labels for each category
        unique = {x for x in y}
        self.label2int = dict([(v,i) for i,v in enumerate(unique)])


    def transformX(self, X):
        N = len(X)
        V = self.keyed_vector.get_vector("king").shape[0]
        transformed_x = np.zeros((N, V), dtype=np.float32)
        for i in range(N):
            mat = []
            line = X[i].lower().split()
            for word in line:
                try:
                    mat.append(self.keyed_vector.get_vector(word))
                except:
                    pass

            if len(mat) > 0: transformed_x[i] = np.mean(mat, axis=0)
            else: print(f"Sentence at index:{i} has no word in the vector dictionary")

            

        return np.array(transformed_x)

    def transformY(self, y):
        return np.array([self.label2int[word] for word in y])
        

    def fit(self, X, y=None):
        self.fitX(X)

        if y is not None: self.fitY(y)

        return self
    

    def transform(self, X, y=None):
        if y is not None: return self.transformX(X), self.transformY(y)
        else: return self.transformX(X)

    def fit_transform(self, X, y=None):
        self.fit(X,y)
        return self.transform(X,y)

In [130]:
## getting the training and testing processed data
## train
train_text = df_train["safe_text"].apply(lambda x: full_text_process(x)).values
train_target = df_train["label"].values

In [131]:
random.choice(train_text)

"parents needa get their kids vaccinated so they don't catch these hands"

In [132]:
random_state = 42
# X_train, X_test, y_train, y_test = train_test_split(train_text, train_target, test_size=0.3, shuffle=True, random_state=random_state)

In [136]:
word2vec_path = "<path>"
vectorizer = Word2vecVectorizer(word2vec_path)

In [137]:
vectorizer.fit(train_text)

In [138]:
X = vectorizer.transform(train_text)
y = train_target

Sentence at index:133 has no word in the vector dictionary
Sentence at index:432 has no word in the vector dictionary
Sentence at index:554 has no word in the vector dictionary
Sentence at index:756 has no word in the vector dictionary
Sentence at index:1081 has no word in the vector dictionary
Sentence at index:1256 has no word in the vector dictionary
Sentence at index:1358 has no word in the vector dictionary
Sentence at index:2064 has no word in the vector dictionary
Sentence at index:2406 has no word in the vector dictionary
Sentence at index:2444 has no word in the vector dictionary
Sentence at index:2784 has no word in the vector dictionary
Sentence at index:3045 has no word in the vector dictionary
Sentence at index:3061 has no word in the vector dictionary
Sentence at index:3572 has no word in the vector dictionary
Sentence at index:3648 has no word in the vector dictionary
Sentence at index:4182 has no word in the vector dictionary
Sentence at index:4213 has no word in the ve

### GridSearchCV

In [139]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)

##### Linear Regression

In [151]:
params = {
    "fit_intercept": [True]
}
grid_lr = GridSearchCV(LinearRegression(), params, cv=kf, scoring="neg_root_mean_squared_error", verbose=10)
grid_lr.fit(X,y)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5; 1/1] START fit_intercept=True..........................................
[CV 1/5; 1/1] END ..........fit_intercept=True;, score=-0.580 total time=   0.1s
[CV 2/5; 1/1] START fit_intercept=True..........................................
[CV 2/5; 1/1] END ..........fit_intercept=True;, score=-0.589 total time=   0.0s
[CV 3/5; 1/1] START fit_intercept=True..........................................
[CV 3/5; 1/1] END ..........fit_intercept=True;, score=-0.576 total time=   0.0s
[CV 4/5; 1/1] START fit_intercept=True..........................................
[CV 4/5; 1/1] END ..........fit_intercept=True;, score=-0.578 total time=   0.0s
[CV 5/5; 1/1] START fit_intercept=True..........................................
[CV 5/5; 1/1] END ..........fit_intercept=True;, score=-0.588 total time=   0.0s


##### SVM

In [154]:
params = {
    "C": [0.5, 1.0, 1.3]
}
grid_svr = GridSearchCV(SVR(), params, cv=kf, scoring="neg_root_mean_squared_error", verbose=10)
grid_svr.fit(X,y)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5; 1/3] START C=0.5.......................................................
[CV 1/5; 1/3] END .......................C=0.5;, score=-0.578 total time=  13.7s
[CV 2/5; 1/3] START C=0.5.......................................................
[CV 2/5; 1/3] END .......................C=0.5;, score=-0.567 total time=  16.4s
[CV 3/5; 1/3] START C=0.5.......................................................
[CV 3/5; 1/3] END .......................C=0.5;, score=-0.557 total time=  16.2s
[CV 4/5; 1/3] START C=0.5.......................................................
[CV 4/5; 1/3] END .......................C=0.5;, score=-0.558 total time=  16.3s
[CV 5/5; 1/3] START C=0.5.......................................................
[CV 5/5; 1/3] END .......................C=0.5;, score=-0.568 total time=  16.5s
[CV 1/5; 2/3] START C=1.0.......................................................
[CV 1/5; 2/3] END .......................C=1.0;, 

#### RandomForestClassifier

In [140]:
params = {
    "n_estimators": [200, 300],
    "max_depth": [ 8, 10],
}

grid = GridSearchCV(RandomForestRegressor(), params, cv=kf, scoring="neg_root_mean_squared_error", verbose=10)
grid.fit(X,y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5; 1/4] START max_depth=8, n_estimators=200...............................


KeyboardInterrupt: 

In [105]:
print("Best Score:", grid.best_score_)
print("Best Parameters:", grid.best_params_)

Best Score: -0.5849319372968924
Best Parameters: {'max_depth': 10, 'n_estimators': 300}


### ExtraTreesClassifier

In [35]:
params = {
    "n_estimators": [600, 700],
    "max_depth": [16, 18],
}

grid_ex = GridSearchCV(ExtraTreesRegressor(), params, cv=kf, scoring="neg_root_mean_squared_error", verbose=10)
grid_ex.fit(X,y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5; 1/4] START max_depth=16, n_estimators=600..............................
[CV 1/5; 1/4] END max_depth=16, n_estimators=600;, score=-0.574 total time= 5.5min
[CV 2/5; 1/4] START max_depth=16, n_estimators=600..............................
[CV 2/5; 1/4] END max_depth=16, n_estimators=600;, score=-0.578 total time= 5.8min
[CV 3/5; 1/4] START max_depth=16, n_estimators=600..............................
[CV 3/5; 1/4] END max_depth=16, n_estimators=600;, score=-0.566 total time= 5.8min
[CV 4/5; 1/4] START max_depth=16, n_estimators=600..............................
[CV 4/5; 1/4] END max_depth=16, n_estimators=600;, score=-0.567 total time= 5.6min
[CV 5/5; 1/4] START max_depth=16, n_estimators=600..............................
[CV 5/5; 1/4] END max_depth=16, n_estimators=600;, score=-0.574 total time= 5.2min
[CV 1/5; 2/4] START max_depth=16, n_estimators=700..............................
[CV 1/5; 2/4] END max_depth=16, n_estim

In [37]:
print("Best Score:", grid_ex.best_score_)
print("Best Parameters:", grid_ex.best_params_)

Best Score: -0.571872860695505
Best Parameters: {'max_depth': 16, 'n_estimators': 600}


#### XGBClassifier

In [141]:
params = {
    "n_estimators": [200, 300],
    "max_depth": [6, 8, 10],
    "learning_rate": [0.05, 0.1]
}

grid_xgb = GridSearchCV(XGBRegressor(), params, cv=kf, scoring="neg_root_mean_squared_error", verbose=10)
grid_xgb.fit(X,y)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV 1/5; 1/12] START learning_rate=0.05, max_depth=6, n_estimators=200..........
[CV 1/5; 1/12] END learning_rate=0.05, max_depth=6, n_estimators=200;, score=-0.563 total time=  30.3s
[CV 2/5; 1/12] START learning_rate=0.05, max_depth=6, n_estimators=200..........
[CV 2/5; 1/12] END learning_rate=0.05, max_depth=6, n_estimators=200;, score=-0.569 total time=  28.8s
[CV 3/5; 1/12] START learning_rate=0.05, max_depth=6, n_estimators=200..........
[CV 3/5; 1/12] END learning_rate=0.05, max_depth=6, n_estimators=200;, score=-0.558 total time=  28.0s
[CV 4/5; 1/12] START learning_rate=0.05, max_depth=6, n_estimators=200..........
[CV 4/5; 1/12] END learning_rate=0.05, max_depth=6, n_estimators=200;, score=-0.553 total time=  31.3s
[CV 5/5; 1/12] START learning_rate=0.05, max_depth=6, n_estimators=200..........
[CV 5/5; 1/12] END learning_rate=0.05, max_depth=6, n_estimators=200;, score=-0.563 total time=  27.5s
[CV 1/5; 2/12] STAR

In [142]:
print("Best Score:", grid_xgb.best_score_)
print("Best Parameters:", grid_xgb.best_params_)

Best Score: -0.560653777424504
Best Parameters: {'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 300}


### Prediction

In [143]:
## filling nulls
df_test = df_test.fillna("")

In [144]:
ids = df_test["tweet_id"]

In [145]:
def predict_on_dataframe(model, data):
    '''Make a prediction on a dataframe'''
    pred_text = data["safe_text"].apply(lambda x: full_text_process(x)).values
    pred_text = vectorizer.transform(pred_text)
    return model.predict(pred_text)

In [146]:
# pred_text = df_test["text"].apply(lambda x: full_text_process(x)).values
# pred_text = vectorizer.transform(pred_text)
preds = predict_on_dataframe(grid_xgb, df_test)

Sentence at index:751 has no word in the vector dictionary
Sentence at index:932 has no word in the vector dictionary
Sentence at index:1317 has no word in the vector dictionary
Sentence at index:1697 has no word in the vector dictionary
Sentence at index:1833 has no word in the vector dictionary
Sentence at index:1917 has no word in the vector dictionary
Sentence at index:2024 has no word in the vector dictionary
Sentence at index:2510 has no word in the vector dictionary
Sentence at index:3304 has no word in the vector dictionary
Sentence at index:3461 has no word in the vector dictionary
Sentence at index:3802 has no word in the vector dictionary
Sentence at index:4356 has no word in the vector dictionary
Sentence at index:4517 has no word in the vector dictionary
Sentence at index:4835 has no word in the vector dictionary


''

### Submitting

In [147]:
# submission = pd.DataFrame({"tweet_id": ids, "target":np.clip(preds, -1, 1)})

# if not os.path.exists("x__submissions"):
#     os.mkdir("x__submissions")

# save_name = "word2vec_xgb.csv"
# submission.to_csv(f"x__submissions/{save_name}", index=False)