## Python Notebook to train XGBoost model on both the original and augmented dataset

Here, we use the xgboost package but use the classifier built in the package that is built to integrate with sklearn at the cost of some functionality.
[Source](https://www.datacamp.com/tutorial/xgboost-in-python)

In [120]:
#Import necessary modules
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import pandas as pd
import numpy as np
import nltk
import xgboost as xgb

## Configure NLTK if applicable

In [None]:
# nltk.download('averaged_perceptron_tagger')
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('vader_lexicon')

In [97]:
# obtained from https://gist.github.com/susanli2016/d35def30b99f0e2f56c0e01e19ad0878
def gettop_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [98]:
def get_len(row):
    return len(row['Text'])

def get_len_aug(row):
    return len(row['text'])

In [112]:
# Perform feature engineering on the dataset
def feature_engineering(dataset, aug):
    if (aug):
        dataset['los'] = dataset.apply(get_len_aug, axis=1)
    else:
        dataset['los'] = dataset.apply(get_len, axis=1)

    # bow_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 1))
    # col_name = ""
    # if aug:
    #     col_name = 'text'
    # else:
    #     col_name = 'Text'
    # text_col = dataset[col_name]
    # new_col = bow_vectorizer.fit_transform(text_col)
    # dataset['bowvec'] = new_col
    return dataset

In [127]:
train_original = pd.read_csv('dataset/fulltrain.csv')
train_augment = pd.read_csv('dataset/merged_final_df_with_topics.csv')
# X = feature_engineering(train_augment, aug=True)['los'].to_frame()

bow_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 1))
X = bow_vectorizer.fit_transform(train_augment['text'])

y = train_original['Label']
# Comment this out if we are just using the original dataset
y = train_augment['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
train_augment[:10]

Unnamed: 0,label,text,has_swear_word,severity,processed_text,topic
0,1,"A little less than a decade ago, hockey fans w...",False,0.0,"['little', 'less', 'decade', 'ago', 'hockey', ...",0
1,1,The writers of the HBO series The Sopranos too...,False,0.0,"['writers', 'hbo', 'series', 'sopranos', 'took...",0
2,1,Despite claims from the TV news outlet to offe...,False,0.0,"['despite', 'claims', 'tv', 'news', 'outlet', ...",0
3,1,After receiving 'subpar' service and experienc...,False,0.0,"['receiving', 'subpar', 'service', 'experienci...",4
4,1,After watching his beloved Seattle Mariners pr...,False,0.0,"['watching', 'beloved', 'seattle', 'mariners',...",0
5,1,"At a cafeteria-table press conference Monday, ...",False,0.0,"['cafeteriatable', 'press', 'conference', 'mon...",4
6,1,Stunned shock and dismay were just a few of th...,False,0.0,"['stunned', 'shock', 'dismay', 'reactions', 'b...",0
7,1,"Speaking with reporters before a game Monday, ...",True,1.0,"['speaking', 'reporters', 'game', 'monday', 'l...",0
8,1,Sports journalists and television crews were p...,False,0.0,"['sports', 'journalists', 'television', 'crews...",0
9,1,"SALEM, VAF;or the eighth straight world-histor...",False,0.0,"['salem', 'vafor', 'eighth', 'straight', 'worl...",0


In [116]:
#Run this if you need to modify X_train again for some reason
# X_train = train_augment['topic']
print(X_train)

  (0, 140220)	1
  (0, 98223)	2
  (0, 59205)	1
  (0, 44663)	1
  (0, 170333)	1
  (0, 167613)	1
  (0, 102293)	1
  (0, 81264)	1
  (0, 163155)	1
  (0, 55795)	1
  (0, 224152)	1
  (0, 60277)	1
  (0, 12802)	1
  (0, 153431)	1
  (0, 130173)	2
  (0, 145188)	1
  (0, 124628)	1
  (0, 91928)	1
  (0, 191622)	2
  (0, 78997)	1
  (0, 124464)	1
  (0, 10156)	1
  (0, 50779)	1
  (0, 221905)	1
  (0, 154211)	1
  :	:
  (47835, 76236)	3
  (47835, 49310)	1
  (47835, 160789)	1
  (47835, 185804)	1
  (47835, 121871)	1
  (47835, 6562)	2
  (47835, 6934)	1
  (47835, 25588)	2
  (47835, 120446)	1
  (47835, 56900)	1
  (47835, 69594)	1
  (47835, 72703)	2
  (47835, 57801)	2
  (47835, 196753)	1
  (47835, 7593)	1
  (47835, 186782)	1
  (47835, 162230)	1
  (47835, 213581)	1
  (47835, 157287)	1
  (47835, 33923)	1
  (47835, 101862)	1
  (47835, 171282)	1
  (47835, 209486)	2
  (47835, 82674)	1
  (47835, 129886)	1


In [108]:
# model = GradientBoostingClassifier() # not xgboost, but just to test that a model works
# X_train_formatted = np.array(X_train).reshape(-1, 1)
# # X_train_formatted = np.array(X_train)
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
# f1_score(y_test, y_pred, average='macro')




In [128]:
xgb_classifier = xgb.XGBClassifier(n_estimators=100, objective='binary:logistic', tree_method='hist', eta=0.1, max_depth=3, enable_categorical=True)

# The label encoder is necessary as XGBClassifier expects labels [0,1,2,3] but we have [1,2,3,4]
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)

xgb_classifier.fit(X_train, y_train)
y_pred = xgb_classifier.predict(X_test)
print("f1 score: " + str(f1_score(y_test, y_pred, average='macro')))
print("accuracy score: " + str(accuracy_score(y_test, y_pred)))
print("precision score: " + str(precision_score(y_test, y_pred, average='macro')))

f1 score: 0.8731360239806389
accuracy score: 0.8759093569696463
precision score: 0.8742293658809233


### Notes

XGBoost classifier (Sklearn version)
As a baseline for the "random" f1 score for the original dataset, I trained the XGBoost model on an X_train that was just the length of the text string. This f1 score turned out to be 0.06729. This is expected, and it just means that any meaningful features will produce an F1 score higher than this.<br>
Doing the same for the augmented dataset yields an F1 score of 0.1285. This improvement does not necessarily mean that the augmented dataset is "better", but rather that this is the base that any meaningful feature needs to beat.

We tried converting each of the text into a bag of words vector and training the XGB classifier on it. The f1 score obtained was 0.8765 for the original dataset, with an accuracy of 0.8851 and precision of 0.8935. Doing the same for the augmented dataset, the f1 score obtained was 0.8731, with accuracy 0.8759 and precision 0.8742. Although the model seems to do poorer on the augmented dataset, the discrepancy is minimal.<br>
Here, I think it is safe to conclude that when it comes to this particular feature, adding the new rows to the dataset does not affect the performance of the XGBoost classifier model.