## Python Notebook to train XGBoost model on both the original and augmented dataset

Here, we use the xgboost package but use the classifier built in the package that is built to integrate with sklearn at the cost of some functionality.
[Source](https://www.datacamp.com/tutorial/xgboost-in-python)

In [53]:
#Import necessary modules
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np
import nltk
import xgboost as xgb

## Configure NLTK if applicable

In [None]:
# nltk.download('averaged_perceptron_tagger')
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('vader_lexicon')

In [21]:
# # obtained from https://gist.github.com/susanli2016/d35def30b99f0e2f56c0e01e19ad0878
# def gettop_n_bigram(corpus, n=None):
#     vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
#     bag_of_words = vec.transform(corpus)
#     sum_words = bag_of_words.sum(axis=0)
#     words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
#     words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
#     return words_freq[:n]

In [54]:
def get_len(row):
    return len(row['Text'])

def get_len_aug(row):
    return len(row['text'])

In [55]:
# Perform feature engineering on the dataset
def feature_engineering(dataset, aug):
    if (aug):
        dataset['los'] = dataset.apply(get_len_aug, axis=1)
    else:
        dataset['los'] = dataset.apply(get_len, axis=1)

    # bow_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 1))
    # col_name = ""
    # if aug:
    #     col_name = 'text'
    # else:
    #     col_name = 'Text'
    # text_col = dataset[col_name]
    # new_col = bow_vectorizer.fit_transform(text_col)
    # dataset['bowvec'] = new_col
    return dataset

In [56]:
train_original = pd.read_csv('dataset/fulltrain.csv')
train_augment = pd.read_csv('dataset/merged_final_df_with_topics_new.csv')
test_original = pd.read_csv('dataset/balancedtest.csv')
test_augment = pd.read_csv('dataset/test_final_with_topics_new.csv')

bow_vectorizer_ori = CountVectorizer(stop_words='english', ngram_range=(1, 1))
bow_vectorizer_aug = CountVectorizer(stop_words='english', ngram_range=(1, 1))
# X_ori = bow_vectorizer.fit_transform(train_original['Text'])
# X_aug = bow_vectorizer.fit_transform(train_augment['text'])

# y_ori = train_original['Label']
# y_aug = train_augment['label']

X_train_ori = bow_vectorizer_ori.fit_transform(train_original['Text'])
X_train_aug = bow_vectorizer_aug.fit_transform(train_augment['text'])

y_train_ori = train_original['Label']
y_train_aug = train_augment['label']

X_test_ori = bow_vectorizer_ori.transform(test_original['Text'])
X_test_aug = bow_vectorizer_aug.transform(test_augment['text'])

y_test_ori = test_original['Label']
y_test_aug = test_augment['label']

# X_train_ori, X_test_ori, y_train_ori, y_test_ori = train_test_split(X_ori, y_ori, test_size=0.20, random_state=42)
# X_train_aug, X_test_aug, y_train_aug, y_test_aug = train_test_split(X_aug, y_aug, test_size=0.20, random_state=42)
train_augment[:10]

Unnamed: 0,label,text,has_swear_word,severity,processed_text,topic
0,1,"A little less than a decade ago, hockey fans w...",False,0.0,"['little', 'less', 'decade', 'ago', 'hockey', ...",0
1,1,The writers of the HBO series The Sopranos too...,False,0.0,"['writers', 'hbo', 'series', 'sopranos', 'took...",4
2,1,Despite claims from the TV news outlet to offe...,False,0.0,"['despite', 'claims', 'tv', 'news', 'outlet', ...",4
3,1,After receiving 'subpar' service and experienc...,False,0.0,"['receiving', 'subpar', 'service', 'experienci...",0
4,1,After watching his beloved Seattle Mariners pr...,False,0.0,"['watching', 'beloved', 'seattle', 'mariners',...",0
5,1,"At a cafeteria-table press conference Monday, ...",False,0.0,"['cafeteriatable', 'press', 'conference', 'mon...",0
6,1,Stunned shock and dismay were just a few of th...,False,0.0,"['stunned', 'shock', 'dismay', 'reactions', 'b...",0
7,1,"Speaking with reporters before a game Monday, ...",True,1.0,"['speaking', 'reporters', 'game', 'monday', 'l...",0
8,1,Sports journalists and television crews were p...,False,0.0,"['sports', 'journalists', 'television', 'crews...",0
9,1,"SALEM, VAF;or the eighth straight world-histor...",False,0.0,"['salem', 'vafor', 'eighth', 'straight', 'worl...",0


In [57]:
#Run this if you need to modify X_train again for some reason
# X_train = train_augment['topic']
print(X_train_ori.shape)
print(X_test_ori.shape)
# print(train_original[:5])
# print(test_original[:5])

(48854, 229285)
(3000, 229285)


In [108]:
# model = GradientBoostingClassifier() # not xgboost, but just to test that a model works
# X_train_formatted = np.array(X_train).reshape(-1, 1)
# # X_train_formatted = np.array(X_train)
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
# f1_score(y_test, y_pred, average='macro')




In [80]:
# Hyperparameter tuning for xgb model

# First, we do a train_test_split on the original dataset
bow_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 1))
X_ori = bow_vectorizer.fit_transform(train_original['Text'])
y_ori = train_original['Label']
X_train, X_test, y_train, y_test = train_test_split(X_ori, y_ori, test_size=0.20, random_state=42)

xgb_classifier = xgb.XGBClassifier(objective='binary:logistic', tree_method='hist', enable_categorical=True, max_depth=3, n_estimators=500)
# params = {'n_estimators': [1, 10, 25, 50, 100, 200], 'eta': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'max_depth': [2, 3, 4, 5, 10, 20]}
params = {'n_estimators': [700, 800 ,900]}

le = LabelEncoder()
y_train = le.fit_transform(y_train)

cv = RepeatedKFold(n_splits=2, n_repeats=1)
clf = GridSearchCV(xgb_classifier, params, cv=cv, verbose=2)
xgb_opt = clf.fit(X_train, y_train)

print("optimal_param: ", xgb_opt.best_estimator_.get_params()['n_estimators'])
# print("optimal_param: ", xgb_opt.best_estimator_.get_params()['eta'])

Fitting 2 folds for each of 3 candidates, totalling 6 fits
[CV] END ...................................n_estimators=700; total time= 1.6min
[CV] END ...................................n_estimators=700; total time= 1.6min
[CV] END ...................................n_estimators=800; total time= 1.8min
[CV] END ...................................n_estimators=800; total time= 1.8min
[CV] END ...................................n_estimators=900; total time= 2.0min
[CV] END ...................................n_estimators=900; total time= 2.1min
optimal_param:  700


### Hyperparameter tuning notes
For the XGBoost model, we decided to stick to tree ensemble methods. Hence, there are 3 main hyperparameters to tune:  
1. n_estimators: Number of estimators used in the ensemble model.
1. eta: The learning rate.
1. max_depth: The maximum depth of each individual tree model.

Hyperparameter tuning was done on the training data with an 80:20 test split. Sklearn's GridSearchCV was used to automate the process.
In the end, the values of the hyperparameters we arrived at was:
1. n_estimators: 700
1. eta: 0.5
1. max_depth: 5

In [51]:
xgb_classifier = xgb.XGBClassifier(objective='binary:logistic', tree_method='hist', enable_categorical=True)

# The label encoder is necessary as XGBClassifier expects labels [0,1,2,3] but we have [1,2,3,4]
le = LabelEncoder()
y_train_ori = le.fit_transform(y_train_ori)
y_test_ori = le.fit_transform(y_test_ori)
y_train_aug = le.fit_transform(y_train_aug)
y_test_aug = le.fit_transform(y_test_aug)

# Original dataset
xgb_classifier.fit(X_train_ori, y_train_ori)
y_pred_ori = xgb_classifier.predict(X_test_ori)
print('Original dataset:\n')
print(classification_report(y_test_ori, y_pred_ori))

# Augmented dataset
xgb_classifier.fit(X_train_aug, y_train_aug)
y_pred_aug = xgb_classifier.predict(X_test_aug)
print('Augmented dataset:\n')
print(classification_report(y_test_aug, y_pred_aug))

Original dataset:

              precision    recall  f1-score   support

           0       0.55      0.51      0.53       750
           1       0.65      0.30      0.41       750
           2       0.40      0.54      0.46       750
           3       0.58      0.73      0.65       750

    accuracy                           0.52      3000
   macro avg       0.55      0.52      0.51      3000
weighted avg       0.55      0.52      0.51      3000

Augmented dataset:

              precision    recall  f1-score   support

           0       0.56      0.53      0.54       750
           1       0.64      0.44      0.52       750
           2       0.50      0.46      0.48       750
           3       0.56      0.82      0.67       750

    accuracy                           0.56      3000
   macro avg       0.57      0.56      0.55      3000
weighted avg       0.57      0.56      0.55      3000



### Notes

XGBoost classifier (Sklearn version)
As a baseline for the "random" f1 score for the original dataset, I trained the XGBoost model on an X_train that was just the length of the text string. This f1 score turned out to be 0.06729. This is expected, and it just means that any meaningful features will produce an F1 score higher than this.<br>
Doing the same for the augmented dataset yields an F1 score of 0.1285. This improvement does not necessarily mean that the augmented dataset is "better", but rather that this is the base that any meaningful feature needs to beat.

We tried converting each of the text into a bag of words vector and training the XGB classifier on it. The f1 score obtained was 0.8765 for the original dataset, with an accuracy of 0.8851 and precision of 0.8935. Doing the same for the augmented dataset, the f1 score obtained was 0.8731, with accuracy 0.8759 and precision 0.8742. Although the model seems to do poorer on the augmented dataset, the discrepancy is minimal.<br>
Here, I think it is safe to conclude that when it comes to this particular feature, adding the new rows to the dataset does not affect the performance of the XGBoost classifier model.