## RoBERTa - XGBoost - CatBoost:

RoBERTa is a language representation model based on the transformer architecture and is primarily used in natural language processing (NLP) tasks. XGBoost and CatBoost are gradient boosting algorithms designed for tabular data and structured data, commonly used in machine learning tasks, such as regression and classification.

It is possible to utilize RoBERTa embeddings as features in conjunction with XGBoost or CatBoost. This can be achieved by first passing the text data through RoBERTa to obtain dense vector representations (embeddings) for each text sample. These embeddings can then be combined with other numerical or categorical features in the dataset and used as input to XGBoost or CatBoost for further modeling.

The process can be summarized as follows:
 - Preprocess the text data and obtain RoBERTa embeddings for each text sample.
 - Combine the RoBERTa embeddings with the structured/tabular features in the dataset.
 - Use the combined feature set as input to XGBoost or CatBoost for training and prediction.
 
This hybrid approach leverages the strengths of both RoBERTa's language representation capabilities and the powerful gradient boosting algorithms like XGBoost or CatBoost for effective modeling on datasets containing a mixture of text and structured features.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import transformers
# transformers is a popular Python library developed by Hugging Face 
# that provides an easy-to-use interface for working with various 
# transformer-based models in natural language processing (NLP).
import tqdm
# tqdm is a Python library that provides a fast, extensible progress bar 
# for loops and other iterable objects.
from keras.preprocessing import sequence
# In the context of the Keras library, keras.preprocessing.sequence 
# provides tools and utilities for sequence-related tasks, especially 
# for working with text data in natural language processing (NLP). 
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(action='ignore', category = UserWarning)
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.utils import shuffle
from sklearn.metrics import classification_report
from catboost import CatBoostClassifier

In [2]:
true = pd.read_csv('True.csv')
fake = pd.read_csv('Fake.csv')
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [3]:
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
fake['Label'] = 0
true['Label'] = 1
df = pd.concat([fake, true], ignore_index = True, sort = False)
df = shuffle(df).reset_index(drop = True)
print(df.shape)
df.head()

(44898, 5)


Unnamed: 0,title,text,subject,date,Label
0,Bahrain's king issues decree reorganizing Nati...,DUBAI (Reuters) - Bahrain s King Hamad bin Isa...,worldnews,"September 12, 2017",1
1,GOP Cut Funds For Veterans And Mental Health ...,Republicans claim to love our military veteran...,News,"February 23, 2016",0
2,"Maine Voters Tell Trump To Go F*ck Himself, E...",Republicans should be downright afraid to try ...,News,"November 8, 2017",0
3,Swastika-Covered Guy Gets Punched In The Face...,One Nazi made the mistake of leaving his safe...,News,"October 20, 2017",0
4,NEW VIDEO…ANTIFA Terror Group INFILTRATED…Tran...,Steven Crowder is an amazing and ALWAYS unafra...,politics,"Sep 29, 2017",0


In [5]:
# Calculate & Delete the duplicates:
print("Number of Duplicates:", df.duplicated().sum())
df.drop_duplicates(inplace = True)
print(df.shape)

Number of Duplicates: 209
(44689, 5)


### Tokenizer Function:

In [6]:
def func_tokenizer(tokenizer_name, docs):
    features = []
    for doc in tqdm.tqdm(docs, desc = 'converting documents to features'):
        tokens = tokenizer_name.tokenize(doc)
        ids = tokenizer_name.convert_tokens_to_ids(tokens)
        features.append(ids)
    return features

X = df['text']
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize BERT Tokenizer:
roberta_tokenizer = transformers.RobertaTokenizer.from_pretrained('roberta-base-openai-detector')
roberta_train_features = func_tokenizer(roberta_tokenizer, X_train)
roberta_test_features = func_tokenizer(roberta_tokenizer, X_test)

roberta_trg = tf.keras.preprocessing.sequence.pad_sequences(roberta_train_features, maxlen = 500)
roberta_test = tf.keras.preprocessing.sequence.pad_sequences(roberta_test_features, maxlen = 500)

converting documents to features: 100%|██████████| 35751/35751 [02:06<00:00, 281.88it/s]
converting documents to features: 100%|██████████| 8938/8938 [00:30<00:00, 292.34it/s]


### XGBoost Classifier:

In [7]:
xgb = XGBClassifier(n_estimators = 1000, learning_rate = 0.15, max_depth = 9,
                    eval_metric = 'auc', use_label_encoder=False, objective = 'binary:logistic')
xgb.fit(roberta_trg, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='auc', feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.15, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=9, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=1000, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In [8]:
xgb_pred = xgb.predict(roberta_test)
xgb_score = accuracy_score(y_test, xgb_pred)
xgb_roc = roc_auc_score(y_test, xgb_pred)
print("The accuracy of XGBoost: {} %".format(xgb_score*100))
print("The roc_auc score of XGBoost: {} %".format(xgb_roc*100))

The accuracy of XGBoost: 98.15394942940256 %
The roc_auc score of XGBoost: 98.21985327692802 %


In [9]:
cr = classification_report(y_test, xgb_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      4792
           1       0.97      0.99      0.98      4146

    accuracy                           0.98      8938
   macro avg       0.98      0.98      0.98      8938
weighted avg       0.98      0.98      0.98      8938



### CatBoost Classifier:

In [10]:
cb = CatBoostClassifier(eval_metric = 'Accuracy', iterations = 2000, learning_rate = 0.2)
cb.fit(roberta_trg, y_train, verbose = 0)
cb_pred = cb.predict(roberta_test)
cb_score = accuracy_score(y_test, cb_pred)
cb_roc = roc_auc_score(y_test, cb_pred)
print("The accuracy of CatBoost: {} %".format(cb_score*100))
print("The roc_auc score of CatBoost: {} %".format(cb_roc*100))

The accuracy of CatBoost: 98.07563213246812 %
The roc_auc score of CatBoost: 98.13543456009253 %


In [11]:
cr2 = classification_report(y_test, cb_pred)
print(cr2)

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      4792
           1       0.97      0.99      0.98      4146

    accuracy                           0.98      8938
   macro avg       0.98      0.98      0.98      8938
weighted avg       0.98      0.98      0.98      8938

