# RoBERTa:
**Model Size:**


RoBERTa is typically larger than DistilBERT, especially the "Large" variant. It has more parameters, making it potentially more powerful but also computationally more expensive.


**Training Techniques**:


RoBERTa uses advanced training techniques, such as removing the Next Sentence Prediction (NSP) objective and training with larger mini-batches. These modifications often lead to improved performance.


**Training Data**:


RoBERTa is trained on a large corpus of text data, and it benefits from extensive pre-training. The training data and techniques contribute to its robustness and effectiveness across various NLP tasks.


**Performance:**


RoBERTa is known for achieving state-of-the-art results on a wide range of natural language processing benchmarks. Its performance is generally top-tier, especially for tasks like text classification, sentiment analysis, and named entity recognition.

# Load Libraries
**Changes:**


-Using RoBerta instead DistillBert


-fix the test data predicting

In [1]:
!pip install keras-core --upgrade
!pip install -q keras-nlp --upgrade

# This sample uses Keras Core, the multi-backend version of Keras.
# The selected backend is TensorFlow (other supported backends are 'jax' and 'torch')
import os
os.environ['KERAS_BACKEND'] = 'tensorflow'



In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
import keras_core as keras
import keras_nlp
import seaborn as sns
import matplotlib.pyplot as plt

train_df = pd.read_csv('/kaggle/input/artificial-text-detection-homework/dev.csv')

X = train_df['Text']
y = train_df['Class'].apply(lambda x: 1 if x == 'M' else 0)



Using TensorFlow backend


# Load The DistilBert for classification

In [3]:
# Load a DistilBERT model.
preset= "roberta_base_en"

# Use a shorter sequence length.
preprocessor = keras_nlp.models.RobertaPreprocessor.from_preset(preset,
                                                                   sequence_length=160,
                                                                   name="preprocessor_4_tweets"
                                                                  )

# Pretrained classifier.
classifier = keras_nlp.models.RobertaClassifier.from_preset(preset,
                                                               preprocessor = preprocessor, 
                                                               num_classes=2)

classifier.summary()

Downloading data from https://storage.googleapis.com/keras-nlp/models/roberta_base_en/v1/vocab.json
[1m898823/898823[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step       
Downloading data from https://storage.googleapis.com/keras-nlp/models/roberta_base_en/v1/merges.txt
[1m456318/456318[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step       
Downloading data from https://storage.googleapis.com/keras-nlp/models/roberta_base_en/v1/model.h5
[1m496436344/496436344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 0us/step


# Prepare Data

In [4]:
def remove_br_tags(text):
    return text.replace('<br />', '')

X = X.apply(remove_br_tags)

In [5]:
from sklearn.model_selection import train_test_split
# Assuming train_X and train_y are your features and labels
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Compile The Model

In [6]:
# Compile
classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), #'binary_crossentropy',
    optimizer=keras.optimizers.Adam(1e-5),
    metrics= ["accuracy"]  
)

# Train (Finetune) DistilBertForClassification

In [7]:
early_stopping = tf.keras.callbacks.EarlyStopping(
    patience = 60, 
    min_delta = 1e-4, 
    restore_best_weights = True
)

In [8]:
EPOCHS = 100
BATCH_SIZE = 32

# Fit
history = classifier.fit(x=X_train,
                         y=y_train,
                         batch_size=BATCH_SIZE,
                         epochs=EPOCHS, 
                         validation_data=(X_val, y_val), 
                         callbacks= [early_stopping]
                        )

Epoch 1/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m121s[0m 753ms/step - accuracy: 0.7094 - loss: 0.5835 - val_accuracy: 0.9875 - val_loss: 0.0453
Epoch 2/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 579ms/step - accuracy: 0.9806 - loss: 0.0615 - val_accuracy: 1.0000 - val_loss: 0.0053
Epoch 3/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 578ms/step - accuracy: 0.9890 - loss: 0.0227 - val_accuracy: 1.0000 - val_loss: 0.0018
Epoch 4/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 574ms/step - accuracy: 1.0000 - loss: 0.0036 - val_accuracy: 1.0000 - val_loss: 9.7034e-04
Epoch 5/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 594ms/step - accuracy: 1.0000 - loss: 0.0034 - val_accuracy: 1.0000 - val_loss: 0.0022
Epoch 6/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 574ms/step - accuracy: 1.0000 - loss: 0.0010 - val_accuracy: 1.0000 - val_loss: 8.3761e-04
Epoch 7

# Predict on test

In [9]:
df_test = pd.read_csv('/kaggle/input/artificial-text-detection-homework/test.csv')
df_test['Text'] = df_test['Text'].apply(remove_br_tags)

df_test['Class'] = [('M' if np.argmax(i) == 1 else 'H') for i in classifier.predict(df_test['Text'].to_list())]

[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 130ms/step


In [10]:
submission_df = pd.DataFrame({'ID': df_test['ID'], 'Class': df_test['Class']})
submission_df.to_csv('submission.csv', index = False)

# Upvote and Comment if you like this notebook😉