# LLM - Detect AI Generated Text
> Identify which essay was written by a large language model

<img src="https://user-images.githubusercontent.com/36858976/279902422-b365f6ef-ef01-49ac-af7f-0bc2ca3ba835.png">

# 🎯 | Motivation

* In this notebook, we will demonstrate the usage of the multi-backend capabilities of `KerasCore` and `KerasNLP` for the **Detect Fake Text** infernece.

# 📓 | Notebooks

* Train: [Detect Fake Text: KerasNLP [TF/Torch/JAX][Train]](https://www.kaggle.com/code/awsaf49/detect-fake-text-kerasnlp-tf-torch-jax-train)
* Infer: [Detect Fake Text: KerasNLP [TF/Torch/JAX][Infer]](https://www.kaggle.com/code/awsaf49/detect-fake-text-kerasnlp-tf-torch-jax-infer)

# 🛠 | Install Libraries 

In [1]:
# !pip install /kaggle/input/llm-science-exam-lib-ds/keras_core-0.1.7-py3-none-any.whl --no-deps
# !pip install /kaggle/input/llm-science-exam-lib-ds/keras_nlp-0.6.2-py3-none-any.whl --no-deps

# 📚 | Import Libraries 

In [2]:
import os
os.environ["KERAS_BACKEND"] = "torch"  # or "tensorflow" or "torch"

import keras_nlp
import keras_core as keras 
import keras_core.backend as K


# import jax
import torch
import tensorflow as tf
# from tensorflow import keras
# import tensorflow.keras.backend as K

import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt

from glob import glob
from tqdm.notebook import tqdm
import gc

Using PyTorch backend.


## Library Version

In [3]:
print("TensorFlow:", tf.__version__)
# print("JAX:", jax.__version__)
print("Keras:", keras.__version__)
print("KerasNLP:", keras_nlp.__version__)
torch.cuda.is_available()

TensorFlow: 2.10.1
Keras: 0.1.7
KerasNLP: 0.6.3


True

# ⚙️ | Configuration

In [4]:
class CFG:
    # TEST_PATH = './kaggle/input/daigt-proper-train-dataset/train_drcat_02.csv'
    TEST_PATH = './kaggle/input/argugpt/argugpt.csv'
    VACAB_PATH = './kaggle/Model/keras-nlp-deberta-v3-base-en-vocab-ds/vocab.spm'
    CKPT_PATH = "./kaggle/working/"  # Name of pretrained models
    num_of_dataset = 1000 # Num of test dataset
    verbose = 0  # Verbosity
    device = 'GPU'  # Device
    seed = 42  # Random seed
    batch_size = 10  # Batch size
    drop_remainder = True  # Drop incomplete batches
    sequence_length = 200  # Input sequence length
    class_names = ['real','fake']  # Class names [A, B, C, D, E]
    num_classes = len(class_names)  # Number of classes
    class_labels = list(range(num_classes))  # Class labels [0, 1, 2, 3, 4]
    label2name = dict(zip(class_labels, class_names))  # Label to class name mapping
    name2label = {v: k for k, v in label2name.items()}  # Class name to label mapping

# ♻️ | Reproducibility 
Sets value for random seed to produce similar result in each run.

In [5]:
keras.utils.set_random_seed(CFG.seed)

# 💾 | Hardware
Following codes automatically detects hardware (TPU or GPU). 

In [6]:
def get_device():
    "Detect and intializes GPU/TPU automatically"
    try:
        # Connect to TPU
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect() 
        # Set TPU strategy
        strategy = tf.distribute.TPUStrategy(tpu)
        print(f'> Running on TPU', tpu.master(), end=' | ')
        print('Num of TPUs: ', strategy.num_replicas_in_sync)
        device=CFG.device
    except:
        # If TPU is not available, detect GPUs
        gpus = tf.config.list_logical_devices('GPU')
        ngpu = len(gpus)
         # Check number of GPUs
        if ngpu:
            # Set GPU strategy
            strategy = tf.distribute.MirroredStrategy(gpus) # single-GPU or multi-GPU
            # Print GPU details
            print("> Running on GPU", end=' | ')
            print("Num of GPUs: ", ngpu)
            device='GPU'
        else:
            # If no GPUs are available, use CPU
            print("> Running on CPU")
            strategy = tf.distribute.get_strategy()
            device='CPU'
    return strategy, device

In [7]:
# Initialize GPU/TPU/TPU-VM
strategy, CFG.device = get_device()
CFG.replicas = strategy.num_replicas_in_sync

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
> Running on GPU | Num of GPUs:  1


# 📁 | Dataset Path 

# 📖 | Meta Data 
* `{test|train}_essays.csv`
    * `id` - A unique identifier for each essay.
    * `prompt_id` - Identifies the prompt the essay was written in response to.
    * `text` - The essay text itself.
    * `generated` - Whether the essay was written by a student (0) or generated by an LLM (1). This field is the target and is not present in test_essays.csv.
* **sample_submission.csv** - is the valid sample submission.

## Test Data

In [8]:
ext_df1 = pd.read_csv(CFG.TEST_PATH)  # Read CSV file into a DataFrame
ext_df1 = ext_df1[['id', 'text']]

# test_df = pd.concat([
#     ext_df1[ext_df1.label==0].sample(CFG.num_of_dataset//2),
#     ext_df1[ext_df1.label==1].sample(CFG.num_of_dataset//2)
# ])
test_df = ext_df1.sample(1000)
test_df['label'] = 1

# Display information about the train data
print("# Test Data: {:,}".format(len(test_df)))
print("# Sample:")
display(test_df.head(2))

# Test Data: 1,000
# Sample:


Unnamed: 0,id,text,label
3137,gre_937,The arts have always played an essential role ...,1
149,weccl_737,University education has been a topic of debat...,1


# 🍽️ | Preprocessing

**What it does:** The preprocessor takes input strings and transforms them into a dictionary (`token_ids`, `padding_mask`) containing preprocessed tensors. This process starts with tokenization, where input strings are converted into sequences of token IDs.

**Why it's important:** Initially, raw text data is complex and challenging for modeling due to its high dimensionality. By converting text into a compact set of tokens, such as transforming `"The quick brown fox"` into `["the", "qu", "##ick", "br", "##own", "fox"]`, we simplify the data. Many models rely on special tokens and additional tensors to understand input. These tokens help divide input and identify padding, among other tasks. Making all sequences the same length through padding boosts computational efficiency, making subsequent steps smoother.

Explore the following pages to access the available preprocessing and tokenizer layers in **KerasNLP**:
- [Preprocessing](https://keras.io/api/keras_nlp/preprocessing_layers/)
- [Tokenizers](https://keras.io/api/keras_nlp/tokenizers/)

In [9]:
tokenizer= keras_nlp.models.DebertaV3Tokenizer(CFG.VACAB_PATH)
preprocessor= keras_nlp.models.DebertaV3Preprocessor(tokenizer, sequence_length=CFG.sequence_length)

Now, let's examine what the output shape of the preprocessing layer looks like. The output shape of the layer can be represented as $(num\_choices, sequence\_length)$.

In [10]:
outs = preprocessor(test_df.text.iloc[0])  # Process options for the first row

# Display the shape of each processed output
for k, v in outs.items():
    print(k, ":", v.shape)

token_ids : torch.Size([200])
padding_mask : torch.Size([200])


We'll use the `preprocessing_fn` function to transform each text option using the `dataset.map(preprocessing_fn)` method.

In [11]:
def preprocess_fn(text, label=None):
    text = preprocessor(text)  # Preprocess text
    return (text, label) if label is not None else text  # Return processed text and label if available

# 🍚 | DataLoader

The code below sets up a robust data flow pipeline using `tf.data.Dataset` for data processing. Notable aspects of `tf.data` include its ability to simplify pipeline construction and represent components in sequences.

To learn more about `tf.data`, refer to this [documentation](https://www.tensorflow.org/guide/data).

In [12]:
def build_dataset(texts, labels=None, batch_size=32,
                  cache=False, drop_remainder=True,
                  augment=False, repeat=False, shuffle=1024):
    AUTO = tf.data.AUTOTUNE  # AUTOTUNE option
    slices = (texts,) if labels is None else (texts, keras.utils.to_categorical(labels, num_classes=5))  # Create slices
    ds = tf.data.Dataset.from_tensor_slices(slices)  # Create dataset from slices
    ds = ds.cache() if cache else ds  # Cache dataset if enabled
    ds = ds.map(preprocess_fn, num_parallel_calls=AUTO)  # Map preprocessing function
    ds = ds.repeat() if repeat else ds  # Repeat dataset if enabled
    opt = tf.data.Options()  # Create dataset options
    if shuffle: 
        ds = ds.shuffle(shuffle, seed=CFG.seed)  # Shuffle dataset if enabled
        opt.experimental_deterministic = False
    ds = ds.with_options(opt)  # Set dataset options
    ds = ds.batch(batch_size, drop_remainder=drop_remainder)  # Batch dataset
    ds = ds.prefetch(AUTO)  # Prefetch next batch
    return ds  # Return the built dataset

## Fetch Train/test Dataset

The function below generates the training and testation datasets for a given fold.

In [13]:
def get_test_dataset(test_df):
    test_texts = test_df.text.tolist()  # Extract testation texts
    
    # Build testation dataset
    test_ds = build_dataset(test_texts, labels=None,
                             batch_size=min(CFG.batch_size*CFG.replicas, len(test_df)), cache=False,
                             shuffle=False, drop_remainder=False, repeat=False)
    
    return test_ds  # Return datasets and dataframes

# 🤖 | Modeling



In [14]:
def build_model():
    # Create a DebertaV3Classifier model
    classifier = keras_nlp.models.DebertaV3Classifier.from_preset(
        CFG.preset,
        load_weights=False,
        preprocessor=None,
        num_classes=1 # one output per one option, for five options total 5 outputs
    )
    inputs = classifier.input
    logits = classifier(inputs)
        
    # Compute final output
    outputs = keras.layers.Activation("sigmoid")(logits)
    model = keras.Model(inputs, outputs)
    return model

## Ckpt processing
For some reason, `keras.models.load_model` requires write access as `/kaggle/input` doesn't have that access it throws error. Workaround is to simply copy the `ckpts` to other directory then load the model.

In [15]:
# Get the checkpoint directory and name
CKPT_PATH = CFG.CKPT_PATH
# ckpt_name = 'daigt-kerasnlp-ckpt'

# Copy the checkpoints to a new directory in the /kaggle directory
# !cp -r {CKPT_PATH} /kaggle/{ckpt_name}

# List all the checkpoint paths in the new directory
# new_ckpt_dir = f"/kaggle/{ckpt_name}"
new_ckpt_dir = CKPT_PATH
ckpt_paths = glob(os.path.join(new_ckpt_dir, '*.keras'))

print("Total CKPT:", len(ckpt_paths))

Total CKPT: 1


# 🧪 | Prediction

## Inference

In [16]:
# Initialize an array to store predictions for each fold
fold_preds = np.zeros(shape=(len(test_df),), dtype='float32')

# # Build model
# model = build_model()

# Iterate through each checkpoint path
for ckpt_path in tqdm(ckpt_paths):
    # Load the pre-trained model from the checkpoint
    print(ckpt_path)
    model = keras.models.load_model(
        ckpt_path,
        compile=False,
    )
#     model.load_weights(ckpt_path)
    
    # Get the test dataset
    test_ds = get_test_dataset(test_df)
    
    # Generate predictions using the model
    preds = model.predict(
        test_ds,
        batch_size=min(CFG.batch_size * CFG.replicas * 2, len(test_df)),  # Set batch size
        verbose=1
    )
    
    # Add predictions to fold_preds and average over checkpoints
    fold_preds += preds.squeeze() / len(ckpt_paths)
    
    # Clean up by deleting the model and collecting garbage
    del model
    gc.collect()

  0%|          | 0/1 [00:00<?, ?it/s]

./kaggle/working\fold1.keras


  instance.compile_from_config(compile_config)


[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 93ms/step


## Check Prediction

In [17]:
# Format predictions and true answers
pred_answers = (fold_preds > 0.5).astype(int).squeeze()
print(pred_answers.shape)

(1000,)


In [18]:
# Check 5 Predictions
print("# Predictions\n")
for i in range(5):
    row = test_df.iloc[i]
    text  = row.text
    pred_answer = CFG.label2name[pred_answers[i]]
    print(f"❓ Text {i+1}:\n{text}\n")
    print(f"🤖 Predicted: {pred_answer}\n")
    print("-"*90, "\n")

# Predictions

❓ Text 1:
The arts have always played an essential role in human societies. They allow us to express ourselves in unique and creative ways and provide us with a means of exploring our emotions and ideas. Some argue that the arts reveal the otherwise hidden ideas and impulses of a society. In this essay, I will argue that I agree with this statement and provide specific reasons and examples to support this position.

To begin with, art allows individuals to express themselves in ways that they would not be able to do otherwise. For instance, a painting or sculpture can reveal an artist's innermost thoughts and emotions. This is because art is often used as a form of self-expression. When someone creates something, they are revealing a part of themselves through their work. In this sense, the arts can reveal hidden ideas and impulses.

Furthermore, the arts can also be used to reveal societal trends and ideas. For example, many writers have used their work to comment on th

In [19]:
len(fold_preds)

1000

# 📮 | Submission

In [20]:
# Create a DataFrame to store the submission
sub_df = test_df.copy()

# Add the formatted predictions to the submission DataFrame
sub_df["pred_prob"] = fold_preds.squeeze()
sub_df["pred_label"] = pred_answers
sub_df["correct"] = sub_df["pred_label"] == sub_df["label"]

# Display the first 2 rows of the submission DataFrame
sub_df.head(2)

# Display Acc
total_cor = sub_df["correct"].sum()
print(f'Total Correct: {total_cor} / {len(sub_df)}')
print(f'Acc: {(total_cor * 100) / len(sub_df):.2f} %')

Total Correct: 942 / 1000
Acc: 94.20 %


In [21]:
sub_df.to_csv('submission.csv',index=False)