<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">


<a class="anchor" id="toc"> 
    <h1> Table of Contents </h1>
</a>
    
<ol>    
<!--     <li> <a href="#lora">What is LoRA?</a></li> -->
    <li> <a href="#setup">Notebook Set Up</a></li>
    <li> <a href="#data">Load Data</a></li>
    <li> <a href="#model">Load Model</a></li>
    <li> <a href="#train">Model Training</a></li>
    <li> <a href="#predict">Make Pediction</a></li>
</ol>
    
</div>

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">

   
<a class="anchor" id="setup"> 
    <h1> --- Notebook Set Up --- </h1>
</a>
    
<h3> Dataset & Library Installation</h3>
    
<h4> 1. peft Library </h4>
    
- GitHub URL: `https://github.com/huggingface/peft`
- Instruction: 
    1. Click the `Upload` icon in `Data` Section on the right.
    2. Click `Link`.
    3. Click `Import GitHub repository` .
    4. Paste the above Github URL to in the `URL` box.
    5. Click `Continue`.
- Path: `/kaggle/input/peft-main`
---
  
<h4> 2. DAIGT V2 Train Dataset </h4>
    
- Dataset URL: https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset/data
- Instruction:
    1. Download the dataset.
    2. Upload the dataset and name the dataset as `daigt_v2` in this notebook.
- Path: `/kaggle/input/daigt_v2/train_v2_drcat_02.csv`
---    
    
<h4> 3. HuggingFace BERT Variants</h4>

- Kaggle URL: https://www.kaggle.com/datasets/sauravmaheshkar/huggingface-bert-variants/
- Instruction:
    1. Download the data from above Kaggle URL.
    2. Unzip the zip file
    3. Compress the subfolder called `distilbert-base-uncased`
    4. Upload the zipped folder and name the dataset as `distilbert-base-uncased` in this notebook.
- Path: `/kaggle/input/distilbert-base-uncased`
    
</div>

In [1]:
# Run this to enable peft library
import sys
sys.path.append("/kaggle/input/peft-main/src")

In [2]:
# Model path
MODEL_PATH = "/kaggle/input/distilbert-base-uncased/distilbert-base-uncased/distilbert-base-uncased"

In [3]:
# Import all required library
import os
import time
import math

import numpy as np
import pandas as pd

import tqdm

import warnings
warnings.filterwarnings("ignore")

import torch

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AdamW, 
    get_linear_schedule_with_warmup,
    DataCollatorWithPadding,
    Trainer, 
    TrainingArguments,
    AutoModelForCausalLM
)

from peft import (
    LoraConfig, 
    get_peft_model, 
    TaskType,
    PeftModel
)

import datasets 

## Helper Function

In [4]:
# Define a function that can print the trainable parameters 
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">

   
<a class="anchor" id="data"> 
    <h1> --- Load Dataset --- </h1>
</a>
    
</div>

## Read Data

In [5]:
full_data = pd.read_csv("/kaggle/input/daigt-v2/train_v2_drcat_02.csv")
full_data = full_data[full_data['RDizzl3_seven'] == True]
full_data = full_data[["text","label"]]
full_data.reset_index(drop=True, inplace = True)

print(f"We have {len(full_data)} samples") # Number of data we have
full_data.head(1)

# full_data = full_data.head(2000) # Subset the data for testing the code

We have 20450 samples


Unnamed: 0,text,label
0,Cars have been around for awhile and they have...,0


## Split Data

In [6]:
from sklearn.model_selection import train_test_split

# Split it when augmented data is ready
X_train, X_val, y_train, y_val = train_test_split(full_data["text"],
                                                  full_data["label"],
                                                  test_size=0.3,
                                                  stratify=full_data["label"],
                                                  random_state=42)
print(f"We have {len(X_train)} training samples")
print(f"We have {len(X_val)} validation samples")
print("----------------------------")
count = full_data["label"].value_counts()
print(f"Number of Essays written by Human: {count[0]}")
print(f"Number of Essays generated by LLM: {count[1]}")

X_train.reset_index(drop = True, inplace = True)
y_train.reset_index(drop = True, inplace = True)
X_val.reset_index(drop = True, inplace = True)
y_val.reset_index(drop = True, inplace = True)

We have 14315 training samples
We have 6135 validation samples
----------------------------
Number of Essays written by Human: 14250
Number of Essays generated by LLM: 6200


<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">

<a class="anchor" id="model"> 
    <h1> --- Load Model --- </h1>
    </a>
    
</div>

## DistilBERT + LoRA
- We used a base distilbert model to run our first prediction.[DistilBERT Base model](https://huggingface.co/distilbert-base-uncased) (6.69M Parameters).
- To fine-tune the model, we apply [Lower Rank Adaptation (LoRA)](https://huggingface.co/docs/peft/conceptual_guides/lora) to reduce the number of trainable parameters.

In [7]:
# Model & Tokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH, return_dict=True, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# Number of trainable parameters
print(print_number_of_trainable_model_parameters(model))

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at /kaggle/input/distilbert-base-uncased/distilbert-base-uncased/distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable model parameters: 66955010
all model parameters: 66955010
percentage of trainable model parameters: 100.00%


In [8]:
# model # Show Model Structure

## Get the Model with LoRA

In [9]:
# Define the LoRA Configuration
lora_config = LoraConfig(
    r=8, # Rank Number
    lora_alpha=32, # Alpha (Scaling Factor)
    lora_dropout=0.05, # Dropout Prob for Lora
    target_modules=["q_lin", "k_lin","v_lin"], # Which layer to apply LoRA, usually only apply on MultiHead Attention Layer
    bias='none',
    task_type=TaskType.SEQ_CLS # Seqence to Classification Task
)

In [10]:
# Get our LoRA-enabled model
peft_model = get_peft_model(model, 
                            lora_config)

# Reduced trainble parameters
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 813314
all model parameters: 67768324
percentage of trainable model parameters: 1.20%


## Note that: 
1. LoRA will frozen all the original parameters in the base model (in this case `distilbert-base-uncased`)
2. The trainable parameters is only 1.2% of the original number of parameters.

In [11]:
# peft_model # Model Structure

## Tokenize Dataset

In [12]:
# Tokenize function
def tokenize_func(data):
    return tokenizer(
            data['texts'],
            max_length=512,
            padding='max_length',
            return_attention_mask=True,
            truncation=True
        )

In [13]:
# Tokenize the Training Data
train_dataset = datasets.Dataset.from_pandas(pd.DataFrame({"texts":X_train,"labels":y_train}))
train_dataset = train_dataset.map(
    tokenize_func,
    batched=True,
    remove_columns=["texts"]
)
train_dataset

  0%|          | 0/15 [00:00<?, ?ba/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 14315
})

In [14]:
# Tokenize the Validation Data
val_dataset = datasets.Dataset.from_pandas(pd.DataFrame({"texts":X_val,"labels":y_val}))
val_dataset = val_dataset.map(
    tokenize_func,
    batched=True,
    remove_columns=["texts"]
)

val_dataset

  0%|          | 0/7 [00:00<?, ?ba/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 6135
})

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">

<a class="anchor" id="model"> 
    <h1> Model Training</h1>
</a>
    
</div>

**Disable Training for Submission**

In [15]:
# Define Eval Metric
def metrics(eval_prediction):
    logits, labels = eval_prediction
    pred = np.argmax(logits, axis=1)
    auc_score = roc_auc_score(labels, pred)
    return {"Val-AUC": auc_score}

train_batch_size = 32
eval_batch_size = 32

# Define training Args
peft_training_args = TrainingArguments(
    output_dir='./result-distilbert-lora',
    logging_dir='./logs-distilbert-lora',
#     auto_find_batch_size=True,
    learning_rate=1e-4,
    per_device_train_batch_size=train_batch_size, # You can adjust this value base on your available GPU, You may encounter "out of memory" error if this value is too lartge
    per_device_eval_batch_size=eval_batch_size, # You can adjust this value base on your available GPU, You may encounter "out of memory" error if this value is too lartge
    num_train_epochs=5,
    logging_steps=10,
    evaluation_strategy='steps',
    eval_steps=10,
    weight_decay=0.01,
    seed=42,
    fp16=True, # Only use with GPU
    report_to='none'
)   

# Define Optimzer
optimizer = AdamW(peft_model.parameters(), 
                  lr=1e-4,
                  no_deprecation_warning=True)

# Define Scheduler
n_epochs = peft_training_args.num_train_epochs
total_steps = n_epochs * math.ceil(len(train_dataset) / train_batch_size / 2)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=0, 
    num_training_steps=total_steps)

# Data Collator
collator = DataCollatorWithPadding(
    tokenizer=tokenizer, 
    padding="longest"
)


# Define Trainer
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=train_dataset, # Training Data
    eval_dataset=val_dataset, # Evaluation Data
    tokenizer=tokenizer,
    compute_metrics=metrics,
    optimizers=(optimizer,lr_scheduler),
    data_collator=collator
)

print(f"Total Steps: {total_steps}")

# Path to save the fine-tuned model
peft_model_path="/kaggle/working/peft-distilbert-lora-local"

# Train the model
peft_trainer.train()

# peft_trainer.model.save_pretrained(peft_model_path) # Save the fine-tuned model
# tokenizer.save_pretrained(peft_model_path) # Save the tokenizer

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Total Steps: 1120


Step,Training Loss,Validation Loss,Val-auc
10,0.5932,0.54312,0.5
20,0.5018,0.435841,0.814726
30,0.3831,0.295072,0.878323
40,0.2249,0.206836,0.912167
50,0.1815,0.173285,0.936835
60,0.1093,0.112987,0.95682
70,0.0749,0.085323,0.966969
80,0.0819,0.255287,0.925078
90,0.0625,0.06976,0.974172
100,0.0723,0.182601,0.945745


TrainOutput(global_step=1120, training_loss=0.04902265665417807, metrics={'train_runtime': 8343.2767, 'train_samples_per_second': 8.579, 'train_steps_per_second': 0.134, 'total_flos': 9660184239820800.0, 'train_loss': 0.04902265665417807, 'epoch': 5.0})

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
   
<a class="anchor" id="predict"> 
    <h1> --- Make Prediction --- </h1>
</a>
    
</div>

In [16]:
# Load the LoRA adpator and add back to tour base model
# def load_fine_tuned_model(peft_model_path):
#     peft_model_base = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH, return_dict=True, num_labels=2)
#     peft_tokenizer = AutoTokenizer.from_pretrained(peft_model_path)

#     peft_model = PeftModel.from_pretrained(peft_model_base, 
#                                            peft_model_path,
#                                            is_trainable=False)

#     return peft_model, peft_tokenizer

# Function to make prediction
def predict(text, model, tokenizer):
    inputs = tokenizer(
        text,
        return_tensors='pt',
        padding='max_length', 
        truncation=True, 
        max_length=512
    )
    
    model.eval()
    
    if model.device.type == 'cuda':
        inputs = {k: v.to('cuda') for k, v in inputs.items()}
    
    with torch.no_grad(): # Inference, but not training
        logits = model(**inputs).logits
    probabilities = torch.nn.functional.softmax(logits, dim=-1) # Apply Softmax
    
    return probabilities[0,-1].item()

In [17]:
# Load the fine-tuned model and tokenizer
# peft_model_path="/kaggle/working/peft-distilbert-lora-local"
# our_model, our_tokenizer = load_fine_tuned_model(peft_model_path)

# Test the code to read test data
test_data = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
test_data['generated'] = test_data['text'].apply(lambda x: predict(x, peft_model, tokenizer))
test_data.drop(['prompt_id', 'text'], axis=1, inplace=True)
test_data

Unnamed: 0,id,generated
0,0000aaaa,0.814472
1,1111bbbb,0.48672
2,2222cccc,0.659877


In [18]:
# Save the new submission
test_data.to_csv('submission.csv', index=False)