# Leave-One-Out Validation for Generated Essays

This is an auxiliary notebook which complements the [main notebook](llm_detection.ipynb). 

We run a script which calculates the prediction probability of each individual generated text. 

## Importing the data

In [1]:
from tqdm.notebook import tqdm_notebook
from tqdm.auto import tqdm
import pandas as pd
import numpy as np
import os

import notebook_config


# enable progress bar functionality
tqdm_notebook().pandas()


INTERMEDIATE_DIR = os.path.join("..", notebook_config.INTERMEDIATE_DIR)

0it [00:00, ?it/s]

In [2]:
df = pd.read_csv(os.path.join(INTERMEDIATE_DIR, notebook_config.LOOV_INPUT_NAME))
df = df.drop("Unnamed: 0", axis=1)
df

Unnamed: 0,id,text,prompt_id,generated,llm,source,embedding
0,-520228841973214070,cars have been a major part of our lives for a...,0,1,PaLM,Konstantina Liagkou,[ 8.0638006e-02 1.6982207e+00 -2.5302460e+00 ...
1,6608331338134701053,"limiting car usage has many advantages, such a...",0,1,PaLM,Konstantina Liagkou,[-0.8870298 -0.45862535 -3.3064713 -0.091987...
2,7138188490374180722,"""america's love affair with it's vehicles seem...",0,1,PaLM,Konstantina Liagkou,[ 9.11974728e-01 1.03492856e+00 -1.52655876e+...
3,3084568973721699379,"cars are convenient, but they can be harmful t...",0,1,PaLM,Konstantina Liagkou,[-0.90010715 -0.37471777 -3.003909 -0.017782...
4,-4989392494283758830,"cars are a convenient way to get around, but t...",0,1,PaLM,Konstantina Liagkou,[-8.3860934e-02 -2.7152100e-01 -3.4049506e+00 ...
...,...,...,...,...,...,...,...
5248,fe6ff9a5,there has been a fuss about the elector colleg...,1,0,Human,Competition,[-1.4767352 0.04409365 -2.265938 0.227724...
5249,ff669174,limiting car usage has many advantages. such a...,0,0,Human,Competition,[-2.03082055e-01 8.64372015e-01 -3.53538036e+...
5250,ffa247e0,there's a new trend that has been developing f...,0,0,Human,Competition,[-8.06897342e-01 1.52280945e-02 -2.08817863e+...
5251,ffc237e9,as we all know cars are a big part of our soci...,0,0,Human,Competition,[-6.59209251e-01 1.89325139e-01 -2.40648055e+...


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer


#https://aclanthology.org/2020.aespen-1.6.pdf
vectorizer = TfidfVectorizer(strip_accents="unicode",
                             ngram_range=(3,5), 
                             max_df=0.9, 
                             min_df=0.05)
vectorizer.fit(df.text)

In [4]:
import skops.io as sio


file = os.path.join(INTERMEDIATE_DIR, notebook_config.MODEL_FILE_NAME)
best_model = sio.load(file, trusted=True)
best_model

## Running LOOV

This procedure when using an Ensemble classifier can anywhere from 13 to 27 hours uisng all processing power, depending on the number of parameters, estimator and hardware. Thus, we will only run the script for a subset of the dataset. We will also flush the results to disk for every iterations for data safety.

In [16]:
from sklearn.base import clone


def validate(essay_id: str) -> tuple[float, pd.Series]:
    """o
    Validate a trained model on a specific essay identified by its ID.

    :param essay_id: A string specifying the ID of the essay to be validated.
    :type essay_id: str

    :return: A tuple containing the predicted probabilities and the text of the specified essay.
             The predicted probabilities are generated by the trained model.
    :rtype: tuple[float, pd.Series]

    :raises ValueError: If the specified essay ID does not have a unique match in the dataset.
    """
    essay_train = df[~df.id.eq(essay_id)]
    essay_test = df[df.id.eq(essay_id)]
    
    if essay_test.shape[0] != 1:
        raise ValueError(f"Error id={essay_id} shape={essay_test.shape}")
        
    model = clone(best_model)
    model = model.fit(vectorizer.transform(essay_train.text), essay_train.generated)

    return model.predict_proba(vectorizer.transform(essay_test.text)), essay_test.text


def csv_output(df: pd.DataFrame, filename: str) -> None:
    """
    Save a pandas DataFrame to a CSV file.

    :param df: The DataFrame to be saved.
    :type df: pd.DataFrame

    :param filename: The name of the CSV file.
    :type filename: str

    :return: This function does not return anything.
    :rtype: None
    """
    file = os.path.join(OUTPUT_DIR, filename)
    df.to_csv(file, encoding = 'utf8')
    print(f"File saved successfully as {file}")


def batch_validate(essay_id: str, file: str) -> None:
    """
    Batch validate a trained model on a specific essay identified by its ID and append results to a CSV file.

    :param essay_id: A string specifying the ID of the essay to be validated.
    :type essay_id: str
    :param file: A string specifying the name of the CSV file to which the results will be appended.
    :type file: str

    :return: None

    :raises Exception: If an error occurs during the validation process, the exception is printed, 
    and the function returns.
    """
    try:
        # get rid of nested array
        res, text = validate(essay_id)
    except Exception as e:
        print(e)
        return

    res_df = pd.DataFrame({"id": [essay_id], 
                           "text": text.iloc[0],
                           "proba": [res[0][1]]})
    
    # if no error append results to disk
    # since the computational cost is absolutely enormous, a few IO operations per iteration
    # don't hurt efficiency
    old_df = pd.read_csv(file).loc[:, ["id", "text", "proba"]]
    new_df = pd.concat([old_df, res_df])
    new_df.to_csv(file, encoding="utf-8")

In [19]:
import random


file = os.path.join(INTERMEDIATE_DIR, notebook_config.LOOV_RES_NAME)

# create or overwrite empty file
try:
    previous_progress_df = pd.read_csv(file)
    if len(previous_progress_df) == 0:
        previous_ids = {}
    else:
        previous_ids = {str(id) for id in previous_progress_df.id}
except FileNotFoundError:
    # create empty csv file
    previous_progress_df = pd.DataFrame({"id": [], "text": [] , "proba": []})
    previous_progress_df.to_csv(file)
    previous_ids = {}

ids = df[df.generated == 1].id
new_ids = [id for id in ids if str(id) not in previous_ids]
random.shuffle(new_ids)

In [20]:
print("Running Leave One Out validation for generated texts...")
for id in tqdm(new_ids):
    batch_validate(id, file=file)

Running Leave One Out validation for generated texts...


  0%|          | 0/3878 [00:00<?, ?it/s]

  new_df = pd.concat([old_df, res_df])


Error id=888804688479382258 shape=(2, 7)


KeyboardInterrupt: 