## First step : tokenization of the text you want to detect the usage of LLM

In [1]:
# import packages
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from Model.helper import tokenize_sent
import concurrent.futures
def apply_in_parallel(series, func):
    with concurrent.futures.ProcessPoolExecutor() as executor:
        # Apply func to each element in the series
        results = list(tqdm(executor.map(func, series), total=len(series)))
    return pd.Series(results)  # Return a Series, not a DataFrame

In [2]:
# for the convenience of the following steps, the input of detector should be a parquet file
df = pd.read_parquet('data/test/ICLR2023.parquet')
# In our example, the column that contains the review text is 'all_review_text', with each row containing a unique review of a paper
display(df)

Unnamed: 0,review ID,summary_of_the_paper,strength_and_weaknesses,"clarity,_quality,_novelty_and_reproducibility",summary_of_the_review,all_review_text,paper_id,keywords,inference_sentence
0,EmaqzywPDa,1. This paper studies the effect of label erro...,Strength\n\n1. They try to answer two signific...,1. Their theoretical analysis seems to be more...,"1. The overall quality of the paper is good, a...",1. This paper studies the effect of label erro...,RUzSobdYy0V,[],"[[this, paper, studies, the, effect, of, label..."
1,1UlNMuZj0fp,This paper considers an important problem of l...,Strength:\n- This paper is very well organized...,"The paper is clearly written in general, and a...",Despite the interesting perspective and a well...,This paper considers an important problem of l...,RUzSobdYy0V,[],"[[this, paper, considers, an, important, probl..."
2,7AWRUUgNqYe,This paper studies the effect of label error o...,Strength:\n+ The research problems are importa...,This paper generally is well-written and easy ...,"For me, the motivation and research problems o...",This paper studies the effect of label error o...,RUzSobdYy0V,[],"[[this, paper, studies, the, effect, of, label..."
3,JhO4VvJYby9,This paper proposes a type of neural networks ...,This paper is motivated by a neurobiological e...,The paper is clear to read. But its quality is...,This is a paper inspired by the lateral inhibi...,This paper proposes a type of neural networks ...,N3kGYG3ZcTi,"[Lateral Inhibition, Convolutional Neural Netw...","[[this, paper, proposes, a, type, of, neural, ..."
4,uM31Bm53z8,Summary:\nThis article starts from the perspec...,"(Positive) Although this article is very poor,...","The quality, clarity, and the originality are ...","See ""Summary Of The Paper."" Regarding the writ...",Summary:\nThis article starts from the perspec...,N3kGYG3ZcTi,"[Lateral Inhibition, Convolutional Neural Netw...","[[summary], [this, article, starts, from, the,..."
...,...,...,...,...,...,...,...,...,...
18559,nsvAxuM8gv,The paper (as the title suggests) proposes a s...,The biggest strength of the paper is the detai...,The manuscript is very clear. The quality of t...,While the authors present their proposed appro...,The paper (as the title suggests) proposes a s...,E9_04otJ62,"[Winograd convolution, structured pruning, GPU...","[[the, paper, as, the, title, suggests, propos..."
18560,3qrCHCgufw,This paper proposes a structured pruning strat...,Strengths:\n- The performance gains for the pr...,Please see the weaknesses above.,Please see the weaknesses above.,This paper proposes a structured pruning strat...,E9_04otJ62,"[Winograd convolution, structured pruning, GPU...","[[this, paper, proposes, a, structured, prunin..."
18561,sjxQy0-OQ7,"The paper proposes a method to ""enhance explor...",## Strengths\n* The idea to evaluate the polic...,The 'Weaknesses' section includes the comments...,"The weaknesses of the paper, especially the mo...","The paper proposes a method to ""enhance explor...",KjKZaJ5Gbv,"[Reinforcement Learning, Multitask Reinforceme...","[[the, paper, proposes, a, method, to, enhance..."
18562,4bJbeJX-G6-,The key idea behind this paper is to use a per...,"This paper is well written, easy to follow and...",The paper is well written and clear. The ideas...,Overall I think the paper is interesting and w...,The key idea behind this paper is to use a per...,KjKZaJ5Gbv,"[Reinforcement Learning, Multitask Reinforceme...","[[the, key, idea, behind, this, paper, is, to,..."


In [3]:
# call the above function to tokenize the review text and create a new column 'inference_sentence'
# the time of this step depends on the size of the dataset, parallel processing can be used here
df['inference_sentence'] = apply_in_parallel(df['all_review_text'], tokenize_sent)

100%|██████████| 18564/18564 [04:49<00:00, 64.07it/s]


In [4]:
# save the result
df.to_parquet('data/test/ICLR2023.parquet')

## Example : check the usage of LLM in ICLR2023 and ICLR2024 reviews

In [5]:
# import packages
import pandas as pd
from Model.model import MLE

In [6]:
# initialize the detector model
model=MLE()

In [7]:
# test the LLM usage on ICLR2023 and ICLR2024 reviews
for i in range(2023, 2025):
    # specify the path to the file containing the data you want to detect, 
    # notice that the file must be in parquet format and contains a column named "inference_sentence"
    # the inference_sentence column should contain the sentences you want to detect, created by the previous step tokenize
    # for ICLR2023 reviews, the estimated usage is 2.6%, and for ICLR2024 reviews, the estimated usage is 14.8%
    # CI indicates the confidence interval of the estimated usage
    model.inference(f"data/test/ICLR{i}.parquet")

Prediction,        CI
     0.026,     0.002
Prediction,        CI
     0.148,     0.002
