# [Team JelyTooth] LLM - Detect AI Generated Text
- Competition: [Detect AI Generated Text](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/overview)
- Timeline
    - October 31, 2023 - Start Date.
    - January 15, 2024 - Entry Deadline. You must accept the competition rules before this date in order to compete.
    - January 15, 2024 - Team Merger Deadline. This is the last day participants may join or merge teams.
    - January 22, 2024 - Final Submission Deadline.

### Original sources 👍
<!-- Drawing Table here, Leftside aligned -->
| Name | Type | Description | Link |
|:---|:---|:---|:---|
| 🌌🎲 Token Game: Ensemble Playbook 📖🔮 | Notebook | BPE, Sentencepiece 토크나이저와 기본적인 ML Classifier의 Ensemble을 통해 텍스트를 분류하는 방법을 소개하는 노트북 예시. 점수가 0.96으로 높은 성능을 보여줌. | [Link](https://www.kaggle.com/code/verracodeguacas/token-game-ensemble-playbook) |
| DAIGT V2 Train Dataset | Dataset | A dataset you can actually train on for the LLM Detect AI Generated Text comp. DAIGT 대회 참가자들이 생성한 데이터 세트를 통합해서 제공하는 데이터세트. | [Link](https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset) |
| Text generated with ChatGPT by MOTH | Dataset | Text generated with ChatGPT by MOTH | [Link](https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset) |
| Persuade corpus contributed by Nicholas Broad | Dataset | Persuade corpus contributed by Nicholas Broad | [Link](https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/) |
| Text generated with Llama-70b and Falcon180b by Nicholas Broad | Dataset | Text generated with Llama-70b and Falcon180b by Nicholas Broad | [Link](https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b) |
| Text generated with ChatGPT and GPT4 by Radek | Dataset | Text generated with ChatGPT and GPT4 by Radek | [Link](https://www.kaggle.com/datasets/radek1/llm-generated-essays) |
| 2000 Claude essays generated by @darraghdog | Dataset | 2000 Claude essays generated by @darraghdog | [Link](https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic) |
| LLM-generated essay using PaLM from Google Gen-AI by @kingki19 | Dataset | LLM-generated essay using PaLM from Google Gen-AI by @kingki19 | [Link](https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai) |


In [12]:
import sys, os
import gc
import pandas as pd
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.metrics import roc_auc_score
import numpy as np
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
    SentencePieceBPETokenizer
)
from datasets import Dataset
from tqdm.auto import tqdm
from transformers import PreTrainedTokenizerFast
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
from typing import List
import logging


def create_logger(name: str):
    logger = logging.getLogger(name)
    logger.setLevel(logging.DEBUG)
    return logger

logger = create_logger('LLM-Detect-AI-Generated-Text')

In [27]:
# global variables
LOWERCASE = False
VOCAB_SIZE = 64000

os.environ["TOKENIZERS_PARALLELISM"] = "false"

### 대회 참가용 데이터
대회 데이터는 Kaggle API를 통해 데이터를 다운로드 받을 수 있습니다.
```bash
kaggle competitions download -c llm-detect-ai-generated-text
unzip llm-detect-ai-generated-text.zip
ls
```
다운로드 받은 데이터는 다음과 같습니다.
```
sample_submission.csv  test_essays.csv train_essays.csv train_prompts.csv
```

### 학습용 데이터 DAIGT V2 Train Dataset
공개되어있는 DAIGT 데이터 세트는 [해당 링크](https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)에서 다운로드 받을 수 있습니다.

> 만약 Kaggle Notebook 환경에서 진행한다면 데이터 세트를 추가하여 사용할 수 있습니다. 이 경우에는 다음과 같이 데이터를 다운로드 받을 필요가 없습니다.

In [14]:
sample_submission_df = pd.read_csv('./dataset/llm-detect-ai-generated-text/sample_submission.csv')
test_essays_df = pd.read_csv('./dataset/llm-detect-ai-generated-text/test_essays.csv')
train_essays_df = pd.read_csv('./dataset/llm-detect-ai-generated-text/train_essays.csv')
train_prompts_df = pd.read_csv('./dataset/llm-detect-ai-generated-text/train_prompts.csv')
train_v2_drcat_df = pd.read_csv('./dataset/llm-detect-ai-generated-text/train_v2_drcat_02.csv')

# preprocess
train_v2_drcat_df = train_v2_drcat_df.drop_duplicates(subset=['text'])
train_v2_drcat_df.reset_index(drop=True, inplace=True)

logger.info('train_v2_drcat_df shape: {}'.format(train_v2_drcat_df.shape))
logger.info('train_essays_df shape: {}'.format(train_essays_df.shape))
logger.info('train_prompts_df shape: {}'.format(train_prompts_df.shape))
logger.info('test_essays_df shape: {}'.format(test_essays_df.shape))
logger.info('sample_submission_df shape: {}'.format(sample_submission_df.shape))

INFO:LLM-Detect-AI-Generated-Text:train_v2_drcat_df shape: (44868, 5)
INFO:LLM-Detect-AI-Generated-Text:train_essays_df shape: (1378, 4)
INFO:LLM-Detect-AI-Generated-Text:train_prompts_df shape: (2, 4)
INFO:LLM-Detect-AI-Generated-Text:test_essays_df shape: (3, 3)
INFO:LLM-Detect-AI-Generated-Text:sample_submission_df shape: (3, 2)


### What is the "RDizzl3 Seven"? [Go to discussion](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/457166)
- The "RDizzl3 Seven" is a monkier given to the (probable) seven prompts used by the hosts of the competition. 
- For more detail see the topic "A "7 Prompts" training dataset" and the comments therein.

#### [LLM] A "7 Prompts" training dataset [Go to discussion](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/453410)
- DAIGT V2 Train Dataset의 제작자는 RDizzl3 Seven 필터를 적용했다고 밝힘.

#### 데이터 세트 수집 전략?
- 이미 많은 대회 참가자들이 문제 해결을 위한 데이터를 자체 생산하고 공개적으로 통합하고 있음
- 이러한 공유된 데이터세트를 적극적으로 활용하면서 Scoring을 낼 수 있는 모델링을 빠르게 수행
- High score를 획득한 모델에서 부족한 부분이 무엇인지 파악하고 이를 위한 데이터를 추가로 수집해 private dataset을 만들어 점수를 높이는 방향으로 진행하면 좋을 것 같음

> 공개 데이터세트 수집 -> 모델링 (제출) -> 개선점 파악 -> 데이터 자체 생산 -> 데이터세트 재가공 -> 모델링 (제출) -> 개선점 파악 -> ... 반복

In [5]:
train_v2_drcat_df.head(2) # 만약 자체적인 데이터 세트를 만든다면 해당 데이터프레임 구조에 맞추는 것이 바람직할 것 같습니다.

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False


In [6]:
train_prompts_df.head(2)

Unnamed: 0,prompt_id,prompt_name,instructions,source_text
0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
1,1,Does the electoral college work?,Write a letter to your state senator in which ...,# What Is the Electoral College? by the Office...


In [7]:
train_essays_df.head(2)

Unnamed: 0,id,prompt_id,text,generated
0,0059830c,0,Cars. Cars have been around since they became ...,0
1,005db917,0,Transportation is a large necessity in most co...,0


In [8]:
test_essays_df.head(2)

Unnamed: 0,id,prompt_id,text
0,0000aaaa,2,Aaa bbb ccc.
1,1111bbbb,3,Bbb ccc ddd.


### BPE 토크나이징 / TF-IDF 벡터화 / ML Classifiers Ensemble
1. 학습용 데이터세트와 테스트 데이터세트의 텍스트 정보를 가져옵니다.
2. 토크나이저를 사용하여 서브토큰화 시킵니다.
3. 해결하려는 문제는 인공적으로 생성된 텍스트를 식별하는것이기 때문에 역으로 테스트 데이터세트의 텍스트로 TF-IDF 단어장을 만듭니다 ?!
    - TF-IDF란 문서 내 단어 빈도수와 단어가 등장한 문서 빈도수를 사용하여 벡터를 생성하는 방법입니다. 
4. 테스트 데이터세트의 TF-IDF 단어장으로 학습용 데이터세트와 테스트 데이터세트를 벡터화 시킵니다.
5. ML Classifier가 입력으로 받는 벡터는 테스트셋의 단어들에 대한 학습용 텍스트들의 TF-IDF 값이 됩니다.
6. ML Classifier는 전달받은 텍스트 내에 테스트 세트의 단어가 얼마나 포함되어 있는지를 벡터화 한 값으로 인식하고 패턴을 학습합니다.

In [25]:
def get_special_tokens():
    return {"unknown": "[UNK]", "padding": "[PAD]", "mask": "[MASK]", "cls": "[CLS]", "sep": "[SEP]"}


def get_raw_bpe_tokenizer(corpus: List[str], num_words: int = VOCAB_SIZE, lowercase: bool = LOWERCASE) -> Tokenizer:
    logger = create_logger('get_raw_bpe_tokenizer')

    # initialize tokenizer
    logger.info("Initializing tokenizer...")
    special_tokens = get_special_tokens()
    bpe_model = models.BPE(unk_token=special_tokens["unknown"])
    raw_tokenizer = Tokenizer(model=bpe_model)

    # set normalizer
    logger.info("Setting normalizer...")
    _normalizers = [normalizers.NFC()]
    if lowercase:
        _normalizers.append(normalizers.Lowercase())
    raw_tokenizer.normalizer = normalizers.Sequence(_normalizers)
    raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
    
    # training
    logger.info("Setting trainer...")
    _special_tokens = list(special_tokens.values())
    trainer = trainers.BpeTrainer(
        vocab_size=num_words,
        special_tokens=_special_tokens,
    )
    dataset = Dataset.from_dict({"text": corpus})

    def _training_corpus_iterator():
        for i in range(0, len(dataset), 1000):
            yield list(dataset["text"][i : i + 1000])

    logger.info("Training tokenizer...")
    raw_tokenizer.train_from_iterator(_training_corpus_iterator(), trainer=trainer)
    tokenizer = PreTrainedTokenizerFast(
        tokenizer_object=raw_tokenizer,
        unk_token=special_tokens["unknown"],
        pad_token=special_tokens["padding"],
        mask_token=special_tokens["mask"],
        cls_token=special_tokens["cls"],
        sep_token=special_tokens["sep"]
    )
    logger.info("Done!")
    return tokenizer


def get_tf_idf_vectors(tokenized_texts_train: List[List[str]], tokenized_texts_test: List[List[str]]):
    logger = create_logger('get_tf_idf_vectors')

    logger.info("Initializing TF-IDF vectorizer for test samples ...")
    vectorizer = TfidfVectorizer(
        ngram_range=(3, 5), 
        lowercase=False, 
        sublinear_tf=True,
        analyzer='word',
        tokenizer=lambda x: x,
        preprocessor=lambda x: x,
        token_pattern=None, 
        strip_accents='unicode'
    )
    logger.info("Fitting TF-IDF vectorizer using test samples ...")
    vectorizer.fit(tokenized_texts_test)
    vocab = vectorizer.vocabulary_
    logger.info(f"Number of words in TF-IDF vocabulary: {len(vocab)}")
    
    vectorizer = TfidfVectorizer(
        ngram_range=(3, 5), 
        lowercase=False, 
        sublinear_tf=True, 
        vocabulary=vocab,
        analyzer='word',
        tokenizer=lambda x: x,
        preprocessor=lambda x: x,
        token_pattern=None, 
        strip_accents='unicode'
    )

    logger.info("Transforming train texts...")
    train_vectors = vectorizer.fit_transform(tokenized_texts_train)
    logger.info("Transforming test texts...")
    test_vectors = vectorizer.transform(tokenized_texts_test)
    
    logger.info(f"Shape of train vectors: {train_vectors.shape}")
    logger.info(f"Shape of test vectors: {test_vectors.shape}")
    return train_vectors, test_vectors


def ensemble_baseline(X, y, X_test):
    clf = MultinomialNB(alpha=0.02)
    sgd_model = SGDClassifier(max_iter=8000, tol=1e-4, loss="modified_huber") 
    weights = [0.5, 0.5]
    ensemble = VotingClassifier(
        estimators=[('mnb',clf), ('sgd', sgd_model)],
        weights=weights, 
        voting='soft', 
        n_jobs=-1
    )
    ensemble.fit(X, y)
    y_pred = ensemble.predict_proba(X_test)[:, 1]
    return y_pred


In [19]:
tokenizer = get_raw_bpe_tokenizer(test_essays_df['text'].tolist())
tokenized_texts_test = []
tokenized_texts_train = []

for text in tqdm(test_essays_df['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))

for text in tqdm(train_v2_drcat_df['text'].tolist()):
    tokenized_texts_train.append(tokenizer.tokenize(text))

INFO:get_raw_bpe_tokenizer:Initializing tokenizer...
INFO:get_raw_bpe_tokenizer:Setting normalizer...
INFO:get_raw_bpe_tokenizer:Setting trainer...
INFO:get_raw_bpe_tokenizer:Training tokenizer...
INFO:get_raw_bpe_tokenizer:Done!





huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/44868 [00:00<?, ?it/s]

In [23]:
tf_train, tf_test = get_tf_idf_vectors(tokenized_texts_train, tokenized_texts_test)

INFO:get_tf_idf_vectors:Initializing TF-IDF vectorizer for test samples ...
INFO:get_tf_idf_vectors:Fitting TF-IDF vectorizer using test samples ...
INFO:get_tf_idf_vectors:Number of words in TF-IDF vocabulary: 9
INFO:get_tf_idf_vectors:Transforming train texts...
INFO:get_tf_idf_vectors:Transforming test texts...
INFO:get_tf_idf_vectors:Shape of train vectors: (44868, 9)
INFO:get_tf_idf_vectors:Shape of test vectors: (3, 9)


In [28]:
pred_baseline = ensemble_baseline(X=tf_train, y=train_v2_drcat_df['label'].to_list(), X_test=tf_test)
pred_baseline

array([0.39345228, 0.39345228, 0.39345228])

In [31]:
sample_submission_df["generated"] = pred_baseline
sample_submission_df.to_csv("submission.csv", index=False)