<a href="https://colab.research.google.com/github/denistoo749/LLM-Detect-AI-Generated-Text/blob/main/LLM_Detect_AI_Generated_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM - Detect AI Generated Text
**1. Problem Definition**
- Develop a Machine Learning model that can accurately detect whether an essay was written by a student or an LLM.
- Identify which essay was written by a large language model.
- Dataset comprises a mix of student-written essays and essays generated by a variety of LLMs

**2. Data**
- Dataset comprises about 10,000 essays, some written by students and some generated by a variety of large language models (LLMs). The goal of the competition is to determine whether or not essay was generated by an LLM.
>https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data

**3. Evaluation**
- Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

In [51]:
# Unzip the zipped file
!unzip '/content/drive/MyDrive/LLM - Detect AI Generated Text/llm-detect-ai-generated-text.zip' -d '/content/drive/MyDrive/LLM - Detect AI Generated Text/data/'

Archive:  /content/drive/MyDrive/LLM - Detect AI Generated Text/llm-detect-ai-generated-text.zip
  inflating: /content/drive/MyDrive/LLM - Detect AI Generated Text/data/sample_submission.csv  
  inflating: /content/drive/MyDrive/LLM - Detect AI Generated Text/data/test_essays.csv  
  inflating: /content/drive/MyDrive/LLM - Detect AI Generated Text/data/train_essays.csv  
  inflating: /content/drive/MyDrive/LLM - Detect AI Generated Text/data/train_prompts.csv  


In [52]:
# Import necessary tools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# model from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Model Evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [53]:
df = pd.read_csv('/content/drive/MyDrive/LLM - Detect AI Generated Text/data/train_essays.csv')
df.head()

Unnamed: 0,id,prompt_id,text,generated
0,0059830c,0,Cars. Cars have been around since they became ...,0
1,005db917,0,Transportation is a large necessity in most co...,0
2,008f63e3,0,"""America's love affair with it's vehicles seem...",0
3,00940276,0,How often do you ride in a car? Do you drive a...,0
4,00c39458,0,Cars are a wonderful thing. They are perhaps o...,0


In [54]:
df.generated.unique()

array([0, 1])

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1378 entries, 0 to 1377
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         1378 non-null   object
 1   prompt_id  1378 non-null   int64 
 2   text       1378 non-null   object
 3   generated  1378 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 43.2+ KB


In [56]:
df.isna().sum()

id           0
prompt_id    0
text         0
generated    0
dtype: int64

In [57]:
df['generated'].value_counts()

generated
0    1375
1       3
Name: count, dtype: int64

# Modelling

In [58]:
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE

# Initialize LabelEncoders
id_encoder = LabelEncoder()
prompt_id_encoder = LabelEncoder()
text_encoder = LabelEncoder()

# Fit and transform the data
df['id_encoded'] = id_encoder.fit_transform(df['id'])
df['prompt_id_encoded'] = prompt_id_encoder.fit_transform(df['prompt_id'])
df['text_encoded'] = text_encoder.fit_transform(df['text'])

# Define features and target
X = df[['id_encoded', 'prompt_id_encoded', 'text_encoded']]
y = df['generated']

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Handle imbalanced data with SMOTE, adjusting k_neighbors
smote = SMOTE(random_state=42, k_neighbors=1) # Set k_neighbors to a value less than or equal to the smallest minority class size
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Initialize models
log_reg_model = LogisticRegression(max_iter=1000, random_state=42)
rf_model = RandomForestClassifier(random_state=42)

# Fit the models
log_reg_model.fit(X_train_balanced, y_train_balanced)
rf_model.fit(X_train_balanced, y_train_balanced)

# Evaluate the models
log_reg_score = log_reg_model.score(X_val, y_val)
rf_score = rf_model.score(X_val, y_val)

print(f'Logistic Regression Score: {log_reg_score}')
print(f'Random Forest Score: {rf_score}')

Logistic Regression Score: 0.8586956521739131
Random Forest Score: 0.9927536231884058


In [59]:
y_train.value_counts()

generated
0    1100
1       2
Name: count, dtype: int64

# Hyperparameter Tuning with RandomizedSearchCV

In [60]:
from scipy.stats import randint, uniform

# Define hyperparameter search space for Logistic Regression
log_reg_param_distributions = {
    'C': uniform(0.1, 10),
    'solver': ['liblinear', 'lbfgs', 'saga'],
    'penalty': ['l2'],
    'max_iter': [100, 200, 500, 1000]
}

log_reg_rs = RandomizedSearchCV(
    LogisticRegression(random_state=42),
    param_distributions=log_reg_param_distributions,
    n_iter=50,
    scoring='accuracy',
    cv=5,
    random_state=42,
    n_jobs=-1,
    verbose=True
)

log_reg_rs.fit(X_train_balanced, y_train_balanced)

# Best hyperparameters and best score
best_log_reg_params = log_reg_rs.best_params_
best_log_reg_score = log_reg_rs.best_score_

print(f"Best Random Forest hyperparameters: {best_log_reg_params}")
print(f"Best Random Forest CV score: {best_log_reg_score}")

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Random Forest hyperparameters: {'C': 4.351558744912447, 'max_iter': 200, 'penalty': 'l2', 'solver': 'lbfgs'}
Best Random Forest CV score: 0.9218181818181819


In [61]:
log_reg_rs.best_params_

{'C': 4.351558744912447, 'max_iter': 200, 'penalty': 'l2', 'solver': 'lbfgs'}

In [62]:
log_reg_rs.score(X_val, y_val)

0.8586956521739131

In [72]:
# Define hyperparameter search space for Random Forest Classifier
rf_param_distributions = {
    'n_estimators': randint(10, 200),
    'max_depth': randint(1, 50),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20),
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False]
}

# Initialize RandomizedSearchCV for Random Forest
rf_rs = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=rf_param_distributions,
    n_iter=50,
    cv=5,
    random_state=42,
    n_jobs=-1,
    verbose=True
)

# Fit the RandomizedSearchCV to the training data
rf_rs.fit(X_train_balanced, y_train_balanced)

# Best hyperparameters and best score
best_rf_params = rf_rs.best_params_
best_rf_score = rf_rs.best_score_

print(f"Best Random Forest hyperparameters: {best_rf_params}")
print(f"Best Random Forest CV score: {best_rf_score}")

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Random Forest hyperparameters: {'bootstrap': False, 'max_depth': 24, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 12, 'n_estimators': 122}
Best Random Forest CV score: 0.9918181818181819


In [73]:
test = pd.read_csv('/content/drive/MyDrive/LLM - Detect AI Generated Text/data/test_essays.csv')
test.head()

Unnamed: 0,id,prompt_id,text
0,0000aaaa,2,Aaa bbb ccc.
1,1111bbbb,3,Bbb ccc ddd.
2,2222cccc,4,CCC ddd eee.


In [74]:
# Initialize LabelEncoders
id_encoder = LabelEncoder()
prompt_id_encoder = LabelEncoder()
text_encoder = LabelEncoder()

# Fit and transform the data
test['id_encoded'] = id_encoder.fit_transform(test['id'])
test['prompt_id_encoded'] = prompt_id_encoder.fit_transform(test['prompt_id'])
test['text_encoded'] = text_encoder.fit_transform(test['text'])

# Define features and target
X = test[['id_encoded', 'prompt_id_encoded', 'text_encoded']]

In [75]:
log_reg_preds = log_reg_rs.predict(X)

In [76]:
log_reg_preds

array([0, 0, 0])

In [77]:
rf_preds = rf_rs.predict(X)

In [78]:
rf_preds

array([0, 0, 0])

In [79]:
submissions = pd.DataFrame({'id': test.id, 'generated': rf_preds})
submissions.to_csv('/content/drive/MyDrive/LLM - Detect AI Generated Text/data/submission.csv', index=False)

In [80]:
sub = pd.read_csv('/content/drive/MyDrive/LLM - Detect AI Generated Text/data/submission.csv')
sub

Unnamed: 0,id,generated
0,0000aaaa,0
1,1111bbbb,0
2,2222cccc,0
