<a href="https://www.kaggle.com/code/scr0ll0/modeling-submission?scriptVersionId=154907981" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Imports/Downloads

In [1]:
import pandas as pd
import re

from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

from scipy.stats import uniform

# Loading Data

Credit to Darek for the daigt-v2-train-dataset: https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset

In [2]:
train = pd.read_csv('/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv')
test = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
sample = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv')

# Cleaning Data

In [3]:
#Filtering
train = train[train['RDizzl3_seven'] == True].copy()

train = train[['text', 'label']].copy()
sample['id'] = test['id']
test = test.drop(columns=['id', 'prompt_id'])

dev_x = train.drop(columns='label')
dev_y = train['label']

In [4]:
#Sampling
dev_x, dev_y = RandomOverSampler(random_state=0).fit_resample(dev_x, dev_y)

Credit to Vladimir Demidov for this normalization function: https://www.kaggle.com/code/yekenot/llm-detect-by-regression/notebook

In [5]:
#Cleaning
def normalize(text):
    # Replace with whitespace to separate '😃\n\nFor'
    text = text.replace(r"\n", r" ")
    text = text.replace(r"\r", r" ")
    # Drop puntuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove extra spaces from '😃  For' to '😃 For'
    text = re.sub(r"\s+", r" ", text)
    # Remove leading and trailing whitespace
    text = text.strip()
    return text

dev_x['text'] = dev_x['text'].apply(lambda x: normalize(x))
test['text'] = test['text'].apply(lambda x: normalize(x))

Credit to both pamilove_dl (https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/455701) and Damien Mourot (https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/455969) for suggesting to fit the TF-IDF on the test set

More credit to Vladimir Demidov for the Tfidf Vectorizer arguments: https://www.kaggle.com/code/yekenot/llm-detect-by-regression

In [6]:
#Vectorizing
vector = TfidfVectorizer(stop_words='english',
                         tokenizer=lambda x: re.findall(r'[^\W]+', x),
                         token_pattern=None, 
                         strip_accents='unicode',
                         ngram_range=(3, 4))
#vector.fit(dev_x['text'])
vector.fit(test['text'])
dev_x = pd.DataFrame(vector.transform(dev_x['text']).toarray())
test = pd.DataFrame(vector.transform(test['text']).toarray())

# Modeling

In [7]:
#Logistic Regression
logistic = LogisticRegression()
#model.fit(dev_x, dev_y)

In [8]:
#SGD Classifier

sgd = SGDClassifier(max_iter=5000, loss='modified_huber', random_state=0)
#model.fit(dev_x, dev_y)

In [9]:
#Voting Classifier

model = VotingClassifier(estimators=[('lr', logistic),('sgd', sgd)], voting='soft', weights=[0.01, 0.99])
model.fit(dev_x, dev_y)

# Predictions

In [10]:
sample['generated'] = model.predict_proba(test)[:, 1]

In [11]:
sample.to_csv('submission.csv', index=False)