# Final Project

## Names: Sophia Hillard & Campbell Linker

- Insert Markdown chunks for your written responses as needed
- Do not include unnecessary code. Only code that is needed to answer the questions should be included, nothing more. 
- Be sure your work is **reproducible** by "Restart and Clear Output" then "Run All" cells
- The group leader will make a single submission on Moodle on behalf of the group. 
- **ACKNOWELDGE ALL EXTERNAL SOURCES HERE:** Give a brief summary of any external sources you used. Ex:
    - https://pankaj8blr.medium.com/eda-and-significance-of-various-plots-in-feature-engineering-f8cededbd520
    - https://seaborn.pydata.org/generated/seaborn.boxplot.html
    - Google AI overview for XXX
    - StackOverflow for YYY

In [39]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import re
from spellchecker import SpellChecker
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

import warnings   
warnings.simplefilter(action='ignore', category=FutureWarning)

## 1. The Data

The dataset we chose is [LLM - Detect AI Generated Text](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/overview) where:

* Outcome variable: Binary, AI generated (1) or human generated (0)
* Predictor variables: -- all created through feature engineering of essay text
* Number of observations:
    * Training set: $n_{train} = 9000$ ($1000$ from competition training set, $8000$ from [reccomended external source](https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset))
    * Test set: $n_{test} = 9000$
* Score/metric used for the [leaderboard](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/leaderboard): Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

In [35]:
# Load all datasets here. 
train = pd.read_csv("data/all-DAIGT-training.csv", sep=',', index_col = 'id') # see write-up for data provenance
test_essays = pd.read_csv('data/test_essays.csv', index_col='id')
example = pd.read_csv('data/sample_submission.csv', index_col='id')

## 2. Exploratory Data Analysis

Perform all necessary EDA here. Before submitting, only keep those you feel are relevant

In [41]:
# Univariate visualization of outcome variable:
y_train_cat = train.map({1: 'Generated', 0: 'Student-Authored'})
category_counts = pd.Series(y_train_cat).value_counts()
plt.bar(x = category_counts.index, height = category_counts.values)
prop_gen = train.mean()
plt.title("Proportion Generated: " + str(f"{prop_gen:.4f}"))
plt.show()

TypeError: the first argument must be callable

## 3. Feature Engineering

Perform your feature engineering here and then create `y_train`, `X_train` and `X_test`. 

In [43]:
# Define inputs for variables 

input_transition = ["first", "firstly", "second", "secondly", "third", "thirdly", "meanwhile", "previously", "subsequently", "eventually", 
                    "finally", "lastly", "ultimately", "conclusion", "addition", "additionally", "furthermore", "moreover", "besides", 
                    "equally", "however", "contrary", "conversely", "despite", "contrast", "nevertheless", "nonetheless", "whereas", "while",
                    "although", "though", "therefore", "thus", "hence", "consequently", "accordingly", "namely", "specifically", "indeed", 
                    "importantly", "significantly", "especially", "notably", "undoubtedly", "likewise", "similarly", "correspondingly", "sum",
                    "summary", "overall", "conclude", "conclusion", "simultaneously", "formerly", "lately", "recently", "opposite", "adjacent",
                    "provided", "admittedly", "regarding"]

input_hyperbole = ["powerful", "groundbreaking", "illuminating", "vital", "invaluable", "indelible", "essential", "poignant", "profound", 
                   "remarkable", "transformative", "revolutionary", "unparalleled", "extraordinary", "compelling", "significant", "exceptional",
                   "crucial", "monumental", "dramatic", "robust", "innovative", "pivotal", "impressive", "astonishing", "visionary", "inspiring",
                   "striking", "dynamic", "iconic", "seminal", "trailblazing", "revolutionary", "extreme", "shocking"]

input_abn_symbols = ["[", "]", "_", "*", "<", ">", "{", "}", "^", "@", "#", "|", "\\"]

input_prompt_lang = ["here you go", "as an ai", "as a language model", "i generated", "here's the essay", "here's your essay", "let me", 
                     "help you", "sure,", "i hope this helps", "your prompt", "your request", "here is", "here's", "sure!", 
                     "here is the essay", "here is your essay", "language model", "large language", "llm", "generative ai", "chatbot", 
                     "your essay"]

In [45]:
# Create functions for feature engineering 

### Spelling errors
spell = SpellChecker()

def clean_text_for_spellcheck(text):
    if pd.isna(text):
        return ""
    for symbol in input_abn_symbols:
        text = text.replace(symbol, "")
    text = re.sub(r"[^a-z\s'-]", "", text.lower())
    return text

def misspelling_ratio(text):
    text_clean = clean_text_for_spellcheck(text)
    words = text_clean.split()
    if len(words) == 0:
        return 0.0 
    misspelled = spell.unknown(words)
    return len(misspelled) / len(words)

def count_misspellings(text):
    text_clean = clean_text_for_spellcheck(text)
    words = text_clean.split()
    if len(words) == 0:
        return 0
    misspelled = spell.unknown(words)
    return len(misspelled)

### Exclamation points
def count_exclamation_points(text):
    return text.count('!')


### Em dashes
def count_em_dash(text):
    return text.count('â€”')


### Transitional words
def count_transition(text):
    return sum(text.count(word) for word in input_transition)


### Hyperbolic phrasing
def count_hyperbolic(text):
    return sum(text.count(word) for word in input_hyperbole)


### Abnormal symbols
def count_abn_symbols(text):
    return sum(text.count(word) for word in input_abn_symbols)


## Prompt indicator
def contains_prompt_indicators(text):
    return sum(text.count(word) for word in input_prompt_lang)

In [49]:
# Add features to both training AND competition test data

def add_features(df):
    # making sure to include a step that makes all text lowercase so that features can be applied
    df["text"] = df["text"].str.lower()  
    df["misspelling_ratio"] = df["text"].apply(misspelling_ratio)
    df["n_misspellings"] = df["text"].apply(count_misspellings)
    df["n_exclamations"] = df["text"].apply(count_exclamation_points)
    df["n_em_dash"] = df["text"].apply(count_em_dash)
    df["n_transition"] = df["text"].apply(count_transition)
    df["n_hyperbolic"] = df["text"].apply(count_hyperbolic)
    df["n_abn_symbols"] = df["text"].apply(count_abn_symbols)
    df["n_prompt"] = df["text"].apply(contains_prompt_indicators)

    return df

train = add_features(train)
test_essays = add_features(test_essays)



In [61]:
train = train.drop(['prompt_id', 'kaggle_repo'], axis=1)
train.head(15)

Unnamed: 0_level_0,text,generated,model,misspelling_ratio,n_misspellings,n_exclamations,n_em_dash,n_transition,n_hyperbolic,n_abn_symbols,n_prompt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
d429f032,advantages of limiting car usage \n\nlimiting ...,1,gpt-3.5-turbo,0.007712,3,0,0,7,2,0,0
1ce279be,advantages of limiting car usage\n\nlimiting c...,1,gpt-3.5-turbo,0.011442,5,0,0,6,4,0,0
c9595213,limiting car usage has numerous advantages tha...,1,gpt-3.5-turbo,0.016327,8,0,0,9,2,0,0
f2266d87,the passages provided discuss the advantages o...,1,gpt-3.5-turbo,0.007634,3,0,0,4,0,0,0
eeace4bd,title: the advantages of limiting car usage\n\...,1,gpt-3.5-turbo,0.014625,8,0,0,7,4,0,0
354fdce0,advantages of limiting car usage\n\nlimiting c...,1,gpt-3.5-turbo,0.013483,6,0,0,5,3,0,0
6eaa842f,the advantages of limiting car usage are becom...,1,gpt-3.5-turbo,0.018672,9,0,0,5,2,0,0
a36a04d5,limiting car usage has numerous advantages for...,1,gpt-3.5-turbo,0.019868,6,0,0,6,4,0,0
c9d5567f,advantages of limiting car usage\n\nlimiting c...,1,gpt-3.5-turbo,0.014184,6,0,0,10,3,0,0
c2e2ead1,the advantages of limiting car usage\n\nin rec...,1,gpt-3.5-turbo,0.006316,3,0,0,10,3,0,0


In [63]:
# important for pre-processing -- define which of the predictors are our featured engineered ones since we are using tf-idf later

our_features = [
    "misspelling_ratio", "n_misspellings", "n_exclamations", "n_em_dash", "n_transition", "n_hyperbolic","n_abn_symbols", "n_prompt"
]

## 4. Data preparation pipelines and pre-processing

Run all preparation and pre-processing pipelines here

#### Splitting with new features

In [71]:
## Train/test split
train_essay, test_essay = train_test_split(train, test_size=0.3, random_state=38)

# train
train_y = train_essay['generated']
train_X = train_essay.drop(columns='generated')

In [73]:
# TF-IDF 
word_tfidf = TfidfVectorizer(stop_words='english', max_features=20000, ngram_range=(1,3), min_df=2)

char_tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3,5), max_features=20000, min_df=2)

# combine the two vectorizer steps -- used Google AI overview for this and the next step, was not previously familiar 
# with Feature Union and making a more complex preprocessor
nlp_features = FeatureUnion([
    ('word', word_tfidf),
    ('char', char_tfidf)
])

In [23]:
# Preprocessing step -- used Google AI overview for this step, as described above
preprocess = ColumnTransformer(
    transformers=[
        ('nlp', nlp_features, 'text'),
        ('our', StandardScaler(), our_features)
    ]
)

pipe = Pipeline([
    ('prep', preprocess),
    ('model', XGBClassifier())
])

## 5. Model selection and hyperparameter tuning

Perform all model selection and hyperpareter tuning here. Create separate pipelines here

In [None]:
# Full grid search of all possible models
### **important note*** this is not the exact code we used to run our comparison of models, this would've EXPLODED our computers!
### we ran this model comparison grid search on preliminary training data (N=1000) and then as we advanced our sample size, 
### started to run grid search over independent models and their parameters individually, comparing AUC scores as our scoring 
### metric and evaluating performance on test data in kaggle. 

### this is formatted to represent all possible combos we tried, though it is not reflective of the multiple instances of 
### gridsearchcv we used


# param_grid = [
#    ENSEMBLE METHODS
#    {
#        "model": [RandomForestClassifier(random_state=38)],
#        "model__n_estimators": [100, 200],
#        "model__max_depth": [None, 5, 10],
#        "model__min_samples_split": [5, 10, 20],
#        "model__class_weight": [None, "balanced"]
#    },
#    {
#        "model": [XGBClassifier(eval_metric='logloss', random_state=38)],
#        "model__learning_rate": [0.05, 0.1],
#        "model__n_estimators": [300, 500],
#        "model__max_depth": [2, 3], 
#        "model__reg_alpha": [0, 0.1, 0.5],
#        "model__reg_lambda": [3, 5]
#    },
#    LINEAR METHODS
#    {
#        "model": [LogisticRegression()],
#        "model__C": [0.01, 0.05, 0.1, 0.5, 1, 2],
#        "model__penalty": ['l1', 'l2'],
#    }
# ]

# grid = GridSearchCV(pipe, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)

# grid.fit(train[['text'] + our_features], train['generated'])

# print("Best score:", grid.best_score_)
# print("Best params:", grid.best_params_)

# pipe_final = grid.best_estimator_

## 6. Creation of final pipeline

Create your final pipeline here and save it in an object called `pipe_final`. 

In [79]:
# XGB
params = {
    "model__learning_rate": [0.2],
    "model__n_estimators": [200],
    "model__max_depth": [3]
}

grid_xgb = GridSearchCV(pipe, params, cv=5, scoring='roc_auc', n_jobs=-1)
grid_xgb.fit(train_X, train_y)

pipe_final = grid_xgb.best_estimator_

python(76115) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(76116) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(76117) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(76118) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(76119) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(76120) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(76121) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(76122) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


KeyboardInterrupt: 

In [None]:
test_preds = pipe_final.predict(test_X)
test_probs = pipe_final.predict_proba(test_X)[:, 1]

from sklearn.metrics import roc_auc_score

roc = roc_auc_score(test_y, test_probs)
print("Test ROC-AUC:", roc)

## 7. Creating Submission

* Using `pipe_final`, create a data frame `final_submission` that has your predictions and write to `final_submission.csv` that you can submit on Kaggle. Note the format of `final_submission.csv` has to match that `example_submission.csv` exactly.
* Take a screen shot of your final leaderboard score and ensure it displays below
* Run the `RepeatedKFold()` as you did in PS3. 


* Keep track of at least one score a day

In [72]:
# Apply pipe_final to create final_submission data frame here:
preds = pipe_final.predict(train_essay_X)

final_submission = pd.DataFrame({
    'essay_id': train_essay.index,  # Use the same index as in your train_essay DataFrame
    'generated': preds
})

final_submission.to_csv('data/final_submission.csv', index=True)

In [None]:
from IPython.display import Image
Image("images/leaderboard.png")

In [75]:
cv_final = RepeatedKFold(n_splits=5, n_repeats=10, random_state=38)
scores_final = cross_val_score(pipe_final, X_train, y_train, scoring='accuracy', cv=cv_final)
print(f'Mean accuracy: {np.mean(scores_final):.3f}')

NameError: name 'pipe_final' is not defined

## 8. Appendix

Please anything extra that you don't feel is central, but would still like to keep here