
<div style="padding:20px; 
            color:#150d0a;
            margin:10px;
            font-size:220%;
            text-align:center;
            display:fill;
            border-radius:20px;
            border-width: 5px;
            border-style: solid;
            border-color: #150d0a;
            background-color:#4FC95F;
            overflow:hidden;
            font-weight:500">LLM - Detect AI Generated Text</div>

![4](https://github.com/benitomartin/benitomartin/assets/116911431/cab5bb0e-1473-47f5-8e9c-d51e1731b207)


  <div style="padding:20px; 
              color:blue;
              margin:10px;
              font-size:150%;
              text-align:center;
              display:fill;
              border-radius:20px;
              border-width: 5px;
              background-color:#eca912;
              overflow:hidden;
              font-weight:500">
    <b>INTRODUCTION</b>
  </div>

This notebook has been created as part of the **LLM - Detect AI Generated Text** competition from **Kaggle**.

The competition dataset comprises about 10,000 essays, some written by students and some generated by a variety of large language models (LLMs). The goal of the competition is to determine whether or not essay was generated by an LLM.

All of the essays were written in response to one of seven essay prompts. In each prompt, the students were instructed to read one or more source texts and then write a response. This same information may or may not have been provided as input to an LLM when generating an essay.

Essays from two of the prompts compose the training set; the remaining essays compose the hidden test set. Nearly all of the training set essays were written by students, with only a few generated essays given as examples.

# **DATASET**
**test|train_essays.csv** 

- `id`- A unique identifier for each essay.

- `prompt_id` - Identifies the prompt the essay was written in response to.

- `text` - The essay text itself.

- `generated` - Whether the essay was written by a student (`0`) or generated by an LLM (`1`). This field is the target and is not present in `test_essays.csv`.

**train_prompts.csv** - Essays were written in response to information in these fields.

- `prompt_id` - A unique identifier for each prompt.

- `prompt_name` - The title of the prompt.
instructions - The instructions given to students.

- `source_text` - The text of the article(s) the essays were written in response to, in Markdown format. Significant paragraphs are enumerated by a numeral preceding the paragraph on the same line, as in `0 Paragraph one.\n\n1 Paragraph two`.. Essays sometimes refer to a paragraph by its numeral. Each article is preceded with its title in a heading, like `# Title`. When an author is indicated, their name will be given in the title after `by`. Not all articles have authors indicated. An article may have subheadings indicated like `## Subheading`.

**sample_submission.csv** - A submission file in the correct format. See the Evaluation page for details.

  <div style="padding:20px; 
              color:blue;
              margin:10px;
              font-size:150%;
              text-align:center;
              display:fill;
              border-radius:20px;
              border-width: 5px;
              background-color:#eca912;
              overflow:hidden;
              font-weight:500">
    <b>CONTACT INFORMATION</b>
  </div>

If you like this notebook, feel free to upvote it and connect with me!

**Benito Martin:** 

- [LinkedIn](https://www.linkedin.com/in/benitomzh/) 🔗

- [GitHub](https://github.com/benitomartin) 🔗

  <div style="padding:20px; 
              color:blue;
              margin:10px;
              font-size:150%;
              text-align:center;
              display:fill;
              border-radius:20px;
              border-width: 5px;
              background-color:#eca912;
              overflow:hidden;
              font-weight:500">
    <b>TABLE OF CONTENTS</b>
  </div>

* [1. Import Libraries](#libraries)

* [2. Helper Functions](#helper)

* [3. Import Data](#data)
    
    * [3.1 Essays Datasets](#essays)
        
        * [3.1.1. Train Set](#essaystrain)   
    
        * [3.1.1. Test Set](#essaystest) 
    
    * [3.2. Prompts Dataset](#prompts)

* [4. Exploratory Data Analysis](#eda)
    
    * [4.1. Distribution](#distribution)
    
        * [4.1.1. Essays](#distess)
            
* [5. External Essays](#distext)
    
    * [5.1. DAIGT Essay](#daigt)
    
    * [5.2. Distribution](#dist)

    * [5.3. Wordcloud](#wc)
    
    * [5.4. Preprocessing](#prepro)
        
* [6. Modeling](#modeling)

    * [6.1. Train/Test Split](#split)

    * [6.2. Vectorization](#vector)
    
    * [6.3. Embedding](#emb)
    
    * [6.4. RNN Model](#model)
  
         * [6.4.1. Fitting](#fit)
         
         * [6.4.2. Learning Curves](#lc) 
         
         * [6.4.3. Confusion Matrix](#cm) 
         
         * [6.4.4. Classification Report](#cr)
        
         * [6.4.5. AUC - ROC Curve](#roc)

         * [6.4.6. Precission-Recall Curve](#prc)

* [7. Submission](#sub)


# <font color='289C4E'>1. Import Libraries 📚<font><a class='anchor' id='libraries'></a> [↑](#top)

In [None]:
# LIBRARIES

# Global
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import regex as re

# Function to plot WordCloud
from wordcloud import WordCloud, STOPWORDS
from collections import Counter

# Tensorflow/Keras
import tensorflow as tf
import keras_core as keras
from keras import layers, Sequential
from keras.layers import TextVectorization
from keras.callbacks import (ModelCheckpoint, 
                             EarlyStopping, 
                             ReduceLROnPlateau, 
                             CSVLogger, 
                             LearningRateScheduler)

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix, ConfusionMatrixDisplay, 
                             classification_report, precision_recall_curve, 
                             roc_curve, auc) 

os.environ["KERAS_BACKEND"] = "tensorflow"  # or "tensorflow" or "torch"

# Set Seed for Reproducibility
keras.utils.set_random_seed(42)

# Use mixed precision to speed up all training.
keras.mixed_precision.set_global_policy("mixed_float16")

# Check Versions
print("TensorFlow:", tf.__version__)
print("Keras:", keras.__version__)

# <font color='289C4E'>2. Helper Functions 💾<font><a class='anchor' id='helper'></a> [↑](#top)

The competition provides an **Efficiency Prize**, which will only be awarded on CPU Only notebooks. Therefore in order to reduce the size of the imported dataframes, the following function will be used to to **downcast** numerical columns to more memory-efficient types, hence improving its memory usage.


Additionally a function to plot **WorClouds** is provided.

In [None]:
def compress(df, verbose=True):
    """
    Reduces the size of the DataFrame by downcasting numerical columns
    """
    input_size = df.memory_usage(index=True).sum() / (1024 ** 2)
    if verbose:
        print("Old dataframe size:", round(input_size, 2), 'MB')

    in_size = df.memory_usage(index=True).sum()
    dtype_before = df.dtypes.copy()  # Copy of original data types

    for col in df.select_dtypes(include=['float64', 'int64']):
        col_type = df[col].dtype
        col_min, col_max = df[col].min(), df[col].max()

        if col_type == 'int64':
            if col_min > np.iinfo(np.int8).min and col_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif col_min > np.iinfo(np.int16).min and col_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif col_min > np.iinfo(np.int32).min and col_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif col_min > np.iinfo(np.int64).min and col_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)
        elif col_type == 'float64':
            ## float16 warns of overflow
            # if col_min > np.finfo(np.float16).min and col_max < np.finfo(np.float16).max:
            #     df[col] = df[col].astype(np.float16)
            if col_min > np.finfo(np.float32).min and col_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            elif col_min > np.finfo(np.float64).min and col_max < np.finfo(np.float64).max:
                df[col] = df[col].astype(np.float64)

    out_size = df.memory_usage(index=True).sum()
    ratio = (1 - round(out_size / in_size, 2)) * 100

    if verbose:
        print("Optimized size by {}%".format(round(ratio, 2)))
        print("New DataFrame size:", round(out_size / (1024 ** 2), 2), "MB")

    # Filter only numerical columns for comparison
    numeric_columns = df.select_dtypes(include=['float32', 'float64', 'int8', 'int16', 'int32', 'int64'])
    dtype_after = numeric_columns.dtypes.copy()  # Copy of data types after compression
    
    # Create a comparison DataFrame
    comparison_df = pd.DataFrame({'Before': dtype_before[numeric_columns.columns], 'After': dtype_after})
    comparison_df['Size Reduction'] = ratio

    return df, comparison_df

In [None]:
# Generate WordCloud

def generate_wordcloud_subplot(df, label_value, subplot_position, max_words=1000, width=800, height=400, top_n = 10):
    """
    Generate a word cloud for a specific label value and display it in a subplot.

    Args:
        df (DataFrame): The DataFrame containing text data and labels.
        label_value (int): The label value for which to generate the word cloud.
        subplot_position (int): The position of the subplot where the word cloud will be displayed.
        max_words (int, optional): Maximum number of words to include in the word cloud. Default is 1000.
        width (int, optional): Width of the word cloud image. Default is 800.
        height (int, optional): Height of the word cloud image. Default is 400.

    Returns:
        None
    """

    # Select the text subset for the specified label value
    text_subset = df[df.generated == label_value].text

    # Define stopwords to be excluded
    stopwords = set(STOPWORDS)

    # Create a WordCloud object with specified parameters
    wc = WordCloud(max_words=max_words, width=width, height=height, stopwords=stopwords)

    # Generate the word cloud from the selected text subset
    wc.generate(" ".join(text_subset))

    # Create a subplot and display the word cloud
    plt.subplot(subplot_position)
    plt.imshow(wc, interpolation='bilinear')

    # Set the title for the word cloud plot
    title = f'WordCloud for Label {label_value} ({("Student" if label_value == 0 else "AI")})'
    plt.title(title)
    
    # Count occurrences of words in the text subset
    words_count = Counter(" ".join(text_subset).split())
    top_words = words_count.most_common(top_n)
    bottom_words = words_count.most_common()[:-top_n-1:-1]  # Extract least common words

    # Print the most common words
    print(f"Top {top_n} words for Label {label_value}:")
    for idx, (word, count) in enumerate(top_words, start=1):
        print(f"{idx}. {word}: {count} times")
    print("------------------------------")

    # Print the least common words
    print(f"Least {top_n} words for Label {label_value}:")
    for idx, (word, count) in enumerate(bottom_words, start=1):
        print(f"{idx}. {word}: {count} times")
    print("------------------------------")

# <font color='289C4E'>3. Import Data 📂<font><a class='anchor' id='data'></a> [↑](#top)

There are 3 files which contains the following information:

**test|train_essays.csv**

- `id`: A unique identifier for each essay.

- `prompt_id`: Identifies the prompt the essay was written in response to. 

- `text`: The essay text itself. 

- `generated`: Whether the essay was written by a student (0) or generated by an LLM (1). This field is the target and is not present in `test_essays.csv`.

**train_prompts.csv**: Essays were written in response to information in these fields. 

- `prompt_id`: A unique identifier for each prompt. 

- `prompt_name`: The title of the prompt. 

- `instructions`: The instructions given to students. 

- `source_text`: The text of the article(s) the essays were written in response to, in Markdown format. Significant paragraphs are enumerated by a numeral preceding the paragraph on the same line, as in `0 Paragraph one.\n\n1 Paragraph two`. Essays sometimes refer to a paragraph by its numeral. Each article is preceded with its title in a heading, like `# Title`. When an author is indicated, their name will be given in the title after `by`. Not all articles have authors indicated. An article may have subheadings indicated like `## Subheading`.


In [None]:
# Import Data

data_path = '/kaggle/input/llm-detect-ai-generated-text/'

for dirname, _, filenames in os.walk(data_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## <font color='289C4E'>3.1. Essays Datasets 🗞️<font><a class='anchor' id='essays'></a> [↑](#top)

### <font color='289C4E'>3.1.1. Train Set 🚂<font><a class='anchor' id='essaystrain'></a> [↑](#top)

If we load the train dataset we see that it contains 1378 essays.

In [None]:
# Essays Train Dataset

df_train_essays = pd.read_csv(data_path + "train_essays.csv")

print(df_train_essays.info())
df_train_essays.head()

In [None]:
# Compression

# Compress dataframe
df_train_essays, comparison_df = compress(df_train_essays, verbose=True)

# Check compression
comparison_df

In [None]:
# Checking the first essay text

df_train_essays.text[0]

### <font color='289C4E'>3.1.2. Test Set 🧪<font><a class='anchor' id='essaystest'></a> [↑](#top)

If we load the test dataset we see that it only contains 3 essays. Additionally, the text only contains 12 characters. However the test set used for evaluation contains more essays.

In [None]:
# Essays Test Dataset

df_test_essays = pd.read_csv(data_path + "test_essays.csv")

print(df_test_essays.info())
df_test_essays.head()

In [None]:
# Compression

# Compress dataframe
df_test_essays, comparison_df = compress(df_test_essays, verbose=True)

# Check compression
comparison_df

In [None]:
# Checking the first essay text

df_test_essays.text[0]

In [None]:
# Checking length of the essays

df_test_essays["text"].apply(lambda x : len(x))

## <font color='289C4E'>3.2. Prompts Dataset 🗞️<font><a class='anchor' id='prompts'></a> [↑](#top)

Let's now explore the prompts dataset. We see that it only contains 2 entries, which means only 2 prompts or topics were created to produced several essays.

In [None]:
# Prompts Dataset

df_train_prompts = pd.read_csv(data_path + "train_prompts.csv")
print(df_train_prompts.info())

df_train_prompts.head()

In [None]:
# Compression

# Compress dataframe
df_train_prompts, comparison_df = compress(df_train_prompts, verbose=True)

# Check compression
comparison_df

In [None]:
# Checking the first instruction given to students

df_train_prompts.instructions[0]

In [None]:
# Checking the first text

df_train_prompts.source_text[0]

# <font color='289C4E'>4. Exploratory Data Analysis 📈<font><a class='anchor' id='eda'></a> [↑](#top)

After exploring the length and the content of the data set, we see that there are not much information to train a model and external data could be provided. Before that let's explore the distribution of the datasets and see how balanced or imbalanced they are.

## <font color='289C4E'>4.1. Distribution 📊<font><a class='anchor' id='distribution'></a> [↑](#top)

### <font color='289C4E'>4.1.1 Essays 🚂<font><a class='anchor' id='distess'></a> [↑](#top)

The distribution of the prompts in the test set seems to be well balanced. On the other hand, there are only 3 texts generated by AI. Therefore, it will ne necessary to add additional data for the training to balance the dataset.

In [None]:
# Distribution of Prompts

# Set Figure
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")

# Create the count plot
ax = sns.countplot(data=df_train_essays, x="prompt_id", palette="viridis")

# Mapping x-axis labels
ax.set_xticklabels(["Student", "AI"])

# Obtaining and setting the count values
abs_values = df_train_essays['prompt_id'].value_counts().values
ax.bar_label(container=ax.containers[0], labels=abs_values, fontsize=12)

# Set title and labels with increased font sizes
ax.set_title("Distribution of Prompt ID", fontsize=16)
ax.set_xlabel("Prompt ID", fontsize=14)
ax.set_ylabel("Count", fontsize=14)

plt.show()

In [None]:
# Distribution of Generated Text

# Set Figure
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")

# Create the count plot
ax = sns.countplot(data=df_train_essays, x="generated", palette="viridis")

# Mapping x-axis labels
ax.set_xticklabels(["Student", "AI"])

# Obtaining and setting the count values
abs_values = df_train_essays['generated'].value_counts().values
ax.bar_label(container=ax.containers[0], labels=abs_values, fontsize=12)

# Set title and labels with increased font sizes
ax.set_title("Distribution of Generated Text", fontsize=16)
ax.set_xlabel("Generated Text", fontsize=14)
ax.set_ylabel("Count", fontsize=14)

plt.show()

# <font color='289C4E'>5. External Essays 📤<font><a class='anchor' id='distext'></a> [↑](#top)

Since the train set is imbalanced, we have the possibility to import external data. In our case we will use the [DAIGT](https://www.kaggle.com/datasets/thedrcat/daigt-proper-train-dataset/) which contains the original `train_essays.csv` file

## <font color='289C4E'>5.1. DAIGT Essay 1️⃣<font><a class='anchor' id='daigt'></a> [↑](#top)

In [None]:
#Import external dataset

ext1 = pd.read_csv('/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv')

In [None]:
ext1.head()

In [None]:
# Check duplicates

# Get the number of rows with duplicates
duplicates = ext1.duplicated().sum()

# Print the number of rows before and after
print(f"Number of rows with duplicates: {duplicates}")

In [None]:
# Compression

# Compress dataframe
ext1, comparison_df = compress(ext1, verbose=True)

# Check compression
comparison_df

The dataset contains several columns that will not be necessary for our specific problem. Therefore we will remove them and rename the `label` column to `generated`.

In [None]:
# Drop Columns

ext1.drop(["RDizzl3_seven", "prompt_name", "source", "prompt_name"], inplace=True, axis=1)

ext1.head()

In [None]:
# Rename label column

ext1.rename(columns = {"label":"generated"}, inplace=True)

ext1.head()

## <font color='289C4E'>5.2. Distribution ⚖️<font><a class='anchor' id='dist'></a> [↑](#top)    

Let's now plot the distribution of the labels. We see that the dataset is imbalanced. However, in order to set a RNN baseline model, we will use the dataset as it is.

In [None]:
# Distribution of Generated Text in the External Dataset

# Set Figure
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")

# Create the count plot
ax = sns.countplot(data=ext1, x="generated", palette="viridis")

# Mapping x-axis labels
ax.set_xticklabels(["Student", "AI"])

# Obtaining and setting the count values
abs_values = ext1['generated'].value_counts().values
ax.bar_label(container=ax.containers[0], labels=abs_values, fontsize=12)

# Set title and labels with increased font sizes
ax.set_title("Distribution of Generated Text", fontsize=16)
ax.set_xlabel("Generated Text", fontsize=14)
ax.set_ylabel("Count", fontsize=14)

plt.show()

## <font color='289C4E'>5.3. Wordcloud 🤼‍<font><a class='anchor' id='wc'></a> [↑](#top)    

Now we will go deeper in the analysis. We will plot the wordcloud an the list of most and less common words.

We see that the most common words are tipically stopwords, which we could remove. Additionally some numbers appear, which are also not relevant and can be removed.

In [None]:
# Plot WordCloud

# Create a 1x2 grid of subplots
plt.figure(figsize=(20, 10))

# Generate WordCloud for label_value = 1 (subplot 1)
generate_wordcloud_subplot(ext1, label_value=1, subplot_position=121, top_n = 10)

# Generate WordCloud for label_value = 0 (subplot 2)
generate_wordcloud_subplot(ext1, label_value=0, subplot_position=122, top_n = 10)

plt.tight_layout()  # Adjust spacing between subplots

plt.show()

## <font color='289C4E'>5.4. Preprocessing 🏋️<font><a class='anchor' id='prepro'></a> [↑](#top)    

There are several preprocessing steps that we will perform:

* Clean text: lower text, removal of punctiation, extra spaces, whitespaces and numbers

* Stopwords removal

This will help to feed the model with mode relevant information.

In [None]:
# Clean Text

def clean_text(text):
    # Replace actual newline and carriage return characters with whitespace
    text = text.replace("\n", " ")
    text = text.replace("\r", " ")
    
    # Drop punctuation
    text = re.sub(r"\p{P}", " ", text)
    
    # Remove extra spaces
    text = re.sub(r"\s+", " ", text)
    
    # Remove leading and trailing whitespace
    text = text.strip()
    
    # Lower text
    text = text.lower()
    
    # Remove numbers
    text = re.sub(r"\d+", "", text)
    
    return text

# Apply the clean_text function to the 'text' column in the DataFrame
ext1['text'] = ext1['text'].apply(clean_text)

# Change contractions
contractions = {
    r'\b(can\'t)\b': 'cannot',
    r'\b(don\'t)\b': 'do not',
    r'\b(won\'t)\b': 'will not',
}

# Iterate through contractions and apply replacements to the entire DataFrame column
for pattern, replacement in contractions.items():
    ext1['text'] = ext1['text'].apply(lambda x: re.sub(pattern, replacement, x, flags=re.IGNORECASE))


In [None]:
# As NLTK is not working in Kaggle. We set the stopwords list

stopword_list = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
len(stopword_list)

In [None]:
# Remove Stopwords

def remove_custom_stopwords(sentence):
    words = sentence.split()
    filtered_words = [word for word in words if word.lower() not in stopword_list]
    return ' '.join(filtered_words)

# Apply the function to the 'text' column
ext1['text'] = ext1['text'].apply(remove_custom_stopwords)

In [None]:
# Check duplicates

# Get the number of rows with duplicates
duplicates = ext1.duplicated().sum()

# Print the number of rows before and after
print(f"Number of rows with duplicates: {duplicates}")

After preprocessing if we plot the stopwords again, we see that the list of works looks more relevant and this will improve the results of our model.

In [None]:
# Plot WordCloud

# Create a 1x2 grid of subplots
plt.figure(figsize=(20, 10))

# Generate WordCloud for label_value = 1 (subplot 1)
generate_wordcloud_subplot(ext1, label_value=1, subplot_position=121, top_n = 10)

# Generate WordCloud for label_value = 0 (subplot 2)
generate_wordcloud_subplot(ext1, label_value=0, subplot_position=122, top_n = 10)


plt.tight_layout()  # Adjust spacing between subplots

plt.show()

# <font color='289C4E'>6. Modeling 💪<font><a class='anchor' id='modeling'></a> [↑](#top)

Now that we have our dataset clean let's perform our training with an RNN model. First, let's make a copy of the final dataset.

In [None]:
# Copy the final_df as df_model

df_model = ext1.copy()

df_model.generated.value_counts()

## <font color='289C4E'>6.1. Train/Test Split 🪓<font><a class='anchor' id='split'></a> [↑](#top)

First we will split the dataset in train, test and validation. We will do it after shuffling it, so that we get a good label distribution.

In [None]:
# Create a shuffled df for a good labels distribution

# Set a random seed for reproducibility
random_seed = 42

print("Before shuffling:", df_model.shape)

# Shuffle the DataFrame with the specified random seed
shuffled_df = df_model.sample(frac=1, random_state=random_seed)

print("After shuffling:", df_model.shape)

In [None]:
# Create a train/val/test split 
X = shuffled_df["text"]
y = shuffled_df["generated"]


# Split the data into train, validation, and test sets (80% train, 15% validation, 15% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=random_seed)

X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=random_seed)

# Display the shapes of the train, validation, and test sets
print("X_train shape:", X_train.shape)
print("X_validation shape:", X_val.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_validation shape:", y_val.shape)
print("y_test shape:", y_test.shape)

If we see the distributions of the labels, we can appreciate a similar distribution in each set, which will help to a better performance of the model.

In [None]:
# Get label counts for train, validation, and test data
train_label_counts = y_train.value_counts()
val_label_counts = y_val.value_counts()
test_label_counts = y_test.value_counts()

# Define custom labels for visualization
custom_labels = {0: 'Student', 1: 'AI'}

# Replace labels for visualization purposes
train_labels_visual = train_label_counts.rename(custom_labels)
val_labels_visual = val_label_counts.rename(custom_labels)
test_labels_visual = test_label_counts.rename(custom_labels)

# Define custom colors for each label
label_colors = {'Student': '#33FF57', 'AI': '#FF5733'}

# Create subplots with 1 row and 3 columns
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Subplot 1: Train data distribution
wedges, texts, autotexts = axes[0].pie(train_labels_visual, labels=train_labels_visual.index, autopct='%1.1f%%', colors=[label_colors[label] for label in train_labels_visual.index])
axes[0].set_title('Train Data Distribution')

# Subplot 2: Validation data distribution
wedges, texts, autotexts = axes[1].pie(val_labels_visual, labels=val_labels_visual.index, autopct='%1.1f%%', colors=[label_colors[label] for label in val_labels_visual.index])
axes[1].set_title('Validation Data Distribution')

# Subplot 3: Test data distribution
wedges, texts, autotexts = axes[2].pie(test_labels_visual, labels=test_labels_visual.index, autopct='%1.1f%%', colors=[label_colors[label] for label in test_labels_visual.index])
axes[2].set_title('Test Data Distribution')

# Adjust spacing between subplots
plt.tight_layout()

# Show the plots
plt.show()


## <font color='289C4E'>6.2. Vectorization ↗️<font><a class='anchor' id='vector'></a> [↑](#top)

For the vestorization we will use the **TextVectorization** layer from TensorFlow. This process prepares text data for input into a machine learning model by converting text into numerical representations that the model can work with.

In [None]:
# Check the max vocaulary size

text_vectorizer = TextVectorization(split="whitespace",
                                    output_mode="int")

# Fit the text vectorizer
text_vectorizer.adapt(X)

# Get the number of unique tokens in the vocabulary
vocab_size = len(text_vectorizer.get_vocabulary())

# Print the vocabulary size
print("Vocabulary size:", vocab_size)

In [None]:
# Setup text vectorization with custom variables

# Set the maximum vocabulary size
# max_vocab_size = 10000
max_vocab_size = vocab_size 

# Calculate the maximum sequence length based on the average number of tokens in training data
average_tokens_per_sequence = round(sum([len(text.split()) for text in X_train]) / len(X_train))

# Create and configure the TextVectorization layer
text_vectorizer = TextVectorization(
    max_tokens=max_vocab_size,
#     ngrams=(3,5),
    output_mode="int",
    output_sequence_length=average_tokens_per_sequence,
    pad_to_max_tokens=True
)

# Adapt the TextVectorization layer to the training text
if len(X_train) > 0:
    text_vectorizer.adapt(X_train)
else:
    print("Warning: X_train is empty, adaptation skipped.")

## <font color='289C4E'>6.3. Embedding 📦<font><a class='anchor' id='emb'></a> [↑](#top)

After vectorization we will create an **Embedding** layer. This Embedding layer converts input text data (represented as indices or sequences of integers) into dense vectors of fixed size (output_dim) in the embedding space. These dense vectors serve as the input for subsequent layers in the neural network model, allowing the model to learn meaningful representations of words based on their contexts within the input sequences.

In [None]:
tf.random.set_seed(42)

embedding = layers.Embedding(input_dim=max_vocab_size,          
                             output_dim=128,
                             embeddings_initializer="uniform",
                             input_length=average_tokens_per_sequence)

## <font color='289C4E'>6.4. RNN Model 🎢<font><a class='anchor' id='model'></a> [↑](#top) 

Our baseline model is set up for binary classification tasks where the input is text data, and the objective is to predict a binary outcome (Student or AI generated). The text is tokenized, embedded, and then processed through a simple neural network architecture for classification. We have added som callbacks for better performance.

### <font color='289C4E'>6.4.1. Fitting 🏃<font><a class='anchor' id='fit'></a> [↑](#top) 

In [None]:
# Build the model

inputs = layers.Input(shape=(1,), dtype="string")

x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GlobalAveragePooling1D()(x)

outputs = layers.Dense(1, activation="sigmoid")(x)

keras_model = tf.keras.Model(inputs, outputs)

# Compile model
keras_model.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
                metrics=["accuracy"])

# Get a summary of the model
keras_model.summary()

In [None]:
# Fit the model

callbacks = [ModelCheckpoint(filepath='keras_model', save_best_only=True, save_format='tf'),
             EarlyStopping(patience=7, monitor='val_loss', restore_best_weights = True),
             ReduceLROnPlateau(factor=0.2, patience=5, monitor='val_loss'),
             CSVLogger('keras_training_log.csv')]


keras_model_history = keras_model.fit(X_train,                                      
                                      y_train,
                                      epochs=100,
                                      validation_data=(X_val, y_val),
                                      callbacks=callbacks,
                                     # batch_size=32
                                     )

### <font color='289C4E'>6.4.2. Learning Curves ↘️<font><a class='anchor' id='lc'></a> [↑](#top) 

This **learning curves** helps in assessing the model's training progress and observing the trend of both training and validation losses over epochs. We can appreciate that the learning curves are shwoing a good trend and low values. However, the fit very fast and show some overfitting.

In [None]:
# Plot learning curves

plt.figure(figsize=(10, 6))

# Plot training & validation loss values
plt.plot(keras_model_history.history['loss'], label='Training Loss')
plt.plot(keras_model_history.history['val_loss'], label='Validation Loss')

plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.grid(True)

# Set y-axis lower limit to 0.5
plt.ylim(top=0.5)

plt.show()


### <font color='289C4E'>6.4.3. Confusion Matrix 🧮<font><a class='anchor' id='cm'></a> [↑](#top) 

The **confusion matrix** aids in understanding the model's accuracy in predicting each class and the misclassifications made by the model on the test dataset (X_test). In our case we get excellent results with low amount of FP/FN.

In [None]:
# Make predictions on input data (X_test) in form of probabilities

keras_probabilities = keras_model.predict(X_test)
keras_probabilities[:5]

In [None]:
# Turn prediction probabilities into single-dimension tensor of floats

# squeeze removes single dimensions
keras_prediction = tf.squeeze(tf.round(keras_probabilities))
keras_prediction[:10]

In [None]:
# Plot confusion matrix

keras_cm = confusion_matrix(y_test, keras_prediction)
keras_cm_plot =ConfusionMatrixDisplay(confusion_matrix=keras_cm)

keras_cm_plot.plot()
plt.show()

### <font color='289C4E'>6.4.4. Classification Report 🏅<font><a class='anchor' id='cr'></a> [↑](#top) 

This **classification report** evaluates the model's performance using classification metrics and provides insights into its precision, recall, and f1-score for each class, helping to assess how well the model performs in classifying instances in the test dataset (X_test). In our case we get excellent results in all the parameters (precision, recall, f1-score and accuracy)

In [None]:
# Predictions from the model on the test set
y_pred = keras_model.predict(X_test)

# Converting probabilities to classes (assuming a threshold of 0.5)
y_pred_classes = (y_pred > 0.5).astype(int)

# Printing the classification report
print(classification_report(y_test, y_pred_classes))

### <font color='289C4E'>6.4.5. AUC - ROC Curve 📐<font><a class='anchor' id='roc'></a> [↑](#top) 

The **ROC curve**, illustrates the trade-off between the true positive rate and false positive rate across different thresholds. The AUC value quantifies the overall performance of the model in distinguishing between the positive and negative classes, with a higher AUC indicating better performance. In our case, again the results are excellent. Note that this is the metric use to evaluate the model performance.

In [None]:
# AUC -  ROC Curve

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

# Calculate AUC
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

# Set labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

# Set title and legend
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)

plt.show()


### <font color='289C4E'>6.4.6. Precission-Recall Curve 📞<font><a class='anchor' id='prc'></a> [↑](#top) 

The **precision-recall curve** illustrates the trade-off between precision and recall for different probability thresholds. A higher AUC value indicates better performance of the model in terms of both precision and recall for different classification thresholds. This curve provides valuable insights into the model's performance, especially in scenarios with imbalanced class distributions. In our case, again the results are excellent.

In [None]:
# Precission-recall curve

# Calculate precision-recall curve
precision, recall, _ = precision_recall_curve(y_test, keras_probabilities)

# Calculate AUC for precision-recall curve
pr_auc = auc(recall, precision)

# Plot precision-recall curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, color='orange', lw=2, label='Precision-Recall curve (AUC = {:.2f})'.format(pr_auc))

# Set labels
plt.xlabel('Recall')
plt.ylabel('Precision')

# Set title and legend
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.grid(True)

plt.show()

# <font color='289C4E'>7. Submission 📩<font><a class='anchor' id='sub'></a> [↑](#top)

Let's submit our baseline model predictions!!

In [None]:
test_prediction = keras_model.predict(df_test_essays["text"])
test_prediction

In [None]:
# Create a DataFrame to store the submission
submission_df = df_test_essays[["id"]].copy()

# Add the formatted predictions to the submission DataFrame
submission_df["generated"] = test_prediction.squeeze()

# Save Submission
submission_df.to_csv('submission.csv',index=False)

# Display the first 2 rows of the submission DataFrame
submission_df.head()