This notebook has been created as part of the LLM - Detect AI Generated Text competition from Kaggle.

The competition dataset comprises about 10,000 essays, some written by students and some generated by a variety of large language models (LLMs). The goal of the competition is to determine whether or not essay was generated by an LLM.

All of the essays were written in response to one of seven essay prompts. In each prompt, the students were instructed to read one or more source texts and then write a response. This same information may or may not have been provided as input to an LLM when generating an essay.

Essays from two of the prompts compose the training set; the remaining essays compose the hidden test set. Nearly all of the training set essays were written by students, with only a few generated essays given as examples.

The Benefits of this project is working with Transformers specially with BERT and DistilBERT which is developed by Google at 2017

Let's Talk about the DataSet:

DATASET
`test|train_essays.csv`

`id` - A unique identifier for each essay.


`prompt_id` - Identifies the prompt the essay was written in response to.

`text` - The essay text itself.

`generated` - Whether the essay was written by a student (0) or generated by an LLM (1). This field is the target and is not present in test_essays.csv.

train_prompts.csv - Essays were written in response to information in these fields.

`prompt_id` - A unique identifier for each prompt.

`prompt_name` - The title of the prompt. instructions - The instructions given to students.

`source_text` - The text of the article(s) the essays were written in response to, in Markdown format. Significant paragraphs are enumerated by a numeral preceding the paragraph on the same line, as in 0 Paragraph one.\n\n1 
`Paragraph two`.. Essays sometimes refer to a paragraph by its numeral. Each article is preceded with its title in a heading, like # Title. When an author is indicated, their name will be given in the title after by. Not all articles have authors indicated. An article may have subheadings indicated like ## Subheading.

`sample_submission.csv` - A submission file in the correct format. See the Evaluation page for details.

In [1]:
import os 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import spacy
import regex as re

#Function to Plot Wordcloud
#A word cloud is a visualization of word frequency where words that appear more 
#frequently in the text are displayed with larger fonts. 
#You can use this class to generate word clouds from your textual data.
from wordcloud import WordCloud,STOPWORDS
from collections import Counter

#Tensorflow/KERAS
import tensorflow as tf 
import keras_core as keras 
from keras import layers,Sequential
from keras.callbacks import ModelCheckpoint,EarlyStopping,ReduceLROnPlateau,CSVLogger,LearningRateScheduler

#Sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay,classification_report,precision_recall_curve,roc_curve,auc 

#Set the backend for Keras
os.environ["KERAS_BACKEND"] = "tf"

#set seed for reproductibility 
keras.utils.set_random_seed(42)

#Use mixed precision to speed up all training
keras.mixed_precision.set_global_policy("mixed_float16")

#Check Versions 
print(f'Keras Version: {keras.__version__}')
print(f'Tensorflow Version: {tf.__version__}')



Using TensorFlow backend
Keras Version: 0.1.7
Tensorflow Version: 2.15.0


The provided function compress(`df`, `verbose=True`) is a comprehensive function that reduces the memory usage of a DataFrame by downcasting numerical columns and provides detailed information about the optimization process. Here's a breakdown of what the function does:

`df`: This is the input DataFrame that you want to compress.

`verbose`: This is an optional parameter that determines whether the function should print information about the size reduction. By default, it's set to True, meaning it will print information; you can set it to False if you don't want the information to be printed.

The function performs the following steps:

Calculates the initial memory usage of the DataFrame (`input_size`) in megabytes (MB).

If the verbose parameter is set to True, it prints the initial size of the DataFrame.

Creates a copy of the original data types of the DataFrame's columns in `dtype_before`.

Iterates through each numerical column in the DataFrame (columns with data types '`float64`' and '`int64`').

For each numerical column, it checks the minimum and maximum values in the column (`col_min` and `col_max`).

If the column's data type is '`int64`', it attempts to downcast the column to the smallest integer data type that can hold the range of values in that column (e.g., '`int8`', '`int16`', '`int32`', or '`int64`').

If the column's data type is '`float64`', it attempts to downcast the column to a smaller floating-point data type that can represent the range of values in that column (e.g., '`float32`' or '`float64`').

The downcasting is done using NumPy data type casting (`np.int8`, `np.int16`, `np.int32`, `np.int6`4, `np.float32`,` np.float64`).

After downcasting, the DataFrame's memory usage is reduced because the smaller data types occupy less memory.

If verbose is True, it prints the new memory usage of the DataFrame and the percentage reduction in size.

It filters the DataFrame to keep only numerical columns for comparison (`numeric_columns`).

It creates a copy of data types after compression in dtype_after.

It constructs a comparison DataFrame `comparison_df` that shows the data types before and after compression and the size reduction percentage for each numerical column.

The function returns the compressed `DataFrame (df)` and the comparison DataFrame (`comparison_df`).

This function is useful for optimizing memory usage when working with large DataFrames, especially in situations where efficient memory usage is crucial. It not only reduces memory usage but also provides detailed information about the optimization process for analysis and reporting.

# Numerical Downcast Function

In [2]:
def compress(df, verbose=True):
    """
    Reduces the size of the DataFrame by downcasting numerical columns
    """
    input_size = df.memory_usage(index=True).sum() / (1024 ** 2)
    if verbose:
        print("Old dataframe size:", round(input_size, 2), 'MB')

    in_size = df.memory_usage(index=True).sum()
    dtype_before = df.dtypes.copy()  # Copy of original data types

    for col in df.select_dtypes(include=['float64', 'int64']):
        col_type = df[col].dtype
        col_min, col_max = df[col].min(), df[col].max()

        if col_type == 'int64':
            if col_min > np.iinfo(np.int8).min and col_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif col_min > np.iinfo(np.int16).min and col_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif col_min > np.iinfo(np.int32).min and col_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif col_min > np.iinfo(np.int64).min and col_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)
        elif col_type == 'float64':
            ## float16 warns of overflow
            # if col_min > np.finfo(np.float16).min and col_max < np.finfo(np.float16).max:
            #     df[col] = df[col].astype(np.float16)
            if col_min > np.finfo(np.float32).min and col_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            elif col_min > np.finfo(np.float64).min and col_max < np.finfo(np.float64).max:
                df[col] = df[col].astype(np.float64)

    out_size = df.memory_usage(index=True).sum()
    ratio = (1 - round(out_size / in_size, 2)) * 100

    if verbose:
        print("Optimized size by {}%".format(round(ratio, 2)))
        print("New DataFrame size:", round(out_size / (1024 ** 2), 2), "MB")

    # Filter only numerical columns for comparison
    numeric_columns = df.select_dtypes(include=['float32', 'float64', 'int8', 'int16', 'int32', 'int64'])
    dtype_after = numeric_columns.dtypes.copy()  # Copy of data types after compression
    
    # Create a comparison DataFrame
    comparison_df = pd.DataFrame({'Before': dtype_before[numeric_columns.columns], 'After': dtype_after})
    comparison_df['Size Reduction'] = ratio

    return df, comparison_df

# WordCloud Function

In [3]:
# Generate WordCloud

def generate_wordcloud_subplot(df, label_value, subplot_position, max_words=1000, width=800, height=400, top_n = 10):
    """
    Generate a word cloud for a specific label value and display it in a subplot.

    Args:
        df (DataFrame): The DataFrame containing text data and labels.
        label_value (int): The label value for which to generate the word cloud.
        subplot_position (int): The position of the subplot where the word cloud will be displayed.
        max_words (int, optional): Maximum number of words to include in the word cloud. Default is 1000.
        width (int, optional): Width of the word cloud image. Default is 800.
        height (int, optional): Height of the word cloud image. Default is 400.

    Returns:
        None
    """

    # Select the text subset for the specified label value
    text_subset = df[df.generated == label_value].text

    # Define stopwords to be excluded
    stopwords = set(STOPWORDS)

    # Create a WordCloud object with specified parameters
    wc = WordCloud(max_words=max_words, width=width, height=height, stopwords=stopwords)

    # Generate the word cloud from the selected text subset
    wc.generate(" ".join(text_subset))

    # Create a subplot and display the word cloud
    plt.subplot(subplot_position)
    plt.imshow(wc, interpolation='bilinear')

    # Set the title for the word cloud plot
    title = f'WordCloud for Label {label_value} ({("Student" if label_value == 0 else "AI")})'
    plt.title(title)
    
    # Count occurrences of words in the text subset
    words_count = Counter(" ".join(text_subset).split())
    top_words = words_count.most_common(top_n)
    bottom_words = words_count.most_common()[:-top_n-1:-1]  # Extract least common words

    # Print the most common words
    print(f"Top {top_n} words for Label {label_value}:")
    for idx, (word, count) in enumerate(top_words, start=1):
        print(f"{idx}. {word}: {count} times")
    print("------------------------------")

    # Print the least common words
    print(f"Least {top_n} words for Label {label_value}:")
    for idx, (word, count) in enumerate(bottom_words, start=1):
        print(f"{idx}. {word}: {count} times")
    print("------------------------------")

# Import Data

There are 3 files which contains the following information:

`test|train_essays.csv`

`id`: A unique identifier for each essay.

`prompt_id`: Identifies the prompt the essay was written in response to.

`text`: The essay text itself.

`generated`: Whether the essay was written by a student (0) or generated by an LLM (1). This field is the target and is not present in `test_essays.csv`.

`train_prompts.csv`

Essays were written in response to information in these fields.

`prompt_id`: A unique identifier for each prompt.

`prompt_name`: The title of the prompt.

`instructions`: The instructions given to students.

`source_text`: The text of the article(s) the essays were written in response to, in Markdown format. Significant paragraphs are enumerated by a numeral preceding the paragraph on the same line, as in 0 Paragraph one.\n\n1 Paragraph two. Essays sometimes refer to a paragraph by its numeral. Each article is preceded with its title in a heading, like # Title. When an author is indicated, their name will be given in the title after by. Not all articles have authors indicated. An article may have subheadings indicated like ## Subheading.

In [4]:
# Import Data

data_path = 'Data/'

for dirname, _, filenames in os.walk(data_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Data/sample_submission.csv
Data/test_essays.csv
Data/train_essays.csv
Data/train_prompts.csv


# Essays Dataset

## Train Dataset 

In [5]:
#Import Data

df_train_essays = pd.read_csv(data_path + "train_essays.csv")
print(df_train_essays.info())
df_train_essays.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1378 entries, 0 to 1377
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         1378 non-null   object
 1   prompt_id  1378 non-null   int64 
 2   text       1378 non-null   object
 3   generated  1378 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 43.2+ KB
None


Unnamed: 0,id,prompt_id,text,generated
0,0059830c,0,Cars. Cars have been around since they became ...,0
1,005db917,0,Transportation is a large necessity in most co...,0
2,008f63e3,0,"""America's love affair with it's vehicles seem...",0
3,00940276,0,How often do you ride in a car? Do you drive a...,0
4,00c39458,0,Cars are a wonderful thing. They are perhaps o...,0


In [6]:
df_train_essays,compress_df = compress(df_train_essays,verbose=True)
compress_df

Old dataframe size: 0.04 MB
Optimized size by 44.0%
New DataFrame size: 0.02 MB


Unnamed: 0,Before,After,Size Reduction
prompt_id,int64,int8,44.0
generated,int64,int8,44.0


In [11]:
df_train_essays.text[0]

'Cars. Cars have been around since they became famous in the 1900s, when Henry Ford created and built the first ModelT. Cars have played a major role in our every day lives since then. But now, people are starting to question if limiting car usage would be a good thing. To me, limiting the use of cars might be a good thing to do.\n\nIn like matter of this, article, "In German Suburb, Life Goes On Without Cars," by Elizabeth Rosenthal states, how automobiles are the linchpin of suburbs, where middle class families from either Shanghai or Chicago tend to make their homes. Experts say how this is a huge impediment to current efforts to reduce greenhouse gas emissions from tailpipe. Passenger cars are responsible for 12 percent of greenhouse gas emissions in Europe...and up to 50 percent in some carintensive areas in the United States. Cars are the main reason for the greenhouse gas emissions because of a lot of people driving them around all the time getting where they need to go. Article