# Data augmentation using a LLM

The original dataset provided by the Kaggle competition is biased towards the number of student essays. Therefore, it is of grave importance to augment this data and increase the number of LLM-generated essays. The are quite a lot of already well-made datasets that have tackled this problem, the most successful is the [DAIGT V2 Train Dataset][1].

Simply using this dataset would suffice in training our model of choice but then we haven't done any work in data preparation. So to make our contribution to the training data and broaden our knowledge of LLMs, we will use a [causal language model][2] to generate essays and append them to the previously mentioned dataset. These LLMs are able to predict the next token in a sequence of tokens.

Our model of choice is the [`Mistral-7b-instruct-v0.1`][3] since it outperforms the  Llama 2 13B model on all tested bencmarks, based on the authors claim.

This notebook heavily sourced [binga's solution to generating essays][4] and even a bit by [Ertuğrul Demir's notebook][5] in terms of its structure.


[1]: https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset/versions/2
[2]: https://huggingface.co/docs/transformers/tasks/language_modeling
[3]: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
[4]: https://www.kaggle.com/code/phanisrikanth/generate-synthetic-essays-with-mistral-7b-instruct
[5]: https://www.kaggle.com/code/datafan07/use-gemini-to-create-student-essays

# Load Model

In [1]:
# Import libraries
from tqdm import tqdm
from pathlib import Path
import numpy as np
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


# Define the path to the pre-trained model
model_path = "mistralai/Mistral-7B-Instruct-v0.1"

# Load the tokenizer and the model
# The tokenizer prepares input text for the model
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Initialize the causal language model
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,  # Set model's parameter data type to bfloat16 to to reduce memory usage and speed up computations.
    device_map="auto",           # Automatically assign the model's layers to available devices (CPU/GPU)
    trust_remote_code=True,      # Allow the execution of custom code from the model, if any
)

Downloading tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

The model is loaded to both the T4 GPUs in ~3minutes. Here is how the memory used.

In [2]:
def generate_essay(prompt):
    """
    Uses the Mistral-7b-instruct-v0.1 model to generate an essay based on a prompt.
    :prompt: The input text we give to the model, describing the details of the essay
    :return: The generated essay
    """
    # Creating a message dictionary with the user's role and the provided prompt
    messages = [{
        "role": "user",
        "content": prompt
    }]

    # Tokenizing the input messages and moving the tensor to the CUDA device (GPU)
    model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to('cuda')
    
    # Disable gradient calculations for inference (performance optimization)
    with torch.no_grad():
        # Generating text based on the provided input
        generated_ids = model.generate(
            model_inputs,
            max_new_tokens=7500,  # Setting the maximum number of new tokens to be generated
            do_sample=True,  # Enable random sampling for diverse outputs
            pad_token_id=tokenizer.eos_token_id  # Use end-of-sequence token for padding
        )

    # Decoding the generated token ids back into text
    decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

    # Extracting the generated text after the instruction marker
    text = decoded[0].split("[/INST]")[1]

    # Returning the generated text
    return text

# Read Prompts

First read the original prompts of the `train_prompts.csv` file frm the competition.
We than add extra prompts that are found in the [DAIGT V2 Train Dataset][1].

[1]: https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset

In [3]:
# Load in the original dataset
path = Path("/kaggle/input/llm-detect-ai-generated-text/")

train_prompts = pd.read_csv(f"{path}/train_prompts.csv")

# Add the exptra prompts
extra_prompts = [
    '"A Cowboy Who Rode the Waves"',
    'Exploring Venus',
    'Facial action coding system',
    'The Face on Mars',
    'Driverless cars',
]

# The instructions for the prompts, generated with ChatGPT
# I provided ChatGPT the two instructions given by he training dataset and asked it to continue the pattern
extra_instructions = [
    'Write a creative short story about a cowboy who takes up surfing. In your story, blend elements of traditional western cowboy culture with the contemporary surf lifestyle. Manage your time carefully to brainstorm ideas; outline your story; write your narrative; and revise and edit your work. Be sure to develop a compelling character; explore the challenges and transformations he faces; use descriptive language to contrast and merge the two distinct lifestyles; and draw inspiration from a variety of sources while maintaining originality. Your story should be structured as a captivating, multiparagraph narrative. Write your story in the space provided.',
    'Write a detailed proposal to a space agency advocating for a new mission to Venus. In your proposal, highlight the scientific and exploratory benefits of such a mission. Manage your time carefully to research the topic; outline your proposal; write your proposal; and revise and edit your response. Be sure to include a clear objective for the mission; address potential challenges and solutions; use evidence from existing space research and missions; and avoid overly relying on a single source of information. Your proposal should be structured as a well-organized, multiparagraph document. Write your proposal in the space provided.',
    'Prepare a research paper explaining the Facial Action Coding System (FACS). In your paper, discuss the history, development, and applications of FACS in various fields such as psychology, animation, and artificial intelligence. Manage your time to research the topic; plan your paper; write your draft; and revise and edit your work. Ensure to cover the theoretical underpinnings of FACS; detail its methodology and coding scheme; use examples from diverse studies; and avoid relying excessively on one source. Your paper should be presented as a structured, multiparagraph academic essay. Write your research paper in the space provided.',
    'Write a scientific article analyzing the phenomenon of the Face on Mars observed in Viking 1 orbiter images. In your article, explore the history of this observation, its impact on popular culture and science, and the scientific explanation behind this visual effect. Manage your time to conduct thorough research; plan your article; write the initial draft; and revise and edit your work. Be sure to include a discussion on pareidolia; reference various Mars missions and their findings; use evidence from space research and imaging technology; and avoid relying solely on one source. Your article should be structured as a detailed, multiparagraph exploration of this topic. Write your article in the space provided.',
    'Compose an informative report on the development and future of driverless cars. In your report, discuss the technological advancements, potential benefits, and challenges associated with autonomous vehicles. Allocate time to conduct comprehensive research; outline your report; write your initial draft; and revise and edit your work. Ensure to cover the evolution of driverless technology; analyze the impact on transportation, safety, and urban planning; use data from various technological and automotive studies; and avoid over-reliance on a single source. Your report should be presented as a well-structured, multiparagraph document. Write your report in the space provided.'
]

# Creating a new DataFrame from the extra prompts and instructions
new_prompts = pd.DataFrame({
    'prompt_name': extra_prompts,
    'instructions': extra_instructions,
})

# Assigning new prompt IDs
max_id = train_prompts['prompt_id'].max()
new_prompts['prompt_id'] = range(max_id + 1, max_id + 1 + len(new_prompts))

# Assigning empty strings to source_text
new_prompts['source_text'] = ''

# Concatenating with the existing DataFrame
train_prompts = pd.concat([train_prompts, new_prompts], ignore_index=True)

In [4]:
train_prompts

Unnamed: 0,prompt_id,prompt_name,instructions,source_text
0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
1,1,Does the electoral college work?,Write a letter to your state senator in which ...,# What Is the Electoral College? by the Office...
2,2,"""A Cowboy Who Rode the Waves""",,
3,3,Exploring Venus,,
4,4,Facial action coding system,,
5,5,The Face on Mars,,
6,6,Driverless cars,,


As we generate synthetic essays, we want to make sure the synthetic essays are closer in length to human essays to ensure the models don't get biased by essay length to distinguish AI generated content from human.

To do this, we generate basic statistics such as average length of essays by humans as well as standard deviation. With these metrics, assuming a normal distribution of lengths for synthetic essays, we randomly sample a number from the normal distribution and generate an essay of that length.

In [5]:
train_essays = pd.read_csv(f"{path}/train_essays.csv")
train_essays['text_length'] = train_essays['text'].str.len()

text_len_mean = int(train_essays.query("generated == 0")['text_length'].mean())
text_len_std = int(train_essays.query("generated == 0")['text_length'].std())

print(f"Mean length of train essays by human in our dataset: {text_len_mean}")
print(f"Mean standard deviation of train essays by human in our dataset: {text_len_std}")

Mean length of train essays by human in our dataset: 3172
Mean standard deviation of train essays by human in our dataset: 918


# Generate Essays

In [6]:
# configuration for generating essays
config = {
    'num_essays': 3,    # Number of essays to generate for a topic
    'typo_prob': 0.2,   # probability to have typos in the essay
}

# number of distinct prompts
num_prompts = len(train_prompts['prompt_id'].unique())

# The length of the different essays (follows a normal distribution)
synthetic_essay_lengths = np.random.normal(text_len_mean, 
                                           text_len_std, 
                                           config['num_essays'] * num_prompts).astype(int)

In [8]:
llm_essays = []  # List to store generated essays

for prompt_id in train_prompts['prompt_id'].unique():
    for k in tqdm(range(config['num_essays']), desc=f"prompt_id: {prompt_id}"):
        prompt_name = train_prompts.loc[prompt_id, 'prompt_name']
        instructions = train_prompts.loc[prompt_id, 'instructions']

        # Determine if typos should be included
        include_typos = np.random.rand() < config['typo_prob']
        typo_text = "\nTry to add a minimal amount of typos and mistakes where a student of your grade would do." if include_typos else ""

        # Construct the prompt
        student_grade = str(np.random.randint(6, 13))
        word_limit = synthetic_essay_lengths[prompt_id * config['num_essays'] + k]
        prompt_combined = (
            f"You are a grade {student_grade} student working on the following assignment.\n\n"
            f"Create an essay based on the following topic in no more than {word_limit} words."
            f"{typo_text}\n\nTopic: {prompt_name}\n\nInstructions:\n\n{instructions}"
        )

        # Generate the essay
        essay_output = generate_essay(prompt_combined)

        # Store the generated essay data
        data_output = {
            'text': essay_output,
            'label': 1,
            'prompt_name': prompt_name,
            'source': 'Mistral-7b-instruct-v0.1',
            'RDizzl3_seven': True
        }
        llm_essays.append(data_output)
        
    
llm_essays = pd.DataFrame(llm_essays)
llm_essays

prompt_id: 0: 100%|██████████| 3/3 [02:45<00:00, 55.18s/it]
prompt_id: 1: 100%|██████████| 3/3 [02:46<00:00, 55.55s/it]
prompt_id: 2: 100%|██████████| 3/3 [02:28<00:00, 49.64s/it]
prompt_id: 3: 100%|██████████| 3/3 [03:01<00:00, 60.57s/it]
prompt_id: 4: 100%|██████████| 3/3 [02:10<00:00, 43.43s/it]
prompt_id: 5: 100%|██████████| 3/3 [01:37<00:00, 32.45s/it]
prompt_id: 6: 100%|██████████| 3/3 [02:29<00:00, 49.78s/it]


Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Car-free cities have become an increasingly p...,1,Car-free cities,Mistral-7b-instruct-v0.1,True
1,Cities around the world are growing in popula...,1,Car-free cities,Mistral-7b-instruct-v0.1,True
2,Car-free cities have been a topic for discuss...,1,Car-free cities,Mistral-7b-instruct-v0.1,True
3,"Dear State Senator,\n\nThe topic of the Elect...",1,Does the electoral college work?,Mistral-7b-instruct-v0.1,True
4,"Dear [State Senator’s Name],\n\nFirstly, I wa...",1,Does the electoral college work?,Mistral-7b-instruct-v0.1,True
5,"Dear Senator [Name],\n\nI am writing to expre...",1,Does the electoral college work?,Mistral-7b-instruct-v0.1,True
6,"As the sun rose over the wide open plain, a l...",1,"""A Cowboy Who Rode the Waves""",Mistral-7b-instruct-v0.1,True
7,A Cowboy Who Rode the Waves\n\nThe American W...,1,"""A Cowboy Who Rode the Waves""",Mistral-7b-instruct-v0.1,True
8,A Cowboy Who Rode the Waves\n\nThe cowboy rod...,1,"""A Cowboy Who Rode the Waves""",Mistral-7b-instruct-v0.1,True
9,Exploring Venus: A Journey to the Future\n\nV...,1,Exploring Venus,Mistral-7b-instruct-v0.1,True


In [9]:
print(llm_essays['text'][0])

 Car-free cities have become an increasingly popular concept in recent years as people become more environmentally conscious and aware of the damage that cars can do to our cities. This essay will explore the many advantages of limiting car usage in cities and argue for the implementation of car-free city policies.

First and foremost, car-free cities would prioritize public transportation, walking, and cycling, which would lead to a reduction in air pollution. Cars are one of the major contributors to air pollution, as they emit significant amounts of harmful chemicals and gases into the atmosphere. By reducing the number of cars on the road, cities could significantly decrease air pollution levels, leading to a healthier environment for citizens. Additionally, a decrease in air pollution would reduce the amount of money spent on healthcare due to respiratory illnesses.

Secondly, car-free cities would promote physical activity and lead to a healthier population. Cars are a sedentary 

## Combine our generated essays with the DAIGT V2 Train Dataset

In [11]:
# Load in the DAIGT V2 Train Dataset
daigt_dataset = pd.read_csv('/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv')

# Combine the two datasets
aumented_dataset = pd.concat([llm_essays, daigt_dataset], ignore_index=True)

# Save the result
aumented_dataset.to_csv('daigt-v2-train-dataset-augmented', index=False)

aumented_dataset

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Car-free cities have become an increasingly p...,1,Car-free cities,Mistral-7b-instruct-v0.1,True
1,Cities around the world are growing in popula...,1,Car-free cities,Mistral-7b-instruct-v0.1,True
2,Car-free cities have been a topic for discuss...,1,Car-free cities,Mistral-7b-instruct-v0.1,True
3,"Dear State Senator,\n\nThe topic of the Elect...",1,Does the electoral college work?,Mistral-7b-instruct-v0.1,True
4,"Dear [State Senator’s Name],\n\nFirstly, I wa...",1,Does the electoral college work?,Mistral-7b-instruct-v0.1,True
...,...,...,...,...,...
44884,"Dear Senator,\n\nI am writing to you today to ...",1,Does the electoral college work?,kingki19_palm,True
44885,"Dear Senator,\n\nI am writing to you today to ...",1,Does the electoral college work?,kingki19_palm,True
44886,"Dear Senator,\n\nI am writing to you today to ...",1,Does the electoral college work?,kingki19_palm,True
44887,"Dear Senator,\n\nI am writing to you today to ...",1,Does the electoral college work?,kingki19_palm,True
