Dataset Proposal:

For training a model to detect AI-generated text, I will be using https://huggingface.co/datasets/Hello-SimpleAI/HC3/tree/main. This dataset is composed of 24,322 questions with corresponding human answers and answers taken from ChatGPT.

In [7]:
import json

# Function to clean text by removing backslashes before apostrophes
def clean_text(text):
    return text.replace("\\'", "'").replace('\\"', '"')

# Open the JSON Lines file in text mode to read lines directly
with open('all.jsonl', 'r', encoding='utf-8') as f:
    lines = f.readlines()

# Initialize lists to store the separated data
questions = []
human_answers = []
chatgpt_answers = []

# Process each line as a JSON object and extract the desired fields
for line in lines:
    data = json.loads(line)
    questions.append(clean_text(data['question']))

    # Clean up the human answers
    human_cleaned = [clean_text(answer) for answer in data['human_answers']]
    human_answers.append(human_cleaned)

    # Clean up the ChatGPT answers
    chatgpt_cleaned = [clean_text(answer) for answer in data['chatgpt_answers']]
    chatgpt_answers.append(chatgpt_cleaned)

# Count the total number of questions, human answers, and ChatGPT answers
total_questions = len(questions)
total_human_answers = len(human_answers)
total_chatgpt_answers = len(chatgpt_answers)

# Print the total counts
print("Total Questions:", total_questions)
print("Total Human Answers:", total_human_answers)
print("Total ChatGPT Answers:", total_chatgpt_answers)

print("----------------- Sample Q/A from Dataset ----------------------")
# Print sample from dataset
print("Questions:", questions[:1])
print(" ")
print("Human Answers:", human_answers[:1])
print(" ")
print("ChatGPT Answers:", chatgpt_answers[:1])

Total Questions: 24322
Total Human Answers: 24322
Total ChatGPT Answers: 24322
----------------- Sample Q/A from Dataset ----------------------
Questions: ['Why is every book I hear about a " NY Times # 1 Best Seller " ? ELI5 : Why is every book I hear about a " NY Times # 1 Best Seller " ? Should n\'t there only be one " # 1 " best seller ? Please explain like I\'m five.']
 
Human Answers: [['Basically there are many categories of " Best Seller " . Replace " Best Seller " by something like " Oscars " and every " best seller " book is basically an " oscar - winning " book . May not have won the " Best film " , but even if you won the best director or best script , you \'re still an " oscar - winning " film . Same thing for best sellers . Also , IIRC the rankings change every week or something like that . Some you might not be best seller one week , but you may be the next week . I guess even if you do n\'t stay there for long , you still achieved the status . Hence , # 1 best seller .'

In [10]:
import json
import csv

# Open the JSON Lines file in binary mode to read bytes
with open('all.jsonl', 'rb') as f:
    # Decode the bytes using utf-8 encoding
    content = f.read().decode('utf-8', errors='ignore')
    # Split the content into lines since it's a JSON Lines file
    lines = content.split('\n')
    # Initialize lists to store the separated data
    questions = []
    human_answers = []
    chatgpt_answers = []
    
    # Process each line as a JSON object and extract the desired fields
    for line in lines:
        if line.strip():  # Check if the line is not empty
            data = json.loads(line)
            questions.append(data['question'])
            human_answers.append(data['human_answers'])
            chatgpt_answers.append(data['chatgpt_answers'])

# Zip the lists together to create rows for the CSV file
rows = zip(questions, human_answers, chatgpt_answers)

# Define the CSV file path and headers
csv_file = 'output.csv'
headers = ['question', 'human_answer', 'chatgpt_answer']

# Write the data to the CSV file
with open(csv_file, 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(headers)  # Write the headers
    writer.writerows(rows)  # Write the rows of data

print(f"CSV file '{csv_file}' has been created successfully.")

CSV file 'output.csv' has been created successfully.


I created a .csv file to store the original dataset just for ease of use since I am less familiar with .json files. The .csv file created splits the data into questions, human answers, and chatgpt answers.

In order to parse the dataset, I will first have to clean the entries for the questions, human answers, and chatgpt answers. The methods I have researched so far include using the BERT tokenizer from the transformers library or the NLTK library, which can remove unwanted characters, punctuation, stopwords, and also normalize the text. I will have to test both methods to see which one better corresponds to my overall goal for the project.

In [36]:
import pandas as pd

# Load the original CSV file
csv_file = "output.csv"
df_original = pd.read_csv(csv_file)

# Separate human and ChatGPT answers into separate DataFrames
df_human = df_original[df_original['human_answer'].notnull()].copy()
df_chatgpt = df_original[df_original['chatgpt_answer'].notnull()].copy()

# Resetting the index for each DataFrame to avoid issues during concatenation
df_human.reset_index(drop=True, inplace=True)
df_chatgpt.reset_index(drop=True, inplace=True)

# Create a new DataFrame to store alternating answers
df_alternating = pd.DataFrame(columns=['question', 'answer', 'result'])

# Populate the new DataFrame with alternating answers
for index, row in df_human.iterrows():
    df_alternating.loc[2 * index] = [row['question'], row['human_answer'], 0]
    df_alternating.loc[2 * index + 1] = [row['question'], row['chatgpt_answer'], 1]

# Define the new CSV file path
new_csv_file = "cleaned_dataset.csv"

# Write the modified data to the new CSV file
df_alternating.to_csv(new_csv_file, index=False)

print(f"New CSV file '{new_csv_file}' has been created successfully.")

New CSV file 'cleaned_dataset.csv' has been created successfully.


In [37]:
import pandas as pd
# load dataset
testdata = pd.read_csv("cleaned_dataset.csv")
testdata.head()

# result column 0 = human response, 1 = chatgpt response

Unnamed: 0,question,answer,result
0,"Why is every book I hear about a "" NY Times # ...","['Basically there are many categories of "" Bes...",0
1,"Why is every book I hear about a "" NY Times # ...",['There are many different best seller lists t...,1
2,"If salt is so bad for cars , why do we use it ...",['salt is good for not dying in car crashes an...,0
3,"If salt is so bad for cars , why do we use it ...","[""Salt is used on roads to help melt ice and s...",1
4,Why do we still have SD TV channels when HD lo...,"[""The way it works is that old TV stations got...",0


In [26]:
import pandas as pd
# load dataset
data = pd.read_csv("output.csv")
data.head()

Unnamed: 0,question,human_answer,chatgpt_answer
0,"Why is every book I hear about a "" NY Times # ...","['Basically there are many categories of "" Bes...",['There are many different best seller lists t...
1,"If salt is so bad for cars , why do we use it ...",['salt is good for not dying in car crashes an...,"[""Salt is used on roads to help melt ice and s..."
2,Why do we still have SD TV channels when HD lo...,"[""The way it works is that old TV stations got...","[""There are a few reasons why we still have SD..."
3,Why has nobody assassinated Kim Jong - un He i...,"[""You ca n't just go around assassinating the ...",['It is generally not acceptable or ethical to...
4,How was airplane technology able to advance so...,['Wanting to kill the shit out of Germans driv...,['After the Wright Brothers made the first pow...
