## Notebook Description

**Note:** To use this notebook, you should download the [OpenLLMText Dataset](https://zenodo.org/records/8285326). Alternatively, you can run the `utilities.download_dataset.py` script, which will automatically download a balanced dataset from Google Drive.

<br>

This notebook focuses on preparing and analyzing the **OpenLLMText** dataset. The dataset includes text entries generated by various models as well as human-written content. Specifically, it contains approximately 300,000 entries from five sources:

- **Human-written texts**: 60,000 entries sourced from the OpenWebText dataset, based on Reddit user-generated content prior to 2019.
- **ChatGPT-generated texts**: 60,000 entries created by GPT-3.5 Turbo, where the AI rephrased human-written paragraphs.
- **LLaMA-7B-generated texts**: 60,000 entries generated by the LLaMA-7B model, using a similar rephrasing approach.
- **PaLM-generated texts**: 60,000 entries created by the Pathway Language Model (PaLM, text-bison-001).
- **GPT2-output dataset**: 60,000 entries adapted from the GPT2-output dataset (GPT2-XL).

### Dataset Adjustments

For this project, we focus on **human-written**, **ChatGPT-generated (GPT-3.5 Turbo)**, and **LLaMA-7B-generated** texts. PaLM-generated and GPT2-generated texts are excluded due to the following reasons:
1. **PaLM**: This model will be deprecated by Google as of April 9, 2025, and its generated data is less relevant for current applications.
2. **GPT2**: Its outputs are less aligned with human-written paragraphs and considered outdated (released in 2019).

### Dataset Proportions

To ensure a balanced dataset, we will organize the data so that:
- Each human-written paragraph is paired with corresponding AI-generated rephrasings (ChatGPT and LLaMA).
- The resulting dataset will consist of:
  - **50% human-written text**
  - **25% ChatGPT-generated text**
  - **25% LLaMA-generated text**

This structured setup ensures that each entry has a clear human-written origin and corresponding AI-generated alternatives for consistent comparison and classification.


### Data Splitting and Saving

We will save **100% of the adjusted dataset**, ensuring all entries are retained for future use. Within this full dataset, we have determined the following rounded proportions for splitting the data:

- **75% for training**: Used to learn and estimate model parameters.
- **15% for validation**: Used to evaluate the model during development.
- **10% for testing**: Used to assess the model's predictive performance on unseen data.

In addition to saving the full dataset, we will create and save smaller **sample sets** to speed up the initial stages of model creation and experimentation. These subsets will have similar proportions (75% train, 15% valid, 10% test) and consist of:

- **50k entries**: 
  - **37,500 for training**
  - **7,500 for validation**
  - **5,000 for testing**
- **10k entries**:
  - **7,500 for training**
  - **1,500 for validation**
  - **1,000 for testing**
- **5k entries**:
  - **3,750 for training**
  - **750 for validation**
  - **500 for testing**
- **1k entries::
  - **750 for training**
  - **150 for validation**
  - **100 for testing**

These smaller sets will be particularly useful for faster iterations and testing during the initial phases of the model-building process.


In [48]:
import os
import pandas as pd
import numpy as np

In [49]:
dataset_path = 'open_llm/'

human_path = 'Human'
chat_gpt_path = 'ChatGPT'
llama_path = 'LLaMA'

source_paths = [human_path, chat_gpt_path, llama_path]

train_path = 'train-dirty.jsonl'
valid_path = 'valid-dirty.jsonl'
test_path = 'test-dirty.jsonl'

In [50]:
def create_source_label_mapping(df: pd.DataFrame):
    """
    Create a mapping of the 'extra' column to a numerical label.

    Args:
        df (pd.DataFrame): DataFrame containing the 'source' column.
    
    Returns:
        pd.DataFrame: DataFrame with an additional columns 'source' and 'label'.

    """

    df['source'] = df['extra'].apply(lambda row: row['source'])

    source_mapping = {'openweb': 0, 'chatgpt': 1, 'llama': 1}

    # Add a new column with mapped values
    df['label'] = df['source'].map(source_mapping)

    return df

In [51]:
train_human_df = pd.read_json(os.path.join(dataset_path, human_path, train_path), lines=True)
valid_human_df = pd.read_json(os.path.join(dataset_path, human_path, valid_path), lines=True)
test_human_df = pd.read_json(os.path.join(dataset_path, human_path, test_path), lines=True)

train_human_df = create_source_label_mapping(train_human_df)
valid_human_df = create_source_label_mapping(valid_human_df)
test_human_df = create_source_label_mapping(test_human_df)

print(train_human_df.shape, valid_human_df.shape, test_human_df.shape)

train_human_df

(51205, 5) (10412, 5) (7367, 5)


Unnamed: 0,uid,text,extra,source,label
0,[urlsf_subset00]-[15],The dangers of Illinois as a ‘right to work’ s...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
1,[urlsf_subset00]-[83],Check current weather conditions\n\nIt’s going...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
2,[urlsf_subset00]-[89],"On Thursday, the president of the United State...","{'source': 'openweb', 'variant': 'original'}",openweb,0
3,[urlsf_subset00]-[176],"We know them from our garden, from damp cellar...","{'source': 'openweb', 'variant': 'original'}",openweb,0
4,[urlsf_subset00]-[209],Andros Townsend is confident of Tottenham’s ch...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
...,...,...,...,...,...
51200,[urlsf_subset05]-[389615],Indonesian tribesman Muhammad Yusuf believes h...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
51201,[urlsf_subset05]-[389630],An Al Nusrah Front convoy streams into Aleppo ...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
51202,[urlsf_subset05]-[389773],Water levels that have reached a historic low ...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
51203,[urlsf_subset05]-[389814],"Less than a month into his tenure, Donald Trum...","{'source': 'openweb', 'variant': 'original'}",openweb,0


In [52]:
train_chat_gpt_df = pd.read_json(os.path.join(dataset_path, chat_gpt_path, train_path), lines=True)
train_llama_df = pd.read_json(os.path.join(dataset_path, llama_path, train_path), lines=True)

train_ai_df = pd.concat([train_chat_gpt_df, train_llama_df], ignore_index=True)

train_ai_df = create_source_label_mapping(train_ai_df)

train_ai_df

Unnamed: 0,uid,text,extra,source,label
0,[urlsf_subset00]-[83],The National Weather Service’s Mike McFarland ...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
1,[urlsf_subset00]-[89],The President of the United States was seen on...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
2,[urlsf_subset00]-[390],Enner Valencia scored two goals in Ecuador's 2...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
3,[urlsf_subset00]-[457],"Beginning with the introduction, the author sh...","{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
4,[urlsf_subset00]-[458],Mexico has implemented its newest data retenti...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
...,...,...,...,...,...
101454,[urlsf_subset05]-[389353],With nearly 40 percent of all edible food in t...,"{'variant': 'original', 'source': 'llama'}",llama,1
101455,[urlsf_subset05]-[389356],Russian airstrikes were concentrated in areas ...,"{'variant': 'original', 'source': 'llama'}",llama,1
101456,[urlsf_subset05]-[389832],Georgia antagonists love to cite a well-worn q...,"{'variant': 'original', 'source': 'llama'}",llama,1
101457,[urlsf_subset05]-[389814],"Less than a month into his tenure, Donald Trum...","{'variant': 'original', 'source': 'llama'}",llama,1


In [53]:
valid_chat_gpt_df = pd.read_json(os.path.join(dataset_path, chat_gpt_path, valid_path), lines=True)
valid_llama_df = pd.read_json(os.path.join(dataset_path, llama_path, valid_path), lines=True)

valid_ai_df = pd.concat([valid_chat_gpt_df, valid_llama_df], ignore_index=True)

valid_ai_df = create_source_label_mapping(valid_ai_df)

valid_ai_df

Unnamed: 0,uid,text,extra,source,label
0,[urlsf_subset06]-[249981],Gaza's power plant will cease operations this ...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
1,[urlsf_subset06]-[164305],The interior ministers of EU member countries ...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
2,[urlsf_subset06]-[43195],"Mai, an 18-year-old female Malayan tiger rescu...","{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
3,[urlsf_subset06]-[202906],"In November, Jesus Huerta died from a gunshot ...","{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
4,[urlsf_subset06]-[17863],This paragraph serves as a warning to players ...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
...,...,...,...,...,...
19768,[urlsf_subset06]-[389944],The telecom industry rang in the new year to t...,"{'variant': 'original', 'source': 'llama'}",llama,1
19769,[urlsf_subset06]-[389984],If Trump‘s job was to punish every internet us...,"{'variant': 'original', 'source': 'llama'}",llama,1
19770,[urlsf_subset06]-[389986],This press release is available in Spanish.\n\...,"{'variant': 'original', 'source': 'llama'}",llama,1
19771,[urlsf_subset06]-[390305],Tymee Holds A Guerilla Performance\n\n[by Yanc...,"{'variant': 'original', 'source': 'llama'}",llama,1


In [54]:
test_chat_gpt_df = pd.read_json(os.path.join(dataset_path, chat_gpt_path, test_path), lines=True)
test_llama_df = pd.read_json(os.path.join(dataset_path, llama_path, test_path), lines=True)

test_ai_df = pd.concat([test_chat_gpt_df, test_llama_df], ignore_index=True)

test_ai_df = create_source_label_mapping(test_ai_df)

test_ai_df

Unnamed: 0,uid,text,extra,source,label
0,[urlsf_subset09]-[29],Jazzman acknowledges that it isn't possible to...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
1,[urlsf_subset09]-[61],"Some car makers such as Ferrari, Lamborghini, ...","{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
2,[urlsf_subset09]-[281],Evolve Politics managed to infiltrate the Cons...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
3,[urlsf_subset09]-[293],Wine and Alchemy is a world music band that ha...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
4,[urlsf_subset09]-[401],It is easy to become cynical about politics in...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
...,...,...,...,...,...
13967,[urlsf_subset09]-[389249],"July 21, 2016\n\nTodd Chretien of the Internat...","{'variant': 'original', 'source': 'llama'}",llama,1
13968,[urlsf_subset09]-[389288],When Dylan Higgins invited me on to the Field ...,"{'variant': 'original', 'source': 'llama'}",llama,1
13969,[urlsf_subset09]-[389386],John Kasich. AP Photo/John Minchillo\n\nJohn K...,"{'variant': 'original', 'source': 'llama'}",llama,1
13970,[urlsf_subset09]-[389484],A Navy-funded thermal engine bobbing off the c...,"{'variant': 'original', 'source': 'llama'}",llama,1


In [55]:
def create_balanced_pairs(human_df, ai_df):
    """
    Create a balanced dataset of human-written and AI-generated text pairs.

    For each human-written text there is a corresponding AI-generated text, either from ChatGPT or LLaMA.
    
    Args:
        human_df (pd.DataFrame): DataFrame containing human-written text data.
        ai_df (pd.DataFrame): DataFrame containing AI-generated text data.
    
    Returns:
        pd.DataFrame: A DataFrame containing equal rows of human and AI text pairs.
    """
    pairs = []

    chatgpt_df = ai_df[ai_df['source'] == 'chatgpt']
    llama_df = ai_df[ai_df['source'] == 'llama']

    chatgpt_df = chatgpt_df.drop_duplicates(subset=['uid', 'source'])
    llama_df = llama_df.drop_duplicates(subset=['uid', 'source'])

    # create index for human_df, chatgpt_df and llama_df on uid
    human_df = human_df.set_index('uid')
    chatgpt_df = chatgpt_df.set_index('uid')
    llama_df = llama_df.set_index('uid')

    # toggle to switch between the AI models
    toggle_model = True

    for _, human_row in human_df.iterrows():

        # get the index of the human row
        human_uid = human_row.name

        if toggle_model:
            if human_uid in chatgpt_df.index:
                matching_row = chatgpt_df.loc[human_uid] # get the matching AI row

                # Add both rows to the pairs list
                pairs.append(human_row)
                pairs.append(matching_row)

                toggle_model = not toggle_model
            
        else:
            if human_uid in llama_df.index:
                matching_row = llama_df.loc[human_uid] # get the matching AI row
                
                # Add both rows to the pairs list            
                pairs.append(human_row)
                pairs.append(matching_row)
                
                toggle_model = not toggle_model

    final_df = pd.DataFrame(pairs)

    final_df = final_df.reset_index()
    final_df = final_df.rename(columns={'index': 'uid'})
          
    return final_df

In [56]:
def print_sources_stats(df):
    """
    Print:
        - the number of records in the DataFrame.
        - the number of pairs in the DataFrame.
        - the number of rows for each source in the DataFrame.
    
    Args:
        df (pd.DataFrame): DataFrame containing the 'source' column.
    """
    print(f"Number of records: {len(df)}")
    print(f"Number of pairs: {int(len(df) / 2)}")
    print(df.value_counts('source'))

In [57]:
train_df = create_balanced_pairs(train_human_df, train_ai_df)

print_sources_stats(train_df)

train_df

Number of records: 101290
Number of pairs: 50645
source
openweb    50645
chatgpt    25323
llama      25322
Name: count, dtype: int64


Unnamed: 0,uid,text,extra,source,label
0,[urlsf_subset00]-[15],The dangers of Illinois as a ‘right to work’ s...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
1,[urlsf_subset00]-[15],"The governor of Illinois, Gov. Rauner, has req...","{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
2,[urlsf_subset00]-[83],Check current weather conditions\n\nIt’s going...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
3,[urlsf_subset00]-[83],Check current weather conditions It’s going to...,"{'variant': 'original', 'source': 'llama'}",llama,1
4,[urlsf_subset00]-[89],"On Thursday, the president of the United State...","{'source': 'openweb', 'variant': 'original'}",openweb,0
...,...,...,...,...,...
101285,[urlsf_subset05]-[389773],Halifax Water has imposed mandatory water cons...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
101286,[urlsf_subset05]-[389814],"Less than a month into his tenure, Donald Trum...","{'source': 'openweb', 'variant': 'original'}",openweb,0
101287,[urlsf_subset05]-[389814],"Less than a month into his tenure, Donald Trum...","{'variant': 'original', 'source': 'llama'}",llama,1
101288,[urlsf_subset05]-[389832],Georgia antagonists love to cite a well-worn q...,"{'source': 'openweb', 'variant': 'original'}",openweb,0


In [58]:
valid_df = create_balanced_pairs(valid_human_df, valid_ai_df)

print_sources_stats(valid_df)

valid_df

Number of records: 19650
Number of pairs: 9825
source
openweb    9825
chatgpt    4913
llama      4912
Name: count, dtype: int64


Unnamed: 0,uid,text,extra,source,label
0,[urlsf_subset06]-[66],"Wednesday, April 6th, 2016\n\n""It is shameful ...","{'source': 'openweb', 'variant': 'original'}",openweb,0
1,[urlsf_subset06]-[66],"A couple in Los Angeles, Willie and Phil Jones...","{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
2,[urlsf_subset06]-[79],SAN FRANCISCO (BCN)— A civil lawsuit filed Wed...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
3,[urlsf_subset06]-[79],SAN FRANCISCO (BCN)— A civil lawsuit filed Wed...,"{'variant': 'original', 'source': 'llama'}",llama,1
4,[urlsf_subset06]-[115],Automated wheel changer. Image: Rio Tinto.\n\n...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
...,...,...,...,...,...
19645,[urlsf_subset06]-[390176],Diego Maradona has paid tribute to the late Al...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
19646,[urlsf_subset06]-[390305],Tymee Holds A Guerilla Performance\n\n[by Yanc...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
19647,[urlsf_subset06]-[390305],Tymee Holds A Guerilla Performance\n\n[by Yanc...,"{'variant': 'original', 'source': 'llama'}",llama,1
19648,[urlsf_subset06]-[390316],South Korea President Moon Jae-in requested a ...,"{'source': 'openweb', 'variant': 'original'}",openweb,0


In [59]:
test_df = create_balanced_pairs(test_human_df, test_ai_df)

print_sources_stats(test_df)

test_df

Number of records: 13952
Number of pairs: 6976
source
openweb    6976
chatgpt    3488
llama      3488
Name: count, dtype: int64


Unnamed: 0,uid,text,extra,source,label
0,[urlsf_subset09]-[13],Media playback is unsupported on your device M...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
1,[urlsf_subset09]-[13],Officials have confirmed that the toxic red sl...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
2,[urlsf_subset09]-[29],Jazzman said: I understand you can't ban spraw...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
3,[urlsf_subset09]-[29],Jazzman said: I understand you can't ban spraw...,"{'variant': 'original', 'source': 'llama'}",llama,1
4,[urlsf_subset09]-[36],The Egyptian parliament witnessed a new disput...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
...,...,...,...,...,...
13947,[urlsf_subset09]-[389512],"A letter from the federal Bureau of Alcohol, T...","{'variant': 'original', 'source': 'llama'}",llama,1
13948,[urlsf_subset09]-[389538],A U.S. Army Black Hawk helicopter is seen. (Ph...,"{'source': 'openweb', 'variant': 'original'}",openweb,0
13949,[urlsf_subset09]-[389538],A Black Hawk helicopter belonging to the U.S. ...,"{'source': 'chatgpt', 'variant': 'original'}",chatgpt,1
13950,[urlsf_subset09]-[389611],Letter from an immigration official at Heathro...,"{'source': 'openweb', 'variant': 'original'}",openweb,0


In [60]:
# Filter rows where text is shorter than 200 characters
short_rows = train_df[train_df['text'].str.len() < 200]

# Extract UIDs from the filtered rows
uids_to_remove = short_rows['uid'].values

# Remove rows with these UIDs from the dataset
train_df = train_df[~train_df['uid'].isin(uids_to_remove)]

# Display the number of rows removed
removed_rows_count = len(short_rows) * 2
print(f"Removed {removed_rows_count} rows with text shorter than 200 characters.")


Removed 6 rows with text shorter than 200 characters.


## Saving whole dataset and its samples sets

In [61]:
n_samples = len(train_df) + len(valid_df) + len(test_df)

train_proportion = len(train_df) / n_samples
valid_proportion = len(valid_df) / n_samples
test_proportion = len(test_df) / n_samples

print(f"Number of samples: {n_samples}")
print(f"Train proportion: {train_proportion}")
print(f"Valid proportion: {valid_proportion}")
print(f"Test proportion: {test_proportion}")

Number of samples: 134886
Train proportion: 0.7508859333066441
Valid proportion: 0.14567857301721454
Test proportion: 0.10343549367614134


In [62]:
# round the proportions to 2 decimal places
train_proportion_rounded = round(train_proportion, 2)
valid_proportion_rounded = round(valid_proportion, 2)
test_proportion_rounded = round(test_proportion, 2)

print(f"Train proportion rounded: {train_proportion_rounded}")
print(f"Valid proportion rounded: {valid_proportion_rounded}")
print(f"Test proportion rounded: {test_proportion_rounded}")

Train proportion rounded: 0.75
Valid proportion rounded: 0.15
Test proportion rounded: 0.1


#### Save Data

The full dataset can be accessed here:  

[Access Dataset on Google Drive](https://drive.google.com/drive/folders/1vi1lA_t_lQMff_esGKhDAOUju-IDdOw0?usp=sharing)
