# Benchmark of existing approaches for detecting machine-generated text

## Contents 
1. [Solaiman](#Solaiman)
    1. [Install dependencies](#Install-dependencies)
    1. [Solaiman Code](#Solaiman-Code)
2. [Data Exploration](#Data-exploration)


## Solaiman

_Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford,
Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. 2019. Release strategies and the social impacts of language models._

### Install dependencies
Dependencies from: [https://github.com/HendrikStrobelt/detecting-fake-text/blob/master/requirements.txt](detecting-fake-text/requirements.txt).

### Solaiman Code
Source code: [https://github.com/openai/gpt-2-output-dataset/tree/master/detector](https://github.com/openai/gpt-2-output-dataset/tree/master/detector)

Logic extracted form `server.py`

In [11]:
import transformers
assert transformers.__version__ == '2.9.1', "Use transformers 2.9.1. It is available in the conda environment transf291"

In [12]:
from transformers import RobertaForSequenceClassification, RobertaTokenizer
import json
import torch
from urllib.parse import urlparse, unquote

#### Select model to use RoBERTa base or large

In [231]:
model_name = 'roberta-large'
#model_name = 'roberta-base'

#### Define model, tokenizer, and basic functions

In [232]:
model = RobertaForSequenceClassification.from_pretrained(model_name)
tokenizer = RobertaTokenizer.from_pretrained(model_name)
device='cuda' if torch.cuda.is_available() else 'cpu'

In [96]:
def evaluate(query):
    tokens = tokenizer.encode(query)
    all_tokens = len(tokens)
    tokens = tokens[:tokenizer.max_len - 2]
    used_tokens = len(tokens)
    tokens = torch.tensor([tokenizer.bos_token_id] + tokens + [tokenizer.eos_token_id]).unsqueeze(0)
    mask = torch.ones_like(tokens)

    with torch.no_grad():
        logits = model(tokens.to(device), attention_mask=mask.to(device))[0]
        probs = logits.softmax(dim=-1)

    fake, real = probs.detach().cpu().flatten().numpy().tolist()

    # Original: 
#    return json.dumps(dict(
#         all_tokens=all_tokens,
#         used_tokens=used_tokens,
#         real_probability=real,
#         fake_probability=fake
#     ))

# Changed to return only the binary classification result. 1 if sentence is likely machine-generated.
    return (fake > 0.5, fake, real)

def initialize(checkpoint):
#     if checkpoint.startswith('gs://'):
#         print(f'Downloading {checkpoint}', file=sys.stderr)
#         subprocess.check_output(['gsutil', 'cp', checkpoint, '.'])
#         checkpoint = os.path.basename(checkpoint)
#         assert os.path.isfile(checkpoint)

    print(f'Loading checkpoint from {checkpoint}')
    data = torch.load(checkpoint, map_location='cpu')
    model.load_state_dict(data['model_state_dict'])
    model.eval()

#### Download finetuned model and load it into RoBERTa

In [16]:
# !wget https://openaipublic.azureedge.net/gpt-2/detector-models/v1/detector-base.pt
# !wget https://openaipublic.azureedge.net/gpt-2/detector-models/v1/detector-large.pt

In [233]:
initialize('detector-large.pt')
# initialize('detector-base.pt')

Loading checkpoint from detector-large.pt


In [234]:
evaluate("hello world")

(True, 0.7026421427726746, 0.29735779762268066)

## Data exploration

In [235]:
import json
import pandas as pd
import os
import time

In [20]:
data_path = "./data"
datasets = sorted([f for f in os.listdir(data_path) if os.path.isfile(os.path.join(data_path, f))])

In [21]:
datasets

['large-762M-k40.test.jsonl',
 'large-762M-k40.train.jsonl',
 'large-762M-k40.valid.jsonl',
 'large-762M.test.jsonl',
 'large-762M.train.jsonl',
 'large-762M.valid.jsonl',
 'medium-345M-k40.test.jsonl',
 'medium-345M-k40.train.jsonl',
 'medium-345M-k40.valid.jsonl',
 'medium-345M.test.jsonl',
 'medium-345M.train.jsonl',
 'medium-345M.valid.jsonl',
 'small-117M-k40.test.jsonl',
 'small-117M-k40.train.jsonl',
 'small-117M-k40.valid.jsonl',
 'small-117M.test.jsonl',
 'small-117M.train.jsonl',
 'small-117M.valid.jsonl',
 'webtext.test.jsonl',
 'webtext.train.jsonl',
 'webtext.valid.jsonl',
 'xl-1542M-k40.test.jsonl',
 'xl-1542M-k40.train.jsonl',
 'xl-1542M-k40.valid.jsonl',
 'xl-1542M.test.jsonl',
 'xl-1542M.train.jsonl',
 'xl-1542M.valid.jsonl']

### Human-written data

In [174]:
ds_hw_filename = 'webtext.train.jsonl'
ds_hw = pd.read_json(os.path.join(data_path, ds_hw_filename), lines = True)        

In [178]:
start = time.time()
payload = evaluate(ds_hw.iloc[20].text)
end = time.time()
print("{:.2f} Seconds for a check with GPT-2".format(end - start))
print(f"machine generated? {payload}")

Token indices sequence length is longer than the specified maximum sequence length for this model (1001 > 512). Running this sequence through the model will result in indexing errors


0.48 Seconds for a check with GPT-2
machine generated? (False, 0.0001773454569047317, 0.9998226761817932)


### Machine-generated data

In [103]:
ds_mg_filename = 'small-117M.train.jsonl'
ds_mg = pd.read_json(os.path.join(data_path, ds_mg_filename), lines = True)        

In [105]:
start = time.time()
payload = evaluate(ds_mg.iloc[20].text)
end = time.time()
print("{:.2f} Seconds for a check with GPT-2".format(end - start))
print(f"machine generated? {payload}")

Token indices sequence length is longer than the specified maximum sequence length for this model (1024 > 512). Running this sequence through the model will result in indexing errors


0.46 Seconds for a check with GPT-2
machine generated? (True, 0.9997071623802185, 0.0002927991736214608)


## Evaluation
### Auxiliary functions

In [378]:
import numpy as np
from sklearn.utils import shuffle

def create_dataset(raw_data, is_machine_generated=True):
    X = np.array(raw_data)
    if is_machine_generated:
        y = np.ones_like(X)
    else:
        y = np.zeros_like(X)
    return pd.DataFrame(data={'X':X, 'y':y})

def concat_and_shuffle(datasets):
    return shuffle(pd.concat(datasets)).reset_index()

In [236]:
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

def safe_macro_f1(y, y_pred):
    """
    Macro-averaged F1, forcing `sklearn` to report as a multiclass
    problem even when there are just two classes. `y` is the list of
    gold labels and `y_pred` is the list of predicted labels.

    """
    return f1_score(y, y_pred, average='macro', pos_label=None)

def safe_accuracy(y, y_pred):
    return accuracy_score(y, y_pred, normalize=True)

def gtc_evaluate(
        dataset,
        model,
        score_func=safe_accuracy):

    # Predictions if we have labels:
    preds = model.predict(dataset['X'])
    if dataset['y'] is not None:
        y = dataset['y'].tolist()
        confusion = confusion_matrix(y, preds)
        score = score_func(y, preds)
        
    # Return the overall scores and other experimental info:
    return {
        'model': model,
        'predictions': preds, 
        'confusion_matrix': confusion, 
        'score': score }


In [187]:
class Wrapper:
    def __init__(self):
        pass

    def predict(self, X):
        y_pred = []
        for x in X:
            (is_generated, fake, real) = evaluate(x)
            y_pred.append(1 if is_generated else 0)      
        return y_pred

In [386]:
solaiman_detector = Wrapper()

### Playing around

In [393]:
mg_pd = create_dataset(ds_mg["text"], is_machine_generated=True)
hw_pd = create_dataset(ds_hw["text"], is_machine_generated=False)
shuffled = concat_and_shuffle([mg_pd, hw_pd])

In [396]:
shuffled

Unnamed: 0,index,X,y
0,122751,Humane… also safe.\n\nIt could also be that ou...,1
1,135322,Morgeous Vikomin douga and other controversial...,1
2,56086,The world of Seward's ultimate service is only...,1
3,3965,"Shooing, neck hair, and sleeves are delicate a...",1
4,53634,"PC until better days.\n\n[BET]\n\nOkay, fine. ...",1
...,...,...,...
499995,103033,"\nThe numbers, however, suggest that the New Y...",1
499996,56157,Firefly Space Systems is part of a new wave of...,0
499997,220473,The public cost of cleaning up the 2011 Fukush...,0
499998,215658,Tourists don't even take a passing glance at t...,1


Testing whether the `gtc_evaluate()` works as expected. By default, we use accuracy to measure the score.

In [387]:
sample_df = shuffled.sample(n=10, random_state=1)

In [397]:
gtc_evaluate(sample_df, solaiman_detector, score_func=safe_accuracy)

Token indices sequence length is longer than the specified maximum sequence length for this model (1024 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1024 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1024 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1024 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1024 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length fo

{'model': <__main__.Wrapper at 0x7f9b8fc63520>,
 'predictions': [0, 0, 1, 0, 1, 0, 1, 1, 1, 1],
 'confusion_matrix': array([[4, 0],
        [0, 6]]),
 'score': 1.0}

The Solaiman's model warns that sentences should have less than 512 tokens. So, let's try by filtering out senteces with more tokens than that.

In [390]:
short_hw_ds = ds_hw[(ds_hw.text.str.len() < 512)].reset_index()
short_mg_ds = ds_mg[(ds_mg.text.str.len() < 512)].reset_index()
hw_short_pd = create_dataset(short_hw_ds["text"], is_machine_generated=False)
mg_short_pd = create_dataset(short_mg_ds["text"], is_machine_generated=True)
shuffled_short = concat_and_shuffle([mg_short_pd, hw_short_pd])

In [391]:
sample_short_df = shuffled_short.sample(n=10, random_state=1)

In [392]:
gtc_evaluate(sample_short_df, solaiman_detector)

{'model': <__main__.Wrapper at 0x7f9b8fc63520>,
 'predictions': [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
 'confusion_matrix': array([[3, 2],
        [0, 5]]),
 'score': 0.8}