Skip to content

aih-lab/ai_detection

 
 

Repository files navigation

AI Style Analysis

Data and code repository for reproducing the experiments from the AI Style project (WashU AI Humanities Lab). DOI: https://doi.org/10.5281/zenodo.15587211

Citation

This repository contains code and data that accompany the article:

Kirilloff, G., Carroll, C., Daboul, Z., Frank, A., Khan, R., Hinrichs-Morrow, M., & Weingart, R. (2025). "'Written in the Style of': ChatGPT and the Literary Canon." Harvard Data Science Review (HDSR). Published Aug 05, 2025. Available: https://doi.org/10.1162/99608f92.6d5fb5ef

If you use the repository code or data, cite this repository; if you use the article or its findings, cite the HDSR article. Example citations:

What this repo contains

  • classifier_data_code/ — scripts and notebooks used to generate feature matrices and run classification experiments to distinguish authentic (human) and synthetic text. Key scripts:
    • classification_code/random_forest_classifier.py — repeated-sampling random-forest experiments (CONFIG-driven).
    • classification_code/k_fold_author_validation_classifier.py — author holdout / k-fold-style experiments.
  • feature_data_code/ — notebooks for building feature matrices.
  • data/ — CSV inputs and outputs generated by the experiments (not all files are tracked).
  • api_code/ — scripts and code for interacting with the OpenAI GPT API (prompt runners, helper functions, and example calls). This folder contains prompt_runner.py which demonstrates how to send prompts to the GPT API and receive responses, handle simple batching, and save outputs.

Python version & environment

  • Recommended Python: 3.10 or 3.11. (The code uses modern pandas and scikit-learn APIs; Python 3.8 may work but 3.10+ is recommended.)

  • Install required Python packages from requirements.txt in this repository.

  • If you plan to use the api_code/ folder to call the GPT API, ensure the OpenAI client (or your chosen API client) is installed.


Quick setup (recommended in a virtual environment):

```bash
# create & activate a venv (macOS/Linux)
python3 -m venv .venv
source .venv/bin/activate

# install dependencies
pip install --upgrade pip
pip install -r requirements.txt

# install OpenAI client and optional dotenv support
pip install openai python-dotenv

# Set your OpenAI API key
export OPENAI_API_KEY="sk-<your-key-here>"

System requirements (CPU/GPU)

  • GPU is not required. The code uses scikit-learn's RandomForest implementation which runs on CPU. The scripts will run on a regular laptop or server without CUDA/OpenCL.

  • For faster runs, prefer a multi-core CPU (4+ cores). Note: earlier edits temporarily set RandomForest training to use all cores (n_jobs=-1) but those changes have been reverted per project owner request; the scripts retain their original behavior.

  • Network & API access: the api_code/ examples call the OpenAI GPT API and therefore requires a valid OpenAI account with an API key, and may incur usage charges.

Run duration

Estimated run durations depend on dataset size, number of features, CONFIG['n_runs'], and CONFIG['n_estimators']. A small run (single author or a few hundred rows, n_runs=1, n_estimators=100): typically completes in seconds to a couple of minutes on a modern laptop (4+ cores). A full default run (repository defaults such as n_runs=100 across many authors): can take tens of minutes to multiple hours on a single machine depending on dataset size and CPU. Expect runtime to scale roughly linearly with n_runs.

Reproducibility & requirements

  • A requirements.txt file is included in the repo root listing the Python packages used by the scripts and notebooks.

How to run the main scripts

  • Edit classifier_data_code/classification_code/random_forest_classifier.py or k_fold_author_validation_classifier.py to adjust CONFIG (paths, n_runs, sample_size_per_category, etc.).

  • Ensure the input CSV (data/master_feature_matrix.csv) is present and formatted with the expected columns (the scripts expect columns such as id, author, model, category, and the features listed in feature_cols).

  • Run:

python classifier_data_code/classification_code/random_forest_classifier.py

Results are written to the data/ directory by default (see CONFIG['output_dir']).

Citation

This repository contains code and data that accompany the article:

Kirilloff, G., Carroll, C., Daboul, Z., Frank, A., Khan, R., Hinrichs-Morrow, M., & Weingart, R. (2025). "'Written in the Style of': ChatGPT and the Literary Canon." Harvard Data Science Review (HDSR). Published Aug 05, 2025. Available: https://doi.org/10.1162/99608f92.6d5fb5ef

If you use the repository code or data, cite this repository; if you use the article or its findings, cite the HDSR article. Example citations:

License

Notes on licensing:

  • The HDSR article is published under a Creative Commons Attribution 4.0 International (CC BY 4.0) license (see the article page). This applies to the article text and supplemental material where indicated by the publishers.
  • The code in this repository is licensed under the MIT License (see the LICENSE file). Data included in data/ may carry licensing or citation restrictions; check the file-level headers and the article's supplemental information where applicable.

Please respect the article's CC BY 4.0 terms when reusing article text or figures: attribution is required.

About

An Expansion on ai-style, exploring its model's AI detection capabilities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.7%
  • Python 0.3%