AI Style Analysis

Data and code repository for reproducing the experiments from the AI Style project (WashU AI Humanities Lab). DOI: https://doi.org/10.5281/zenodo.15587211

Citation

This repository contains code and data that accompany the article:

Kirilloff, G., Carroll, C., Daboul, Z., Frank, A., Khan, R., Hinrichs-Morrow, M., & Weingart, R. (2025). "'Written in the Style of': ChatGPT and the Literary Canon." Harvard Data Science Review (HDSR). Published Aug 05, 2025. Available: https://doi.org/10.1162/99608f92.6d5fb5ef

If you use the repository code or data, cite this repository; if you use the article or its findings, cite the HDSR article. Example citations:

For the article: Kirilloff, G., et al. (2025). 'Written in the Style of': ChatGPT and the Literary Canon. Harvard Data Science Review. https://doi.org/10.1162/99608f92.6d5fb5ef
For the software (this repository): Carroll, C., & Kirilloff, G. (2025). AI Style Analysis (Version 1.0.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.15587211

What this repo contains

classifier_data_code/ — scripts and notebooks used to generate feature matrices and run classification experiments to distinguish authentic (human) and synthetic text. Key scripts:
- classification_code/random_forest_classifier.py — repeated-sampling random-forest experiments (CONFIG-driven).
- classification_code/k_fold_author_validation_classifier.py — author holdout / k-fold-style experiments.
feature_data_code/ — notebooks for building feature matrices.
data/ — CSV inputs and outputs generated by the experiments (not all files are tracked).
api_code/ — scripts and code for interacting with the OpenAI GPT API (prompt runners, helper functions, and example calls). This folder contains prompt_runner.py which demonstrates how to send prompts to the GPT API and receive responses, handle simple batching, and save outputs.

Python version & environment

Recommended Python: 3.10 or 3.11. (The code uses modern pandas and scikit-learn APIs; Python 3.8 may work but 3.10+ is recommended.)
Install required Python packages from requirements.txt in this repository.
If you plan to use the api_code/ folder to call the GPT API, ensure the OpenAI client (or your chosen API client) is installed.


Quick setup (recommended in a virtual environment):

```bash
# create & activate a venv (macOS/Linux)
python3 -m venv .venv
source .venv/bin/activate

# install dependencies
pip install --upgrade pip
pip install -r requirements.txt

# install OpenAI client and optional dotenv support
pip install openai python-dotenv

# Set your OpenAI API key
export OPENAI_API_KEY="sk-<your-key-here>"

System requirements (CPU/GPU)

GPU is not required. The code uses scikit-learn's RandomForest implementation which runs on CPU. The scripts will run on a regular laptop or server without CUDA/OpenCL.
For faster runs, prefer a multi-core CPU (4+ cores). Note: earlier edits temporarily set RandomForest training to use all cores (n_jobs=-1) but those changes have been reverted per project owner request; the scripts retain their original behavior.
Network & API access: the api_code/ examples call the OpenAI GPT API and therefore requires a valid OpenAI account with an API key, and may incur usage charges.

Run duration

Estimated run durations depend on dataset size, number of features, CONFIG['n_runs'], and CONFIG['n_estimators']. A small run (single author or a few hundred rows, n_runs=1, n_estimators=100): typically completes in seconds to a couple of minutes on a modern laptop (4+ cores). A full default run (repository defaults such as n_runs=100 across many authors): can take tens of minutes to multiple hours on a single machine depending on dataset size and CPU. Expect runtime to scale roughly linearly with n_runs.

Reproducibility & requirements

A requirements.txt file is included in the repo root listing the Python packages used by the scripts and notebooks.

How to run the main scripts

Edit classifier_data_code/classification_code/random_forest_classifier.py or k_fold_author_validation_classifier.py to adjust CONFIG (paths, n_runs, sample_size_per_category, etc.).
Ensure the input CSV (data/master_feature_matrix.csv) is present and formatted with the expected columns (the scripts expect columns such as id, author, model, category, and the features listed in feature_cols).
Run:

python classifier_data_code/classification_code/random_forest_classifier.py

Results are written to the data/ directory by default (see CONFIG['output_dir']).

Citation

This repository contains code and data that accompany the article:

Kirilloff, G., Carroll, C., Daboul, Z., Frank, A., Khan, R., Hinrichs-Morrow, M., & Weingart, R. (2025). "'Written in the Style of': ChatGPT and the Literary Canon." Harvard Data Science Review (HDSR). Published Aug 05, 2025. Available: https://doi.org/10.1162/99608f92.6d5fb5ef

If you use the repository code or data, cite this repository; if you use the article or its findings, cite the HDSR article. Example citations:

For the article: Kirilloff, G., et al. (2025). 'Written in the Style of': ChatGPT and the Literary Canon. Harvard Data Science Review. https://doi.org/10.1162/99608f92.6d5fb5ef
For the software (this repository): Carroll, C., & Kirilloff, G. (2025). AI Style Analysis (Version 1.0.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.15587211

License

Notes on licensing:

The HDSR article is published under a Creative Commons Attribution 4.0 International (CC BY 4.0) license (see the article page). This applies to the article text and supplemental material where indicated by the publishers.
The code in this repository is licensed under the MIT License (see the LICENSE file). Data included in data/ may carry licensing or citation restrictions; check the file-level headers and the article's supplemental information where applicable.

Please respect the article's CC BY 4.0 terms when reusing article text or figures: attribution is required.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
api_code		api_code
classifier_data_code		classifier_data_code
feature_data_code		feature_data_code
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Style Analysis

Citation

What this repo contains

Python version & environment

System requirements (CPU/GPU)

Run duration

Reproducibility & requirements

How to run the main scripts

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

aih-lab/ai_detection

Folders and files

Latest commit

History

Repository files navigation

AI Style Analysis

Citation

What this repo contains

Python version & environment

System requirements (CPU/GPU)

Run duration

Reproducibility & requirements

How to run the main scripts

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages