Data and code repository for reproducing the experiments from the AI Style project (WashU AI Humanities Lab). DOI: https://doi.org/10.5281/zenodo.15587211
This repository contains code and data that accompany the article:
Kirilloff, G., Carroll, C., Daboul, Z., Frank, A., Khan, R., Hinrichs-Morrow, M., & Weingart, R. (2025). "'Written in the Style of': ChatGPT and the Literary Canon." Harvard Data Science Review (HDSR). Published Aug 05, 2025. Available: https://doi.org/10.1162/99608f92.6d5fb5ef
If you use the repository code or data, cite this repository; if you use the article or its findings, cite the HDSR article. Example citations:
- For the article: Kirilloff, G., et al. (2025). 'Written in the Style of': ChatGPT and the Literary Canon. Harvard Data Science Review. https://doi.org/10.1162/99608f92.6d5fb5ef
- For the software (this repository): Carroll, C., & Kirilloff, G. (2025). AI Style Analysis (Version 1.0.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.15587211
classifier_data_code/— scripts and notebooks used to generate feature matrices and run classification experiments to distinguish authentic (human) and synthetic text. Key scripts:classification_code/random_forest_classifier.py— repeated-sampling random-forest experiments (CONFIG-driven).classification_code/k_fold_author_validation_classifier.py— author holdout / k-fold-style experiments.
feature_data_code/— notebooks for building feature matrices.data/— CSV inputs and outputs generated by the experiments (not all files are tracked).api_code/— scripts and code for interacting with the OpenAI GPT API (prompt runners, helper functions, and example calls). This folder containsprompt_runner.pywhich demonstrates how to send prompts to the GPT API and receive responses, handle simple batching, and save outputs.
-
Recommended Python: 3.10 or 3.11. (The code uses modern pandas and scikit-learn APIs; Python 3.8 may work but 3.10+ is recommended.)
-
Install required Python packages from
requirements.txtin this repository. -
If you plan to use the
api_code/folder to call the GPT API, ensure the OpenAI client (or your chosen API client) is installed.
Quick setup (recommended in a virtual environment):
```bash
# create & activate a venv (macOS/Linux)
python3 -m venv .venv
source .venv/bin/activate
# install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# install OpenAI client and optional dotenv support
pip install openai python-dotenv
# Set your OpenAI API key
export OPENAI_API_KEY="sk-<your-key-here>"
-
GPU is not required. The code uses scikit-learn's RandomForest implementation which runs on CPU. The scripts will run on a regular laptop or server without CUDA/OpenCL.
-
For faster runs, prefer a multi-core CPU (4+ cores). Note: earlier edits temporarily set RandomForest training to use all cores (
n_jobs=-1) but those changes have been reverted per project owner request; the scripts retain their original behavior. -
Network & API access: the
api_code/examples call the OpenAI GPT API and therefore requires a valid OpenAI account with an API key, and may incur usage charges.
Estimated run durations depend on dataset size, number of features, CONFIG['n_runs'], and CONFIG['n_estimators']. A small run (single author or a few hundred rows, n_runs=1, n_estimators=100): typically completes in seconds to a couple of minutes on a modern laptop (4+ cores). A full default run (repository defaults such as n_runs=100 across many authors): can take tens of minutes to multiple hours on a single machine depending on dataset size and CPU. Expect runtime to scale roughly linearly with n_runs.
- A
requirements.txtfile is included in the repo root listing the Python packages used by the scripts and notebooks.
-
Edit
classifier_data_code/classification_code/random_forest_classifier.pyork_fold_author_validation_classifier.pyto adjustCONFIG(paths,n_runs,sample_size_per_category, etc.). -
Ensure the input CSV (
data/master_feature_matrix.csv) is present and formatted with the expected columns (the scripts expect columns such asid,author,model,category, and the features listed infeature_cols). -
Run:
python classifier_data_code/classification_code/random_forest_classifier.pyResults are written to the data/ directory by default (see CONFIG['output_dir']).
This repository contains code and data that accompany the article:
Kirilloff, G., Carroll, C., Daboul, Z., Frank, A., Khan, R., Hinrichs-Morrow, M., & Weingart, R. (2025). "'Written in the Style of': ChatGPT and the Literary Canon." Harvard Data Science Review (HDSR). Published Aug 05, 2025. Available: https://doi.org/10.1162/99608f92.6d5fb5ef
If you use the repository code or data, cite this repository; if you use the article or its findings, cite the HDSR article. Example citations:
- For the article: Kirilloff, G., et al. (2025). 'Written in the Style of': ChatGPT and the Literary Canon. Harvard Data Science Review. https://doi.org/10.1162/99608f92.6d5fb5ef
- For the software (this repository): Carroll, C., & Kirilloff, G. (2025). AI Style Analysis (Version 1.0.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.15587211
Notes on licensing:
- The HDSR article is published under a Creative Commons Attribution 4.0 International (CC BY 4.0) license (see the article page). This applies to the article text and supplemental material where indicated by the publishers.
- The code in this repository is licensed under the MIT License (see the
LICENSEfile). Data included indata/may carry licensing or citation restrictions; check the file-level headers and the article's supplemental information where applicable.
Please respect the article's CC BY 4.0 terms when reusing article text or figures: attribution is required.