Skip to content

ghomasHudson/ao3_style_change

Repository files navigation

AO3 Style Change Detection

Style change detection dataset using AO3 fics. Inspired by the PAN 21: Style Change Detection Task, but for much longer documents.

Note: Due to the nature of the fanfiction source, much of the text will be NSFW.

Dataset construction methodology

We pick 4 relationships from different popular fandoms on AO3:

  • Sherlock Holmes/John Watson
  • Castiels/Dean Winchester
  • Steve Rodgers/Tony Stark
  • Draco Malfoy/Harry Potter (used as the test set)

For each pairing, we find collect stories which include it, and are written in English. We collate these by author and randomly generate documents which contain paragraphs from 1-4 authors.

Quickstart

To quickly use this dataset in your code use the Huggingface Datasets loader:

import datasets
ds = datasets.load_dataset("ghomasHudson/ao3_style_change")
print(ds["train"][0])
>> {"site": "Castiel/Dean Winchester", "authors": 4, "structure": ["Author1", "Author2", ...], "multi-author": 1, "changes": [0,0,...]...}

Data Format

We use the same data format as the PAN 21 task, with 2 files for each problem instance, x:

  • problem-x.txt containing the text
  • truth-problem-x.json containing the ground truth (labels), e.g.
{
    "site": "Sherlock Holmes/John Watson",
    "authors": 3,
    "multiauthor": 1,
    "structure": ["Username1", "Username2", "Username1", "Username3"],
    "changes": [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...],
    "paragraph-authors": [1, 1, 1, 1, 1, 2, 2, 2, 2, ...]
}

Gathering new data

2 python files are provided which were used when scraping the data:

  • main.py iterates through the list of character pairings, downloading fics in the following structure:
fanfics/
├── pairing1
│   ├── Username1
│   │    ├── 3b6ff2cadcaedf11d5eaaefd1e998d49c493c45f.json
│   │    ├── 3b6ff2cadcaedf11d5eaaefd1e998d49c493c45f.txt
│   │    ├── ab35ee7ceb06ee97c94cd042d8874f1eab99bd1a.json
│   │    ├── ab35ee7ceb06ee97c94cd042d8874f1eab99bd1a.txt
│   │    └── ...
│   ├── Username2
│   │    └── ...
│   ...
└── pairing2
│   ├── Username3
│   │    └── ...
│   ├── Username4
│   │    └── ...
    ...
  • to_style_change.py turns this into a style change task, by randomly creating a structure and filling it with random paragraphs.

Baseline model (WIP)

run_baseline.sh will train a simple baseline model based on chunking the data.