AO3 Style Change Detection

Style change detection dataset using AO3 fics. Inspired by the PAN 21: Style Change Detection Task, but for much longer documents.

Note: Due to the nature of the fanfiction source, much of the text will be NSFW.

Dataset construction methodology

We pick 4 relationships from different popular fandoms on AO3:

Sherlock Holmes/John Watson
Castiels/Dean Winchester
Steve Rodgers/Tony Stark
Draco Malfoy/Harry Potter (used as the test set)

For each pairing, we find collect stories which include it, and are written in English. We collate these by author and randomly generate documents which contain paragraphs from 1-4 authors.

Quickstart

To quickly use this dataset in your code use the Huggingface Datasets loader:

import datasets
ds = datasets.load_dataset("ghomasHudson/ao3_style_change")
print(ds["train"][0])
>> {"site": "Castiel/Dean Winchester", "authors": 4, "structure": ["Author1", "Author2", ...], "multi-author": 1, "changes": [0,0,...]...}

Data Format

We use the same data format as the PAN 21 task, with 2 files for each problem instance, x:

problem-x.txt containing the text
truth-problem-x.json containing the ground truth (labels), e.g.

{
    "site": "Sherlock Holmes/John Watson",
    "authors": 3,
    "multiauthor": 1,
    "structure": ["Username1", "Username2", "Username1", "Username3"],
    "changes": [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...],
    "paragraph-authors": [1, 1, 1, 1, 1, 2, 2, 2, 2, ...]
}

Gathering new data

2 python files are provided which were used when scraping the data:

main.py iterates through the list of character pairings, downloading fics in the following structure:

fanfics/
├── pairing1
│   ├── Username1
│   │    ├── 3b6ff2cadcaedf11d5eaaefd1e998d49c493c45f.json
│   │    ├── 3b6ff2cadcaedf11d5eaaefd1e998d49c493c45f.txt
│   │    ├── ab35ee7ceb06ee97c94cd042d8874f1eab99bd1a.json
│   │    ├── ab35ee7ceb06ee97c94cd042d8874f1eab99bd1a.txt
│   │    └── ...
│   ├── Username2
│   │    └── ...
│   ...
└── pairing2
│   ├── Username3
│   │    └── ...
│   ├── Username4
│   │    └── ...
    ...

to_style_change.py turns this into a style change task, by randomly creating a structure and filling it with random paragraphs.

Baseline model (WIP)

run_baseline.sh will train a simple baseline model based on chunking the data.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
ao3		ao3
data		data
.gitignore		.gitignore
README.md		README.md
ao3_style_change.py		ao3_style_change.py
main.py		main.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
to_style_change.py		to_style_change.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AO3 Style Change Detection

Dataset construction methodology

Quickstart

Data Format

Gathering new data

Baseline model (WIP)

About

Releases

Languages

ghomasHudson/ao3_style_change

Folders and files

Latest commit

History

Repository files navigation

AO3 Style Change Detection

Dataset construction methodology

Quickstart

Data Format

Gathering new data

Baseline model (WIP)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages