Spoiler Matching

This is the code for Spoiler Detection as Semantic Text Matching. The dataset along with a detailed description is available on Kaggle and Hugging Face.

Quickstart

Data

Start by downloading the dataset from Kaggle or Hugging Face.

mkdir data

and extract the dataset into data/.

Environment

Please ensure that you have Anaconda or Miniconda installed, then

conda env create -f environment.yml
conda activate spoiler

Logging

We use Comet.ml to store and read our logs. By default, train.py will run in offline mode, but you may enter your API key at the top of train.py to log your experiments on Comet.ml.

Train

python train.py --config config/longformer.yml

Pytorch Lightning model checkpoints are automatically saved in the checkpoints directory under the experiment name and top 2 models with best validation MRR are kept.

Trained Models

Alternatively, you can skip training and download the models from the paper.

Test

Point the resume_from field in your config file (Ex: models/checkpoints/longformer/longformer.yml) to your desired model checkpoint (Ex: models/checkpoints/longformer/best.ckpt), then

python test.py --config models/checkpoints/longformer/longformer.yml --mode test

The individual MRR on the four test set shows will be printed first, then the total test set MRR.

Auto-labeling

We provide a medium-size autolabeled training set ready for training a spoiler matching model. But if you'd like to create your own training set, we also make available the raw unlabeled comments, as well as the irrelevant/relevant dataset we used to train the autolabeler.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aggregate_binary_dataset.py		aggregate_binary_dataset.py
count_tokens.py		count_tokens.py
create_episode_dataset.py		create_episode_dataset.py
dataset.py		dataset.py
environment.yml		environment.yml
generate_labels.py		generate_labels.py
scrape_reddit.py		scrape_reddit.py
scrape_summary.py		scrape_summary.py
test.py		test.py
train.py		train.py
util.py		util.py

License

bobotran/spoiler-matching

Folders and files

Latest commit

History

Repository files navigation

Spoiler Matching

Quickstart

Data

Environment

Logging

Train

Trained Models

Test

Auto-labeling

About

Resources

License

Stars

Watchers

Forks

Languages