KOMODIS dataset

This is the repository for the paper: "A Corpus of Controlled Opinionated and Knowledgeable Movie Discussions for Training Neural Conversation Models". The paper can be found here: http://arxiv.org/abs/2003.13342.

We introduce an augmented dialogue dataset (Knowledgable and Opinionated MOvie DIScussions) that is crowd-sourced and collected with Amazon Mechanical Turk. Each dialogue is based on two feature structures (one for each crowd-worker) about the same movie:

Dialogue examples

For detailed information please check the paper. Below are two dialogue examples.

Data

We provide the full postprocessed dialogue dataset in data/dataset.json.

For explanations on how to read and use the structured data, please check data/example.json (will be updloaded soon!).

Model

We provide a baseline script to train a GPT-2 model with our dataset in PyTorch in model/.

To train a model, you have to run the train.py script:

python train.py --dataset komodis

More information regarding additional arguments can be found in the script. Please download the pretrained GPT-2 weights from https://github.com/huggingface/transformers and store them in data/pretrained_models/gpt2/ and data/pretrained_weights/tokenizers.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
model		model
LICENSE		LICENSE
README.md		README.md
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

model

model

LICENSE

LICENSE

README.md

README.md

train.py

train.py

Repository files navigation

KOMODIS dataset

Dialogue examples

Data

Model

About

Releases

Packages

Languages

License

clp-research/komodis-dataset

Folders and files

Latest commit

History

Repository files navigation

KOMODIS dataset

Dialogue examples

Data

Model

About

Resources

License

Stars

Watchers

Forks

Languages