Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Movie QA Benchmarking Dataset

For one particular application of YodaQA, we want to enhance and speed up its capability to answer "noisy" questions on a structured knowledge base in a narrow domain. To start prototyping, we have chosen the "movies" domain.

To get started, we extracted movie-related questions from WebQuestions ( - Berant et al., 2013, CC-BY) using the machinery in (we use the same JSON structure and scripts in this repo). This is the moviesB dataset.

The moviesC dataset also includes "mfb" questions which stand for "movie feedback", as reported by the YodaQA feedback tool when testing the YodaQA Movies engine by internet users (mainly interns of the eClub Prague foundation). The script extracts the feedback data from a Google Docs spreadsheet.

We intend to follow up with even larger and better datasets, using next consecutive letters.

Using with YodaQA

YodaQA typically excepts datasets in a TSV format rather than JSON. (JSON collection reader in YodaQA is work-in-progress.) To get the data to TSV format, run

../dataset-factoid-webquestions/scripts/ moviesC train moviesC
../dataset-factoid-webquestions/scripts/ moviesC test moviesC

The dataset is called moviesA - the A letter represents our intention to develop it further. It is currently rather noisy, mixed with sports questions and not that large either.

moviesC is a dataset created by merging the t-movies dataset (here named moviesB for reference) from and public feedback in our 2 spreadsheets (downloaded 17.8.2015):

moviesD is an update of moviesC on 2015-10-19.

moviesE is an update of moviesD on 2015-12-10 and inclusion of synthetic questions gen v0.

moviesF is an update of moviesE on 2016-01-04 with a variety of bugs related to the synthetic questions fixed.

Licence and Acknowledgements

This dataset may be distributed under the terms of the CC-BY 4.0 licence. Work on this project has been supported in part by the Medialab foundation.


A question answering research dataset of movie-related factoids



No releases published


No packages published
You can’t perform that action at this time.