Audio cloze test generator using open data
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Audio cloze test generator using open data

This repository contains code to automatically generate cloze tests with audio support based on data from the Global Storybooks project. They are designed as an aid for self-study rather than diagnostic testing.

The generated pages update to test new randomly selected words on refresh, so no test is ever the same. They are separated by level so users can select the most appropriate test for their language ability.

In order to generate the tests you will need to download a copy of one of the Global Storybooks source repos. A good example to start with, containing a variety of different languages, is the sbc-source repo. You can also (optionally) download a local copy of the audio files rather than streaming them online.


You can use the cloze tests generated by this script live online on the Storybooks Canada website.


It is important to note that this is meant as a self-study aid, rather than a classroom assessment tool.

This is a proof of concept that may have applications for other purposes -- for exampe, it should be quite simple to use this framework to create all types of test content -- including hand-selected clozes drawing on similar parts of speech.

The main appeal of the audio cloze test is that it "gamifies" listening practice. Some users may find it to be an enjoyable way of checking to see if they have heard the text correctly.

One thing to note is that although it would in theory be simple to "guess" the answer from the context (or just try each option until hitting on the right one), since this is very low stakes there is really no point in doing so. You may find that you have to listen much more closely than you would otherwise to see if you can pick out the correct word from the audio.

Design principles of this experiment:

  • Clozes or "distractors" are drawn entirely from the Global Storybooks corpus of 40 stories in each given language
    • In principle, the clozes could be restricted to stories of the same level, something which might be added in the next update.
  • The test updates with new, randomly-selected clozes each time the page is loaded, meaning it is possible to test yourself on the same story multiple times for listening practise.
  • The tests are generated entirely automatically, meaning an infinite number of tests can be created at random from the same corpus of 40 stories.
  • All the tests are audio-linked at the sentence/paragraph level, drawing on our database of recorded audio in multiple languages.
  • There is instant feedback on whether the answer was right or wrong, and a running tally to keep track of your progress
  • Rather than trying to classify words by parts of speech (possible in theory for English, but much more complicated if working with a dozen other languages), the clozes for each word have been selected at random from all the words beginning with the same letter in the corpus
    • This is a rough but surprisingly adequate proxy for "similarity" this purpose
    • If you try this with a language you are learning or are not very familiar with, you may find it requires a fair bit of attention to distinguish words beginning with the same sounds from a list!
  • Stopwords ("the", "was", "their" etc) have been left in deliberately to allow readers to practise discriminating basic vocabulary
    • Although it would be quite simple to remove these using e.g., this project for the purposes of listening practise it has been felt that sometimes common words are just as important as "content" words in a new language -- this could be changed in a future update though (particularly at higher levels where they might be distracting)
  • At the moment, test generation works for all languages that use spaces to separate words (for this reason Chinese and Japanese will require either automated or manual semantic parsing and are not yet included)

Language-specific features

  • Proper names ignored:
    • Applies to English and most other Latin-based orthographies
  • Case support:
    • The German corpus preserves letter case distinctions
  • Right-to-left language support:
    • Special templates are automatically applied to accommodate Arabic, Persian, and Urdu
    • Includes changes to text direction and language-specific fonts


This script requires a working Ruby installation (ideally 1.9.3 or above), as well as the unicode_utils gem and a copy of a Global Storybooks source repo with markdown files.

To install unicode_utils:

gem install unicode_utils

To get a copy of a markdown source folder you can download the sbc-source repo and place the entire folder in the root directory of this project.

If you would like to download the audio rather than streaming online, you can obtain it in all available languages from this repo. Once you have the audio stored locally you will need to update the URLs in cloze.js to point to your local filesystem or webhost.


If you have all of the requirements listed above, you should be able to enter the project folder and run the following command:

./parse_cloze [LANG]

(Where [LANG] is an ISO language code -- for example, en for English, or es for Spanish.)

This will create a subfolder in the project root directory called json_output, in which there should be a subfolder named after the language (e.g., en or es), as well as two folders containing the Javascript and CSS files needed to run the site. Within the language subfolder there should be 40 story folders arranged by index number, each of which contains an individual audio cloze test. These tests should work in any Javascript-enabled browser without further configuration.