Skip to content

dellison/CornellMovieDialogsCorpus.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CornellMovieDialogsCorpus.jl

Build Status codecov

CornellMovieDialogsCorpus.jl is a Julia package that provides a thin wrapper for the Cornell Movie Dialogs Corpus.

Usage

Exported functions:

  • movie_conversations
  • movie_lines
  • movie_title_metadata
  • movie_character_metadata
  • movie_script_urls

Each of these loads the corresponding corpus database file.

Example

Let's say you want to train a simple chatbot using "call-and-response" dialog pairs as training data, as in this pytorch tutorial.

using CornellMovieDialogsCorpus

First, create a Dict that maps line IDs to the raw text.

id2text = Dict(l.line_id => l.text for l in movie_lines())

Now, create a dataset of (utterance, response) pairs from the movie conversations.

utterance_pairs = [(id2text[id], id2text[conv.lines[i+1]])
                   for conv in movie_conversations()
                   for (i, id) in enumerate(conv.lines[1:end-1])]
julia> utterance_pairs[1:5]
5-element Array{Tuple{Any,Any},1}:
 ("Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.", "Well, I thought we'd start with pronunciation, if that's okay with you.")
 ("Well, I thought we'd start with pronunciation, if that's okay with you.", "Not the hacking and gagging and spitting part.  Please.")
 ("Not the hacking and gagging and spitting part.  Please.", "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?")
 ("You're asking me out.  That's so cute. What's your name again?", "Forget it.")
 ("No, no, it's my fault -- we didn't have a proper introduction ---", "Cameron.")

Releases

No releases published

Packages

No packages published

Languages