Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
doc Added JSON files. Apr 9, 2018
json Updated SDK. Apr 13, 2018
scripts Updated README. Apr 30, 2018
LICENSE.txt Added JSON files. Apr 9, 2018
README.md Updated README. Apr 30, 2018

README.md

Character Mining

The Character Mining project challenges machine comprehension on multiparty dialogue. The objective of this project is to infer explicit and implicit contexts about individual characters through their conversations. This is an open-source project led by the Emory NLP research group that provides resources for the following tasks:

We welcome feedbacks and contributions from the community. Most of our annotation are crowdsourced; implying that, errors are expected to be found. Please make pull requests if you wish to fix errors in our datasets.

Dataset

Our dataset is based on the popular TV show called Friends. Transcripts for all 10 seasons of the show as well as manual and crowdsourced annotation for subparts of the show are provided. All text data are available in the JSON files; please visit the individual task pages to retrieve datasets specifically designed for those tasks.

Statistics

Each season consists of episodes, each episode is divided into scenes, each scene comprises utterances, each utterance is a list of sentences where tokens are split.

Season ID Episodes Scenes Utterances Sentences Tokens Speakers
s01 24 326 5,968 10,790 81,453 107
s02 24 293 5,747 9,337 81,910 107
s03 25 348 6,495 10,858 90,753 108
s04 24 338 6,318 10,889 87,289 100
s05 24 311 6,220 11,133 83,907 107
s06 25 350 6,458 11,496 90,384 112
s07 24 332 6,314 11,340 84,974 94
s08 24 288 6,220 11,714 86,164 107
s09 24 302 6,322 11,831 93,773 99
s10 18 219 5,247 9,345 69,493 78
Total 236 3,107 61,309 108,733 850,100 700

Some utterances include action notes. In the following example, extracted from s01_e01_c01_u028, the speaker is talking to Ross, which is indicated by the action note:

"transcript": "Let me get you some coffee.",
"transcript_with_note": "(to Ross) Let me get you some coffee.",

The followings show the statistics including action notes:

Season ID Utterances Sentences Tokens
s01 6,626 12,088 100,773
s02 6,048 10,565 97,763
s03 7,267 12,288 117,912
s04 7,119 12,811 116,703
s05 7,082 13,540 118,509
s06 7,235 13,506 120,471
s07 7,019 13,363 116,341
s08 6,845 13,321 109,984
s09 6,653 13,548 119,090
s10 5,479 11,029 93,390
Total 67,373 126,059 1,110,936

Documentations

References

Contact