The Character Mining project challenges machine comprehension on multiparty dialogue. The objective of this project is to infer explicit and implicit contexts about individual characters through their conversations. This is an open-source project led by the Emory NLP research group that provides resources for the following tasks:
- Character Identification (since May 2016).
- Emotion Detection (since May 2017).
- Reading Comprehension (since May 2018).
We welcome feedbacks and contributions from the community. Most of our annotation are crowdsourced; implying that, errors are expected to be found. Please make pull requests if you wish to fix errors in our datasets.
Our dataset is based on the popular TV show called Friends. Transcripts for all 10 seasons of the show as well as manual and crowdsourced annotation for subparts of the show are provided. All text data are available in the JSON files; please visit the individual task pages to retrieve datasets specifically designed for those tasks.
Each season consists of episodes, each episode is divided into scenes, each scene comprises utterances, each utterance is a list of sentences where tokens are split.
Some utterances include action notes.
In the following example, extracted from
s01_e01_c01_u028, the speaker is talking to Ross, which is indicated by the action note:
"transcript": "Let me get you some coffee.", "transcript_with_note": "(to Ross) Let me get you some coffee.",
The followings show the statistics including action notes:
- How to retrieve information from the JSON files:
- Challenging Reading Comprehension on Daily Conversation: Passage Completion on Multiparty Dialog. Kaixin Ma, Tomasz Jurczyk, and Jinho D. Choi. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL'18, 2018 (poster).
- Emotion Detection on TV Show Transcripts with Sequence-based Convolutional Neural Networks. Sayyed Zahiri and Jinho D. Choi. In The AAAI Workshop on Affective Content Analysis, AFFCON'18, 2018.
- Cross-domain Document Retrieval: Matching between Conversational and Formal Writings. Tomasz Jurczyk and Jinho D. Choi. In Proceedings of the EMNLP Workshop on Building Linguistically Generalizable NLP Systems, of BLGNLP'17, 48-53, Copenhagen, Denmark, 2017 (slides).
- Robust Coreference Resolution and Entity Linking on Dialogues: Character Identification on TV Show Transcripts, Henry Y. Chen, Ethan Zhou, and Jinho D. Choi. Proceedings of the 21st Conference on Computational Natural Language Learning, CoNLL'17, 216-225 Vancouver, Canada, 2017 (slides).
- Text-based Speaker Identification on Multiparty Dialogues Using Multi-document Convolutional Neural Networks. Kaixin Ma, Catherine Xiao, and Jinho D. Choi. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, ACL:SRW'17, 49-55, Vancouver, Canada, 2017 (poster).
- Character Identification on Multiparty Conversation: Identifying Mentions of Characters in TV Shows, Henry Y. Chen and Jinho D. Choi. Proceedings of the 17th Annual SIGdial Meeting on Discourse and Dialogue, SIGDIAL'16, 90-100 Los Angeles, CA, 2016 (poster).