Entity linking of personal mentions in multiparty dialogue.
Switch branches/tags
Clone or download
Latest commit 5c47347 Aug 21, 2018
Failed to load latest commit information.
doc Added JSON files. Apr 9, 2018
json Added JSON files. Apr 9, 2018
python resources Jun 13, 2018
.gitignore resources Jun 13, 2018
LICENSE.txt Added JSON files. Apr 9, 2018
README.md Updated README. Aug 21, 2018


Character Identification

Character Identification is an entity linking task that finds the global entity of each personal mention in multiparty dialogue. Let a mention be a nominal referring to a person (e.g., she, mom, Judy), and an entity be a character in a dialogue. The goal is to assign each mention to its entity, who may or may not participate in the dialogue. For the following example, the mention "mom" is not one of the speakers; nonetheless, it clearly refers to the specific person, Judy Geller, that could appear in some other dialogue. Identifying such mentions as real characters requires cross-document entity resolution, which makes this task challenging.

Character Identification Example

This task is a part of the Character Mining project led by the Emory NLP research group.


All personal mentions are annotated with their global entities. For the above example, the first mention "I" is annotated with its global entity, Ross Geller, and the second mention "mom" is annotated with, Judy Geller, and so on. The mention detection is first performed automatically then corrected manually. The entity annotation is mostly crowdsourced although lots of them are fixed manually by experts.


For each season, episodes 1 ~ 19 are used for training (TRN), 20 ~ 21 for development (DEV), and 22 ~ rest for evaluation (TST).

Dataset Episodes Scenes Utterances Tokens Speakers Mentions Entities
TRN 76 987 18,789 262,650 265 36,385 628
DEV 8 122 2142 28523 48 3932 102
TST 13 192 3,597 50,232 91 7,050 165
Total 97 1,301 24,528 341,405 331 47,367 781


Each utterance is split into sentences and personal mentions in every sentence are annotated with their entities. For the example below, the utterance consists of one sentence including four mentions. The first three mentions, I, *mom and dad, are singular that refer to Ross Geller, Judy Geller and Jack Geller, respectively. The last mention, they, is plural that refers to both Judy Geller and Jack Geller.

  "utterance_id": "s01_e01_c01_u039",
  "speakers": ["Ross Geller"],
  "transcript": "I told mom and dad last night, they seemed to take it pretty well.",
  "tokens": [
    ["I", "told", "mom", "and", "dad", "last", "night", ",", "they", "seemed", "to", "take", "it", "pretty", "well", "."]
  "character_entities": [
    [[0, 1, "Ross Geller"], [2, 3, "Judy Geller"], [4, 5, "Jack Geller"], [8, 9, "Jack Geller", "Judy Geller"]]

Each mention is annotated by the following scheme:

[begin_index, end_index, entity(, entity)*]
  • begin_index: int - the beginning token index of the mention (inclusive).
  • end_index: int - the ending token index of the mention (exclusive).
  • entity: str - the label of the entity.



Shared Task