Skip to content

Latest commit

History

History
134 lines (115 loc) 路 4.1 KB

annotation-guidelines.md

File metadata and controls

134 lines (115 loc) 路 4.1 KB

Annotation Guidelines

Guidelines for annotating the ground truth for the data.

These guidelines define how to annotate the ground truth for the reddit data in order to produce extraction quality evaluations. There are two kinds of annotations used in evaluating the extractions: comment and post annotations.

Comment Annotations

Comment annotations should be written in a JSON Lines file where each object has the following keys:

id
the ID attribute of the corresponding comment
label
a gold annotation for the label (one of "AUTHOR", "OTHER", "EVERYBODY", "NOBODY", or "INFO") expressed by the comment, or null if no label is expressed
implied
true if the label is implied by the view of the author and false if the label is somehow explicitly stated
spam
true if the comment is spam, false otherwise

The possible labels are:

AUTHOR
The author of the anecdote is in the wrong.
OTHER
The other person in the anecdote is in the wrong.
EVERYBODY
Everyone in the anecdote is in the wrong.
NOBODY
No one in the anecdote is in the wrong.
INFO
More information is required to make a judgment.

If the comment explicitly expresses a label either by its initialism or some phrase corresponding to the initialism, then use that label for the comment. Similarly, mark the comment with implied as false and spam as false.

If the comment expresses multiple labels with no clear winner or is otherwise ambiguous, mark label as null, implied as null, and spam as true.

If the comment expresses no labels explicitly but still has a viewpoint that clearly expresses one of the labels, then use that label for the comment. Mark implied as true and spam as false.

Finally, if the comment expresses no label (i.e., none of AUTHOR, OTHER, NOBODY, EVERYBODY, or INFO), then mark label as null, implied as null, and spam as true.

Post Annotations

Post annotations should be written in a JSON Lines file where each object has the following keys:

id
the ID attribute of the corresponding post
post_type
a gold annotation for the post's type
implied
true if the post type is not explicitly stated in the post title.
spam
true if the post is spam, false otherwise

Possible post types are:

HISTORICAL
The author is asking "am I the a**hole?"
HYPOTHETICAL
The author is asking "would I be the a**hole?"
META
The post is about the subreddit itself.

If the post type is explicitly stated in the post title, then mark post_type as the stated post type, mark implied as false, and spam as false, unless the post type is META in which case mark spam as true. Additionally, if the post type is explicitly stated but clearly wrong (such as using HISTORICAL for a HYPOTHETICAL post), then use the true post type rather than the stated one.

If the post type is not explicitly stated in the post title, but otherwise clear from the post, mark the appropriate post type, mark implied as true and spam as false.

If the post cannot be categorized into one of the types above, mark the post_type as null, implied as null, and spam as true.

If the post is something that should not be present in the dataset (for example a deleted post), then mark spam as true.