Guidelines for annotating the ground truth for the data.
These guidelines define how to annotate the ground truth for the reddit data in order to produce extraction quality evaluations. There are two kinds of annotations used in evaluating the extractions: comment and post annotations.
Comment annotations should be written in a JSON Lines file where each object has the following keys:
id
- the ID attribute of the corresponding comment
label
-
a gold annotation for the label (one of
"AUTHOR"
,"OTHER"
,"EVERYBODY"
,"NOBODY"
, or"INFO"
) expressed by the comment, ornull
if no label is expressed implied
-
true
if the label is implied by the view of the author andfalse
if the label is somehow explicitly stated spam
-
true
if the comment is spam,false
otherwise
The possible labels are:
AUTHOR
- The author of the anecdote is in the wrong.
OTHER
- The other person in the anecdote is in the wrong.
EVERYBODY
- Everyone in the anecdote is in the wrong.
NOBODY
- No one in the anecdote is in the wrong.
INFO
- More information is required to make a judgment.
If the comment explicitly expresses a label either by its initialism or
some phrase corresponding to the initialism, then use that label for the
comment. Similarly, mark the comment with implied
as false
and
spam
as false
.
If the comment expresses multiple labels with no clear winner or is
otherwise ambiguous, mark label
as null
, implied
as null
, and
spam
as true
.
If the comment expresses no labels explicitly but still has a viewpoint
that clearly expresses one of the labels, then use that label for the
comment. Mark implied
as true
and spam
as false
.
Finally, if the comment expresses no label (i.e., none of AUTHOR
,
OTHER
, NOBODY
, EVERYBODY
, or INFO
), then mark label
as null
,
implied
as null
, and spam
as true
.
Post annotations should be written in a JSON Lines file where each object has the following keys:
id
- the ID attribute of the corresponding post
post_type
- a gold annotation for the post's type
implied
-
true
if the post type is not explicitly stated in the post title. spam
-
true
if the post is spam,false
otherwise
Possible post types are:
HISTORICAL
- The author is asking "am I the a**hole?"
HYPOTHETICAL
- The author is asking "would I be the a**hole?"
META
- The post is about the subreddit itself.
If the post type is explicitly stated in the post title, then mark
post_type
as the stated post type, mark implied
as false
, and
spam
as false
, unless the post type is META
in which case mark
spam as true
. Additionally, if the post type is explicitly stated but
clearly wrong (such as using HISTORICAL for a HYPOTHETICAL post), then
use the true post type rather than the stated one.
If the post type is not explicitly stated in the post title, but
otherwise clear from the post, mark the appropriate post type, mark
implied
as true
and spam
as false
.
If the post cannot be categorized into one of the types above, mark the
post_type
as null
, implied
as null
, and spam
as true
.
If the post is something that should not be present in the dataset (for
example a deleted post), then mark spam
as true
.