-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coreference of speaker and addressee #177
Comments
Thanks for reporting - some of these are definitely wrong; it seems mostly in the XML speaker list, rather than the coref clustering. This is especially true for However the entire zip files contain too many false positives to go over manually, esp. since, as you guessed, generic 'you' (meaning "one") is not clustered with actual referential 'you'. Is there some way you could produce a narrower list of candidates for any further errors? At a minimum, I think "you know" should be assumed to be generic, this is almost always true, unless followed by a complement. BTW if you're curious, you can find the original speaker info for the conversation data in the Santa Barbara Corpus, for example: https://www.linguistics.ucsb.edu/sites/secure.lsit.ucsb.edu.ling.d7/files/sitefiles/research/SBC/SBC048.trn . The addressee info was filled in by annotators based on their understanding of the conversation. |
I've filtered out the "you know" occurences: Also, in the speaker_inverse, I've removed the entities containing multiple people. However, I don't know about any simple way to differentiate between generic and referential you. Also sometimes there is indirect speech in the text, which also generates some false positives, and which I also cannot detect. |
This took a while to get to, but the speaker inverse cases should now be resolved in the source files. The compiled files in the remaining formats incl. conllu will propagate on the next release. Thanks again for reporting the issue! |
I've looked into files where speaker and addressee are annotated. (Files from dep/*.conllu.) It turns out that:
I am attaching files with all the occurences of suspicious coreference accross the files with annotated speaker and addressee. Not all of these are neccessarily errors -- for example in the sentence "...,you know,..." the word "you" is annotated as a different entity to the addressee, which, I think, is deliberate.
GUM_speaker.zip
GUM_speaker_inverse.zip
The text was updated successfully, but these errors were encountered: