Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coreference of speaker and addressee #177

Closed
kybersutr opened this issue Dec 19, 2023 · 3 comments
Closed

Coreference of speaker and addressee #177

kybersutr opened this issue Dec 19, 2023 · 3 comments

Comments

@kybersutr
Copy link

kybersutr commented Dec 19, 2023

I've looked into files where speaker and addressee are annotated. (Files from dep/*.conllu.) It turns out that:

  • The same person sometimes does not corefere with itself accross multiple sentences. For example in GUM_conversation_christmas:

Dan

d1.19: GUM_conversation_christmas-36#2

d1.2: GUM_conversation_christmas-64#1

d1.10: GUM_conversation_christmas-89#1, GUM_conversation_christmas-98#4, GUM_conversation_christmas-104#3, GUM_conversation_christmas-185#1

d1.37: GUM_conversation_christmas-105#1

  • There are also occurences of the opposite problem: two people are marked as the same entity. E.g. in GUM_conversation_grounded:

d1.4

Sabrina: GUM_conversation_grounded-3#1, GUM_conversation_grounded-3#13

Kendra: GUM_conversation_grounded-70#1, GUM_conversation_grounded-15#2, GUM_conversation_grounded-51#14, GUM_conversation_grounded-53#21, GUM_conversation_grounded-75#2, GUM_conversation_grounded-24#17, GUM_conversation_grounded-47#11, GUM_conversation_grounded-70#9, GUM_conversation_grounded-83#1, GUM_conversation_grounded-84#8, GUM_conversation_grounded-16#3, GUM_conversation_grounded-20#8, GUM_conversation_grounded-18#22, GUM_conversation_grounded-69#3, GUM_conversation_grounded-82#8, GUM_conversation_grounded-106#12, GUM_conversation_grounded-19#10, GUM_conversation_grounded-13#1, GUM_conversation_grounded-74#2, GUM_conversation_grounded-53#1, GUM_conversation_grounded-10#6, GUM_conversation_grounded-18#9, GUM_conversation_grounded-18#2, GUM_conversation_grounded-69#14, GUM_conversation_grounded-11#1, GUM_conversation_grounded-22#12, GUM_conversation_grounded-76#20, GUM_conversation_grounded-83#4, GUM_conversation_grounded-84#2, GUM_conversation_grounded-66#1, GUM_conversation_grounded-82#1, GUM_conversation_grounded-48#9, GUM_conversation_grounded-74#7, GUM_conversation_grounded-83#8, GUM_conversation_grounded-79#1, GUM_conversation_grounded-24#8, GUM_conversation_grounded-53#18, GUM_conversation_grounded-55#1, GUM_conversation_grounded-106#3, GUM_conversation_grounded-18#30, GUM_conversation_grounded-24#12, GUM_conversation_grounded-37#1, GUM_conversation_grounded-47#8, GUM_conversation_grounded-73#4, GUM_conversation_grounded-115#10, GUM_conversation_grounded-30#1, GUM_conversation_grounded-110#4, GUM_conversation_grounded-73#10, GUM_conversation_grounded-19#2, GUM_conversation_grounded-53#25, GUM_conversation_grounded-42#1, GUM_conversation_grounded-69#18, GUM_conversation_grounded-18#28, GUM_conversation_grounded-21#8, GUM_conversation_grounded-18#15, GUM_conversation_grounded-93#1, GUM_conversation_grounded-76#6

I am attaching files with all the occurences of suspicious coreference accross the files with annotated speaker and addressee. Not all of these are neccessarily errors -- for example in the sentence "...,you know,..." the word "you" is annotated as a different entity to the addressee, which, I think, is deliberate.

GUM_speaker.zip
GUM_speaker_inverse.zip

amir-zeldes added a commit that referenced this issue Dec 20, 2023
@amir-zeldes
Copy link
Owner

Thanks for reporting - some of these are definitely wrong; it seems mostly in the XML speaker list, rather than the coref clustering. This is especially true for GUM_conversation_christmas, which is very complex. I was able to fix some of these in 85b042d now.

However the entire zip files contain too many false positives to go over manually, esp. since, as you guessed, generic 'you' (meaning "one") is not clustered with actual referential 'you'. Is there some way you could produce a narrower list of candidates for any further errors? At a minimum, I think "you know" should be assumed to be generic, this is almost always true, unless followed by a complement.

BTW if you're curious, you can find the original speaker info for the conversation data in the Santa Barbara Corpus, for example: https://www.linguistics.ucsb.edu/sites/secure.lsit.ucsb.edu.ling.d7/files/sitefiles/research/SBC/SBC048.trn . The addressee info was filled in by annotators based on their understanding of the conversation.

@kybersutr
Copy link
Author

I've filtered out the "you know" occurences:
GUM_speaker_inverse2.zip
GUM_speaker2.zip

Also, in the speaker_inverse, I've removed the entities containing multiple people.

However, I don't know about any simple way to differentiate between generic and referential you. Also sometimes there is indirect speech in the text, which also generates some false positives, and which I also cannot detect.

amir-zeldes added a commit that referenced this issue Mar 26, 2024
@amir-zeldes
Copy link
Owner

This took a while to get to, but the speaker inverse cases should now be resolved in the source files. The compiled files in the remaining formats incl. conllu will propagate on the next release. Thanks again for reporting the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants