Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add topic names for coherence scores (fix #1016) #1019

Merged
merged 2 commits into from Aug 11, 2020

Conversation

JeanPaulShapo
Copy link
Contributor

Currently there is inappropriate use of topic_name field for labeling coherence scores.
While coherence field of TopTokensScore has one number per requested topics, topic_name field has at most num_tokens numbers per requested topics.

We may true to use some parts of topic_name filed, e.g. we can try to get each num_tokens-th element, but there are at least two problems:

  • It complicates code
  • In fact it's wrong: actual number of gathered top tokens per topic can be less than num_tokens, because tokens with small probability cannot be gathered.

Instead of this approach, let's simply add another field into protobuf message TopTokensScore which contains topic_names and is aligned with coherence values.

@ofrei
Copy link
Contributor

ofrei commented Aug 10, 2020

@JeanPaulShapo I think that's OK in this particular case - can't think of cases where serializing TopTokensScore is critical. I though it's good to be careful about protobuf serialization (particularly in messages related to Batches, Dictionaries and MasterModelConfig - TopTokensScore is none of those).

@JeanPaulShapo JeanPaulShapo merged commit eb1d788 into bigartm:master Aug 11, 2020
@JeanPaulShapo JeanPaulShapo deleted the fix/misaligned-coherence branch August 11, 2020 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants