You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
@Golovneva
The code returns 20 scores (namely faithfulness, informativeness_step, informativeness_chain, faithfulness_ww, repetition_word, repetition_step, reasoning_alignment, external_hallucination, redundancy, common_sense_error, missing_step, semantic_coverage_step, semantic_coverage_chain, discourse_representation, coherence_step_vs_step, perplexity_step, perplexity_chain, perplexity_step_max, grammar_step, grammar_step_max), and while some of these correspond exactly with the scores defined in the paper, some of them don't (such as discourse_representation, coherence_step_vs_step etc.).
(1) Can you let me know which score in the code corresponds to which definition in the paper?
(2) For almost all the scores, it looks like the higher the score is, the better the model's reasoning is. However, it looks like it is the opposite case for scores such as repetition_step. Can you also clarify what is the best score (i.e., 0 or 1), for each of these 20 scores?
The text was updated successfully, but these errors were encountered:
perplexity_step_max and grammar_step_max showed same behavior as their mean version (perplexity_step and grammar_step), so we report only the later.
(2) For all scores the higher the better. For repetition_step specifically, score 1 would mean that all steps are completely different from each other (in reality, this only occurs when there is only 1 step in the chain and there is nothing to compare with), while 0 means that there are two completely identical steps/sentences in the chain.
@Golovneva
The code returns 20 scores (namely
faithfulness, informativeness_step, informativeness_chain, faithfulness_ww, repetition_word, repetition_step, reasoning_alignment, external_hallucination, redundancy, common_sense_error, missing_step, semantic_coverage_step, semantic_coverage_chain, discourse_representation, coherence_step_vs_step, perplexity_step, perplexity_chain, perplexity_step_max, grammar_step, grammar_step_max
), and while some of these correspond exactly with the scores defined in the paper, some of them don't (such asdiscourse_representation, coherence_step_vs_step
etc.).(1) Can you let me know which score in the code corresponds to which definition in the paper?
(2) For almost all the scores, it looks like the higher the score is, the better the model's reasoning is. However, it looks like it is the opposite case for scores such as
repetition_step
. Can you also clarify what is the best score (i.e., 0 or 1), for each of these 20 scores?The text was updated successfully, but these errors were encountered: