Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantify consistency improvement #93

Merged
merged 14 commits into from
May 28, 2024

Conversation

JohnShiuMK
Copy link
Collaborator

close #76

@JohnShiuMK JohnShiuMK self-assigned this May 22, 2024
@tonyshumlh
Copy link
Collaborator

@JohnShiuMK I add my 2 cents here.
Although the result shall be the same after the adjustment, below are some suggested adjustments

  1. In the example, I notice that var_curr is indeed smaller than var_prev, meaning the F = var_prev / var_curr, and thus p_value should be = 1 - scipy.stats.f.cdf(F, df_prev, df_curr). Therefore, you might want to make the code flexible.
F = var_curr / var_prev if var_prev < var_curr else var_prev / var_curr
...
p_value = 1 - scipy.stats.f.cdf(F, df_curr, df_prev)
  1. I recommend doing two-tailed test rather than one-tailed test. In the example, H_0 is var_prev <= var_curr and H_A is var_prev > var_curr, and the p_value is 0.33 which failed to reject H_A.
    However, our point of interest should be whether var_curr < var_prev. Therefore, I recommend H_0: var_prev = var_curr and H_A: var_prev != var_curr. If sample var_curr < sample var_prev and the result is significant, that implies population var_curr < population var_prev

@JohnShiuMK
Copy link
Collaborator Author

@JohnShiuMK I add my 2 cents here. Although the result shall be the same after the adjustment, below are some suggested adjustments

  1. In the example, I notice that var_curr is indeed smaller than var_prev, meaning the F = var_prev / var_curr, and thus p_value should be = 1 - scipy.stats.f.cdf(F, df_prev, df_curr). Therefore, you might want to make the code flexible.
F = var_curr / var_prev if var_prev < var_curr else var_prev / var_curr
...
p_value = 1 - scipy.stats.f.cdf(F, df_curr, df_prev)
  1. I recommend doing two-tailed test rather than one-tailed test. In the example, H_0 is var_prev <= var_curr and H_A is var_prev > var_curr, and the p_value is 0.33 which failed to reject H_A.
    However, our point of interest should be whether var_curr < var_prev. Therefore, I recommend H_0: var_prev = var_curr and H_A: var_prev != var_curr. If sample var_curr < sample var_prev and the result is significant, that implies population var_curr < population var_prev

wait, i thought the F is already defined in the way you specified?
Screenshot 2024-05-22 at 11 50 33 PM

@JohnShiuMK
Copy link
Collaborator Author

According to #72 and #99
I will update the consistency F-test from one-tail to two-tail (keep it in jupyter notebook)

  • (optional) to look at stat package if have time

@JohnShiuMK JohnShiuMK marked this pull request as ready for review May 27, 2024 16:48
@JohnShiuMK
Copy link
Collaborator Author

@tonyshumlh I have updated the demo to 2-tail test, ready to review and merge, thanks

@JohnShiuMK JohnShiuMK marked this pull request as draft May 27, 2024 19:28
@JohnShiuMK
Copy link
Collaborator Author

@tonyshumlh I have updated the demo to 2-tail test, ready to review and merge, thanks

hold on, the file / code structure are actually ugly. Let me tidy it up a little bit first, sorry.

@JohnShiuMK JohnShiuMK marked this pull request as ready for review May 28, 2024 00:10
@JohnShiuMK
Copy link
Collaborator Author

Hi @SoloSynth1 @tonyshumlh

In this demo of F-score comparison, I'm comparing the code of week 3 (before refactoring, i.e. old code base) vs. week 4 (after refactoring). Therefore, I have to keep the old code base (archive/analyze.py) and adjust the ConsistencyEvaluator (archive/llm_eval/consistency_eval.py) so that it also works for the old code.

We may delete them in the future when we are having a comparison between newer versions. But, for now, I guess it's better to keep a record of the above comparison, in case someone asks for it.

In order not to disturb the latest code base, I put those related to the demo and the old code base under archive/.

What do you think? Do you have any better ways to proceed whenever we encounter a situation like this?

@JohnShiuMK
Copy link
Collaborator Author

I have added a note here, to avoid confusion of the archive/

I think we can merge for now.
And if we need to clean up later, we could create another PR, what do you think?

Copy link
Collaborator

@tonyshumlh tonyshumlh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2-tailed f-test p-value calculation looks good to me

@JohnShiuMK JohnShiuMK merged commit 577363d into main May 28, 2024
@JohnShiuMK JohnShiuMK deleted the 76-quantify-consistency-improvement branch May 28, 2024 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metric (e.g. from regression model) to quantify the improvement of consistency
2 participants