Skip to content
This repository has been archived by the owner on Sep 17, 2023. It is now read-only.

Maybe use Bayesian average rather than dropping gihwr if games < 200 #5

Closed
maqifrnswa opened this issue Mar 18, 2022 · 3 comments
Closed

Comments

@maqifrnswa
Copy link

In this line:

gihwr = float(gihwr) * 100 if int(card["ever_drawn_game_count"]) > 200 else 0.0

the gihwr is left blank if there isn't enough data (card["ever_drawn_game_count"] < 200)

As an alternative, maybe use a Bayesian average.

todo that, replace

gihwr = float(gihwr) * 100 if int(card["ever_drawn_game_count"]) > 200 else 0.0

with

winCount = float(gihwr) * int(card["ever_drawn_game_count"])
gihwr = 100.0*(winCount+10)/ ( int(card["ever_drawn_game_count"]) +20)

The numbers 10 and 20 are from the Bayesian priors assuming a beta distribution (basically assuming that an unknown random card has an expected win rate of 50% and is probably in the range of 40-60%). This "correction" also effectively disappears smoothly as ever_drawn_game_count approaches 200.

[by the way, nice job fixing the "auto averaging" of the top two colors, I agree that picking the greatest of the two is a much better strategy!]

@bstaple1
Copy link
Owner

bstaple1 commented Mar 20, 2022

Thanks for the suggestion.

I wanted to confirm that the correction disappears so I did some calculations. What I'm seeing are some inaccurate values between 200 and 5000. Should I only apply this equation for counts less than 200 and just use gihwr * 100 if the count is equal to or greater than 200?

bayesian_average

@maqifrnswa
Copy link
Author

maqifrnswa commented Mar 22, 2022

Short answer with an intuitive explanation: you don't need to put any conditions on it. The equation works for all came counts. What you're seeing happens when the GIHWR is 80%, which is far outside of what is expected (I designed it for a prior distribution that basically falls between 40-60%). If you repeat that for a GIHWR between 40-60, it will "look" better. However, I'd actually argue that the excel sheet you showed is actually working as intended since it's so unlikely that a card has an 80% win rate, you would really need 5000 games to have confidence that the GIHWR really is 80%.

Longer (theoretical) answer: The theory behind this is to use a Bayesian average instead of a simple average because Bayesian averages are much better at estimating what the true GIHWR will be when the number of observations is small. To do that, we are trying to find a "true" GIHWR of a card using results from individual games as the observations. The likelihood (probability) of the true of the true GIHWR being a certain value given a number of wins and losses follows a beta distribution. If we have prior knowledge/expectations of what the likely GIHWR is, even before playing a single game, we can use that as a "prior distribution" to help estimate the GIHWR when the number of observations (games) is small. The formula works for both small and large number of observations.

A little more theoretical: the probability of have x number of wins after y number of games given a GIHWR of p is a binomial distribution. The likelihood of the true GIHWR being p after x number of wins and y number of losses follows a beta distribution with alpha = x+1 and beta = y+1. The most likely estimate (MLE) of the true GIHWR is the mode of the beta distribution, which is (alpha -1)/(alpha+beta-2) = (x/ (x+y)). That is, with no prior information, the most likely estimate of the GIHWR is simply the average. Here the problem is obvious, let's say someone goes 7-3 with some random splash. The GIHWR will be estimated as 70%, which is probably far from the true GIHWR. To fix that, I proposed putting in some prior information as to what the GIHWR likelihood is expected to be, even before playing a single game. This is known as Bayesian inference of the true GIHWR based on prior knowledge and observed results.

To do that, I say, "Ok, what do I think the GIHWR of a random card should be before I even play a single game?" I estimated that it should probably be 50%, and probably in the range between 40-60% (which is about an alpha=11 and beta=11). I then look at the beta distribution to find the corresponding alpha and beta that gets me that (see the wikipedia page for mean and variance of a beta distribution). To do the Bayesian inference, you multiply the pdf of the prior with the likelihood function based on observations. Luckily for us, multiplying two beta distributions together is just another beta distribution with the new alpha and beta equal to the sum of the likelihood and prior's alphas and betas. Thus the MLE estimate of the GIHWR being:

MLE of GIHWR = (wins+10)/(games played +20)

More rigorous approach: I just completely guessed the expected range of 40-60% with a mean of 50%. If you want to be more rigorous, you can get all the GIHWR for all cards on 17 lands and fit that to a beta distribution. Extract the alpha and beta from that distribution and use the following equation:

MLE of GIHWR = (wins+alpha_extracted-1)/(games played + alpha_extracted+beta_extracted - 2)

but, honestly, it won't matter that much since the effective "correction" disappears as the number of games becomes much larger than the alpha+ beta of the prior (thus my estimate for a beta of 20 means that the "correction" disappears at 10x that number or around 200 games).

EDIT: actually, the quick an dirty way of estimating alpha and beta of the prior based on real data would probably just by finding the mean and variance of the 17 lands dataset GIHWR, and just use those two numbers to estimate alpha and beta. But I still think it's probably not worth it as the quick and dirty estimates probably do the same job. I'm curious, so I'll actually use the data from your tool to estimate it, just to see what it is!

EDIT 2: So I did a quick analysis at https://gist.github.com/maqifrnswa/2e2f42f16d58dbf5fc49b34699c86956, and found that for for NEO Premier: alpha = 77 beta = 60. the mean is not 50% since the mean on 17lands is not 50%.

So the "accurate" estimate, at least for NEO Premier
MLE of GIHWR = (wins+76)/(games played + 135)

Here the "correction" goes away after 1000 games. But since it is just a "prior," you can basically make it whatever you want that makes sense based on your expectations. Maybe alpha and beta = 11 is still fine, or you can use these larger numbers instead. The larger numbers just take in to account the bias in 17lands data and the fact that the distribution is actually kind of tight.

@maqifrnswa
Copy link
Author

maqifrnswa commented Mar 22, 2022

Update! So there are different alphas and betas depending on whether you are looking at "All Decks" or two color decks. That's because the distribution for all decks is much tighter for all decks than two color decks.

NEO Premier "All Decks"
MLE of GIHWR = (wins+76)/(games played + 135)

NEO Premier "BR":
MLE of GIHWR = (wins+13)/(games played + 24.3)

NEO Premier "WG":
MLE of GIHWR = (wins+14.3)/(games played + 26.3)

So the estimate of alpha = 11 and beta = 11 might be pretty accurate after all

@bstaple1 bstaple1 closed this as completed May 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants