# Chapter 4. Estimating Proportions
[Link to chapter online](https://allendowney.github.io/ThinkBayes2/chap04.html)

A reminder of Bayes’s Theorem:

$P(A|B) = \frac{P(A)P(B|A)}{P(B)}$

or

$P(H|D) = \frac{P(H)P(D|H)}{P(D)}$

## Warning

The content of this file may be incorrect, erroneous and/or harmful. Use it at Your own risk.

## Imports

In [None]:
import CairoMakie as Cmk
import Distributions as Dsts

In [None]:
include("pmf.jl")
import .ProbabilityMassFunction as Pmf

## The Euro Problem

In Information Theory, Inference, and Learning Algorithms, David MacKay poses this problem:

“A statistical statement appeared in The Guardian on Friday January 4, 2002:

> When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times
> and tails 110. "It looks very suspicious to me", said Barry Blight, a statistics
> lecturer at the London School of Economics. "If the coin were unbiased, the
> chance of getting a result as extreme as that would be less than 7%."

“But [MacKay asks] do these data give evidence that the coin is biased rather than fair?”




## The Binomial Distribution

The probability that we get a total of $k$ heads is given by the [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution):

$\binom{n}{k}*p^{k}*(1-p)^{n-k}$

for any value of $k$ from 0 to $n$, including both.

The term $\binom{n}{k}$ is the binomial coefficient, usually pronounced “n choose k”.

We could evaluate the expression ourselves or use a library, like so

In [None]:
n = 2
p = 0.5
k = 1

Dsts.pdf(Dsts.Binomial(n, p), k) |> x -> round(x, digits=3)

We can also use multiple values of $k$ with the functions

In [None]:
ks = 0:1:n |> collect
ps = Dsts.pdf.(Dsts.Binomial(n, p))
ps = map(x -> round(x, digits=3), ps)
ps

We can put these probabilities in a `Pmf`

In [None]:
pmfK = Pmf.Pmf(ks, ps)
pmfK

Here's what it looks like with `n=250` and `p=0.5`:

In [None]:
pmfK = Pmf.getBinomialPmf(250, 0.5)

In [None]:
fig = Pmf.drawLinesPriors(pmfK,
    "Binomial Distribution (n=250, p=0.5)",
    "Number of heads (k)",
    "PMF"
    )
fig

In [None]:
Pmf.getNameMaxPrior(pmfK)

In [None]:
# (125 + 1) because Julia's indexing starts at 1
pmfK.priors[126]

In MacKay's example, we got 140 heads, which is even less liekly than 125:

In [None]:
# (140 + 1) because Julia's indexing starts at 1
pmfK.priors[141]

In [None]:
Pmf.getTotalProbGEName(pmfK, "priors", 140)

The result is about 3.3%, which is less than the quoted 7%. The reason for the difference is that the statistician includes all outcomes “as extreme as” 140, which includes outcomes less than or equal to 110. (two tailed probability)

In [None]:
Pmf.getTotalProbGEName(pmfK, "priors", 140) * 2

In [None]:
# alternative solution (without Pmf)
Dsts.cdf(Dsts.Binomial(250, 0.5), 110) +
Dsts.ccdf(Dsts.Binomial(250, 0.5), 139)
# or just
# Dsts.cdf(Dsts.Binomial(250, 0.5), 110) * 2

## Bayesian Estimation

In [None]:
# coins with different probs of getting heads
coins = Pmf.getPmfFromSeq(range(0, 1, 101) |> collect)

In [None]:
likelihoodHeads = copy(coins.names)
likelihoodTails = 1 .- likelihoodHeads

likelihoodMapping = Dict(
   'h' => likelihoodHeads,
   't' => likelihoodTails
)

In [None]:
dataset = "h" ^ 140 * "t" ^ 110

In [None]:
"""
    Update pmf with a given sequence of h and t
"""
function updateEuro!(
    coins::Pmf.Pmf{T},
    dataset::String,
    probMapping::Dict{Char,Vector{Float64}}) where {T<:Union{Int,String,Float64}}

    coins.likelihoods .= 1
    for data in dataset
        coins.likelihoods .*= probMapping[data]
    end
    Pmf.updatePosteriors!(coins, true)

    return nothing

end

In [None]:
updateEuro!(coins, dataset, likelihoodMapping)

In [None]:
fig = Pmf.drawLinesPosteriors(coins,
    "Binomial Distribution (n=250, p=0.5),\n140/250 heads",
    "Number of heads (k)",
    "PMF"
    )
fig

In [None]:
# index of coins with max priors
Pmf.getIndMaxPosterior(coins)

In [None]:
# value for heads with max priors
Pmf.getNameMaxPosterior(coins)

## Triangle prior

Comparison between two priors:
- uniform
- triangle shaped

In [None]:
uniform = Pmf.getPmfFromSeq(range(0, 1, 101) |> collect)

In [None]:
shape = vcat(0:49, 50:-1:0)
shape = shape ./ sum(shape)
triangle = Pmf.Pmf(range(0, 1, 101) |> collect, shape)

In [None]:
fig = Cmk.Figure(size=(600, 400))
Cmk.lines(fig[1, 1], uniform.names, uniform.priors,
    color="blue",
    axis=(;
        title="Uniform and triangle distributions",
        xlabel="Proportion of heads (x)",
        ylabel="Probability")
    )
Cmk.lines!(fig[1, 1], triangle.names, triangle.priors, color="orange")
fig

In [None]:
updateEuro!(uniform, dataset, likelihoodMapping);
updateEuro!(triangle, dataset, likelihoodMapping);

In [None]:
fig = Cmk.Figure(size=(600, 400))
Cmk.lines(fig[1, 1], uniform.names, uniform.posteriors,
    color="blue",
    axis=(;
        title="Uniform and triangle distributions",
        xlabel="Proportion of heads (x)",
        ylabel="Probability")
    )
Cmk.lines!(fig[1, 1], triangle.names, triangle.posteriors, color="orange")
fig

This is an example of **swamping the priors**: with enough data, people who start with different priors will tend to converge on the same posterior distribution.

## The Binomial Likelihood Function

We've been updating likelihood one result of the experiment at a time, a better option is to do it in one go.

In [None]:
"""
    Update a binomial Pmf.
    n - number of trials
    k - number of success
"""
function updateBinomial!(pmf::Pmf.Pmf{T}, k::Int, n::Int) where T<:Union{Int, Float64}
    @assert (k <= n) "k must be <= n"
    xs::Vector{T} = pmf.names
    likelihoods::Vector{Float64} = Dsts.pdf.(Dsts.Binomial.(n, xs), k)
    Pmf.setLikelihoods!(pmf, likelihoods)
    Pmf.updatePosteriors!(pmf, true)
    return nothing
end

In [None]:
uniform2 = Pmf.getPmfFromSeq(range(0, 1, 101) |> collect)
k, n = 140, 250
updateBinomial!(uniform2, k, n)

In [None]:
Pmf.drawLinesPosteriors(
    uniform2,
    "Binomial Distribution 140/250 heads",
    "Coin probs of getting heads",
    "Likeliehoods"
)

In [None]:
Pmf.getNameMaxPosterior(uniform2)

## Bayesian Statistics

In the Euro problem, the choice of the prior is subjective; that is, reasonable
people could disagree, maybe because they have different information about coins
or because they interpret the same information differently.

Because the priors are subjective, the posteriors are subjective, too. And some
people find that problematic.

Bayes’s Theorem is a mathematical law of probability; no reasonable person
objects to it. But Bayesian statistics is surprisingly controversial.
Historically, many people have been bothered by its subjectivity and its use of
probability for things that are not random.

## Exercises

### Exercise 1

In Major League Baseball, most players have a batting average between .200 and
.330, which means that their probability of getting a hit is between 0.2 and
0.33.

Suppose a player appearing in their first game gets 3 hits out of 3 attempts.
What is the posterior distribution for their probability of getting a hit?

Let's start with a uniform distribution.

In [None]:
ex1Hypos = range(0.1, 0.4, 101) |> collect
ex1 = Pmf.getPmfFromSeq(ex1Hypos)

In [None]:
# y - getting a hit
# n - not getting a hit
ex1LikelihoodMap = Dict(
    'y' => ex1Hypos,
    'n' => 1 .- ex1Hypos
)

In [None]:
# a dataset with a reasonable prior distribution
ex1Dataset = "y" ^ 25 * "n" ^ 75

In [None]:
updateEuro!(ex1, ex1Dataset, ex1LikelihoodMap)

In [None]:
Pmf.drawLinesPosteriors(
    ex1,
    "Exercise 1. Baseball",
    "Probability of getting a hit",
    "PMF"
    )

Now, the task.

Update this distributiuon with the data (I assume it's 3 out of 3 hits) and plot the posterior. What is the most
likely quantity in the posterior distribution?

In [None]:
ex1Pmf = Pmf.Pmf(ex1.names |> copy, ex1.posteriors |> copy)

In [None]:
updateEuro!(ex1Pmf, "yyy", ex1LikelihoodMap)

In [None]:
fig = Cmk.Figure()
ax1, l1 = Cmk.lines(fig[1, 1],
    ex1.names,
    ex1.posteriors,
    color="navy",
    axis=(;
        title="Exercise 1. Baseball",
        xlabel="Probability of getting a hit",
        ylabel="PMF"
    )
)
l2 = Cmk.lines!(fig[1, 1],
    ex1Pmf.names,
    ex1Pmf.posteriors,
    color="red"
)
Cmk.axislegend(
    ax1,
    [l1, l2],
    ["priors", "posteriors"],
    position=:rt
)
fig

In [None]:
Pmf.getNameMaxPosterior(ex1),
Pmf.getNameMaxPosterior(ex1Pmf)

In [None]:
maximum(ex1.posteriors),
maximum(ex1Pmf.posteriors)

### Exercise 2

Whenever you survey people about sensitive issues, you have to deal with [social
desirability](https://en.wikipedia.org/wiki/Social-desirability_bias) bias,
which is the tendency of people to adjust their answers to show themselves in
the most positive light. One way to improve the accuracy of the results is
[randomized response](https://en.wikipedia.org/wiki/Randomized_response).

As an example, suppose you want to know how many people cheat on their taxes. If
you ask them directly, it is likely that some of the cheaters will lie. You can
get a more accurate estimate if you ask them indirectly, like this: Ask each
person to flip a coin and, without revealing the outcome,
- If they get heads, they report YES.
- If they get tails, they honestly answer the question “Do you cheat on your
taxes?”

[...]

Suppose you survey 100 people this way and get 80 YESes and 20 NOs. Based on
this data, what is the posterior distribution for the fraction of people who
cheat on their taxes? What is the most likely quantity in the posterior
distribution?