# Probability II

We have determined whether a coin is fair or not. Now let's take more real example.

## The Case
Let's assume you are a data analyst/ scientist on one of travel ecommerce. You have a task to increase engagement for users to use our platform. To simplify our problem let's also assume that we only have two pages: 

### Home Page
![HOme](img/002-01.png)

### Search Result page
![srp](img/002-02.png)

**First thing first** as a quantitative analyst, you should ask to yourselves and to your business people:

> What does it means by users engagement to our platform?

One possible answer is **Tendency of users to use our platform functions**. How to measure such tendency then? getting back to the home page, we have search function. 

then we know that one of the possible answer is to measure usage in that button to move into search result page. Ofcourse, to quantify these measurements, we need some standard metrics.

## Online Metric: Click Through Probability (CTP) and Click Through Rate (CTR)

Here we have define two "often used" metrics to measure the page/function effectiveness, CTP $^{[1]}$ and CTR $^{[2]}$. Their definitions are:

$$
CTR = \frac{number \space of \space click \space in \space funnel \space 2}{number \space of \space click \space in \space funnel\space  1}
$$

$$
CTP = \frac{number \space of \space unique \space users \space in \space funnel \space 2}{number \space of \space unique \space users\space  in \space funnel\space  1}
$$


for this case, 

$$
CTR_{home\_page-search\_result\_page} = \frac{number \space of \space click \space in \space search\_result\_page}{number \space of \space click \space in \space home\_page}
$$

$$
CTP_{home\_page-search\_result\_page} = \frac{number \space of \space unique \space users \space in \space search\_result\_page}{number \space of \space unique \space users \space  in \space home\_page}
$$



In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import seaborn as sns
import scipy.stats as st
from func import *

## Assumptions in This Notebook

1. On 1 day there are **100 users** visiting home page, uniformly distributed and unchanged throughout the week. 
2. Within each day, there are **1000 visits** to home page and **200** click through towards search result page.


Implement these assumption in simulation and function to calculate ctp and ctr.

In [None]:
np.random.seed(310)
# TODO generate random experiments
users_activities_in_home_page = 
users_activities_in_srp = 

In [None]:
def ctp(funnel_1_log, funnel_2_log):
    """
    calculate the Click Through Probability between funnel 1 and funnel 2.
    CTP should be 0 <= CTP <= 1
    Args: 
    ----
    funnel_1_log: list of int (user id)
    funnel_2_log: list of int (user id)
    
    Output:
    ----
    ctp: float
    """
    # TODO: code here

def ctr(funnel_1_log, funnel_2_log):
    """
    calculate the Click Through Rate between funnel 1 and funnel 2. 
    CTR may be more than 1, the only constraint that it has is
    CTR >= 0
    Args: 
    ----
    funnel_1_log: list of int (user id)
    funnel_2_log: list of int (user id)
    
    Output:
    ----
    ctr: float
    """    
    # TODO: code here

In [None]:
ctp_current = ctp(users_activities_in_home_page, users_activities_in_srp)
ctr_current = ctr(users_activities_in_home_page, users_activities_in_srp)
# test if your functions has been correctly implemented
assert np.isclose(ctp_current, 0.92)
assert np.isclose(ctr_current, 0.2)

We are only interested in CTP right now. So now the question is:

> given this number what is our probability to get this number? 


So with above assumptions and this question, we might have an idea that we should generate a simulation.

In [None]:
def simulate_ctp(
    n_activities_home_page, n_activities_search_page, 
    n_users=100, n_simulations=500
):
    ctps = []
    # TODO: code here
    return np.array(ctps)

In [None]:
# DO NOT CHANGE THIS
ctps = simulate_ctp(1000, 200)

assert len(ctps) == 500

Now we have generated enough simulations to draw what is the frequencies of each numbers to happen. But wait! there is a problem! if I group by the numbers of simulated ctps, I might get all count to zero! 

Why? because it is a continuous number!

Solution? I have to bin each ctp. Hint: 
- Use `np.linspace` to generate bins, with lower bound 0.7 and upper bound 1, nbins = 20
- Use `np.digitize` to group ctps to each bin
- Use `np.unique` to generate bin index and bin counts

In [None]:
# TODO: code here
bins = 
ctp_bins = 
ctp_bin_indexes, ctp_bin_counts = 

ctp_current_bin = np.digitize(ctp_current, bins)

In [None]:
draw_ctp_happening_binned(
    ctp_current, 
    bins, 
    ctp_bin_indexes, 
    ctp_bin_counts, 
    ctp_current_bin
)

### Bonus simulation

Here's interactive version of graph that tell how the frequencies and probabilities changed over when we do 20 - thousands of simulations.

In [None]:
draw_interactive_bar_plot_for_simulations(
    make_df_for_simulations(simulate_ctp)
)

In [None]:
draw_interactive_prob_bar_plot_for_simulations(
    make_df_for_simulations(simulate_ctp)
)

## Probability Density Function & Cumulative Density Function


There is a way to prevent simulations like this to be run over and over again, it is to use probability density function and cumulative density function.

### Probability Density Function (PDF)

For a **continuous variable** is essentially a mathematical model to predict $^{[4]}$

> how much more likely it is that the random variable would equal one sample compared to the other sample


or in much simpler word **"how likely a number will happen?"**

You must not surprise that PDF function could be more than 1 like in this example, because PDF is not the probability itself, but relative probability.

To calculate the probability of **a number** happening is **impossible** and will be near to 0. Why? because there are infinite numbers that approaches to that number. 


What we can do is to calculate
> **The probability of a number will fall between a range**

for example: the probability of a number will fall between $0.55$ and $0.63$. This would be a result to calculate area under curve for pdf.


### Gaussian / Normal Distribution

A gaussian distribution $^{[3]}$ is a most common distribution that has been used in assumptions. It shaped like a bell curved and we will draw this in the code, instead of showing you a picture. Assume that this CTPs are spread like a bell curve, we will need at least 2 parameters to generate the gaussian distribution: **mean and variance**. A variance could be replaced by **standard deviation** because it is essentially:

$$
variance = \sigma(x)^2
$$

where $\sigma(x)$ is the standard deviation of random variable $x$.

In [None]:
# TODO: code here
mean_ctp = 
std_ctp = 
lower_1_sigma_ctp, upper_1_sigma_ctp = 

now we could draw ctps distribution using just 1 function with seaborn:

```python
sns.distplot(random_variable)
```

while the other code below will just be a complementary to explain and annotate each points

In [None]:
f, ax = plt.subplots(figsize=(12, 7))


# TODO: code here using sns distplot, 1 line only

plt.axvline(mean_ctp, color='red')
plt.axvline(lower_1_sigma_ctp, color='red')
plt.axvline(upper_1_sigma_ctp, color='red')
plt.axvspan(lower_1_sigma_ctp, upper_1_sigma_ctp, color='red', alpha=0.2)
plt.axvline(ctp_current, color='green')
values = dict(
    mean=mean_ctp,
    lower_1_sigma=lower_1_sigma_ctp,
    upper_1_sigma=upper_1_sigma_ctp,
    ctp_current=ctp_current
)
for k, val in values.items():
    
    x_text, y_text = val + np.random.uniform(-std_ctp, std_ctp), st.norm.pdf(val, loc=mean_ctp, scale=std_ctp)
    plt.annotate(
        '{}: {:.2f}'.format(k, val),
        (val,8),
        xytext=(x_text, y_text),
        arrowprops=dict(
            arrowstyle='-|>',
            connectionstyle='angle3'
        )
    )

plt.title('Distribution of CTP')
plt.show()

Yaxis shows the PDF result for each point in Xaxis. As we have said earlier, this PDF function allowed to surpasses 1, because it is not exactly probability but rather relative probability.


Now to get a hang for gaussian distribution pdf you might be as well implement the formula $^{[3]}$ by yourselves and test with our unit test below

In [None]:
def gaussian_dist_pdf(x, mean, std):
    # TODO: code here


In [None]:
assert np.isclose(
    gaussian_dist_pdf(0.85, mean_ctp, std_ctp),
    st.norm.pdf(0.85, loc=mean_ctp, scale=std_ctp)
)

### Cumulative Density Function (CDF)

The CDF is a function to calculate **probability** of a number will fall below certain number. Yes it will sum up to 1 because it is the probability, unlike probability density function. Now how do we calculate the probability of $0.92$ ctp happening in our case? Then we could just use CDF range!

$$
Prob_{0.92} = CDF_{0.90} - CDF_{0.92}
$$

In [None]:
# TODO: code here
prob_092 = 

In [None]:
assert np.isclose(prob_092, 0.087802724)

print('The probabilty is {:.1f}%'.format(prob_092 * 100))

And this results produces same probabilty as our simulation!

# References

[1] 12 Estimating Click Through Probability. (2019). Retrieved from [https://www.youtube.com/watch?v=LFLSApHc-jM](https://www.youtube.com/watch?v=LFLSApHc-jM)

[2] Wikipediaorg. (2019). Wikipediaorg. Retrieved 9 April, 2019, from https://en.wikipedia.org/wiki/Click-through_rate


[3] Wikipediaorg. (2019). Wikipediaorg. Retrieved 9 April, 2019, from https://en.wikipedia.org/wiki/Normal_distribution

[4] Wikipediaorg. (2019). Wikipediaorg. Retrieved 9 April, 2019, from https://en.wikipedia.org/wiki/Probability_density_function
