<a href="https://colab.research.google.com/github/drewwint/Learning/blob/main/ab_hands_on_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A/B Hands-On Exercises

In [None]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import scipy
import seaborn as sns
from scipy import stats

## Exercise 1: Match the functionality with the library
This is a classic matching exercise. The libraries in this course are used quite often, so a quick recall of which one does what will aid in your learning across all lessons.

From the imports above match which library or libraries
  1. is used to define models
  2. provides the data container for model results
  3. are used for plotting
  4. is specialized for Bayesian plotting
  5. is specialized for data manipulation and cleaning

## Exercise 2: Estimate conversion for Website A
We've launched a new website and your boss wants you to estimate the conversion rate. Your boss tells you that there wasn't a lot of money for testing the new website, so you only have 20 data points for the new website, **Website A**.

### Part 1: Estimating `p(Conversion)`

Give your boss an estimate of conversion rate in two ways:

1. Calculate the point value of percentage conversion from pandas or plain python
  * This is what most people would do using Excel
2. Use the model from the A/B testing lesson and give her an estimate with uncertainty
  * Hint: The posterior plot will be quite helpful

In [None]:
# websites = pd.read_csv("data/websites.csv")
url_control_websites = "https://raw.githubusercontent.com/drewwint/Learning/refs/heads/main/Intuative_Bayes/IntroductoryCourse-main/IntroductoryCourse-main/lesson_code/3_AB_Hands_On/data/websites.csv"
websites = pd.read_csv(url_control_websites)
websites.head(20)

In [None]:
website_a = websites["Website A"]

In [None]:
# Calculate tthe point estimte of percentage
ref_val_web_a = np.mean(website_a)
ref_val_web_a

In [None]:
# Fill in the PyMC Model
import pymc as pm

with pm.Model() as model_website_a:
    # Parameter = prior solution space
    conversion_rate = pm.Beta("conversion_rate", alpha=2, beta=10)

    # Plausibility evaluator: likelihood function
    n_website_a = pm.Binomial(
        "n_website_a", p=conversion_rate, n=len(website_a), observed=website_a.sum()
    )

with model_website_a:
    # Posterior solution space
    plausible_values_website_a = pm.sample()

In [None]:
# Plot the posterior plot
az.plot_posterior(plausible_values_website_a, ref_val=ref_val_web_a)

### Part 2: The A/B test

Your boss tells you there also wasn't any money for logging the control website, Website B, so you also only have 20 samples. Run the same analysis as before, but for Website B, and compare your results. Specifically:

1. Calculate the point value of percentage conversion from pandas or plain python and draw a conclusion on which website is better.
2. As a professional statistician you have to provide advice, not just code. Based on the point estimates would you feel comfortable telling your boss Website B is better (or worse)?
3. Use your Bayesian model to also estimate the conversion rate with uncertainty. Use a forest plot to compare the results from the two models. If your boss asked which website has higher conversion, what would you advise now? How did you confidence-level in your recommendation change?

In [None]:
website_b = websites["Website B"]

In [None]:
# Calculate the point value of percentage
ref_val_web_b = website_b.mean()
ref_val_web_b

In [None]:
# Create a Bayesian model for the conversion of Website B
with pm.Model() as model_website_b:
    # Parameter = prior solution space
    conversion_rate = pm.Beta("conversion_rate", alpha=2, beta=10)

    # Plausibility evaluator: likelihood function
    n_website_b = pm.Binomial(
        "n_website_b", p=conversion_rate, n=len(website_b), observed=website_b.sum()
    )

with model_website_b:
    # Posterior solution space
    plausible_values_website_b = pm.sample()

In [None]:
# Compare the two. Hint: The forest plot is quite useful here
# az.plot_posterior(plausible_values_website_b, ref_val=ref_val_web_b)

# Compare them again
az.plot_forest([plausible_values_website_a,plausible_values_website_b],
               model_names=["Control","Intervention"],
               combined=True,
               figsize=(10,5))


### Part 3: Rerun your analysis with more data
You convince your boss that it's worth getting more samples. You warn her that if she doesn't invest in proper analysis she could lose out on millions of dollars in sales. She reluctantly agrees.

Now you have 5000 samples for each website. Rerun your analysis from above, and now provide a recommendation as to which website has a better conversion rate.

In [None]:
# websites = pd.read_csv("data/websites_5000_samples.csv")
url_control_websites = "https://raw.githubusercontent.com/drewwint/Learning/refs/heads/main/Intuative_Bayes/IntroductoryCourse-main/IntroductoryCourse-main/lesson_code/3_AB_Hands_On/data/websites_5000_samples.csv"
websites = pd.read_csv(url_control_websites)
websites.shape

In [None]:
website_a = websites["Website A"]

In [None]:
# Create a model for Website A
# Fill in the PyMC Model
import pymc as pm

with pm.Model() as model_website_a:
    # Parameter = prior solution space
    conversion_rate = pm.Beta("conversion_rate", alpha=2, beta=10)

    # Plausibility evaluator: likelihood function
    n_website_a = pm.Binomial(
        "n_website_a", p=conversion_rate, n=len(website_a), observed=website_a.sum()
    )

with model_website_a:
    # Posterior solution space
    plausible_values_website_a_5000 = pm.sample()

In [None]:
website_b = websites["Website B"]

In [None]:
# Create a model for Website B
# Create a Bayesian model for the conversion of Website B
with pm.Model() as model_website_b:
    # Parameter = prior solution space
    conversion_rate = pm.Beta("conversion_rate", alpha=2, beta=10)

    # Plausibility evaluator: likelihood function
    n_website_b = pm.Binomial(
        "n_website_a", p=conversion_rate, n=len(website_b), observed=website_b.sum()
    )

with model_website_b:
    # Posterior solution space
    plausible_values_website_b_5000 = pm.sample()

In [None]:
# Compare them again
az.plot_forest([plausible_values_website_a_5000,plausible_values_website_b_5000],
               model_names=["Control","Intervention"],
               combined=True,
               figsize=(10,5))

## Part 4: See the effect of more data
You also want to check your own assumptions. Compare your estimation for conversion for Website A with 20 samples versus 5000 samples.

1. Before writing any code, pause and ponder for a minute: would you be more certain or less certain with more data?
2. Use a forest plot to compare your estimation from 20 samples with 5000. Do the results match your intuition?

In [None]:
# Compare the inference data objects

diff = (
    plausible_values_website_b_5000.posterior["conversion_rate"] -
    plausible_values_website_a_5000.posterior["conversion_rate"]
    )

# Compare the two. Hint: The forest plot is quite useful here
az.plot_posterior(diff, ref_val= 0 )



### The differnece is always positive so b(intervention) is better than a(control) such that all values of B > A.

# Exercise 3: Reading docs
Documentation is an important tool in becoming a proficient computational data scientist or statistician.

Utilizing the [ArviZ documentation](https://arviz-devs.github.io/arviz/) determine how to
1. Generate a rankplot from the output of `pm.sample()`
2. Convert the Forest Plot into a Ridge Plot
3. **Optional:** Access the posterior group of the returned `InferenceData`, and get the mean of the samples across all dimensions

Use any of the inference data objects available in these exercises

In [None]:
# Generate a rank plot
az.plot_forest(
    [plausible_values_website_a, plausible_values_website_a_5000],
    model_names=["20 Samples", "5000 Samples"],
    combined=True,
    figsize=(5, 3),
);


In [None]:
# Generate a ridge plot
fig, ax = plt.subplots(figsize=(10, 4))
az.plot_forest(
    [plausible_values_website_b_5000, plausible_values_website_a_5000],
    model_names=["Control", "Intervention"],
    combined=True,
    kind="ridgeplot",
    ax=ax,
)
ax.set_title("Conversion Ridge Plot");

In [None]:
# Get the mean of the posterior
plausible_values_website_a_5000.posterior["conversion_rate"].mean()