### The problem (Taken from Kaggle Marketing A/B Testing)

A marketing campaign was run to improve the sale. 96% received an advertisement while 4% didn't.
The idea of the dataset is to analyze the groups, find if the ads were successful, how much the company can make from the ads, and if the difference between the groups is statistically significant.

1. Data dictionary:

    user id: User ID (unique)
    test group: If "ad" the person saw the advertisement, if "psa" they only saw the public service announcement
    converted: If a person bought the product then True, else is False
    total ads: Amount of ads seen by person
    most ads day: Day that the person saw the biggest amount of ads
    most ads hour: Hour of day that the person saw the biggest amount of ads

2. Data preparation:

   We randomly took only 2% of the original dataset (11,762 out of 588,100) for the sake of simplicity and minimizing the occupied GIT memory (and cloud run in future).
   The dataset is checked out for missing values and data types. There is no missing values.
   To keep A/B test clean, we'll only focus on test group (ad) vs control group (psa) in terms of the variable 'converted'.

3. A/B testing:

   Will be done by comparing the conversion rate diference between ad the test and control groups.

4. Measure the statistical strength of the test:
   
   Z-Test will be done and P-value will be measured.
   
   Z-Test for Two Proportions:

   Here, the converted variable is binary, so we’re comparing two proportions (conversion rates) from two independent Bernoulli samples.      Hence, the Null Hypothesis can be defines this way:
   
   Null Hypothesis (H0): The two population proportions are equal (no difference in conversion rate).
   Alternative Hypothesis (H1): The proportions are not equal.

   Let's assume ad and psa are groups A and B as follows.

   Group A (ad):
   n1 users
   x1 conversions
   p1 = x1/n1 conversion rate

   Group B (psa):
   n2 users
   x2 conversions
   p2 = x2/n2 conversion rate

   Then using formula for Z-statistic we can define:
   p = (x1 + x2) / (n1 + n2) (total conversion rate or probability)
   SE = (p(1−p)(1/n1 + 1/n2))^0.5

   then the z-score that tells how many standard errors apart the two proportions are can be calculated:
   z = (p1 − p2) / SE

   Z-score is an good indicator for the statical strength of causal inference we need here and is interpreted as below:
   If ∣z∣ is large enough (usially >= 2), the p-value (area under the normal curve beyond ±z) will be small.
   A small p-value (typically < 0.05) lets us reject the Null Hypothesis and conclude the proportions differ significantly.

5. As was said, the other variables are not need for the A/B testing however the can be useful for Post-hoc Optimization. In other words, after proving the campaign worked, we might wonder:
    What ad frequency is optimal?
    What time/day performs best?

    These are possible with exploratory/predictive modeling that will help to optimize campaigns by finding the best ad frequency, etc.
   

In [1]:
    # Import the required libraries
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
import sys
import os

    # And set the absolut path of the project for both importing modules and data
sys.path.append(os.path.abspath(".."))
project_root = os.path.abspath("..")

In [5]:
### 1. load and explore the data

file_path = os.path.join(project_root, "data", "marketing_AB.csv")  # Ensure this file is in the same directory
data = pd.read_csv(file_path)
data.head()

Unnamed: 0,user id,test group,converted,total ads,most ads day,most ads hour
0,1412735,ad,False,2,Saturday,18
1,1205011,ad,False,1,Saturday,19
2,1009581,ad,False,11,Thursday,16
3,1613192,ad,False,109,Friday,17
4,1524027,ad,False,31,Saturday,18


In [6]:
# From the this table,we'll see that there is no missing data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11762 entries, 0 to 11761
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   user id        11762 non-null  int64 
 1   test group     11762 non-null  object
 2   converted      11762 non-null  bool  
 3   total ads      11762 non-null  int64 
 4   most ads day   11762 non-null  object
 5   most ads hour  11762 non-null  int64 
dtypes: bool(1), int64(3), object(2)
memory usage: 471.1+ KB


In [21]:
### 2. A/B Test Analysis: Ad vs PSA Group

# 2.1 Group summary

group_conversion = data.groupby('test group')['converted'].agg(['sum', 'count'])
group_conversion['conversion_rate'] = group_conversion['sum'] / group_conversion['count']
print("\nTest groups statistics:")
print("-" * 50)
print(group_conversion)


Test groups statistics:
--------------------------------------------------
            sum  count  conversion_rate
test group                             
ad          293  11298         0.025934
psa           5    464         0.010776


In [23]:
# 2.2 Statistical strength indicators. Z-Test and p-value for Proportions
# We can calculate the z-score for this table using the folrmula given in the problem explanation part of this notebook, however using
# the imported proportions_ztest functions makes it easier.

z_stat, p_val = proportions_ztest(count=group_conversion['sum'], nobs=group_conversion['count'])
print("\nStatistical strength indicators:")
print("-" * 50)
print(f"Z-statistic: {z_stat:.2f}")
print(f"P-value: {p_val:.4f}")


Statistical strength indicators:
--------------------------------------------------
Z-statistic: 2.04
P-value: 0.0417


In [29]:
# Conclusion

print("\nConversion Rates:")
print("-" * 50)
print("Treatment group (ad): ", group_conversion['conversion_rate'].ad)
print("Control group (psa): ", group_conversion['conversion_rate'].psa)

print("\nConclusion:")
print("*" * 50)
if p_val < 0.05:
    print("RELIABLE: The difference in conversion rates for the treatment and control groups is statistically significant (p < 0.05).")
else:
    print("QUESTIONABLE:  The difference in conversion rates for the treatment and control groups is not statistically significant (p ≥ 0.05).")



Conversion Rates:
--------------------------------------------------
Treatment group (ad):  0.02593379359178616
Control group (psa):  0.010775862068965518

Conclusion:
**************************************************
RELIABLE: The difference in conversion rates for the treatment and control groups is statistically significant (p < 0.05).
