# A/B Testing Implementation in Python 📎
Use statistical tests (Chi-square and t-test) to determine if there's a statistically significant difference in engagement (CTR and conversion) between the two email subject lines.<br/>

#### Created by: Felice Benita

#### Dataset: Online Learning Platform - Email Engagement Campaign


##### Context

You're working with an online learning platform that offers various courses and certifications. The goal of this A/B test is to increase user engagement by testing two types of email subject lines aimed at encouraging users to return to the platform and resume a course they previously enrolled in.

##### Objective of the A/B Test
Analyze which email subject line variant results in a higher:
<ul><li>Click-through rate (CTR)</li>
<li>Conversion rate (resuming the course)</li></ul>

##### Data Structure

User ID: Unique identifier for each user.<br/>
Age: Age group (e.g., 18-25, 26-35, etc.) of the user, which could impact engagement levels.<br/>
Enrollment Type: Whether the user enrolled for a free trial or paid subscription.<br/>
Course Category: The category of the course enrolled in (e.g., Data Science, Marketing, Design).<br/>
Previous Engagement Level: Historical data indicating low, medium, or high engagement based on past actions.<br/>
Email Variant (A/B): The variant of the subject line used for each user (e.g., "Variant A: 'Complete Your Course in Record Time!' " or "Variant B: 'Your Learning Journey Awaits - Resume Now!'").<br/>
Click-Through Rate (CTR): Whether the user clicked the email (0 for no, 1 for yes).<br/>
Conversion Rate: Whether the user resumed their course after opening the email (0 for no, 1 for yes).<br/>

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing

In [2]:
# Reading data
df = pd.read_csv('D:/Pedoman/Portfolio/Data Science/A_B Testing/online_learning.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,User ID,Age Group,Enrollment Type,Course Category,Previous Engagement Level,Email Variant,Click Through Rate,Conversion Rate
0,0,1,46-55,Paid Subscription,Design,Low,A,1,0
1,1,2,56+,Free Trial,Data Science,Low,B,1,0
2,2,3,36-45,Paid Subscription,Programming,Low,A,1,1
3,3,4,56+,Paid Subscription,Design,Medium,A,0,0
4,4,5,56+,Paid Subscription,Design,Low,B,0,0


#### T-Test

In [3]:
from scipy.stats import ttest_ind

# Separate the data for each email variant
clicks_A = df[df['Email Variant'] == 'A']['Click Through Rate']
clicks_B = df[df['Email Variant'] == 'B']['Click Through Rate']

conversions_A = df[df['Email Variant'] == 'A']['Conversion Rate']
conversions_B = df[df['Email Variant'] == 'B']['Conversion Rate']

# Perform t-tests for click-through rate and conversion rate
t_stat_click, p_val_click = ttest_ind(clicks_A, clicks_B, equal_var=False)
t_stat_conversion, p_val_conversion = ttest_ind(conversions_A, conversions_B, equal_var=False)

# Display the results
print("A/B Testing Results Using T-Test:")
print(f"Click-Through Rate T-Test: t-statistic = {t_stat_click:.3f}, p-value = {p_val_click:.3f}")
print(f"Conversion Rate T-Test: t-statistic = {t_stat_conversion:.3f}, p-value = {p_val_conversion:.3f}")

# Interpretation
if p_val_click < 0.05:
    print("The difference in Click-Through Rates between variants A and B is statistically significant.")
else:
    print("The difference in Click-Through Rates between variants A and B is not statistically significant.")

if p_val_conversion < 0.05:
    print("The difference in Conversion Rates between variants A and B is statistically significant.")
else:
    print("The difference in Conversion Rates between variants A and B is not statistically significant.")

A/B Testing Results Using T-Test:
Click-Through Rate T-Test: t-statistic = -0.648, p-value = 0.517
Conversion Rate T-Test: t-statistic = -0.033, p-value = 0.973
The difference in Click-Through Rates between variants A and B is not statistically significant.
The difference in Conversion Rates between variants A and B is not statistically significant.


##### Slice data by user demographics (age & enrollment type) to see if certain groups respond better to one email variant over another.

In [4]:
from scipy.stats import ttest_ind

# Define demographic groups to analyze
demographic_groups = ['Age Group', 'Enrollment Type']
results = []

for group in demographic_groups:
    print(f"\nAnalyzing response rates by '{group}' and Email Variant\n")
    
    # Loop through each unique value in the demographic group
    for val in df[group].unique():
        # Filter data by the current demographic value
        subgroup = df[df[group] == val]
        
        # Separate data for each email variant within the demographic subgroup
        clicks_A = subgroup[subgroup['Email Variant'] == 'A']['Click Through Rate']
        clicks_B = subgroup[subgroup['Email Variant'] == 'B']['Click Through Rate']
        
        conversions_A = subgroup[subgroup['Email Variant'] == 'A']['Conversion Rate']
        conversions_B = subgroup[subgroup['Email Variant'] == 'B']['Conversion Rate']
        
        # Perform t-tests for click-through rate and conversion rate
        t_stat_click, p_val_click = ttest_ind(clicks_A, clicks_B, equal_var=False)
        t_stat_conversion, p_val_conversion = ttest_ind(conversions_A, conversions_B, equal_var=False)
        
        # Append results for each demographic value
        results.append({
            'Demographic Group': group,
            'Demographic Value': val,
            'Click-Through Rate T-Statistic': t_stat_click,
            'Click-Through Rate P-Value': p_val_click,
            'Conversion Rate T-Statistic': t_stat_conversion,
            'Conversion Rate P-Value': p_val_conversion
        })
        
        # Display results
        print(f"'{group}' = {val}")
        print(f"CTR T-Test: t-statistic = {t_stat_click:.3f}, p-value = {p_val_click:.3f}")
        print(f"Conversion Rate T-Test: t-statistic = {t_stat_conversion:.3f}, p-value = {p_val_conversion:.3f}")
        print("-" * 50)

# Convert results to DataFrame for easier analysis
results_df = pd.DataFrame(results)
print("\nSummary of t-test results for each demographic group:")
print(results_df)


Analyzing response rates by 'Age Group' and Email Variant

'Age Group' = 46-55
CTR T-Test: t-statistic = -0.289, p-value = 0.773
Conversion Rate T-Test: t-statistic = 1.310, p-value = 0.192
--------------------------------------------------
'Age Group' = 56+
CTR T-Test: t-statistic = 1.295, p-value = 0.197
Conversion Rate T-Test: t-statistic = 1.776, p-value = 0.077
--------------------------------------------------
'Age Group' = 36-45
CTR T-Test: t-statistic = 0.560, p-value = 0.576
Conversion Rate T-Test: t-statistic = -0.447, p-value = 0.656
--------------------------------------------------
'Age Group' = 26-35
CTR T-Test: t-statistic = -1.028, p-value = 0.305
Conversion Rate T-Test: t-statistic = -0.477, p-value = 0.634
--------------------------------------------------
'Age Group' = 18-25
CTR T-Test: t-statistic = -1.986, p-value = 0.048
Conversion Rate T-Test: t-statistic = -2.118, p-value = 0.035
--------------------------------------------------

Analyzing response rates by 'E

#### Chi-square test

In [5]:
# Re-import necessary libraries and re-generate the dataset to continue where we left off.
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Set random seed for reproducibility
np.random.seed(42)

# Parameters
num_users = 1000

# Generate user IDs
user_ids = np.arange(1, num_users + 1)

# Age groups
age_groups = np.random.choice(['18-25', '26-35', '36-45', '46-55', '56+'], num_users)

# Enrollment type (free trial or paid subscription)
enrollment_types = np.random.choice(['Free Trial', 'Paid Subscription'], num_users, p=[0.6, 0.4])

# Course categories
course_categories = np.random.choice(['Data Science', 'Marketing', 'Design', 'Business', 'Programming'], num_users)

# Previous engagement level
engagement_levels = np.random.choice(['Low', 'Medium', 'High'], num_users, p=[0.5, 0.3, 0.2])

# Email variant (A or B)
email_variants = np.random.choice(['A', 'B'], num_users)

# Click-through rate (CTR) - Generate some variability based on the email variant and engagement level
click_through_rates = [
    1 if (email == 'A' and np.random.rand() < 0.35) or 
         (email == 'B' and np.random.rand() < 0.4) else 0
    for email in email_variants
]

# Conversion rate (if the user resumes the course after clicking)
conversion_rates = [
    1 if (click == 1 and engagement == 'High' and np.random.rand() < 0.5) or 
         (click == 1 and engagement != 'High' and np.random.rand() < 0.2) else 0
    for click, engagement in zip(click_through_rates, engagement_levels)
]

# Compile the data into a DataFrame
df = pd.DataFrame({
    'User ID': user_ids,
    'Age Group': age_groups,
    'Enrollment Type': enrollment_types,
    'Course Category': course_categories,
    'Previous Engagement Level': engagement_levels,
    'Email Variant': email_variants,
    'Click Through Rate': click_through_rates,
    'Conversion Rate': conversion_rates
})

# Summarize data for the A/B test
ab_summary = df.groupby('Email Variant').agg(
    total_clicks=('Click Through Rate', 'sum'),
    total_users=('User ID', 'count'),
    total_conversions=('Conversion Rate', 'sum')
).reset_index()

# Calculate click-through rate and conversion rate per email variant
ab_summary['CTR'] = ab_summary['total_clicks'] / ab_summary['total_users']
ab_summary['Conversion Rate'] = ab_summary['total_conversions'] / ab_summary['total_users']

# Create contingency table for click-through rates
click_contingency = pd.crosstab(df['Email Variant'], df['Click Through Rate'])

# Chi-square test for independence to compare click-through rates
chi2_click, p_click, _, _ = chi2_contingency(click_contingency)

# Create contingency table for conversion rates
conversion_contingency = pd.crosstab(df['Email Variant'], df['Conversion Rate'])

# Chi-square test for independence to compare conversion rates
chi2_conversion, p_conversion, _, _ = chi2_contingency(conversion_contingency)

# Display the results
ab_summary[['Email Variant', 'CTR', 'Conversion Rate']].round(3), p_click, p_conversion

(  Email Variant    CTR  Conversion Rate
 0             A  0.379            0.094
 1             B  0.399            0.094,
 0.5593938977973092,
 1.0)

##### Slice data by user demographics (age & enrollment type) to see if certain groups respond better to one email variant over another.

In [6]:
import pandas as pd
from scipy.stats import chi2_contingency

# Grouping data by demographics (e.g., Age Group, Enrollment Type) and Email Variant
demographic_groups = ['Age Group', 'Enrollment Type']
results = []

for group in demographic_groups:
    print(f"\nAnalyzing response rates by '{group}' and Email Variant")
    
    # Group by demographic attribute and email variant
    demographic_summary = df.groupby([group, 'Email Variant']).agg(
        total_clicks=('Click Through Rate', 'sum'),
        total_users=('User ID', 'count'),
        total_conversions=('Conversion Rate', 'sum')
    ).reset_index()
    
    # Calculate CTR and Conversion Rate within each group and email variant
    demographic_summary['CTR'] = demographic_summary['total_clicks'] / demographic_summary['total_users']
    demographic_summary['Conversion Rate'] = demographic_summary['total_conversions'] / demographic_summary['total_users']
    print(demographic_summary[['Email Variant', group, 'CTR', 'Conversion Rate']])

    # Pivot the data to create a contingency table for CTR
    for val in df[group].unique():
        click_contingency = pd.crosstab(df[df[group] == val]['Email Variant'], 
                                        df[df[group] == val]['Click Through Rate'])
        conversion_contingency = pd.crosstab(df[df[group] == val]['Email Variant'], 
                                             df[df[group] == val]['Conversion Rate'])
        
        # Chi-square tests
        chi2_click, p_click, _, _ = chi2_contingency(click_contingency)
        chi2_conversion, p_conversion, _, _ = chi2_contingency(conversion_contingency)
        
        # Collect and display results
        results.append({
            'Demographic Group': group,
            'Demographic Value': val,
            'CTR P-Value': p_click,
            'Conversion Rate P-Value': p_conversion
        })
        
        print(f"\n'{group}' = {val}")
        print(f"CTR p-value: {p_click}, Conversion Rate p-value: {p_conversion}")

# Display the summarized test results
results_df = pd.DataFrame(results)
print("\nSummary of p-values for each demographic group and variant:")
print(results_df)


Analyzing response rates by 'Age Group' and Email Variant
  Email Variant Age Group       CTR  Conversion Rate
0             A     18-25  0.311828         0.064516
1             B     18-25  0.444444         0.153846
2             A     26-35  0.360000         0.070000
3             B     26-35  0.433333         0.088889
4             A     36-45  0.372340         0.085106
5             B     36-45  0.333333         0.104167
6             A     46-55  0.402062         0.092784
7             B     46-55  0.422018         0.045872
8             A       56+  0.439252         0.149533
9             B       56+  0.350515         0.072165

'Age Group' = 46-55
CTR p-value: 0.8818321970598999, Conversion Rate p-value: 0.2900257388380034

'Age Group' = 56+
CTR p-value: 0.25000392430640644, Conversion Rate p-value: 0.12771287863391959

'Age Group' = 36-45
CTR p-value: 0.6812261376078861, Conversion Rate p-value: 0.8408535898242047

'Age Group' = 26-35
CTR p-value: 0.37672678475838584, Conversio

#### Created by: Felice Benita