<a href="https://colab.research.google.com/github/datascience-uniandes/business-case/blob/master/business-case.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Case (EDA, KPIs, A/B testing)

MINE-4101: Applied Data Science  
Univerisdad de los Andes  

Last update: September, 2024

## 1. Business scenario

An e-commerce company aims to improve its product recommendation system to boost sales and enhance customer engagement. The company has developed a new machine learning model and wants to test if it performs better than the existing model. To evaluate the new model's impact, they plan to conduct an A/B test, measuring key performance indicators (KPIs) on a daily basis over a month.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import ttest_ind, chi2_contingency

In [None]:
import warnings
warnings.filterwarnings('ignore')

## 2. The dataset

Let's create a simulated dataset that includes user interactions, assigned to either the control group (**existing model**) or the treatment group (**new model**). The dataset covers daily data over 30 days, including variables like `user_id`, `date`, `event_type` (view, click, purchase), `group` (control or treatment), and `revenue`.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Simulation parameters
num_days = 30
dates = pd.date_range(start='2024-01-01', periods=num_days)
num_users = 3000  # Total unique users
users = np.arange(1000, 1000 + num_users)

events_per_day = lambda: np.random.randint(150, 200, size=1)[0]  # Number of events per day per group

# Controls whether revenue varies by group
revenue_global = True
r_revenue_global = (20, 200)

# Probabilities for control group
p_click_control = 0.15           # 15% chance to click after view
p_purchase_control = 0.05        # 5% chance to purchase after click
r_revenue_control = (200, 300)   # Revenue by purchase between 200 and 300

# Probabilities for treatment group
p_click_treatment = 0.25         # 25% chance to click after view
p_purchase_treatment = 0.10      # 10% chance to purchase after click
r_revenue_treatment = (20, 100)  # Revenue by purchase between 20 and 100

In [None]:
data_list = []
for date in dates:
    for group in ['control', 'treatment']:
        # Set group parameters
        if group == 'control':
            p_click = p_click_control
            p_purchase = p_purchase_control
            r_revenue = r_revenue_control
        else:
            p_click = p_click_treatment
            p_purchase = p_purchase_treatment
            r_revenue = r_revenue_treatment

        if revenue_global:
            r_revenue = r_revenue_global
        
        # Generate 'view' events
        num_views = events_per_day()
        user_ids = np.random.choice(users, size=num_views, replace=True)
        event_types = np.array(['view'] * num_views, dtype='U10')

        # Determine which 'view' events turn into 'clicks'
        click_mask = np.random.rand(num_views) < p_click
        event_types[click_mask] = 'click'

        # For 'click' events, simulate 'purchase' events
        purchase_mask = click_mask & (np.random.rand(num_views) < p_purchase)
        event_types[purchase_mask] = 'purchase'

        # Assign revenue to 'purchase' events
        revenues = np.zeros(num_views)
        num_purchases = purchase_mask.sum()
        revenues[purchase_mask] = np.random.uniform(r_revenue[0], r_revenue[1], size=num_purchases)

        # Create DataFrame for current group and date
        day_data = pd.DataFrame({
            'user_id': user_ids,
            'date': date,
            'event_type': event_types,
            'group': group,
            'revenue': revenues
        })

        data_list.append(day_data)

# Combine all daily data into a single DataFrame
data = pd.concat(data_list, ignore_index=True)

## 3. Exploratory Data Analysis (EDA)

Minimal validations:
- **Data Preview:** Get an initial understanding of the data structure.
- **Missing Values:** Check for and handle missing data.
- **Event Distribution:** Visualize how events are distributed between control and treatment groups.
- **Daily Events:** Analyze the volume of events over time.

In [None]:
data.sample(5)

In [None]:
# Check for missing values
data.isnull().sum()

In [None]:
# Event type distribution
sns.countplot(x='event_type', hue='group', data=data)
plt.title('Event type distribution by group')
plt.xlabel('Event type')
plt.ylabel('Number of events')
plt.show()

In [None]:
# Daily events
daily_events = data.groupby(['date', 'group'])['event_type'].count().unstack()
ax = daily_events.plot(kind='bar', stacked=False, figsize=(12,6))
ax.set_xticks(range(len(daily_events.index)))
ax.set_xticklabels(daily_events.index.strftime('%Y-%m-%d'), rotation=45)
plt.title('Daily events by group')
plt.xlabel('Date')
plt.ylabel('Number of events')
plt.tight_layout()
plt.show()

## 4. KPIs of interest

- **Conversion Rate (CVR):** Measures the percentage of visits that result in a purchase, indicating the effectiveness of the recommendation system in driving sales.

\begin{equation}
CVR = \left(\frac{\text{Number of purchases}}{\text{Number of visits}}\right) \times 100\%
\end{equation}

- **Average Order Value (AOV):** Represents the average revenue per purchase, reflecting customer spending behavior.

\begin{equation}
AOV = \frac{\text{Total revenue}}{\text{Number of purchases}}
\end{equation}

- **Click-Through Rate (CTR):** Indicates how often users click on recommendations, showing engagement levels.

\begin{equation}
CTR = \left(\frac{\text{Number of clicks}}{\text{Number of views}}\right) \times 100\%
\end{equation}

- **Daily Revenue:** Total revenue generated each day, essential for assessing the financial impact.

### Why these KPIs matter

- **Business alignment:** These KPIs directly relate to sales and user engagement, critical for business growth.
- **Measurable iImpact:** They provide quantifiable metrics to evaluate the performance of the new model.
- **Decision-Mmaking:** Help determine if the new model should replace the existing one based on data-driven insights.

## 5. Experimentation process

### Hypothesis Formulation:

- **Null Hypothesis (H₀):** There is no significant difference in KPIs between the control and treatment groups.
- **Alternative Hypothesis (H₁):** The treatment group shows a significant improvement in KPIs compared to the control group.

In [None]:
# Calculate KPIs for each group
grouped_data = data.groupby(['group', 'date'])

kpi_data = grouped_data.apply(lambda x: pd.Series({
    'views': np.sum(x['event_type'] == 'view'),
    'clicks': np.sum(x['event_type'] == 'click'),
    'purchases': np.sum(x['event_type'] == 'purchase'),
    'revenue': x['revenue'].sum()
})).reset_index()

# Calculate KPIs
kpi_data['CVR'] = (kpi_data['purchases'] / kpi_data['views']) * 100
kpi_data['AOV'] = kpi_data['revenue'] / kpi_data['purchases']
kpi_data['CTR'] = (kpi_data['clicks'] / kpi_data['views']) * 100

# Handle divisions by zero
kpi_data.replace([np.inf, -np.inf], np.nan, inplace=True)
kpi_data.fillna(0, inplace=True)

In [None]:
kpi_data.sample(5)

In [None]:
# Separate data by group
control_data = kpi_data[kpi_data['group'] == 'control']
treatment_data = kpi_data[kpi_data['group'] == 'treatment']

In [None]:
# Plot Average Order Value over time
plt.figure(figsize=(12,6))
sns.lineplot(x='date', y='AOV', hue='group', data=kpi_data)
plt.title('Daily Average Order Value (AOV) by group')
plt.xlabel('Date')
plt.ylabel('AOV')
plt.xticks(rotation=45)
plt.show()

In [None]:
# t-test for AOV
t_stat_aov, p_value_aov = ttest_ind(control_data['AOV'], treatment_data['AOV'])
print('AOV t-test:')
print(f'T-statistic: {t_stat_aov:.4f}')
print(f'P-value: {p_value_aov:.4f}')

In [None]:
# Calculate contingency table
cr_table = pd.DataFrame({
    'Group': ['Control', 'Treatment'],
    'Purchases': [control_data['purchases'].sum(), treatment_data['purchases'].sum()],
    'Non-Purchases': [control_data['views'].sum() - control_data['purchases'].sum(),
                      treatment_data['views'].sum() - treatment_data['purchases'].sum()]
})

In [None]:
cr_table

In [None]:
# Chi-Square test for CVR
chi2_cr, p_value_cr, dof, expected = chi2_contingency(cr_table[['Purchases', 'Non-Purchases']])
print('CVR chi-squared:')
print(f'Chi-squared: {chi2_cr:.4f}')
print(f'P-value: {p_value_cr:.4f}')

## 6. Results analysis

In [None]:
# Plot Conversion Rate over time
plt.figure(figsize=(12,6))
sns.lineplot(x='date', y='CVR', hue='group', data=kpi_data)
plt.title('Daily Conversion Rate (CVR) by group')
plt.xlabel('Date')
plt.ylabel('CVR (%)')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Plot Daily Revenue over time
plt.figure(figsize=(12,6))
sns.lineplot(x='date', y='revenue', hue='group', data=kpi_data)
plt.title('Daily Revenue by group')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Ensure 'kpi_data' DataFrame is sorted by 'group' and 'date'
kpi_data_sorted = kpi_data.sort_values(by=['group', 'date'])

# Calculate cumulative revenue for each group
kpi_data_sorted['cumulative_revenue'] = kpi_data_sorted.groupby('group')['revenue'].cumsum()

# Plot cumulative daily revenue by group
plt.figure(figsize=(12, 6))

sns.lineplot(
    x='date',
    y='cumulative_revenue',
    hue='group',
    data=kpi_data_sorted,
    marker='o'
)

plt.title('Cumulative Daily Revenue by Group')
plt.xlabel('Date')
plt.ylabel('Cumulative Revenue')
plt.xticks(rotation=45)
plt.legend(title='Group')
plt.tight_layout()
plt.show()