As a Data Scientist at a leading online travel agency, you’ve been tasked with evaluating the impact of a new search ranking algorithm designed to improve conversion rates. The Product team is considering a full rollout, but only if the experiment shows a clear positive effect on the conversion rate and does not lead to a longer time to book.

They have shared A/B test datasets with session-level booking data (`"sessions_data.csv"`) and user-level control/variant split (`"users_data.csv"`). Your job is to analyze and interpret the results to determine whether the new ranking system delivers a statistically significant improvement and provide a clear, data-driven recommendation.

## `sessions_data.csv`

| column | data type | description | 
|--------|-----------|-------------|
| `session_id` | `string` | Unique session identifier (unique for each row) |
| `user_id` | `string` | Unique user identifier (non logged-in users have missing user_id values; each user can have multiple sessions) |
| `session_start_timestamp` | `string` | When a session started |
| `booking_timestamp` | `string` | When a booking was made (missing if no booking was made during a session) |
| `time_to_booking` | `float` | time from start of the session to booking, in minutes (missing if no booking was made during a session) |
| `conversion` | `integer` | _New column to create:_ did session end up with a booking (0 if booking_timestamp or time_to_booking is Null, otherwise 1) |

<br>

## `users_data.csv`

| column | data type | description | 
|--------|-----------|-------------|
| `user_id` | `string` | Unique user identifier (only logged-in users in this table) |
| `experiment_group` | `string` | control / variant split for the experiment (expected to be equal 50/50) |

<br>

The full on criteria are the following:
- Primary metric (conversion) effect must be statistically significant and show positive effect (increase).
- Guardrail (time_to_booking) effect must either be statistically insignificant or show positive effect (decrease)

In [2]:
import pandas as pd
from scipy.stats import chisquare
from pingouin import ttest
from statsmodels.stats.proportion import proportions_ztest

## 1. Loading and merging the data

- Join `"sessions_data.csv"` and `"users_data.csv"` into a new dataframe `sessions_x_users`.

In [3]:
# LOAD DATA
users = pd.read_csv('users_data.csv') # Load user and experiment group data
sessions = pd.read_csv('sessions_data.csv') # Load session/booking data

In [4]:
# JOIN DATA
# Merge on user ID to enrich sessions with user experiment group
sessions_x_users = sessions.merge(users, on = 'user_id', how = 'inner')
sessions_x_users.head()

Unnamed: 0,session_id,user_id,session_start_timestamp,booking_timestamp,time_to_booking,experiment_group
0,CP0lbAGnb5UNi3Ut,TcCIMrtQ75wHGXVj,2025-01-26 20:02:39.177358627,,,variant
1,UQAjrPYair63L1p8,TcCIMrtQ75wHGXVj,2025-01-20 16:12:51.536912203,,,variant
2,9zQrAPxV5oi2SzSa,TcCIMrtQ75wHGXVj,2025-01-28 03:46:40.839362144,,,variant
3,kkrz1M5vxrQ8wXRZ,GUGVzto9KGqeX3dc,2025-01-25 02:48:50.953303099,,,variant
4,ABZZFrwItZAPdYGP,v2EBIHmOdQfalI6k,2025-01-11 11:41:36.912253618,,,variant


## 2. Creating primary metric

- Create a new column conversion as per the requirements specified in the workbook.
- 1 if `booking_timestamp` is not missing, 0 otherwise

In [5]:
# COMPUTE PRIMARY METRIC
# Binary conversion flag: 1 if booking occurred, 0 otherwise
sessions_x_users['conversion'] = sessions_x_users['booking_timestamp'].notnull().astype(int)

## 3. Sample Ratio Mismatch (SRM) test 

- Perform a Chi-squared test to check for SRM in control and variant groups (the split is expected to be equal 50/50).

In [6]:
# SAMPLE RATIO MISMATCH TEST
# Check if the number of users in each experiment group is balanced (a basic A/A sanity check)
groups_count = sessions_x_users['experiment_group'].value_counts()
print(groups_count)

experiment_group
variant    7653
control    7630
Name: count, dtype: int64


In [7]:
n = sessions_x_users.shape[0] # Total sample size
srm_chi2_stat, srm_chi2_pval = chisquare(f_obs = groups_count, f_exp = (n/2, n/2))
srm_chi2_pval = round(srm_chi2_pval, 4)
print(f'\nSRM\np-value: {srm_chi2_pval}') # If p < alpha, there's likely a sampling issue issue


SRM
p-value: 0.8524


## 4. Effect analysis on primary metric - `conversion`

- Run the appropriate test (Z-test or T-test) considering the type of the `conversion` metric to assess the significance of the effect -> `conversion` is a binary metric so a Z-test should be used in this case.

In [8]:
# EFFECT ANALYSIS - PRIMARY METRIC
# Compute success counts and sample sizes for each group
success_counts = sessions_x_users.groupby('experiment_group')['conversion'].sum().loc[['control', 'variant']]

sample_sizes = sessions_x_users['experiment_group'].value_counts().loc[['control', 'variant']]

In [9]:
# Run Z-test for proportions (binary conversion metric)
zstat_primary, pval_primary = proportions_ztest(
    success_counts,
    sample_sizes,
    alternative = 'two-sided',
)

pval_primary = round(pval_primary, 4)

## 5. Effect analysis on the guardrail metric - `time_to_booking`

- Run the appropriate test considering the type of the `time_to_booking` metric to assess the significance of the effect -> `time_to_booking` is a continuous metric so a T-test should be used in this case.

In [10]:
# EFFECT ANALYSIS - GUARDRAIL METRIC
# T-test on time to booking for control vs variant
stats_guardrail = ttest(
    sessions_x_users.loc[(sessions_x_users['experiment_group'] == 'control'), 'time_to_booking'],
    sessions_x_users.loc[(sessions_x_users['experiment_group'] == 'variant'), 'time_to_booking'],
    alternative='two-sided',
)

pval_guardrail, tstat_guardrail = stats_guardrail['p-val'].values[0], stats_guardrail['T'].values[0]
pval_guardrail = round(pval_guardrail, 4)

## 6. Estimate the effect sizes on primary and guardrail metrics 

Calculate the ATE (Average Treatment Effect), i.e. average relative effect size on `conversion` and `time_to_booking`.

The formula is: 
`effect_size = avg(variant) / av(control) - 1`

In [11]:
# DEFINE FUNCTIONS
def estimate_effect_size(df: pd.DataFrame, metric: str) -> float:
    """
    Calculate relative effect size

    Parameters:
    - df (pd.DataFrame): data with experiment_group ('control', 'variant') and metric columns.
    - metric (str): name of the metric column

    Returns:
    - effect_size (float): average treatment effect (effect size)
    """
    avg_metric_per_group = df.groupby('experiment_group')[metric].mean()
    effect_size = avg_metric_per_group['variant'] / avg_metric_per_group['control'] - 1
    return effect_size

In [12]:
# Estimate effect size for the conversion metric
effect_size_primary = estimate_effect_size(sessions_x_users, 'conversion')
effect_size_primary = round(effect_size_primary, 4)
print(f'\nPrimary metric\np-value: {pval_primary: .4f} | effect size: {effect_size_primary: .4f}')


Primary metric
p-value:  0.0002 | effect size:  0.1422


In [13]:
# Estimate effect size for the guardrail metric
effect_size_guardrail = estimate_effect_size(sessions_x_users, 'time_to_booking')
effect_size_guardrail = round(effect_size_guardrail, 4)
print(f'\nGuardrail\np-value: {pval_guardrail} | effect size: {effect_size_guardrail}')


Guardrail
p-value: 0.5365 | effect size: -0.0079


## 7. Making decisions

Make the decision to go full on or pull back. The criteria are the following: 
- Primary metric (`conversion`) effect must be statistically significant and show positive effect (increase).
- Guardrail (`time_to_booking`) effect must either be statistically insignificant or show positive effect (decrease).

In [14]:
confidence_level = 0.90  # Set the pre-defined confidence level (90%)
alpha = 1 - confidence_level  # Significance level for hypothesis tests

In [16]:
# DECISION
# Primary metric must be statistically significant and show positive effect (increase)
criteria_full_on_primary = (pval_primary < alpha) & (effect_size_primary > 0)
# Guardrail must either be statistically insignificant or whow positive effect (decrease)
criteria_full_on_guardrail = (pval_guardrail > alpha) | (effect_size_guardrail <= 0)

In [18]:
# Final launch decision based on both metrics
if criteria_full_on_primary and criteria_full_on_guardrail:
    decision_full_on = 'Yes'
    print('\nThe experiment results are significantly positive and the guardrail metric was not harmed, we are going full on!')
else:
    decision_full_on = 'No'
    print('\nThe experiment results are inconclusive or the guardrail metric was harmed, we are pulling back!')


The experiment results are significantly positive and the guardrail metric was not harmed, we are going full on!
