![ab_testing_image](ab_testing_image.jpg)

# A/B Testing Analysis of New Search Ranking Algorithm

## Project Overview

The purpose of this experiment is to evaluate the impact of a new search ranking algorithm designed to improve booking conversion rates for an online travel agency. The Product team wants to ensure the new algorithm delivers a statistically significant uplift in conversion without negatively impacting the booking speed.

Primary Objective:
Improve the booking conversion rate (percentage of sessions resulting in a booking).

This project involves analyzing experimental data from an A/B test, comparing a control group (existing ranking) to a variant group (new ranking), and making a data-driven recommendation for rollout.

---

## Data

- **sessions_data.csv**: Session-level data including session IDs, user IDs, session start time, booking timestamps, and time to booking.
- **users_data.csv**: User-level data with logged-in users and their assignment to control or variant groups.

---

## Objectives

- Prepare and merge datasets properly, ensuring only logged-in users are included.
- Create a primary metric — session-level conversion (booking or no booking).
- Conduct sanity checks (Sample Ratio Mismatch test) to validate randomization.
- Analyze the primary metric (conversion) for statistically significant uplift using Z-test.
- Analyze guardrail metric (time to booking) using t-test to ensure no negative impact.
- Calculate effect sizes to quantify improvements.
- Provide a clear recommendation to either fully roll out the new search ranking or pull back based on the experiment results.

---

## Tools & Methods

- Python (pandas, scipy)
- Statistical hypothesis testing (Z-test, t-test, Chi-square test)
- Effect size calculations

---




## `sessions_data.csv`

| column | data type | description | 
|--------|-----------|-------------|
| `session_id` | `string` | Unique session identifier (unique for each row) |
| `user_id` | `string` | Unique user identifier (non logged-in users have missing user_id values; each user can have multiple sessions) |
| `session_start_timestamp` | `string` | When a session started |
| `booking_timestamp` | `string` | When a booking was made (missing if no booking was made during a session) |
| `time_to_booking` | `float` | time from start of the session to booking, in minutes (missing if no booking was made during a session) |
| `conversion` | `integer` | _New column to create:_ did session end up with a booking (0 if booking_timestamp or time_to_booking is Null, otherwise 1) |

<br>

## `users_data.csv`

| column | data type | description | 
|--------|-----------|-------------|
| `user_id` | `string` | Unique user identifier (only logged-in users in this table) |
| `experiment_group` | `string` | control / variant split for the experiment (expected to be equal 50/50) |

<br>

The full on criteria are the following:
- Primary metric (conversion) effect must be statistically significant and show positive effect (increase).
- Guardrail (time_to_booking) effect must either be statistically insignificant or show positive effect (decrease)

In [87]:
import pandas as pd
from scipy.stats import chisquare
from pingouin import ttest
from statsmodels.stats.proportion import proportions_ztest

**STEP 1. Data preparation**  

-Loading and merging the data  

-Creating a new column for the primary metric 'conversion' 

In [88]:
sessions = pd.read_csv('sessions_data.csv')
users = pd.read_csv('users_data.csv')

In [89]:
sessions.sample(5)

Unnamed: 0,session_id,user_id,session_start_timestamp,booking_timestamp,time_to_booking
5580,LBabtfCxAP8QCNCu,nwrqI3UYOQJZ0gCn,2025-01-03 07:28:21.429081202,2025-01-03 07:42:50.468963980,14.483998
9107,eKfcIwnaxDPlWraO,DiG2bcGSu1m7SqTw,2025-01-15 03:51:09.190968275,,
12832,dM2eERmQ7cyrzAae,fNqBz0ySu2T0FPMA,2025-01-08 07:42:12.229913950,,
1496,CP0qwixUZnDf07vq,V2H5dJeCWYnk0pBi,2025-01-02 20:29:31.959428072,,
14953,E0k8SgOptqpJRIyQ,ketg2PrPtncVHFhr,2025-01-06 08:22:27.569632530,,


In [90]:
users.sample(5)

Unnamed: 0,user_id,experiment_group
8155,giz9OlwozKtcpA97,control
2940,QqaaRjjeh1NeRMeH,variant
6213,Cdgz77DVwxv60WzO,control
7435,CPTS9h3gvdjs3pj6,variant
9529,SsnUS5odFmFU8GIR,variant


In [91]:
confidence_level = 0.90  # Set the pre-defined confidence level (90%)
alpha = 1 - confidence_level  # Significance level for hypothesis tests

In [92]:
# Merging the data
sessions_x_users = sessions.merge (users, on = 'user_id', how ='inner')
print (sessions_x_users.head(10))

         session_id           user_id  ... time_to_booking experiment_group
0  CP0lbAGnb5UNi3Ut  TcCIMrtQ75wHGXVj  ...             NaN          variant
1  UQAjrPYair63L1p8  TcCIMrtQ75wHGXVj  ...             NaN          variant
2  9zQrAPxV5oi2SzSa  TcCIMrtQ75wHGXVj  ...             NaN          variant
3  kkrz1M5vxrQ8wXRZ  GUGVzto9KGqeX3dc  ...             NaN          variant
4  ABZZFrwItZAPdYGP  v2EBIHmOdQfalI6k  ...             NaN          variant
5  9dGu1TXJ9cnNPDRD  wnsKpRB9SE0gTZAq  ...             NaN          variant
6  pjU3gt6Fti6axeMj  go0Nl2hbR6L3zYu4  ...             NaN          variant
7  94iFRPtpApTCJUkc  XmsrbZS1PW4BXwdd  ...       25.663952          control
8  JO0SGia9COmpYrKi  XmsrbZS1PW4BXwdd  ...             NaN          control
9  SWjTk16Q40V1kVUX  XmsrbZS1PW4BXwdd  ...             NaN          control

[10 rows x 6 columns]


In [93]:
# Creating conversion column (1 if booking_timestamp is not missing, 0 otherwise)
sessions_x_users['conversion'] = sessions_x_users ['booking_timestamp'].notnull().astype(int)
print (sessions_x_users.sample(10))

             session_id           user_id  ... experiment_group conversion
11307  7seXiiRZ8Sba4N9u  60wg1GBTYqNuTI6B  ...          variant          0
13617  gbRSlIoYOdcJeAvr  mtcRpc90ROXohN9y  ...          variant          0
3691   mW4w95qh2JWrohth  T1hBK9nae7Jnb2dE  ...          variant          0
7754   9p9vmMOqS9mhhOsL  cqQw0Yr6B6T9vZo7  ...          variant          0
5753   yjZUst2u4hEogp7Q  5hdUWe5LTnFAV1cm  ...          variant          0
715    6OwMENronNA6A5Dl  Vg8o5XvpFF6LFh4K  ...          control          0
10360  rxP3z85l1PiyG6Yl  IOehvOnfKqeZi9xG  ...          variant          1
6450   yz6MZ3WfifJK2AF1  mGX6OSODxdB54Rjz  ...          variant          0
13074  QdFq8y8RgmdOawEp  bj8KlkNVKx2siO7T  ...          variant          0
8188   K7fwZyEobejKoge9  xMDJknRPJHxlE6m9  ...          variant          0

[10 rows x 7 columns]


In [94]:
# Grouping data by experiment group (control/variant) and calculating total # of sessions, unique users, and total bookings
summary = sessions_x_users.groupby('experiment_group').agg({
    'session_id': 'count',            
    'user_id': pd.Series.nunique,     
    'conversion': 'sum'               
})
summary.columns = ['total_sessions', 'unique_users', 'total_bookings']
print(summary)


                  total_sessions  unique_users  total_bookings
experiment_group                                              
control                     7630          4706            1215
variant                     7653          4748            1392


In [95]:
# Finding conversion rate by groups (in %)
conversion_rate = sessions_x_users.groupby('experiment_group')['conversion'].mean()
print(round(conversion_rate*100,2))

experiment_group
control    15.92
variant    18.19
Name: conversion, dtype: float64


**STEP 2. Sanity Check**

–Sample Ratio Mismatch (SRM). Using a chi-squared test comparing the actual split of control and variant users (expected 50/50).

In [96]:
# Sample Ratio Mismatch (SRM) test
from scipy.stats import chisquare

# Get the observed frequencies
observed_frequencies = sessions_x_users['experiment_group'].value_counts().values

# Calculate the expected frequencies
total_count = observed_frequencies.sum()
expected_frequencies = [total_count / 2, total_count / 2]

# Perform the chi-square test
srm_chi2_pval = chisquare(observed_frequencies, f_exp=expected_frequencies).pvalue
print("SRM p-value:", round(srm_chi2_pval, 4))

SRM p-value: 0.8524


**Interpretation:**

Since 0.8524 > 0.1, this means there's no Sample Ratio Mismatch (SRM).  

The actual distribution between control and variant groups is statistically consistent with a 50/50 split. So the experiment was properly randomized in terms of group assignment.

**STEP 3. Analyzing the experiment effect**

-Primary metric: conversion (is there a significant increase?)  

-Guardrail metric: time_to_booking (no significant increase). We want to make sure that the new algorithm doesn't delay booking. So we test if the average time to booking in the variant group is not significantly worse than in the control group.

In [97]:
# Analysiing on the primary metric - 'conversion'. We should use a z-test because the conversion rate is a binary metric (converted or not)

conversions = sessions_x_users.groupby('experiment_group')['conversion'].sum()
n_sessions = sessions_x_users['experiment_group'].value_counts() 

successes = [conversions['control'], conversions['variant']] # number of booking in each group
sessions = [n_sessions['control'], n_sessions['variant']] # number of sessions in each group

# Perform two-sided Z-test
z_stat, pval_primary = proportions_ztest(successes, sessions)

print("Z-test:", round(z_stat, 4))
print("P-value (primary metric - conversion):", round(pval_primary, 4))

Z-test: -3.722
P-value (primary metric - conversion): 0.0002


**Interpretation of primary metric 'conversion':**

Since p-value = 0.0002 < 0.1, the result is statistically significant.  

As the Z-statistic is negative, it means that the variant group has a higher conversion.  

So, we can say that the new search ranking algorithm improves conversion, and the effect is statistically significant. This passes the primary metric criterion.

In [98]:
# Analysiing on the guardrail metric - 'time to booking'. We should use a t-test because it is a continuous metric (time)
from scipy.stats import ttest_ind

# Filter only rows with a booking (to have valid time_to_booking values)
booked_sessions = sessions_x_users[sessions_x_users['conversion'] == 1]

# Get time_to_booking values for each group
control_time = booked_sessions[booked_sessions['experiment_group'] == 'control']['time_to_booking']
variant_time = booked_sessions[booked_sessions['experiment_group'] == 'variant']['time_to_booking']

# Perform two-sided t-test
t_stat, pval_guardrail = ttest_ind(control_time, variant_time, equal_var=False)  

print("T-test:", round(t_stat, 4))
print("P-value (guardrail metric - time_to_booking):", round(pval_guardrail, 4))

T-test: 0.6182
P-value (guardrail metric - time_to_booking): 0.5365


**Interpretation of guardrail metric 'time to booking':**  

Since p-value = 0.5365 > 0.1, this means the difference in time_to_booking between control and variant is not statistically significant.  

So, the new algorithm does NOT make people wait longer to book, which means it passes the guardrail check.

**STEP 4. Calculating the effect sizes for both metrics**  

-Calculating the ATE (Average Treatment Effect), i.e. average relative effect size on conversion and time_to_booking

In [99]:
# The formula is: effect_size = mean(variant_group) / mean(control_group) - 1

# Primary metric: conversion
primary_mean = sessions_x_users.groupby('experiment_group')['conversion'].mean()
effect_size_primary = round(primary_mean['variant']/primary_mean['control']-1,4)

# Guardrail metric: time_to_booking (where conversion = 1)
guardial_mean = booked_sessions.groupby('experiment_group')['time_to_booking'].mean()
effect_size_guardrail = round(guardial_mean['variant']/guardial_mean['control']-1,4)

print("Effect size (conversion):", effect_size_primary)
print("Effect size (time_to_booking):", effect_size_guardrail)

Effect size (conversion): 0.1422
Effect size (time_to_booking): -0.0079


**STEP 5. Conclusion**

### Experiment Summary

| Metric              | Requirement                                                                      | Result                             | Pass? |
|---------------------|----------------------------------------------------------------------------------|------------------------------------|--------|
| **Conversion**       | Statistically significant **increase**                                           | ✔ Significant **(+14.2%)**          | ✅     |
| **Time to Booking**  | Statistically **insignificant / decrease** in time to book                      | ✔ Not significant **(−0.79%)**      | ✅     |

### Final Decision: **Full Rollout = YES**


In [100]:
decision_full_on = 'Yes'