This project inspired from an article called "Pengertian, Cara Kerja, dan Penerapan A/B Testing" (Indonesian).

Here is the link to the article: https://softscients.com/2022/01/14/pengertian-cara-kerja-dan-penerapan-a-b-testing/

## Problem Statement

A hotel in Indonesia has a website for their customer to book their rooms. Besides that, the website has an important role to attract and engage more customers to book their rooms. Hopefully, it would also make the hotel to be more well-known in the city.

The manager of this hotel wants to make changes in some features in the website to increase their purchasing rate (more customers rent their rooms). Therefore, the manager initiates the changes by elaborating A/B testing to the website for the new features. The experiment has a month to extract the data and then do the hypothesis testing to check whether the new features have a greater impact on the increase of their purchasing rate. After a month, the data is ready to use in hypothesis testing.

Note: The data was written in Indonesia so I will translate it into English.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm

In [10]:
# Import the data
df = pd.read_csv("Hotel Site Visit (Ind).csv", decimal = ",", thousands = ".")
df

Unnamed: 0,variasi,menginap,hari,pendapatan
0,A,TIDAK,0.0,0.0
1,A,TIDAK,0.0,0.0
2,A,TIDAK,0.0,0.0
3,A,TIDAK,0.0,0.0
4,A,TIDAK,0.0,0.0
...,...,...,...,...
1446,B,YA,4.0,22281.0
1447,B,TIDAK,0.0,0.0
1448,B,TIDAK,0.0,0.0
1449,B,TIDAK,0.0,0.0


In [11]:
# Translate the data into English
df.columns = ["variation", "book", "day", "revenue"]
df["book"] = ["NO" if i == "TIDAK" else "YES" for i in df["book"]]
df

Unnamed: 0,variation,book,day,revenue
0,A,NO,0.0,0.0
1,A,NO,0.0,0.0
2,A,NO,0.0,0.0
3,A,NO,0.0,0.0
4,A,NO,0.0,0.0
...,...,...,...,...
1446,B,YES,4.0,22281.0
1447,B,NO,0.0,0.0
1448,B,NO,0.0,0.0
1449,B,NO,0.0,0.0


In [12]:
# Save the translated data frame into csv
df.to_csv("Hotel Site Visit (Eng).csv", index = False)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1451 entries, 0 to 1450
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   variation  1451 non-null   object 
 1   book       1451 non-null   object 
 2   day        1451 non-null   float64
 3   revenue    1451 non-null   float64
dtypes: float64(2), object(2)
memory usage: 45.5+ KB


**Data Description**

There are 4 columns in the data which are:

1. variation: "A" stands for the control group which is the customers that gets the default website features and "B" stands for the  treatment group which is the customers that gets the new website features.
2. book: is a customer booked room(s) using the website?
3. day: how many days room(s) booked by a customer?
4. revenue: revenue gets from the booked room(s).

The data consists of 1451 customers.

Our main concern is to check **whether the new website features is effective or not to increase customers purchase conversion rate**. Therefore, our main concern is to use the first two columns in the data which are variation and book columns.

In [14]:
# Update the data by using only the first two columns
df = df.iloc[:,:2]
df.head()

Unnamed: 0,variation,book
0,A,NO
1,A,NO
2,A,NO
3,A,NO
4,A,NO


We will use the updated data to conduct hypothesis testing on the purchase conversion rate using z-test.

In [19]:
# Create a function to calculate purchase conversion rate
def conversion_rate(variant):
    conversion = df[(df["variation"] == variant) & (df["book"] == "YES")].count()[0]
    visitor = df[df["variation"] == variant].count()[0]
    rate = conversion/visitor
    return {
        "conversion": conversion,
        "visitor": visitor,
        "rate": rate
    }

In [20]:
# Calculate conversion rate for each group
variant_A = conversion_rate("A")
variant_B = conversion_rate("B")
print(variant_A)
print(variant_B)

{'conversion': 20, 'visitor': 721, 'rate': 0.027739251040221916}
{'conversion': 37, 'visitor': 730, 'rate': 0.050684931506849315}


In [21]:
# Significance level 5% for Z-test on proportion/purchase conversion rate
pool = (variant_A["conversion"] + variant_B["conversion"]) / (variant_A["visitor"] + variant_B["visitor"])
se_pool = np.sqrt(pool * (1 - pool) * ((1 / variant_A["visitor"]) + (1 / variant_B["visitor"])))
margin_err = se_pool * norm.ppf(0.975)
diff_proportion = variant_B["rate"] - variant_A["rate"]
increased = ((variant_B["rate"] / variant_A["rate"]) - 1) * 100
z_score = diff_proportion / se_pool
pvalue = norm.cdf(-z_score) * 2

In [22]:
# Show the result
result = pd.DataFrame({
    "metric": ["Estimated Difference", "Relative Uplift (%)", "Pooled Sample Proportion", "Standard Error of Difference", "Z-score", "P-value", "Margin of Error"],
    "value": [diff_proportion, increased, pool, se_pool, z_score, pvalue, margin_err]
})
result

Unnamed: 0,metric,value
0,Estimated Difference,0.022946
1,Relative Uplift (%),82.719178
2,Pooled Sample Proportion,0.039283
3,Standard Error of Difference,0.0102
4,Z-score,2.249546
5,P-value,0.024478
6,Margin of Error,0.019992


From the z-test p-value which is 0.0245 (less than 5%), we can conclude that the new website features has different conversion rate with the default website features. Also, we can see that the relative uplift for the new website features is 82.72% from the default website features. So, we can conclude that the new website features has significant effect on the increased of purchase conversion rate in the hotel website.

Therefore, **we can suggest to the manager of the hotel to update their website using the new features**.