# A/B Test Analysis
We're going to conduct an Independent Samples T-test to analyse our A/B test. An Indepdent Samples T-test compares the differences between two means of two different samples

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

Export your results to a .csv file and save it to you github repository. Import your .csv file, inspect it, and clean it where neccesary.

In [2]:
# Load the data
df_a = pd.read_csv("AB-Test_HCAI_Version_A.csv", sep=';')
df_b = pd.read_csv("AB-Test_HCAI_Version_B.csv", sep=';')

# EDA A
df_a.info() 
df_a.head() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 11 columns):
 #   Column                                                                                        Non-Null Count  Dtype  
---  ------                                                                                        --------------  -----  
 0   Id                                                                                            8 non-null      int64  
 1   Start time                                                                                    8 non-null      object 
 2   Completion time                                                                               8 non-null      object 
 3   Email                                                                                         8 non-null      object 
 4   Name                                                                                          0 non-null      float64
 5   Answer the following questions:.I

Unnamed: 0,Id,Start time,Completion time,Email,Name,Answer the following questions:.I understood what I could use the app for.,Answer the following questions:.I found the application intuitive to use.,Answer the following questions:.I thought the application was useful.,Answer the following questions:.I enjoyed using the application.,Answer the following questions:.I consider that this would add value to the HORECA industry.,Anything else to add?
0,1,4.9.24 13:07:47,4.9.24 13:12:34,anonymous,,Agree,Agree,Agree,Agree,Strongly agree,
1,2,4.9.24 13:21:33,4.9.24 13:25:31,anonymous,,Disagree,Agree,Strongly agree,Neutral,Strongly agree,
2,3,4.9.24 13:29:03,4.9.24 13:29:29,anonymous,,Disagree,Disagree,Neutral,Neutral,Neutral,
3,4,4.9.24 13:30:15,4.9.24 13:31:43,anonymous,,Agree,Neutral,Neutral,Disagree,Neutral,
4,5,4.9.24 13:34:14,4.9.24 13:35:20,anonymous,,Strongly agree,Neutral,Agree,Disagree,Neutral,


In [3]:
# EDA B
df_b.info() 
df_b.head() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 11 columns):
 #   Column                                                                                 Non-Null Count  Dtype  
---  ------                                                                                 --------------  -----  
 0   Id                                                                                     15 non-null     int64  
 1   Start time                                                                             15 non-null     object 
 2   Completion time                                                                        15 non-null     object 
 3   Email                                                                                  15 non-null     object 
 4   Name                                                                                   0 non-null      float64
 5   Question.I understood what I could use the app for.                             

Unnamed: 0,Id,Start time,Completion time,Email,Name,Question.I understood what I could use the app for.,Question.I found the application intuitive to use.,Question.I thought the application was useful.,Question.I enjoyed using the application.,Question.Tracking the application's performance is not in the interest of the waiter.,Anything else to add?
0,1,4.9.24 11:24:01,4.9.24 11:25:42,anonymous,,Neutral,Neutral,Agree,Neutral,Neutral,
1,2,4.9.24 11:22:49,4.9.24 11:26:23,anonymous,,Agree,Agree,Neutral,Agree,Disagree,
2,3,4.9.24 11:32:19,4.9.24 11:33:45,anonymous,,Agree,Agree,Strongly agree,Agree,Neutral,
3,4,4.9.24 11:31:25,4.9.24 11:34:10,anonymous,,Agree,Strongly agree,Strongly agree,Agree,Neutral,
4,5,4.9.24 11:24:30,4.9.24 11:36:17,anonymous,,Agree,Agree,Agree,Neutral,Neutral,


In [4]:
# Define the mapping for Likert scale responses
likert_mapping = {
    'Strongly disagree': 0,
    'Disagree': 1,
    'Neutral': 2,
    'Agree': 3,
    'Strongly agree': 4
}

# Columns to transform
columns_to_transform_a = [
    'Answer the following questions:.I understood what I could use the app for.',
    'Answer the following questions:.I found the application intuitive to use.',
    'Answer the following questions:.I thought the application was useful.',
    'Answer the following questions:.I enjoyed using the application.',
    'Answer the following questions:.I consider that this would add value to the HORECA industry.'
]

columns_to_transform_b = [
    'Question.I understood what I could use the app for.',
    'Question.I found the application intuitive to use.',
    'Question.I thought the application was useful.',
    'Question.I enjoyed using the application.',
    "Question.Tracking the application's performance is not in the interest of the waiter."
]

# Apply the mapping to the columns
df_a[columns_to_transform_a] = df_a[columns_to_transform_a].applymap(likert_mapping.get)
df_b[columns_to_transform_b] = df_b[columns_to_transform_b].applymap(likert_mapping.get)

# Drop columns that are not necessary
columns_to_drop = ['Id', 'Start time', 'Completion time', 'Email', 'Name', 'Anything else to add?']
df_a = df_a.drop(columns=columns_to_drop)
df_b = df_b.drop(columns=columns_to_drop)

# Rename columns in df_b to match df_a for consistency
df_b.columns = columns_to_transform_a

# Convert Likert scores to integers
df_a = df_a.apply(pd.to_numeric, errors='coerce')
df_b = df_b.apply(pd.to_numeric, errors='coerce')

# Drop any rows with NaN values that might have resulted from conversion
df_a.dropna(inplace=True)
df_b.dropna(inplace=True)

  df_a[columns_to_transform_a] = df_a[columns_to_transform_a].applymap(likert_mapping.get)
  df_b[columns_to_transform_b] = df_b[columns_to_transform_b].applymap(likert_mapping.get)


The rest we leave for tomorrow when we actually have our data. But if you are eager to play around a bit you can simply refresh the survey and fill in a couple of responses to create an A and a B version.

Now, let's start analysing our gathered data! This block we won't dive into inferential statistics since it can get complex quite fast; we'll do that in Year 2, block A. For now, you just need to know that we need to test whether the data is normally distributed and whether the variances of both samples are equal. Otherwise, our statistical tests would not be valid and we can therefore not say that the results we're seeing are due to chance. What we are going to statistically ascertain is whether there is a statistically significant different in the mean of a given variable for version A or B. 

In [5]:
df_A = df_a #without this running the function twice would break the dataframe 
df_B = df_b

# Select only the columns we need (assuming these are columns 0 to 4 after dropping the unnecessary columns)
df_A = df_a.iloc[:, :5]
df_B = df_b.iloc[:, :5]

# Define new column names
new_column_names = ['q1', 'q2', 'q3', 'q4', 'q5']

# Rename columns
df_A.columns = new_column_names
df_B.columns = new_column_names

# Display the cleaned dataframes
print(df_A.head())
print(df_B.head())

   q1  q2  q3  q4  q5
0   3   3   3   3   4
1   1   3   4   2   4
2   1   1   2   2   2
3   3   2   2   1   2
4   4   2   3   1   2
   q1  q2  q3  q4  q5
0   2   2   3   2   2
1   3   3   2   3   1
2   3   3   4   3   2
3   3   4   4   3   2
4   3   3   3   2   2


In [6]:
# Run the shapiro-wilk statistical test for each question to check whether the data is normally distributed
normal_a = stats.shapiro(df_A)
normal_b = stats.shapiro(df_B)

# Check whether the variance of both samples is equal
homogeneity = stats.levene(df_A['q1'],
                           df_B['q1'])

print(f"If the p-value is above 0.05, then the data is normally distrubted", normal_a, ". If the data is not normally distributed then you will have to run the bootstrapped version.")
print(f"If the p-value is above 0.05, then the data is normally distrubted", normal_b, ". If the data is not normally distributed then you will have to run the bootstrapped version.")
print(f"If the p-value is above 0.05, then the groups have equal variances", homogeneity, ". If the variance aren't equal then you will have to run the bootstrapped version.")

If the p-value is above 0.05, then the data is normally distrubted ShapiroResult(statistic=0.8788917547893227, pvalue=0.0004914016589802226) . If the data is not normally distributed then you will have to run the bootstrapped version.
If the p-value is above 0.05, then the data is normally distrubted ShapiroResult(statistic=0.854623220843561, pvalue=4.626648517855177e-07) . If the data is not normally distributed then you will have to run the bootstrapped version.
If the p-value is above 0.05, then the groups have equal variances LeveneResult(statistic=0.001852015168886157, pvalue=0.966080168004411) . If the variance aren't equal then you will have to run the bootstrapped version.


In [7]:
# Perform the Shapiro-Wilk test for each question to check for normality
def check_normality(dataframe):
    normality_results = {}
    for column in dataframe.columns:
        stat, p_value = stats.shapiro(dataframe[column])
        normality_results[column] = p_value
    return normality_results

# Perform Levene's test for each question to check for homogeneity of variances
def check_homogeneity(df1, df2):
    homogeneity_results = {}
    for column in df1.columns:
        stat, p_value = stats.levene(df1[column], df2[column])
        homogeneity_results[column] = p_value
    return homogeneity_results

# Check normality for both dataframes
normality_a = check_normality(df_A)
normality_b = check_normality(df_B)

# Check homogeneity for both dataframes
homogeneity = check_homogeneity(df_A, df_B)

# Print the results
print("Shapiro-Wilk Test for Normality - df_A:")
for column, p_value in normality_a.items():
    print(f"  {column}: p-value = {p_value:.5f} ( {'Normally distributed' if p_value > 0.05 else 'Not normally distributed'} )")

print("\nShapiro-Wilk Test for Normality - df_B:")
for column, p_value in normality_b.items():
    print(f"  {column}: p-value = {p_value:.5f} ( {'Normally distributed' if p_value > 0.05 else 'Not normally distributed'} )")

print("\nLevene's Test for Homogeneity of Variances:")
for column, p_value in homogeneity.items():
    print(f"  {column}: p-value = {p_value:.5f} ( {'Equal variances' if p_value > 0.05 else 'Unequal variances'} )")

Shapiro-Wilk Test for Normality - df_A:
  q1: p-value = 0.03177 ( Not normally distributed )
  q2: p-value = 0.02697 ( Not normally distributed )
  q3: p-value = 0.03697 ( Not normally distributed )
  q4: p-value = 0.03697 ( Not normally distributed )
  q5: p-value = 0.00188 ( Not normally distributed )

Shapiro-Wilk Test for Normality - df_B:
  q1: p-value = 0.00296 ( Not normally distributed )
  q2: p-value = 0.00406 ( Not normally distributed )
  q3: p-value = 0.00222 ( Not normally distributed )
  q4: p-value = 0.00358 ( Not normally distributed )
  q5: p-value = 0.00363 ( Not normally distributed )

Levene's Test for Homogeneity of Variances:
  q1: p-value = 0.96608 ( Equal variances )
  q2: p-value = 0.92923 ( Equal variances )
  q3: p-value = 0.32562 ( Equal variances )
  q4: p-value = 0.91201 ( Equal variances )
  q5: p-value = 0.78480 ( Equal variances )


Because the data is not normally distributed (which is not uncommon with survey data), we will need to use the bootstrapped version.

In [8]:
# Define the bootstrapping function
def bootstrap_mean_diff(data1, data2, num_bootstrap=10000):
    observed_diff = np.mean(data1) - np.mean(data2)
    combined = np.concatenate([data1, data2])
    
    boot_diffs = []
    for _ in range(num_bootstrap):
        boot_sample1 = np.random.choice(combined, size=len(data1), replace=True)
        boot_sample2 = np.random.choice(combined, size=len(data2), replace=True)
        boot_diff = np.mean(boot_sample1) - np.mean(boot_sample2)
        boot_diffs.append(boot_diff)
        
    boot_diffs = np.array(boot_diffs)
    p_value = np.mean(boot_diffs >= observed_diff)  # one-sided test; adjust as needed
    return observed_diff, p_value, boot_diffs

# Perform bootstrapping for each question
bootstrap_results = {}
for column in df_A.columns:
    observed_diff, p_value, boot_diffs = bootstrap_mean_diff(df_A[column], df_B[column])
    bootstrap_results[column] = {
        'observed_diff': observed_diff,
        'p_value': p_value,
        'boot_diffs': boot_diffs
    }

# Print the results
for column, results in bootstrap_results.items():
    print(f"{column} - Observed Difference: {results['observed_diff']:.5f}, P-value: {results['p_value']:.5f}")

q1 - Observed Difference: -0.18333, P-value: 0.64650
q2 - Observed Difference: -0.55833, P-value: 0.92660
q3 - Observed Difference: -0.45833, P-value: 0.93650
q4 - Observed Difference: -1.12500, P-value: 0.99910
q5 - Observed Difference: 0.15833, P-value: 0.35750


Based on the bootstrap analysis, none of the questions (q1 through q5) show a statistically significant difference in means between the two groups. This means that, for each question, the differences observed are likely due to chance rather than a real effect.

In [9]:
# Perform the independent samples t-test for each question
t_test_results = {}
for column in df_A.columns:
    t_stat, p_value = stats.ttest_ind(df_A[column], df_B[column], equal_var=True)
    t_test_results[column] = {
        't_stat': t_stat,
        'p_value': p_value
    }

# Print the results
for column, results in t_test_results.items():
    print(f"{column} - t-statistic: {results['t_stat']:.5f}, p-value: {results['p_value']:.5f}")

q1 - t-statistic: -0.35989, p-value: 0.72253
q2 - t-statistic: -1.42550, p-value: 0.16871
q3 - t-statistic: -1.50155, p-value: 0.14810
q4 - t-statistic: -3.95280, p-value: 0.00073
q5 - t-statistic: 0.35653, p-value: 0.72500


Q4: "I enjoyed using the application"
Question 4 is statistically significant. It suggests that there's a significant difference in users' enjoyment between the two versions of the wireframe model. This could imply that the design or functionality changes made between the two versions had a noticeable impact on user experience in terms of enjoyment.

After analyzing the results, the second version of the wireframe (Version B) was more liked. I hypothesize that this is because the second version was a simplified version of the first version (the statistics page was updated to be less crowded), which was not only more visually appealing, but also easier to comprehend.

However, the second test had more respondants, which could indicate that the data is skewed.

In [10]:
# Run Bootstrapped Independent Samples T-test when assumptions are violated
# Define the function for bootstrapped t-test
def bootstrapped_ttest(data1, data2, num_bootstrap=10000):
    observed_diff = np.mean(data1) - np.mean(data2)
    combined = np.concatenate([data1, data2])
    boot_diffs = []

    for _ in range(num_bootstrap):
        boot_sample1 = np.random.choice(combined, size=len(data1), replace=True)
        boot_sample2 = np.random.choice(combined, size=len(data2), replace=True)
        boot_stat, _ = stats.ttest_ind(boot_sample1, boot_sample2, equal_var=True)
        boot_diffs.append(boot_stat)

    boot_diffs = np.array(boot_diffs)
    p_value = np.mean(boot_diffs >= observed_diff)
    return observed_diff, p_value, boot_diffs

# Perform the bootstrapped t-test for each question
bootstrapped_results = {}
for column in df_A.columns:
    observed_diff, p_value, boot_diffs = bootstrapped_ttest(df_A[column], df_B[column])
    bootstrapped_results[column] = {
        'observed_diff': observed_diff,
        'p_value': p_value,
        'boot_diffs': boot_diffs
    }

# Print the results
for column, results in bootstrapped_results.items():
    print(f"{column} - Observed Difference: {results['observed_diff']:.5f}, P-value: {results['p_value']:.5f}")

  res = hypotest_fun_out(*samples, **kwds)


q1 - Observed Difference: -0.18333, P-value: 0.58440
q2 - Observed Difference: -0.55833, P-value: 0.71470
q3 - Observed Difference: -0.45833, P-value: 0.68420
q4 - Observed Difference: -1.12500, P-value: 0.86480
q5 - Observed Difference: 0.15833, P-value: 0.42940


Based on the bootstrap analysis, none of the questions (q1 through q5) show a statistically significant difference in means between the two groups. This means that, for each question, the differences observed are likely due to chance rather than a real effect.