The objective for data analysis. I am assuming the role of a product analyst at a mobile application company. I have just finished running an A/B test surrounding two variants of a key feature. The data received was CTR and Average Time Spent in minutes along with some demographic qualifiers.

My Approach will be to prepare a data analysis report with each step annotated so that someone in the future will be able to understand my code.

The report will aim to complete the tasks listed:

1. Analyze the results to determine which feature (if any) results in CTR or Time Spent lift.

2. Conduct statistical testing to determine if there is a statistically significant difference between the features and the control group.

3. Summarize your results. Make a recommendation to the engineering team about which feature to deploy. 

4. Create a roll-out plan. How quickly will you introduce the feature to your audience?

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import math
import scipy.stats

Firstly all the necessary python libraries will be imported

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/ajkam/schulich_data_science/main/Assignment%202/experiment_dataset.csv")

Then the dataset will be read by python to be put into a pandas data frame. The data is located on my github.

# Step 1: Analyze the results to determine if there is any lift

This will help us understand if the Variant has resulted in any change in the key metrics CTR and Time Spent that were tracked throughout the experiment.

In [3]:
df.head(20)

Unnamed: 0.1,Unnamed: 0,Age,Location,Device,Variant,Time Spent,CTR
0,0,62,Location2,Device2,Control,13.928669,0.084776
1,1,18,Location1,Device1,Variant B,11.310518,0.096859
2,2,21,Location2,Device1,Variant B,24.8421,0.09763
3,3,21,Location1,Device3,Variant B,20.0613,0.109783
4,4,57,Location1,Device2,Variant B,34.495503,0.068579
5,5,27,Location3,Device1,Variant B,26.129246,0.149341
6,6,37,Location3,Device3,Variant B,20.525362,0.095788
7,7,39,Location2,Device1,Variant A,21.525217,0.149985
8,8,54,Location3,Device2,Control,21.910608,0.135535
9,9,41,Location1,Device2,Variant A,27.642788,0.137266


View the first 20 rows of data to get an idea of what we are working with

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1000 non-null   int64  
 1   Age         1000 non-null   int64  
 2   Location    1000 non-null   object 
 3   Device      1000 non-null   object 
 4   Variant     1000 non-null   object 
 5   Time Spent  1000 non-null   float64
 6   CTR         1000 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 54.8+ KB


Understand key information about the dataset

In [5]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Age,Time Spent,CTR
count,1000.0,1000.0,1000.0,1000.0
mean,499.5,40.715,22.713345,0.109145
std,288.819436,13.655557,5.479292,0.022366
min,0.0,18.0,7.114806,0.012975
25%,249.75,28.0,19.216608,0.094286
50%,499.5,41.0,22.506707,0.108944
75%,749.25,53.0,26.25595,0.124238
max,999.0,64.0,39.39577,0.172728


Describe the main measures of central tendency and other important factors of the dataset

In [6]:
print(df['Location'].unique())

['Location2' 'Location1' 'Location3']


Understand how many unique locations there are

In [7]:
print(df['Device'].unique())

['Device2' 'Device1' 'Device3']


Understand how many unique devices there are

In [8]:
print(df['Variant'].unique())

['Control' 'Variant B' 'Variant A']


Understand how many unique Variants are

In [12]:
df.groupby('Location')[['Time Spent', 'CTR']].mean()


Unnamed: 0_level_0,Time Spent,CTR
Location,Unnamed: 1_level_1,Unnamed: 2_level_1
Location1,22.707286,0.110217
Location2,22.648998,0.108517
Location3,22.787691,0.108708


Understand whether the mean time spent and CTR varies by location. There seems to be marginal varaition in Location as Location 2 is 22.6 whereas Location3 is 22.8. There seems to be marginal variation in CTR as Location1 is 0.11 whereas the others are 0.108.

In [15]:
df.groupby('Device')[['Time Spent', 'CTR']].mean()


Unnamed: 0_level_0,Time Spent,CTR
Device,Unnamed: 1_level_1,Unnamed: 2_level_1
Device1,22.635032,0.109634
Device2,22.890021,0.109868
Device3,22.612276,0.107993


Understand whether the mean time spent and CTR varies by device. There seems to be marginal variation in Time Spent as Device1 in 22.6 whereas Device 2 is 22.9. There seems to be almost no variation in CTR as Device3 is 0.11 whereas the others are 0.108.

In [13]:
df.groupby('Age')[['Time Spent', 'CTR']].mean()


Unnamed: 0_level_0,Time Spent,CTR
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
18,21.80516,0.104431
19,24.301099,0.113928
20,22.658484,0.106769
21,23.174444,0.106317
22,23.339777,0.111566
23,21.203465,0.111024
24,21.505956,0.109765
25,23.541231,0.109907
26,22.598906,0.110061
27,21.72939,0.105247


Understand whether the mean time spent and CTR varies by age. There is slight variation in time spent as the highest average time spent is 25.3 while the lowest is 20.8. There is also slight variation in CTR spent as the highest average time spent is 0.12 while the lowest is 0.099.

Since Age, Device, and Location have no significnat variation based on their averages we need to look into Location

In [None]:
Variant_A = df[df.Variant == 'Variant A']
Variant_A.describe()

In [None]:
Variant_B = df[df.Variant == 'Variant B']
Variant_B.describe()

In [None]:
Control = df[df.Variant == 'Control']
Control.describe()

In [None]:
df.var()


In [None]:
Variant_A.var()


In [None]:
Variant_B.var()


In [None]:
Control.var()