### Marketing A/B testing dataset
Marketing companies want to run successful campaigns, but the market is complex and several options can work. So normally they tun A/B tests, that is a randomized experimentation process wherein two or more versions of a variable (web page, page element, banner, etc.) are shown to different segments of people at the same time to determine which version leaves the maximum impact and drive business metrics.

The companies are interested in answering two questions:

### Would the campaign be successful?
If the campaign was successful, how much of that success could be attributed to the ads?
With the second question in mind, we normally do an A/B test. The majority of the people will be exposed to ads (the experimental group). And a small portion of people (the control group) would instead see a Public Service Announcement (PSA) (or nothing) in the exact size and place the ad would normally be.

The idea of the dataset is to analyze the groups, find if the ads were successful, how much the company can make from the ads, and if the difference between the groups is statistically significant.

### Data dictionary:

Index: Row index
user id: User ID (unique)
test group: If "ad" the person saw the advertisement, if "psa" they only saw the public service announcement
converted: If a person bought the product then True, else is False
total ads: Amount of ads seen by person
most ads day: Day that the person saw the biggest amount of ads
most ads hour: Hour of day that the person saw the biggest amount of ads

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/marketing-ab-testing/marketing_AB.csv


In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
sns.set()



In [3]:
df = pd.read_csv('/kaggle/input/marketing-ab-testing/marketing_AB.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,user id,test group,converted,total ads,most ads day,most ads hour
0,0,1069124,ad,False,130,Monday,20
1,1,1119715,ad,False,93,Tuesday,22
2,2,1144181,ad,False,21,Tuesday,18
3,3,1435133,ad,False,355,Tuesday,10
4,4,1015700,ad,False,276,Friday,14


In [4]:
df.drop('Unnamed: 0', axis=1, inplace=True)
df.describe(include='all')

Unnamed: 0,user id,test group,converted,total ads,most ads day,most ads hour
count,588101.0,588101,588101,588101.0,588101,588101.0
unique,,2,2,,7,
top,,ad,False,,Friday,
freq,,564577,573258,,92608,
mean,1310692.0,,,24.820876,,14.469061
std,202226.0,,,43.715181,,4.834634
min,900000.0,,,1.0,,0.0
25%,1143190.0,,,4.0,,11.0
50%,1313725.0,,,13.0,,14.0
75%,1484088.0,,,27.0,,18.0


In [5]:
# clean the column names
df.columns = np.array(pd.Series(df.columns).apply(lambda x: x.replace(' ', '_')))

In [6]:
df.isnull().sum()

user_id          0
test_group       0
converted        0
total_ads        0
most_ads_day     0
most_ads_hour    0
dtype: int64

**split the data into ad and psa**

In [7]:
ad = df[df.test_group == 'ad'].reset_index(drop=True)
psa = df[df.test_group == 'psa'].reset_index(drop=True)

In [8]:
ad.shape, psa.shape

((564577, 6), (23524, 6))

#### the data is skewed towards ads. let's see if the difference in conversions would be statistically significant

this below code was commented out. it was a ***sanity check***;
I selected the same number of rows as the psa group from the ad group. and checked their conversion rates. and the significance of their difference. there results were the same, so the data is good.

In [9]:
# shuffle the ad data rows
##ad = ad.sample(frac=1)
# select the number of rows you need
##ad = ad.iloc[:23524, :]
##ad.shape

In [10]:
converts_ad = ad.groupby('converted')['user_id'].count()
converts_ad

converted
False    550154
True      14423
Name: user_id, dtype: int64

In [11]:
converts_psa = psa.groupby('converted')['user_id'].count()
converts_psa

converted
False    23104
True       420
Name: user_id, dtype: int64

#### we can see that ad is doing better than psa. but let's check this difference for statistical significance
- H0: no significant difference
- H1: statistically significant difference.

In [12]:
from scipy.stats import shapiro, levene, mannwhitneyu, ttest_ind

#### check normality and homogeneity of variance

In [13]:
# normality check
_, pvalue_ad = shapiro(ad.converted)
_, pvalue_psa = shapiro(psa.converted)
pvalue_psa, pvalue_ad



(0.0, 0.0)

In [14]:
# variance homogeneity check
levene(psa.converted, ad.converted)

LeveneResult(statistic=54.322884225785735, pvalue=1.703305262368155e-13)

##### p_values are less than 0.05 so the normality and variance assumptions are met.

In [15]:
# test for significance
ttest_ind(psa.converted, ad.converted)

Ttest_indResult(statistic=-7.37040597428566, pvalue=1.7033052627831264e-13)

#### the results are statistically significant!
- pvalue strongly rejects the null hypothesis
- difference in conversions are due to the ads

In [16]:
# conversion rates
(converts_psa[1]/converts_psa[0]), (converts_ad[1]/converts_ad[0])

(0.018178670360110803, 0.026216295800812136)

#### psa has a 1.8% conversion rate, while an ad has a 2.6% conversion rate. there is only a 0.8% difference between them, but the ads also reach a larger audience so it might be worth it.

## lets' check the most profitable days for ads

In [17]:
#conversion rates per day of the week
days = ad[ad.converted==True].groupby('most_ads_day')['converted'].count()/ad[ad.converted==False].groupby('most_ads_day')['converted'].count()
px.bar(days, color=days.index, title = 'most profitable day')

### mondays are the most profitable days.

#### let's check what hours are the best

In [18]:
# these hours are actually time stamps
ad.most_ads_hour.nunique()

24

#### each hour represents a time. 0 = 12am, 20 = 8pm

In [19]:
hours = ad[ad.converted==True].groupby('most_ads_hour')['converted'].count()/ad[ad.converted==False].groupby('most_ads_hour')['converted'].count()
px.bar(hours, color=hours.index, title = 'conversion rates by hour of day')

#### we can see that the peak time is during the day. between 1pm - 6pm. 

##### another peak is between 7pm and 11pm

## conclusions and recommendations
- the difference in their conversion rates are statistically significant.
- the ads are successful. they had better reach and conversions. 
- mondays saw the conversions. and peak hours are between 12pm - 11pm
- i suggest we run our most of our ads on mondays or tuesdays between these hours.