In [436]:
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
import numpy as np

In [438]:
#What I am doing here is running an A/B test to determine whether one marketing campaign is more effective than another campaign. I will be using various standardized methodology to explore the impact (if any) of the new campaign, and I will explain each step as we go along.

In [440]:
#First, lets go ahead and import the necessary data! We will be using two data sets, Control Group (containing current marketing data and impact) and Test Group (containing data about the results of our new marketing campaign).

In [442]:
control_group = pd.read_csv('C:\\Users\\diamo\\datasets\\AB Portfolio Project Data\\control_group.csv', delimiter = ';')

In [444]:
test_group = pd.read_csv('C:\\Users\\diamo\\datasets\\AB Portfolio Project Data\\test_group.csv', delimiter = ';')

In [446]:
#Great! Now lets take a quick look at the first few rows of those datasets to get an idea of the data they contain.

In [448]:
print(control_group.head())

      Campaign Name       Date  Spend [USD]  # of Impressions     Reach  \
0  Control Campaign  1.08.2019         2280           82702.0   56930.0   
1  Control Campaign  2.08.2019         1757          121040.0  102513.0   
2  Control Campaign  3.08.2019         2343          131711.0  110862.0   
3  Control Campaign  4.08.2019         1940           72878.0   61235.0   
4  Control Campaign  5.08.2019         1835               NaN       NaN   

   # of Website Clicks  # of Searches  # of View Content  # of Add to Cart  \
0               7016.0         2290.0             2159.0            1819.0   
1               8110.0         2033.0             1841.0            1219.0   
2               6508.0         1737.0             1549.0            1134.0   
3               3065.0         1042.0              982.0            1183.0   
4                  NaN            NaN                NaN               NaN   

   # of Purchase  
0          618.0  
1          511.0  
2          372.0  
3   

In [450]:
print(test_group.head())

   Campaign Name       Date  Spend [USD]  # of Impressions  Reach  \
0  Test Campaign  1.08.2019         3008             39550  35820   
1  Test Campaign  2.08.2019         2542            100719  91236   
2  Test Campaign  3.08.2019         2365             70263  45198   
3  Test Campaign  4.08.2019         2710             78451  25937   
4  Test Campaign  5.08.2019         2297            114295  95138   

   # of Website Clicks  # of Searches  # of View Content  # of Add to Cart  \
0                 3038           1946               1069               894   
1                 4657           2359               1548               879   
2                 7885           2572               2367              1268   
3                 4216           2216               1437               566   
4                 5863           2106                858               956   

   # of Purchase  
0            255  
1            677  
2            578  
3            340  
4            768  


In [452]:
#Looks great. However, there are some NaN values in row 4 of control_group. Since this data set is only 30 rows, lets just take a look at the whole data set to see if there are any other NaN values in either data frame.

In [454]:
print(control_group)
print(test_group)

       Campaign Name        Date  Spend [USD]  # of Impressions     Reach  \
0   Control Campaign   1.08.2019         2280           82702.0   56930.0   
1   Control Campaign   2.08.2019         1757          121040.0  102513.0   
2   Control Campaign   3.08.2019         2343          131711.0  110862.0   
3   Control Campaign   4.08.2019         1940           72878.0   61235.0   
4   Control Campaign   5.08.2019         1835               NaN       NaN   
5   Control Campaign   6.08.2019         3083          109076.0   87998.0   
6   Control Campaign   7.08.2019         2544          142123.0  127852.0   
7   Control Campaign   8.08.2019         1900           90939.0   65217.0   
8   Control Campaign   9.08.2019         2813          121332.0   94896.0   
9   Control Campaign  10.08.2019         2149          117624.0   91257.0   
10  Control Campaign  11.08.2019         2490          115247.0   95843.0   
11  Control Campaign  12.08.2019         2319          116639.0  100189.0   

In [456]:
#Fantastic. Looks like we just have the NaN values in row 4 of control_group.  There are several ways to handle NaN values, but for this data, it makes the most sense to fill those NaN values with the average value for each column, because that should give a good representation of what those values could be.
#Before I replace the NaN values with the average values, I am also going to clean the data a little bit so it is easier to read.

In [458]:
control_group.columns = control_group.columns.str.replace('# of ', '')
test_group.columns = test_group.columns.str.replace('# of ', '')

In [460]:
#This just gets rid of '# of' in each column's name. Now to replace the NaN values in row 4.

In [462]:
target_columns = control_group.columns[3:]
target_means = control_group[target_columns].mean().round(0)
control_group.loc[4, target_columns] = control_group.loc[4, target_columns].fillna(target_means)
control_group = control_group.astype({col: 'float64' for col in target_columns})
#Future versions of pandas will handle the changing of data types differently and it gives a warning letting you know, but the warning is not necessary for this project so i am going to filter it.
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

In [498]:
print(control_group.head())

      Campaign Name       Date  Spend [USD]  Impressions     Reach  \
0  Control Campaign  1.08.2019         2280      82702.0   56930.0   
1  Control Campaign  2.08.2019         1757     121040.0  102513.0   
2  Control Campaign  3.08.2019         2343     131711.0  110862.0   
3  Control Campaign  4.08.2019         1940      72878.0   61235.0   
4  Control Campaign  5.08.2019         1835     109560.0   88845.0   

   Website Clicks  Searches  View Content  Add to Cart  Purchase  
0          7016.0    2290.0        2159.0       1819.0     618.0  
1          8110.0    2033.0        1841.0       1219.0     511.0  
2          6508.0    1737.0        1549.0       1134.0     372.0  
3          3065.0    1042.0         982.0       1183.0     340.0  
4          5321.0    2221.0        1944.0       1300.0     523.0  


In [466]:
#Great! Now the data is nice and easy to read, and the NaN values have been replaced with the average value for each column.
#Now we can start asking some questions. For this example, there are a lot of different approcahes.
#But what we want to look at is whether the number of impressions in our test group leads to more purchases than our control group.
#So lets get started!
#The first thing that we are going to do is establish our baseline conversion rate.
#We are going to do that by taking the sum of the total impressions and the sum of total purchases and calculating that as a percentage.
#We are not going to use the means (averages) because we want to understand the cumulative effort of total performance.

In [468]:
control_group_impressions_to_purchases = control_group['Purchase'].sum() / control_group['Impressions'].sum() * 100

In [470]:
print(control_group_impressions_to_purchases)

0.4771824693553869


In [472]:
#So there is our baseline conversion rate as a percent.
#Now we have to ask 'What is the minimum increase in our conversion rate to make this new campaign worth the time and money?'
#That is known as the Minimum Decetable Effect, or Lift.
#This is not a set in stone number and depends on what you or your company is wanting to accomplish. It could be 0.0001% or it could be 50%.
#However, since our baseline conversion rate of impressions to purchases is so low (less than .5%) lets say that the MDE will be 2%.
#That means that if our Test Group has at least a 2% higher conversion rate, we will consider it a success and run the new campaign.
#Now let's calculate the conversion rate for the test group.

In [474]:
test_group_impressions_to_purchases = test_group['Purchase'].sum() / test_group['Impressions'].sum() * 100

In [476]:
print(test_group_impressions_to_purchases)

0.6988465925139349


In [478]:
#Great. Now lets calculate the absolute difference in percentages for the two campaigns.
print(test_group_impressions_to_purchases - control_group_impressions_to_purchases)

0.221664123158548


In [488]:
#Absolutely fantastic! The new campaign blows the old campaign out of the water! We were wanting a .2% MDE, but the actual increase is 22%!
#As unbelievable as that seems, lets run another test to make sure that increase is considered stasticically significant and did not occur by chance.
#Even though such strong success seems like a no brainer, marketing is expensive, so lets be sure!
#We are going to run what is called a z-test to make sure that seemingly wonderful increase in conversion rate wasn't by coincidence or chance.

In [494]:
control_purchases = control_group['Purchase'].sum()
test_purchases = test_group['Purchase'].sum()
control_impressions = control_group['Impressions'].sum()
test_impressions = test_group['Impressions'].sum()
purchase_counts = [control_purchases, test_purchases]
impression_counts = [control_impressions, test_impressions]

In [496]:
stat, p_value = proportions_ztest(purchase_counts, impression_counts)
print(p_value)

2.5966752754680143e-254


In [None]:
#And there we go! A p-value less than .05 is considered stasticially significant, or most likely not by chance!
#This seems intuitive, since our increase in conversion rate was 22%, but this test validates our decision to go ahead with our new campaign!
#To be specific, we operate under the assumption of the Null Hypothesis.
#The Null Hypothesis, in this case is that the new marketing campaign had ZERO impact (positive or negative) on the conversion rate.
#What we are testing is our Alternative Hypothesis, which was that the new campaign did have an effect on conversion rate.
#If the p-value is above .05 (for standard cases) then we accept the Null hypothesis and say that the new campaign had no effect on the conversion rate.
#If the p-value is below .05, we reject the Null Hypothesis (meaning that it is extremely unlikely that the new campaign did have an effect on the conversion rate.
#In this case, the p-value was drastically below .05, and since we had a 22% increase in our conversion rate, our new campaign is a go!
#Time to get this information to the marketing department and project managers so that we can cash in on our new campaign!