# Introduction: #5 Calculation for Sample Size

This notebook includes the code for calculating the baseline and standard deviation, which are parameters in the formula of getting sample size for A/B test.

According to the result, __the average of revenue per session__ during 2021/1/1 to 2021/1/31 in Northern America is __around 6__. Moreover, __the standard deviation of revenue per session__ is __around 5.3__.

Note: We can also utilize [online sources](https://www.evanmiller.org/ab-testing/sample-size.html) which automatically compute the sample size for us if the metric is a ratio, e.g., add-to-cart rate, CVR, etc.

## Imports
We only use `pandas` in this section.

In [1]:
import pandas as pd

## Read in Data 

In [2]:
ecommerce_all = pd.read_csv('ecommerce_all.csv')
ecommerce_all['user_pseudo_id'] = ecommerce_all['user_pseudo_id'].astype(str)
ecommerce_all['ga_session_id'] = ecommerce_all['user_pseudo_id'].astype(str) + ecommerce_all['ga_session_id'].astype(str)
ecommerce_all.head()

Unnamed: 0,event_date,event_time,user_pseudo_id,continent,sub_continent,country,region,city,device_category,mobile_brand_name,...,transaction_id,item_id,item_name,item_brand,item_variant,item_category,price_in_usd,quantity,item_revenue_in_usd,promotion_name
0,20201103,2020-11-03 11:04:19,8825068.227430915,Europe,Southern Europe,(not set),(not set),(not set),desktop,Apple,...,(not set),GGOEGXXX1378,Google Cotopaxi Shell,(not set),(not set),Eco-Friendly,,,,(not set)
1,20201103,2020-11-03 07:43:31,10673312.893252771,Americas,Northern America,United States,Georgia,(not set),desktop,Google,...,(not set),GGOEYHPB121110,YouTube Leather Strap Hat Black,(not set),(not set),Home/Apparel/Hats/,,,,(not set)
2,20201103,2020-11-03 13:50:36,15626554.576808143,Americas,Northern America,United States,New York,New York,mobile,Apple,...,(not set),GGOEGXXX1381,Google Men's Softshell Moss,(not set),(not set),Home/Shop by Brand/Google/,,,,(not set)
3,20201103,2020-11-03 22:34:53,16462431.22835943,Americas,South America,Colombia,(not set),(not set),mobile,Apple,...,(not set),GGOEGXXX1038,Google F/C Long Sleeve Tee Charcoal,(not set),(not set),Home/Apparel/Men's / Unisex/,,,,(not set)
4,20201103,2020-11-03 21:31:24,16765047.752179116,Americas,South America,Ecuador,Pichincha,Quito,desktop,Google,...,(not set),GGOEGXXX1347,Google Sherpa Zip Hoodie Navy,(not set),(not set),Home/Apparel/Men's / Unisex/,,,,(not set)


In [3]:
### add column promotion_or_not to identify whether customer responded to internal promotions in this session
temp = ecommerce_all.copy()
temp['promotion_or_not'] = temp['promotion_name'].apply(lambda x: 1 if ((x == 'Google Mural Collection') | 
                                                                        (x == 'Act Responsible') |
                                                                        (x == 'Reach New Heights') |
                                                                        (x == 'Complete Your Collection')) else 0)
promotion_or_not = temp.groupby(['event_date', 'user_pseudo_id', 'ga_session_id'])['promotion_or_not'].agg('sum').\
                   reset_index().rename(columns={'promotion_or_not':'promotion_or_not'})
promotion_or_not['promotion_or_not'] = promotion_or_not['promotion_or_not'].apply(lambda x: 'No' if x == 0 else 'Yes')

# dataframe showing whether customer reponded to internal promotions for each session
session_promotion_or_not = ecommerce_all[['event_date', 'user_pseudo_id', 'ga_session_id']].drop_duplicates().merge(promotion_or_not)


### classify sessions with promotion and sessions without promotion
ecommerce_all['unique_session_id'] = ecommerce_all['event_date'].astype(str) + ecommerce_all['ga_session_id'].astype(str)
session_promotion_or_not['unique_session_id'] = session_promotion_or_not['event_date'].astype(str) + session_promotion_or_not['ga_session_id'].astype(str)
sessions_with_promotion_list = session_promotion_or_not[session_promotion_or_not['promotion_or_not'] == 'Yes']['unique_session_id'].tolist()
sessions_without_promotion_list = session_promotion_or_not[session_promotion_or_not['promotion_or_not'] == 'No']['unique_session_id'].tolist()

# sessions with promotion
sessions_with_promotion = ecommerce_all[ecommerce_all.unique_session_id.isin(sessions_with_promotion_list)]
# sessions without promotion
sessions_without_promotion = ecommerce_all[ecommerce_all.unique_session_id.isin(sessions_without_promotion_list)]

### only including Northern America and January data
sessions_with_promotion_us_jan = sessions_with_promotion[(sessions_with_promotion.sub_continent == 'Northern America') \
                                                         & (sessions_with_promotion.event_time >= '2021-01-01')]

## Calculation for Mean & Standard Deviation

In [4]:
# number of sessions engaged with promotions in the past one month
total_session_count_by_date = sessions_with_promotion_us_jan.groupby(['event_date'])['unique_session_id'].agg('nunique').\
                              reset_index().rename(columns={'unique_session_id':'total_session_count_by_date'})

# total purchase revenue in the past one month
total_purchase_revenue_by_date = sessions_with_promotion_us_jan.groupby(['event_date'])['purchase_revenue_in_usd'].agg('sum').\
                                 reset_index().rename(columns={'purchase_revenue_in_usd':'total_purchase_revenue_by_date'})

# revenue per session
temp = total_session_count_by_date.merge(total_purchase_revenue_by_date)
temp['revenue_per_session'] = temp['total_purchase_revenue_by_date'] / temp['total_session_count_by_date']

print('Mean of revenue per session: ', temp['revenue_per_session'].mean())
print('Standard deviation of revenue per session: ', temp['revenue_per_session'].std())

Mean of revenue per session:  5.923313758085377
Standard deviation of revenue per session:  5.312589959473213
