In [1]:
import pandas as pd

We've prepared the data in a notebook titled "data prep" in this repository. In that notebook, we created a column `is_adopted`, which indicates that the user has logged into the product on at least three separate days within any rolling 7-day window within the date range of user logins. 

In [2]:
df = pd.read_csv('takehome_users_prepped.csv')
df = df.reset_index(drop=True).reset_index(drop=False)
cols = ['user_id', 'is_adopted']
cols = cols + [x for x in df.columns if x not in cols]
df = df[cols].copy()
df.head(5)

Unnamed: 0,user_id,is_adopted,index,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,was_invited
0,1,False,0,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,True
1,2,False,1,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,True
2,3,False,2,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,True
3,4,False,3,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,True
4,5,False,4,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,True


Only some of these columns are potentially useful for understanding adoption (e.g. the name or email columns won't have anything to do with engagement). 

Below we choose columns that are potentially useful for analysis, and we notice that these are all categorical variables:

In [3]:
feats = ['creation_source', 'opted_in_to_mailing_list', 'enabled_for_marketing_drip', 'was_invited']

for col in df[feats]:
    x = pd.DataFrame(df[col].value_counts())
    print(f"for column {col}:")
    display(x)
    print('#'*80)
    print()

for column creation_source:


Unnamed: 0_level_0,count
creation_source,Unnamed: 1_level_1
ORG_INVITE,4254
GUEST_INVITE,2163
PERSONAL_PROJECTS,2111
SIGNUP,2087
SIGNUP_GOOGLE_AUTH,1385


################################################################################

for column opted_in_to_mailing_list:


Unnamed: 0_level_0,count
opted_in_to_mailing_list,Unnamed: 1_level_1
0,9006
1,2994


################################################################################

for column enabled_for_marketing_drip:


Unnamed: 0_level_0,count
enabled_for_marketing_drip,Unnamed: 1_level_1
0,10208
1,1792


################################################################################

for column was_invited:


Unnamed: 0_level_0,count
was_invited,Unnamed: 1_level_1
True,6417
False,5583


################################################################################



For each of these, we want to see if there is a difference between adoption across each of the category values; a throwaway function `analyze` is included below and accomplishes this end:

In [4]:
def analyze(col):
    global df
    pv = pd.pivot_table(
        df,
        index=col,
        columns='is_adopted',
        values='index',
        aggfunc='count',
    )
    pv['prob'] = pv[True] / pv.sum(axis=1)
    pv = pv.rename(columns={True:'treatment', False:'no_treatment'})
    return pv

In [5]:
import statsmodels.api as sm


def test_binary(d, verbose=True):
    count = d['treatment']
    nobs = d[['treatment', 'no_treatment']].sum(axis=1)
    z, p = sm.stats.proportions_ztest(count, nobs, alternative='two-sided')
    px = f"p={p:.3f}"
    if verbose:
        if p < 0.05:
            print(f"p < 0.05, reject null ({px})")
        else:
            print(f"p > 0.05, cannot reject null ({px})")
    return p 

In [7]:
for col in df[feats]:
    print(f"for column {col}:")
    x = analyze(col)
    display(x)
    y = x['prob'].max() - x['prob'].min()
    print(f"largest probability minus smallest probability: {100*y:.1f}%")
    if len(x) == 2:
        test_binary(x)
    print('#'*80)
    print()

for column creation_source:


is_adopted,no_treatment,treatment,prob
creation_source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GUEST_INVITE,1867,296,0.136847
ORG_INVITE,3809,445,0.104607
PERSONAL_PROJECTS,1984,127,0.060161
SIGNUP,1841,246,0.117873
SIGNUP_GOOGLE_AUTH,1202,183,0.13213


largest probability minus smallest probability: 7.7%
################################################################################

for column opted_in_to_mailing_list:


is_adopted,no_treatment,treatment,prob
opted_in_to_mailing_list,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,8048,958,0.106374
1,2655,339,0.113226


largest probability minus smallest probability: 0.7%
p > 0.05, cannot reject null (p=0.295)
################################################################################

for column enabled_for_marketing_drip:


is_adopted,no_treatment,treatment,prob
enabled_for_marketing_drip,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,9108,1100,0.107759
1,1595,197,0.109933


largest probability minus smallest probability: 0.2%
p > 0.05, cannot reject null (p=0.785)
################################################################################

for column was_invited:


is_adopted,no_treatment,treatment,prob
was_invited,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,5027,556,0.099588
True,5676,741,0.115475


largest probability minus smallest probability: 1.6%
p < 0.05, reject null (p=0.005)
################################################################################



...we can see that columns `opted_in_to_mailing_list`, `enabled_for_marketing_drip`, `was_invited` all have fairly small differences across categories. These are weak effect sizes:

* `opted_in_to_mailing_list` - in this sample, correlated to a lift in adoption of about 0.7% (p>0.05)
* `enabled_for_marketing_drip` - lifted adoption by about 0.2% (p>0.05)
* `was_invited` - lifted adoption by 1.6% (p<0.05) ** 

...there is weak evidence that email marketing opt-in status and marketing drip enablement lifted adoption, however the difference fails a basic statistical test. 

Column `was_invited` provides evidence that users who were referred are more likely to adopt. However, this difference is small at only 1.6%, and it isn't immediately clear that there is an intervention which could leverage this observation. 

Different `creation_source` values had a noticable impact on adoption:

In [10]:
d = analyze('creation_source').sort_values(by=['prob'])
d

is_adopted,no_treatment,treatment,prob
creation_source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PERSONAL_PROJECTS,1984,127,0.060161
ORG_INVITE,3809,445,0.104607
SIGNUP,1841,246,0.117873
SIGNUP_GOOGLE_AUTH,1202,183,0.13213
GUEST_INVITE,1867,296,0.136847


In [19]:
import numpy as np 
from scipy.stats import chi2_contingency

t = np.array([d['no_treatment'].values, d['treatment'].values])
chi2_stat, p, dof, expected = chi2_contingency(t)
print(f"p={p:.3f}, p<0.05: {p<0.05}")

p=0.000, p<0.05: True


...and there is a statistically significant difference between the categories. However, one part of this apparently the same as having been invited and it is unclear what the rest of the categories refer to. 

# Conclusion and Next Steps

There is weak evidence that email marketing efforts have impacted adoption. Users who are invited by others users were slightly more likely to adopt, by about 1-2%. 

Based on this analysis, the recommendation would be to consider ways to leverage user recommendations (e.g. some manner of referral bonus) and evaluate ways to enhance the effectiveness of marketing efforts. 