# Relax Challenge

**Prompt**

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, identify which factors predict future user adoption.

Please send us a brief writeup of your findings (the more concise, the better ­­ no more than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach. Please note any factors you considered or investigation you did, even if they did not pan out. Feel free to identify any further research or data you think would be valuable.

In [1]:
# imports
import numpy as np
import pandas as pd
import os

In [2]:
DIR = '/Users/allankapoor/Documents/Springboard/springboard/relax_challenge'

engagement = pd.read_csv(os.path.join(DIR, 'takehome_user_engagement.csv'))

users = pd.read_csv(os.path.join(DIR, 'takehome_users.csv'), encoding='latin-1')

In [5]:
engagement.head(5)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [6]:
engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [9]:
engagement.user_id.max()

12000

In [24]:
#convert time_stamp to datetime
engagement['time_stamp'] = pd.to_datetime(engagement['time_stamp'])

In [25]:
engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   time_stamp  207917 non-null  datetime64[ns]
 1   user_id     207917 non-null  int64         
 2   visited     207917 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 4.8 MB


In [10]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [11]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null  int64  
 7   enabled_for_marketing_drip  12000 non-null  int64  
 8   org_id                      12000 non-null  int64  
 9   invited_by_user_id          6417 non-null   float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [15]:
#confirming that null values in invited_by_user_id are because they weren't invited
null_invite_signups = users.loc[users.invited_by_user_id.isnull(),:].creation_source.unique()
non_null_invite_signups = users.loc[~users.invited_by_user_id.isnull(),:].creation_source.unique()

print(f'Signup categories when invited_by_user_id is null: {null_invite_signups}')
print(f'Signup categories when invited_by_user_id is null: {non_null_invite_signups}')

Signup categories when invited_by_user_id is null: ['SIGNUP' 'PERSONAL_PROJECTS' 'SIGNUP_GOOGLE_AUTH']
Signup categories when invited_by_user_id is null: ['GUEST_INVITE' 'ORG_INVITE']


In [16]:
#convert creation_time from string to datetime
users['creation_time'] = pd.to_datetime(users['creation_time'])

#convert last_session_creation_time from unix to datetime
users['last_session_creation_time'] = pd.to_datetime(users['last_session_creation_time'], unit='s', origin='unix')

In [22]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240.0


## Create target variable

Per prompt, defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven day period, identify which factors predict future user adoption.

*In other words:* for a given user is their 7-day rolling sum ever 3 or greater?

In [100]:
def is_adopted (row):
    
    '''For given row in users df, determine if user has logged in 3+ times in any 7 day period based on engagement df'''
    
    #get user id for that row
    user = row['object_id']
    
    #filter engagement df to that user
    user_df = engagement.loc[engagement.user_id == user, :].drop(columns=['user_id'])
    
    #if table is empty, user not adopted
    if len(user_df) == 0:
        return 0
    
    else:

        # resample filtered engagement df to 1 day increments
        logins_interval = user_df.resample("1D", on='time_stamp').count()

        # get rolling sum (7 day period)
        logins_interval_rolling = logins_interval.visited.rolling(window=7, min_periods=1).sum()

        # find max value of rolling sum
        week_max = logins_interval_rolling.max()

        if week_max >= 3:
            return 1
        else:
            return 0

In [101]:
# apply function above to generate target variable for each row
users['is_adopted'] = users.apply((lambda row: is_adopted(row)), axis=1)

In [127]:
# confirming users with null last_session_creation_time never have is_adopted == 1
users[users.last_session_creation_time.isnull()].is_adopted.sum()

0

In [128]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,is_adopted
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151.0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240.0,0


### Feature Generation

Since there aren't a lot of features to predict on, let's create a few.

In [None]:
# explanatory features 

    #obvious

        #creation source

        #opted_in_to_mailing_list

        #enabled_for_marketing_drip
        
    #derive
    
        #month of creation time?
        
        #what org? (binary flag for part of org with > X active users) 
        
        #invited by? (binary flag for invite by user who invites a lot)

In [152]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   object_id                   12000 non-null  int64         
 1   creation_time               12000 non-null  datetime64[ns]
 2   name                        12000 non-null  object        
 3   email                       12000 non-null  object        
 4   creation_source             12000 non-null  object        
 5   last_session_creation_time  8823 non-null   datetime64[ns]
 6   opted_in_to_mailing_list    12000 non-null  int64         
 7   enabled_for_marketing_drip  12000 non-null  int64         
 8   org_id                      12000 non-null  int64         
 9   invited_by_user_id          6417 non-null   float64       
 10  is_adopted                  12000 non-null  int64         
dtypes: datetime64[ns](2), float64(1), int64(5), object(3)


In [154]:
#Creation Month
users['creation_month'] = users.creation_time.dt.to_period('M')

In [172]:
#number of users by org
users_by_org = pd.DataFrame(users.groupby('org_id').count().email).sort_values('email', ascending=False)
users_by_org = users_by_org.rename(columns={'email':'org_num_users'})
users = users.merge(users_by_org, on='org_id', how='left')

In [165]:
users

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,is_adopted,creation_month
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0,0,2014-04
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0,1,2013-11
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0,0,2013-03
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151.0,0,2013-05
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240.0,0,2013-01
...,...,...,...,...,...,...,...,...,...,...,...,...
11995,11996,2013-09-06 06:14:15,Meier Sophia,SophiaMeier@gustr.com,ORG_INVITE,2013-09-06 06:14:15,0,0,89,8263.0,0,2013-09
11996,11997,2013-01-10 18:28:37,Fisher Amelie,AmelieFisher@gmail.com,SIGNUP_GOOGLE_AUTH,2013-01-15 18:28:37,0,0,200,,0,2013-01
11997,11998,2014-04-27 12:45:16,Haynes Jake,JakeHaynes@cuvox.de,GUEST_INVITE,2014-04-27 12:45:16,1,1,83,8074.0,0,2014-04
11998,11999,2012-05-31 11:55:59,Faber Annett,mhaerzxp@iuxiw.com,PERSONAL_PROJECTS,2012-06-02 11:55:59,0,0,6,,0,2012-05


In [158]:
#number of adopted users by org
pd.DataFrame(users.groupby('org_id').sum().is_adopted).sort_values('is_adopted', ascending=False).head(60)

Unnamed: 0_level_0,is_adopted
org_id,Unnamed: 1_level_1
4,16
7,16
2,15
9,14
3,14
13,14
1,14
62,12
5,12
0,11


In [None]:
# statistical tests

In [None]:
# dummy encode

In [None]:
# transform

In [None]:
# train test split

In [None]:
# logistic regression model

In [None]:
# look at coefficients