The data is available as two attached CSV files:
takehome_user_engagement.csv
takehome_users.csv

The data has the following two tables:

1] A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years. This table includes:

● name: the user's name

● object_id: the user's id

● email: email address

● creation_source: how their account was created. This takes on one of 5 values:

○ PERSONAL_PROJECTS: invited to join another user's personal workspace

○ GUEST_INVITE: invited to an organization as a guest (limited permissions)

○ ORG_INVITE: invited to an organization (as a full member)

○ SIGNUP: signed up via the website

○ SIGNUP_GOOGLE_AUTH: signed up using Google

Authentication (using a Google email account for their login id)

● creation_time: when they created their account

● last_session_creation_time: unix timestamp of last login

● opted_in_to_mailing_list: whether they have opted into receiving marketing emails

● enabled_for_marketing_drip: whether they are on the regular marketing email drip

● org_id: the organization (group of users) they belong to

● invited_by_user_id: which user invited them to join (if applicable).

2] A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven-day period, identify which factors predict future user adoption.

We suggest spending 1-2 hours on this, but you're welcome to spend more or less. Please send us a brief writeup of your findings (the more concise, the better - no more than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach. Please note any factors you considered or investigation you did, even if they did not pan out. Feel free to identify any further research or data you think would be valuable.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime,timedelta

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, average_precision_score
import itertools

In [2]:
ls -ltrh

total 13992
-rw-r--r--@ 1 bogdan  staff   1.1M Dec  7  2016 takehome_users.csv
-rw-r--r--@ 1 bogdan  staff   5.6M Dec  7  2016 takehome_user_engagement.csv
-rw-------@ 1 bogdan  staff    98K Dec  7  2016 relax_data_science_challenge.pdf
-rw-r--r--  1 bogdan  staff    59K May  2 18:00 example_take_home_challenge.ipynb


In [3]:
users = pd.read_csv('takehome_users.csv', encoding='latin1')
usage_summary = pd.read_csv('takehome_user_engagement.csv')

In [4]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [5]:
users.set_index('object_id', inplace=True)

In [6]:
users.head()

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [7]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 1 to 12000
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   creation_time               12000 non-null  object 
 1   name                        12000 non-null  object 
 2   email                       12000 non-null  object 
 3   creation_source             12000 non-null  object 
 4   last_session_creation_time  8823 non-null   float64
 5   opted_in_to_mailing_list    12000 non-null  int64  
 6   enabled_for_marketing_drip  12000 non-null  int64  
 7   org_id                      12000 non-null  int64  
 8   invited_by_user_id          6417 non-null   float64
dtypes: float64(2), int64(3), object(4)
memory usage: 937.5+ KB


In [8]:
users.describe()

Unnamed: 0,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
count,8823.0,12000.0,12000.0,12000.0,6417.0
mean,1379279000.0,0.2495,0.149333,141.884583,5962.957145
std,19531160.0,0.432742,0.356432,124.056723,3383.761968
min,1338452000.0,0.0,0.0,0.0,3.0
25%,1363195000.0,0.0,0.0,29.0,3058.0
50%,1382888000.0,0.0,0.0,108.0,5954.0
75%,1398443000.0,0.0,0.0,238.25,8817.0
max,1402067000.0,1.0,1.0,416.0,11999.0


In [9]:
usage_summary.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [10]:
usage_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [11]:
usage_summary.describe()

Unnamed: 0,user_id,visited
count,207917.0,207917.0
mean,5913.314197,1.0
std,3394.941674,0.0
min,1.0,1.0
25%,3087.0,1.0
50%,5682.0,1.0
75%,8944.0,1.0
max,12000.0,1.0


In [12]:
usage_summary.visited.unique()

array([1])

In [13]:
# the 'visited' column from usage_summary is useless, all values are equal to 1

In [14]:
usage_summary.time_stamp.nunique()

207220

In [15]:
usage_summary.time_stamp.count()

207917

In [16]:
# There are rows with duplicate timestamps but this is not an issue

In [17]:
usage_summary.head(3)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1


In [18]:
usage_summary.time_stamp[0]

'2014-04-22 03:53:30'

In [19]:
# Convert timestamp column to datetime
usage_summary['time_stamp'] = usage_summary['time_stamp'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))

In [20]:
usage_summary.head(3)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1


In [21]:
usage_summary.time_stamp[0]

Timestamp('2014-04-22 03:53:30')

In [22]:
# Find all users over 3 visited logins
active_users = usage_summary.groupby('user_id').sum()
active_users = active_users[active_users['visited'] >= 3]
active_users_list = active_users.index

In [23]:
users.head()

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [24]:
# Filter users that have logged in at least 3 times within 7 day time window
retained = []

for user in active_users_list:

    user_login_series = usage_summary['time_stamp'][usage_summary['user_id'] == user]
    
    start = 0
    ct = 0
    delta = 0

    for dt in user_login_series:

        if start == 0:
            start = dt
            delta = start + timedelta(days=7)
            ct = 1
        else:
            if dt <= delta:
                ct += 1
            else:
                start = 0

        if ct == 3:
            retained.append(user)
            break

In [25]:
len(retained)

1545

In [26]:
# Only 1545 users are retained (had at least 3 logins in a 7-days period)

In [27]:
users['retained'] = 0
for user_id in users.index:
    if user_id in retained:
        users.loc[user_id, 'retained'] = 1
        
users.head(3)

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,retained
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0


In [28]:
df = users.copy()
# Fill NaN values
df['invited_by_user_id'].fillna(0, inplace=True)
df.loc[df['invited_by_user_id'].notnull(), 'invited_by_user_id'] = 1

last_created_mode = users.last_session_creation_time.mode()[0]
df['last_session_creation_time'].fillna(last_created_mode, inplace=True)

# Scale number down
df['last_session_creation_time'] = df['last_session_creation_time'].div(10**9)

# Convert to datetime
df['creation_time'] = df['creation_time'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

# Convert creation time to days int
max_time = df['creation_time'].max()
df['creation_time'] = df['creation_time'].apply(lambda x: (max_time - x).days)

# Email address converted to just hosted address
df['email'] = df['email'].str.split('@').str[1]

# Drop name column
df.drop('name', axis=1, inplace=True)

# Change org column to categorical
df['org_id'] = df['org_id'].astype('category')

In [29]:
df_dummies = pd.get_dummies(df)

In [30]:
X = df_dummies.drop(['retained'], axis=1)
y = df_dummies['retained']

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=77, stratify=y)
    
clf = RandomForestClassifier(random_state=77)
    
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
    
class_names = ['Not_Adopt', 'Adopt']
    
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Average Precision: {:.2f}'.format(average_precision_score(y_test, y_pred)))
print(classification_report(y_test, y_pred, target_names=class_names))

Accuracy: 0.95
Average Precision: 0.64
              precision    recall  f1-score   support

   Not_Adopt       0.94      1.00      0.97      3136
       Adopt       0.98      0.59      0.74       464

    accuracy                           0.95      3600
   macro avg       0.96      0.80      0.86      3600
weighted avg       0.95      0.95      0.94      3600



In [32]:
# Last thing to do: calculate feature importance
feature_imp = pd.Series(clf.feature_importances_,index=list(X.columns)).sort_values(ascending=False)
print("First 20 features ordered by importance:")
feature_imp[:20]

First 20 features ordered by importance:


last_session_creation_time            0.372512
creation_time                         0.119232
opted_in_to_mailing_list              0.014543
creation_source_ORG_INVITE            0.013102
creation_source_SIGNUP_GOOGLE_AUTH    0.013028
creation_source_SIGNUP                0.010774
enabled_for_marketing_drip            0.010454
email_gmail.com                       0.009944
creation_source_PERSONAL_PROJECTS     0.009031
creation_source_GUEST_INVITE          0.008558
email_yahoo.com                       0.008081
email_hotmail.com                     0.006699
email_gustr.com                       0.006538
email_jourrapide.com                  0.006525
email_cuvox.de                        0.006442
org_id_270                            0.002746
org_id_58                             0.002330
org_id_61                             0.002293
org_id_306                            0.002232
org_id_0                              0.002181
dtype: float64