# Relax Take Home Challenge

"takehome_users.csv" : data on 12,000 users who signed up for the
product in the last two years. This table includes:
* name: the user's name
* object_id: the user's id
* email: email address
* creation_source: how their account was created. This takes on one
of 5 values:
     * PERSONAL_PROJECTS: invited to join another user's
personal workspace
     * GUEST_INVITE: invited to an organization as a guest
(limited permissions)
     * ORG_INVITE: invited to an organization (as a full member)
     * SIGNUP: signed up via the website
     * SIGNUP_GOOGLE_AUTH: signed up using Google
Authentication (using a Google email account for their login
id)
* creation_time: when they created their account
* last_session_creation_time: unix timestamp of last login
* opted_in_to_mailing_list: whether they have opted into receiving
marketing emails
* enabled_for_marketing_drip: whether they are on the regular
marketing email drip
* org_id: the organization (group of users) they belong to
* invited_by_user_id: which user invited them to join (if applicable).

"takehome_user_engagement.csv" : A usage summary table that has a row for each day
that a user logged into the product.

Defining an "adopted user" as a user who has logged into the product on three separate
days in at least one sevenday
period , identify which factors predict future user
adoption

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

In [2]:
#load in user data
users = pd.read_excel('takehome_users.xlsx', index_col=0)

In [3]:
users.head()

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2014-04-22 03:53:00,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
2,2013-11-15 03:45:00,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
3,2013-03-19 23:14:00,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
4,2013-05-21 08:09:00,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
5,2013-01-17 10:14:00,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [4]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 1 to 12000
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   creation_time               12000 non-null  datetime64[ns]
 1   name                        12000 non-null  object        
 2   email                       12000 non-null  object        
 3   creation_source             12000 non-null  object        
 4   last_session_creation_time  8823 non-null   float64       
 5   opted_in_to_mailing_list    12000 non-null  int64         
 6   enabled_for_marketing_drip  12000 non-null  int64         
 7   org_id                      12000 non-null  int64         
 8   invited_by_user_id          6417 non-null   float64       
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 937.5+ KB


In [5]:
#replace na values with 0
users['invited_by_user_id'] = users.invited_by_user_id.fillna(0)

In [6]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 1 to 12000
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   creation_time               12000 non-null  datetime64[ns]
 1   name                        12000 non-null  object        
 2   email                       12000 non-null  object        
 3   creation_source             12000 non-null  object        
 4   last_session_creation_time  8823 non-null   float64       
 5   opted_in_to_mailing_list    12000 non-null  int64         
 6   enabled_for_marketing_drip  12000 non-null  int64         
 7   org_id                      12000 non-null  int64         
 8   invited_by_user_id          12000 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 937.5+ KB


In [7]:
users.head()

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2014-04-22 03:53:00,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
2,2013-11-15 03:45:00,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
3,2013-03-19 23:14:00,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
4,2013-05-21 08:09:00,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
5,2013-01-17 10:14:00,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [8]:
#load in engagement data
engagement = pd.read_csv('takehome_user_engagement.csv', index_col=0)
#convert index to datetime index
engagement.index = pd.to_datetime(engagement.index)

In [9]:
#create label column
users['adopted'] = 0
#aggregate total number of engagements per user
total_engagements = engagement.groupby('user_id').sum()['visited']
#iterate through every user_id in engagement
for i in engagement.user_id.unique():
    #skip users who had less than 3 total engagements
    if total_engagements[i] >= 3:
        #resample engagements daily so that same-day logins are not double counted
        user_engagements = engagement[engagement.user_id == i].resample('D').first().dropna()
        last_login = user_engagements.iloc[-1].name
        #create 7 day window from first login
        window_start = user_engagements.iloc[0].name
        window_end = window_start + datetime.timedelta(7)
        while window_end <= last_login:
            #aggregate number of logins during 7 day window
            logins = len(user_engagements[window_start:window_end])
            #if a user had 3 or more logins in a 7 day window, change status of user to adopted and skip to next user
            if logins >= 3:
                users.loc[i, 'adopted'] = 1
                break
            #if a user had less than 3 logins during 7 day window, shift time window by one day
            else:
                window_start = window_start + datetime.timedelta(1)
                window_end = window_end + datetime.timedelta(1)

In [10]:
#one hot encoding
dummies = pd.get_dummies(users.creation_source, drop_first=True)
model_df = pd.concat([users, dummies], axis=1)
#drop irrelevant features
model_df = model_df.drop(['creation_time', 'last_session_creation_time', 'email', 'name', 'invited_by_user_id', 'org_id', 'creation_source'], axis=1)

In [11]:
model_df

Unnamed: 0_level_0,opted_in_to_mailing_list,enabled_for_marketing_drip,adopted,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,0,0,0,0,0,0
2,0,0,1,1,0,0,0
3,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...
11996,0,0,0,1,0,0,0
11997,0,0,0,0,0,0,1
11998,1,1,0,0,0,0,0
11999,0,0,0,0,1,0,0


In [12]:
#split data
X = model_df.drop('adopted', axis=1)
y = model_df.adopted
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)

In [13]:
#train model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
print(classification_report(y_test, rf_preds))
print(confusion_matrix(y_test, rf_preds))
print(roc_auc_score(y_test, rf_preds))

              precision    recall  f1-score   support

           0       0.85      1.00      0.92      4101
           1       0.00      0.00      0.00       699

    accuracy                           0.85      4800
   macro avg       0.43      0.50      0.46      4800
weighted avg       0.73      0.85      0.79      4800

[[4101    0]
 [ 699    0]]
0.5


  _warn_prf(average, modifier, msg_start, len(result))


In [14]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_test)
print(classification_report(y_test, lr_preds))
print(confusion_matrix(y_test, lr_preds))
print(roc_auc_score(y_test, lr_preds))

              precision    recall  f1-score   support

           0       0.85      1.00      0.92      4101
           1       0.00      0.00      0.00       699

    accuracy                           0.85      4800
   macro avg       0.43      0.50      0.46      4800
weighted avg       0.73      0.85      0.79      4800

[[4101    0]
 [ 699    0]]
0.5


  _warn_prf(average, modifier, msg_start, len(result))


In [15]:
svm = SVC()
svm.fit(X_train, y_train)
svm_preds = svm.predict(X_test)
print(classification_report(y_test, svm_preds))
print(confusion_matrix(y_test, svm_preds))
print(roc_auc_score(y_test, svm_preds))

              precision    recall  f1-score   support

           0       0.85      1.00      0.92      4101
           1       0.00      0.00      0.00       699

    accuracy                           0.85      4800
   macro avg       0.43      0.50      0.46      4800
weighted avg       0.73      0.85      0.79      4800

[[4101    0]
 [ 699    0]]
0.5


  _warn_prf(average, modifier, msg_start, len(result))
