In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

# Engagement data schema
● name: the user's name

● object_id: the user's id

● email: email address

● creation_source: how their account was created. This takes on one
of 5 values:
    
    ○ PERSONAL_PROJECTS: invited to join another user's personal workspace
    
    ○ GUEST_INVITE: invited to an organization as a guest (limited permissions)
    
    ○ ORG_INVITE: invited to an organization (as a full member)
    
    ○ SIGNUP: signed up via the website
    
    ○ SIGNUP_GOOGLE_AUTH: signed up using Google
    Authentication (using a Google email account for their login
     id)

● creation_time: when they created their account

● last_session_creation_time: unix timestamp of last login

● opted_in_to_mailing_list: whether they have opted into receiving
marketing emails

● enabled_for_marketing_drip: whether they are on the regular
marketing email drip

● org_id: the organization (group of users) they belong to

● invited_by_user_id: which user invited them to join (if applicable).

# Target: 
## Adopted User:

Has logged in to product on three separate days in at least one seven-day period.

In [2]:
#read in relevant data
users = pd.read_csv('./takehome_users.csv', encoding = 'latin1')
engagement = pd.read_csv('./takehome_user_engagement.csv')

# Data Preparation

The first step in this process is to create the target feature. My methodology will be simple: Using the __engagement__ dataframe, I will create a Datetime Index using the time_stamp feature, downsample this Index to weekly observations and count the number unique user_id logins per week period, recording all unique user_ids that appear at least three times in one of the created week level bins. Then, with a set of user_ids that qualify as "adopted users", I will add a binary feature to the __users__ dataframe representing whether or not a given user qualifies as an "adopted user".

In [3]:
#view datatable head
engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [None]:
#convert time stamp to datetime object and set as a DateimeIndex
engagement['time_stamp'] = pd.to_datetime(engagement['time_stamp'], format = '%Y-%m-%d %H:%M:%S')
engagement = engagement.set_index(engagement.time_stamp).drop(columns = ['time_stamp'])

#usage counts
counts = engagement.resample('7D').user_id.value_counts()

#extract multi users
adopted_indices = counts[counts >= 3].index

#create set of adopted user ids 
adopted = set()
for index in adopted_indices:
    adopted.add(index[1]) 
    
#create target variable
users['adopted_user'] = users.object_id.apply(lambda x: 1 if x in adopted else 0)
#view information for dataframe 
users.info()

In [5]:
#data table head
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0


There are several features such as __object_id__, __name__, and __email__ that either hold no value in modelling or were only used as a merge key. These will be dropped. I will also convert __creation_source__, a categorical variable with five levels, to binary features via one-hot encoding.

In [6]:
#drop merge keys and trivial object features
users = users.drop(columns = ['object_id', 'name', 'email'])

#create dummy variable from creation_source
creation_source = pd.get_dummies(users.creation_source)
#add back to main frame
users = pd.concat([users, creation_source], axis = 1).drop(columns = ['creation_source'])

# Feature Engineering
On their own, __last_session_creation_time__ and __invited_by_user_id__ lack value to generating a model. However, these features can yield two potentially important features. The timedelta between account creation and the last session used could potentially be valuable, and whether or not the user who invited the user is an adopted user themselves could also be potentially valuable. The code below will extract these two features in a format that the model can make use of. There is also the question of what to do with __org_id__. Because __org_id__ is a categorical variable with hundreds of levels, it would be unwise to leave the feature as is (machine learning algorithms will interpret higher __ord_ids__ as "larger" values, which is obviously not correct in this case, they merely represent different organizations) but because there are so many levels, we are looking at creating hundreds of features. The solution to this will be to use convert create binary variables from __ord_ids__ and convert the dataframe to a sparse matrix to conserve memory. Likely most of these will not end up being important to modelling, and thus using an algorithm down the road that penalizes and removes unimportant features (e.g., Lasso or anything else with L1 regularization) will help in removing unimportant features.

## Creating Timedelta feature to analyze minimum early activity

In [7]:
#quick check on null value login times
check_nulls = users[(users.last_session_creation_time.isnull()) & (users.adopted_user == 1)]
print(len(check_nulls))

0


I ran the block of code above to check if there were any users who had null values for __last_session_creation_time__ but were classified as __adopted_users__. The reason being, in creating features for minimum user activity, I needed to ensure my assumption that a null value in __last_session_creation_time__ means that the user never logged in, meaning I can fill null values with a timedelta of $0$. While this is not a perfect test, it was a quick data verification check before proceeding.

In [8]:
#transform date features into usable datetime objects
users['creation_time'] = pd.to_datetime(users['creation_time'], format = '%Y-%m-%d %H:%M:%S')
users['last_session_creation_time'] = pd.to_datetime(users['last_session_creation_time'], unit = 's')

#get times difference between last login and first login
users['timedeltas'] = users.last_session_creation_time - users.creation_time

#fill null values (assuming zero logins)
zero_logins = pd.Timedelta(0, unit = 's')
users['timedeltas'] = users.timedeltas.fillna(zero_logins)

#create timedelta objects for 1 week and 1 month
week = pd.Timedelta(7, unit = 'D')
month = pd.Timedelta(30, unit = 'D')

print('{} users did not login after their first week'.format(len(users.timedeltas[users.timedeltas < week])))
print()
print('{} users did not login after their first month'.format(len(users.timedeltas[users.timedeltas < month])))

#create feature: users still active for 1 week, still active after 1 month 
users['min_1week'] = users.timedeltas.apply(lambda x: 1 if x > week  else 0)
users['min_1month'] = users.timedeltas.apply(lambda x: 1 if x > month  else 0)
users = users.drop(columns = ['last_session_creation_time', 'creation_time', 'timedeltas'])

9437 users did not login after their first week

10082 users did not login after their first month


## Creating invited by adopted user feature

In [9]:
users['invited_by_adopted'] = users.invited_by_user_id.apply(lambda x: 1 if x in adopted else 0)
users = users.drop(columns = ['invited_by_user_id'])

# Dealing with org_ids

In [10]:
#check number of levels
print('There are {} different levels of "org_id"'.format(len(users.org_id.unique())))

There are 417 different levels of "org_id"


In [11]:
#get dummy variables 
orgs = pd.get_dummies(users.org_id)

#concatentate to main frame and drop original column
users = pd.concat([users, orgs], axis = 1).drop(columns = ['org_id'])

# Preparing for modelling

In [12]:
#extract target feature
target = users.adopted_user

#create sparse feature matrix
users = users.drop(columns = ['adopted_user'])
feature_cols = list(users.columns)
X = csr_matrix(users)

# Baseline Model
I will begin with a simple Logistic Regression with an L1 regularization. I have selected L1 regularization here because the matrix is sparse and L1 regularization has a strong penalty for unimportant features.

In [13]:
xtrain, xtest, ytrain, ytest = train_test_split(X, target, test_size = 0.3, random_state = 43)
lr = LogisticRegression(penalty = 'l1', solver = 'liblinear', max_iter = 1000)
lr.fit(xtrain, ytrain)
ypreds = lr.predict(xtest)
print('The accuracy of the baseline linear model was {}'.format(accuracy_score(ytest, ypreds)))
print()
print('The area under the ROC curve for the baseline linear model was {}'.format(roc_auc_score(ytest, ypreds)))

The accuracy of the baseline linear model was 0.9455555555555556

The area under the ROC curve for the baseline linear model was 0.944126984126984


In [14]:
#look into feature importances for baseline model 
coefficients = lr.coef_[0]
log_feature_importances = pd.DataFrame({'feature' : feature_cols, 'importance' : coefficients})

In [15]:
log_feature_importances = log_feature_importances[log_feature_importances.importance != 0].reset_index(drop = True).\
sort_values(by = ['importance'], ascending = False)
log_feature_importances.head(20)

Unnamed: 0,feature,importance
6,min_1week,4.538556
7,min_1month,4.242808
36,118,1.114544
62,339,1.078317
64,341,0.811375
41,131,0.710513
27,82,0.630841
13,7,0.456881
52,219,0.4296
29,89,0.404141


# Test on single feature

In [16]:
#create single feature to test on
min1_week = users.min_1week
min1_week = np.array(min1_week).reshape(-1, 1)

In [17]:
xtrain, xtest, ytrain, ytest = train_test_split(min1_week, target, test_size = 0.3, random_state = 43)
lr = LogisticRegression(penalty = 'l1', solver = 'liblinear', max_iter = 1000)
lr.fit(xtrain, ytrain)
ypreds = lr.predict(xtest)
print('The accuracy of the baseline linear model was {}'.format(accuracy_score(ytest, ypreds)))
print()
print('The area under the ROC curve for the baseline linear model was {}'.format(roc_auc_score(ytest, ypreds)))

The accuracy of the baseline linear model was 0.9044444444444445

The area under the ROC curve for the baseline linear model was 0.9434920634920635


# Test Removing Primary Feature

In [20]:
new = users.drop(columns = ['min_1week', 'min_1month'])
X2 = csr_matrix(new)

xtrain, xtest, ytrain, ytest = train_test_split(X2, target, test_size = 0.3, random_state = 43)
lr = LogisticRegression(penalty = 'l1', solver = 'liblinear', max_iter = 1000)
lr.fit(xtrain, ytrain)
ypreds = lr.predict(xtest)
print('The accuracy of the baseline linear model was {}'.format(accuracy_score(ytest, ypreds)))
print()
print('The area under the ROC curve for the baseline linear model was {}'.format(roc_auc_score(ytest, ypreds)))

The accuracy of the baseline linear model was 0.8738888888888889

The area under the ROC curve for the baseline linear model was 0.49936507936507935


An area under the ROC curve of nearly $0.5$ indicates that there is no decision boundary in this data, rather there is heavy class imbalance and the model is just predicting the more common value. If we look at the distribution of the predicted values versus the actual distribution of the test set, we will see this is the case.

In [23]:
#check predicted values distribution
series = pd.Series(ypreds)
series.value_counts()

0    3596
1       4
dtype: int64

In [22]:
true_series = pd.Series(ytest)
true_series.value_counts()

0    3150
1     450
Name: adopted_user, dtype: int64

# Findings

Far and away the most important indicator of whether or not someone will become an adopted user is whether they continue to use the product after the first week. That single feature alone explains $94\%$ of the variance in the data. 