In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_curve, auc, confusion_matrix

# Relax Data Challenge

## Problem:
Identify which factors predict future user adoption
+ Adopted user: user who has logged into the product on three separate days in at least one seven-day period

## Data

The first step is to import the data into Python to do some initial data exploration to figure out if the data has any missing values or irregularities.

### User Engagement

In [2]:
# Import user engagement
user_engage = pd.read_csv('takehome_user_engagement.csv',
                          index_col = 'time_stamp',
                         parse_dates = True)

user_engage.head()

Unnamed: 0_level_0,user_id,visited
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-04-22 03:53:30,1,1
2013-11-15 03:45:04,2,1
2013-11-29 03:45:04,2,1
2013-12-09 03:45:04,2,1
2013-12-25 03:45:04,2,1


In [3]:
user_engage.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 207917 entries, 2014-04-22 03:53:30 to 2014-01-26 08:57:12
Data columns (total 2 columns):
user_id    207917 non-null int64
visited    207917 non-null int64
dtypes: int64(2)
memory usage: 4.8 MB


In [4]:
user_engage.describe()

Unnamed: 0,user_id,visited
count,207917.0,207917.0
mean,5913.314197,1.0
std,3394.941674,0.0
min,1.0,1.0
25%,3087.0,1.0
50%,5682.0,1.0
75%,8944.0,1.0
max,12000.0,1.0


By briefly looking at the data, there doesn't seem to be any missing values or irregularities. There seems to be 12,000 unique users for which every time they visited, it counted only once which should be expected. Before moving on, I want to get the date range for which this data was recorded for.

In [5]:
# Get the beginning and end date for logs
min_date = user_engage.index.min()
max_date = user_engage.index.max()

print('The log started recording on', min_date)
print('The log fnished recording on', max_date)

The log started recording on 2012-05-31 08:20:06
The log fnished recording on 2014-06-06 14:58:50


### Users

In [6]:
# Import user information
users = pd.read_csv('takehome_users.csv', 
                    index_col = 'object_id', 
                    parse_dates = [1], 
                    encoding = 'iso-8859-1')

users.head()

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [7]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 1 to 12000
Data columns (total 9 columns):
creation_time                 12000 non-null datetime64[ns]
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 937.5+ KB


In [8]:
users.describe()

Unnamed: 0,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
count,8823.0,12000.0,12000.0,12000.0,6417.0
mean,1379279000.0,0.2495,0.149333,141.884583,5962.957145
std,19531160.0,0.432742,0.356432,124.056723,3383.761968
min,1338452000.0,0.0,0.0,0.0,3.0
25%,1363195000.0,0.0,0.0,29.0,3058.0
50%,1382888000.0,0.0,0.0,108.0,5954.0
75%,1398443000.0,0.0,0.0,238.25,8817.0
max,1402067000.0,1.0,1.0,416.0,11999.0


From the initial findings, it appears that the data has some missing values under ``last_session_creation_time`` and ``invited_by_user_id``. There's roughly a quarter of ``last_session_creation_time`` and half of ``invited_by_user_id`` data missing. Removing those entries would account for valuable observations so it should be avoided. Also, according to the data documentation, the column ``last_session_creation_time`` is encoded as a unix timestamp which may make more sense if it's converted from type float64 to a timestamp.

In [9]:
# Convert last_session_creation_time to datetime
users['last_session_creation_time'] = pd.to_datetime(users['last_session_creation_time'], unit = 's')

#### ``invited_by_user_id``

I'll start by investigating the missing values for ``invited_by_user_id`` as it might be easier to resolve. One reason why these values might be missing is because they weren't invited by other users to join. In that case, the values under ``creation_source`` for these observations would be _SIGNUP_ or *SIGNUP_GOOGLE_AUTH*.

In [10]:
# Filter out missing invited_by_user_id
missing_invited_by = users[users['invited_by_user_id'].isna()]

In [11]:
# Group by creation_source and obtain count
missing_invited_by.groupby('creation_source')['name'].count()

creation_source
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: name, dtype: int64

In [12]:
# Compare counts to counts for users
users.groupby('creation_source')['name'].count()

creation_source
GUEST_INVITE          2163
ORG_INVITE            4254
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: name, dtype: int64

As suspected, all the missing values either correspond to direct signups through the website or Google. However, all the users to signed up for personal projects weren't considered invited by another user.

Now that we know why values of ``invited_by_user_id`` are missing, we can fill them in. Since there's no ``object_id`` (or ``user_id``) of $0$, I'm going to choose that value to fill in the missing values to mean *self*. Also, I don't want the user ID of the user who invited them but instead whether or not if they were invited by someone. 

In [13]:
# Fill nan values in invited_by_user_id as 0
users['invited_by_user_id'].fillna(0, inplace=True)

In [14]:
# Greater than 0?
users.loc[users['invited_by_user_id'] > 0, 'invited_by_user_id'] = 1

#### ``last_session_creation_time``

The missing values in ``last_session_creation_time`` offer an intriguing problem. Although the missing data comprises of roughly 25% of the data, the only real way to fill in the values is to use the user engagement data set. That data is also necessary for determining adopted users. The only explanation is that these users signed up but never used the platform. For that reasoning, these observations have to be removed as they won't provide any insight.

In [15]:
# Drop nan's from the data frame
users.dropna(inplace=True)

### ``adopted_users``

Relax wants to know how many of their users were adopted and what factors play into it. A column will have to be added to the users data frame to mark if they were adopted or not. This column will eventually be used as the target variable of our model

In [16]:
# Create a list of adopted users
adopted_users = []
for i in users.index:
    # Filter out by user_id and resample user engagement by days
    user_activity = user_engage[user_engage['user_id'] == i].resample('D').min().sort_index()
    
    # Get a rolling count with window of 7 days
    rolling_count = user_activity['visited'].rolling(window=7, min_periods=1).sum()
    
    # Extract the most days the user logged-in in a 7 day window
    max_days_active = rolling_count.max()
    
    # If max_days_active is more than or equal to 3, append it to adopted_users as 1
    adopted_users.append(int(max_days_active >= 3))

In [17]:
# Added adopted_users column
users['adopted_users'] = adopted_users

### ``activity_amount``

An important factor that may useful in predicting if the user was adopted is the amount of activity on their account. The following column was added to the user data frame to express that.

In [18]:
# Get total logins
activity_grouped = user_engage.reset_index().groupby('user_id')['visited'].sum()

In [19]:
# Add activity_amount column
users['activity_amount'] = activity_grouped

## EDA

Here, I want to briefly see if there are any patterns or correlations between any variables with being an adopted user.

In [20]:
users.head()

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_users,activity_amount
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,1.0,0,1
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,1.0,1,14
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1.0,0,1
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,1.0,0,1
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,1.0,0,1


In [21]:
# Drop unnecessary columns
users.drop(['creation_time','last_session_creation_time', 'name', 'email'], inplace=True, axis=1)

In [22]:
# Create dummy variables for creation_source
users_dummies = pd.get_dummies(users, columns=['creation_source'])

In [23]:
# Get mean of variables by adopted_users
adopted_grouped = users_dummies.groupby('adopted_users').mean()

adopted_grouped.transpose()

adopted_users,0,1
opted_in_to_mailing_list,0.250935,0.258427
enabled_for_marketing_drip,0.151641,0.153558
org_id,138.200803,162.276529
invited_by_user_id,0.534967,0.569913
activity_amount,1.384711,123.54432
creation_source_GUEST_INVITE,0.17006,0.224719
creation_source_ORG_INVITE,0.364908,0.345194
creation_source_PERSONAL_PROJECTS,0.083091,0.102372
creation_source_SIGNUP,0.222268,0.182896
creation_source_SIGNUP_GOOGLE_AUTH,0.159673,0.144819


In the table above, there aren't huge differences between the averages of the variables grouped by adopted users. The only variable notably different is ``activity_amount``.

## Modeling

The goal for a model should be to outperform guessing and for a binary classification problem like this, it should have a better accuracy than the percentage of the majority class.

In [24]:
# Percent of adopted users
users_dummies['adopted_users'].mean()

0.18157089425365522

In [25]:
# Extract predictor and target variables
predictors = users_dummies.drop('adopted_users', axis=1)
target = users_dummies['adopted_users']

In [26]:
# Split the data to training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(predictors,
                                                   target,
                                                   test_size = 0.3, 
                                                   random_state = 123,
                                                   stratify = target)

In [27]:
# Create a randomized grid of parameters to help tune a random forest classifier

# Number of trees in the random forest
n_estimators = [int(x) for x in np.linspace(start = 5, stop = 50, num = 10)]

# Max depth of trees
max_depth = [int(x) for x in np.linspace(start = 5, stop = 50, num = 10)]

# Minimum number of samples required to split
min_samples_split = [2, 5, 10]

# Create the random grid
random_grid = {'n_estimators':n_estimators,
              'max_depth':max_depth,
              'min_samples_split':min_samples_split,
              'class_weight':['balanced']}

In [28]:
# Instatiate RandomForestClassifier model
rf = RandomForestClassifier()

# Create RandomSearchCV
rf_random = RandomizedSearchCV(estimator = rf,
                              param_distributions = random_grid,
                              n_iter = 200,
                              cv = 5,
                              random_state = 123)

# Fit training data to rf_random
rf_random.fit(X_train, y_train)

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid='warn', n_iter=200, n_jobs=None,
          param_distributions={'n_estimators': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50], 'max_depth': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50], 'min_samples_split': [2, 5, 10], 'class_weight': ['balanced']},
          pre_dispatch='2*n_jobs', random_state=123, refit=True,
          return_train_score='warn', scoring=None, verbose=0)

In [29]:
# Return the best parameters
rf_random.best_params_

{'n_estimators': 45,
 'min_samples_split': 10,
 'max_depth': 25,
 'class_weight': 'balanced'}

In [30]:
def model_eval(model, X_test, y_test):
    '''This function should print evaluation metrics for a given model'''
    y_pred = model.predict(X_test)
    
    fpr, tpr, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(fpr, tpr)
    
    accuracy = model.score(X_test, y_test)
    
    print('Accuracy Score: %.4f' %(accuracy))
    print('AUC Score: %.4f' %(roc_auc))

In [31]:
# Print the evaluation metrics for rf_random
model_eval(rf_random.best_estimator_, X_test, y_test)

Accuracy Score: 0.9766
AUC Score: 0.9687


In [32]:
# Get the model predictions of the test set and print the confusion matrix
predictions = rf_random.best_estimator_.predict(X_test)

confusion_matrix(y_test, predictions)

array([[2125,   41],
       [  21,  460]], dtype=int64)

In [33]:
# Create a parameter grid for GridSearchCV
param_grid = {'n_estimators':[22, 25, 28],
             'min_samples_split':[4, 5, 6],
             'max_depth':[18, 20, 22]}

In [34]:
# Instantiate a Random Forest model
rf = RandomForestClassifier()

# Create GridSearchCV
grid_search = GridSearchCV(estimator=rf,
                          param_grid = param_grid,
                          cv = 5)

In [35]:
# Fit the training data to grid_search
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_estimators': [22, 25, 28], 'min_samples_split': [4, 5, 6], 'max_depth': [18, 20, 22]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [36]:
# Return best parameters
grid_search.best_params_

{'max_depth': 20, 'min_samples_split': 4, 'n_estimators': 28}

In [37]:
# Print model evaluation scores
model_eval(grid_search.best_estimator_, X_test, y_test)

Accuracy Score: 0.9777
AUC Score: 0.9565


In [38]:
# Print confusion matrix of predictions
predictions = grid_search.best_estimator_.predict(X_test)

confusion_matrix(y_test, predictions)

array([[2144,   22],
       [  37,  444]], dtype=int64)

The models were able to perform better than random guessing. Next would be to extract the feature importances per Relax's request.

In [39]:
# Print feature_importance
feature_importance = grid_search.best_estimator_.feature_importances_
feature_importance_df = pd.DataFrame(feature_importance, index = X_train.columns, columns = ['importance'])

feature_importance_df.sort_values('importance', ascending = False)

Unnamed: 0,importance
activity_amount,0.935542
org_id,0.050323
opted_in_to_mailing_list,0.002906
enabled_for_marketing_drip,0.002893
creation_source_SIGNUP,0.001569
creation_source_ORG_INVITE,0.001549
creation_source_PERSONAL_PROJECTS,0.001519
creation_source_GUEST_INVITE,0.00142
creation_source_SIGNUP_GOOGLE_AUTH,0.001385
invited_by_user_id,0.000895


## Remarks

Relax wanted to know which factors predicted future user adoption. According to the model's feature importances, the amount of activity by far is most important in predicting future user adoption. That makes sense since the more frequent a user logs in, the higher the probability that they'll log in at least 3 times in a 7 day period. One caveat though is that the variable ``activity_amount`` is directly correlated to defining an adopted user which makes the model invalid. The variable is hard to use practically because it's a "resulting" variable meaning it's a variable taken after the fact. It makes it useless for new users to predict user adoption.

In the future, a model should exclude the ``activity_amount`` variable and integrate better predictor variables. The variables given by the available data isn't enough to reliably predict user adoption. Variables to consider adding to help predict future user adoption could be amount of time the user spent initially after signup, why a user signed up, or the rating the ``invited_by_user_id`` gave the product before inviting someone.