<h1>Take Home Challenge 2: Relax, Inc.</h1>
<h2>Springboard Data Science Career Track</h2>

My responses labeled

<h3>Data description</h3>
The data has the following two tables:

1. A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years. This table inclues:

<ul>
<li><b>name:</b> the user's name</li>
<li><b>object_id:</b> the user's id</li>
<li><b>email:</b> email address</li>
<li><b>creation_source:</b> how their account was created. This takes on one of 5 values:</li>
<ul>
<li><b>PERSONAL_PROJECTS:</b> invited to join another user's personal workspace</li>
<li><b>GUEST_INVITE:</b> invited to an organization as a guest (limited permissions)</li>
<li><b>ORG_INVITE:</b> invited to an organization (as a full member)</li>
<li><b>SIGNUP:</b> signed up via the website</li>
<li><b>SIGNUP_GOOGLE_AUTH:</b> signed up using Google Authentication (using a Google email account for their login id)</li>
</ul>
<li><b>creation_time:</b> when they created their account</li>
<li><b>last_session_creation_time:</b> unix timestamp of last login</li>
<li><b>opted_in_to_mailing_list:</b> whether they have opted into receiving marketing emails</li>
<li><b>enabled_for_marketing_drip:</b> whether they are on the regular marketing email drip</li>
<li><b>org_id:</b> the organization (group of users) they belong to</li>
<li><b>invited_by_user_id:</b> which user invited them to join (if applicable)</li>
</ul>

2. A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.

Defining an "adopted user" as a user who <i>has logged into the product on three separate days in at least one seven-day period</i>, <b>identify which factors predict future user adoption.</b>

We suggest spending 1-2 hours on this, but you're welcome to spend more or less. Please send us a brief writeup of your findings (the more concise, the better no—more than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach. Please note any factors you considered or investigation you did, even if they did not pan out. Feel free to identify any further research or data you think would be valuable.

In [1]:
import pandas as pd
import numpy as np

users = pd.read_csv("takehome_users.csv", encoding='latin1')
logins = pd.read_csv("takehome_user_engagement.csv")
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [2]:
logins.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [3]:
# Determine the status of users as 'adopted' or not
# Clean up both tables first

logins['time_stamp'] = pd.to_datetime(logins['time_stamp'])
grouped = logins.groupby(['user_id'])

users['user_id'] = users['object_id']
users.drop(['object_id'], axis=1, inplace=True)
users['adopted'] = 0

In [4]:
# Loop over the groups

for key, table in grouped:
    # Only consider users who have logged in three times
    if table.shape[0] < 3:
        pass
    else:
        # Create a new numpy array based on the time_stamp row
        timestamps = np.array(table['time_stamp'])
        for i in range(len(timestamps) - 2):
            # The login rows are ordered, so we just check to see
            # if the row two logins ahead is within a week, make 'adopted' = 1
            if pd.Timedelta(timestamps[i+2] - timestamps[i]) > pd.Timedelta('7 days'):
                pass
            else:
                users.loc[users['user_id'] == key, 'adopted'] = 1
                break

In [5]:
# Clean the users table to prepare it for a ML model
users.set_index('user_id', inplace=True)
creation_times = pd.to_datetime(users['creation_time'])
users['creation_year'] = [entry.year for entry in creation_times]
users['creation_month'] = [entry.month for entry in creation_times]
users['creation_day'] = [entry.day for entry in creation_times]
users['creation_hour'] = [entry.hour for entry in creation_times]
drops = ['creation_time', 'name', 'email']
users.drop(drops, axis=1, inplace=True)
users = pd.get_dummies(users, prefix=['creation_source'], columns=['creation_source'], drop_first=True)
users.head()

Unnamed: 0_level_0,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted,creation_year,creation_month,creation_day,creation_hour,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,1398139000.0,1,0,11,10803.0,0,2014,4,22,3,0,0,0,0
2,1396238000.0,0,0,1,316.0,1,2013,11,15,3,1,0,0,0
3,1363735000.0,0,0,94,1525.0,0,2013,3,19,23,1,0,0,0
4,1369210000.0,0,0,1,5151.0,0,2013,5,21,8,0,0,0,0
5,1358850000.0,0,0,193,5240.0,0,2013,1,17,10,0,0,0,0


In [6]:
users.isnull().sum()

last_session_creation_time            3177
opted_in_to_mailing_list                 0
enabled_for_marketing_drip               0
org_id                                   0
invited_by_user_id                    5583
adopted                                  0
creation_year                            0
creation_month                           0
creation_day                             0
creation_hour                            0
creation_source_ORG_INVITE               0
creation_source_PERSONAL_PROJECTS        0
creation_source_SIGNUP                   0
creation_source_SIGNUP_GOOGLE_AUTH       0
dtype: int64

In [7]:
# Lots of null values. Let's fill them in with a placeholder.

users.fillna(-2, inplace=True)
users.isnull().sum()

last_session_creation_time            0
opted_in_to_mailing_list              0
enabled_for_marketing_drip            0
org_id                                0
invited_by_user_id                    0
adopted                               0
creation_year                         0
creation_month                        0
creation_day                          0
creation_hour                         0
creation_source_ORG_INVITE            0
creation_source_PERSONAL_PROJECTS     0
creation_source_SIGNUP                0
creation_source_SIGNUP_GOOGLE_AUTH    0
dtype: int64

In [8]:
# Determine the proportion of users who adopted the platform
users['adopted'].value_counts()

0    10344
1     1656
Name: adopted, dtype: int64

In [9]:
pct_adopted = round(1656/(10344+1656) * 100, 2)
print(f'{pct_adopted}% of users adopted the platform.')

13.8% of users adopted the platform.


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
import numpy as np

np.random.seed(0)

target = users['adopted']
features = users.drop('adopted', axis=1)

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [11]:
lr = LogisticRegressionCV(solver='liblinear', cv=10)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.588

In [12]:
cols = [col for col in features.columns]
coefs = [round(coef, 2) for coef in lr.coef_[0]]
coefficients = zip(cols, coefs)

print(list(coefficients))

[('last_session_creation_time', -0.0), ('opted_in_to_mailing_list', -0.0), ('enabled_for_marketing_drip', -0.0), ('org_id', -0.0), ('invited_by_user_id', -0.0), ('creation_year', -0.0), ('creation_month', -0.0), ('creation_day', -0.0), ('creation_hour', -0.0), ('creation_source_ORG_INVITE', -0.0), ('creation_source_PERSONAL_PROJECTS', -0.0), ('creation_source_SIGNUP', -0.0), ('creation_source_SIGNUP_GOOGLE_AUTH', -0.0)]


In [13]:
from sklearn.metrics import confusion_matrix

y_pred = lr.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[1764  828]
 [ 408    0]]


<b>(Me)</b> Our first model is indiscriminately classifying all users as non-adopters. It does try to classify some as "adopted", but they literally all of the ones it chooses are non-adopters. Let's try XGB.

In [14]:
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV

xgb = XGBClassifier()

param_grid = {'max_depth': [2, 3, 4, 5],
              'learning_rate': [0.5, 0.1, 0.2],
              'n_estimators': [30, 100, 300]}

xgb_grid = GridSearchCV(xgb, param_grid=param_grid, cv=3)

xgb_grid.fit(X_train, y_train)
xgb_grid.score(X_test, y_test)

0.9646666666666667

In [15]:
xgb_grid.best_params_

{'learning_rate': 0.5, 'max_depth': 2, 'n_estimators': 300}

In [16]:
xgb = XGBClassifier(learning_rate=0.5, max_depth=2, n_estimators=300)
xgb.fit(X_train, y_train)

cols = [col for col in features.columns]
weights = [round(weight, 2) for weight in xgb.feature_importances_]
importances = zip(cols, weights)

print(list(importances))

[('last_session_creation_time', 0.29), ('opted_in_to_mailing_list', 0.03), ('enabled_for_marketing_drip', 0.03), ('org_id', 0.03), ('invited_by_user_id', 0.03), ('creation_year', 0.34), ('creation_month', 0.11), ('creation_day', 0.04), ('creation_hour', 0.02), ('creation_source_ORG_INVITE', 0.02), ('creation_source_PERSONAL_PROJECTS', 0.03), ('creation_source_SIGNUP', 0.04), ('creation_source_SIGNUP_GOOGLE_AUTH', 0.02)]


In [17]:
y_pred2 = xgb.predict(X_test)
cm2 = confusion_matrix(y_test, y_pred2)
print(cm2)

[[2564   28]
 [  78  330]]


<b>(Me)</b> We see a marked increase in accuracy using the XGB Classifier that has had its hyperparameters tuned. The model also tells us that the most important features to determining whether a user will adopt is their last session time and the year the user created the account. Likely, it's found that users who adopted are <i>still using the site,</i> and thus their last session was recent. Additionally, I would assume that creation year is important because the older your account, the more time you have to eventually adopt the site according to our three-logins-per-week criterion. This leaves us with a model that can accurately classify the current data, but might have difficulty dealing with new data due to the significant time-based nature of the model. The model also still has a little difficulty with sensitivity.

The next steps would be:

1. To tune it even further by creating a more sophisticated approach to missing values; perhaps imputation rather than replacement.

2. To force it to use more of the other features by removing the two most important ones or by forcing its max_depth above two steps.