In [941]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

**data**

A user table ( "takehome_users" ) with data on 12,000 users who signed up for the
product in the last two years

A usage summary table ( "takehome_user_engagement" ) that has a row for each day
that a user logged into the product.

**What I do**
Defining an "adopted user" as a user who has logged into the product on three separate
days in at least one sevenday
period , identify which factors predict future user
adoption .

In [942]:
# load dataset
user_df = pd.read_csv("takehome_users.csv", encoding = "ISO-8859-1")
user_engage_df = pd.read_csv("takehome_user_engagement.csv", encoding = "ISO-8859-1")      

In [943]:
user_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [944]:
user_engage_df.head(10)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1
5,2013-12-31 03:45:04,2,1
6,2014-01-08 03:45:04,2,1
7,2014-02-03 03:45:04,2,1
8,2014-02-08 03:45:04,2,1
9,2014-02-09 03:45:04,2,1


In [945]:
user_df.describe()

Unnamed: 0,object_id,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
count,12000.0,8823.0,12000.0,12000.0,12000.0,6417.0
mean,6000.5,1379279000.0,0.2495,0.149333,141.884583,5962.957145
std,3464.24595,19531160.0,0.432742,0.356432,124.056723,3383.761968
min,1.0,1338452000.0,0.0,0.0,0.0,3.0
25%,3000.75,1363195000.0,0.0,0.0,29.0,3058.0
50%,6000.5,1382888000.0,0.0,0.0,108.0,5954.0
75%,9000.25,1398443000.0,0.0,0.0,238.25,8817.0
max,12000.0,1402067000.0,1.0,1.0,416.0,11999.0


In [946]:
user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null  int64  
 7   enabled_for_marketing_drip  12000 non-null  int64  
 8   org_id                      12000 non-null  int64  
 9   invited_by_user_id          6417 non-null   float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [947]:
user_engage_df.describe()

Unnamed: 0,user_id,visited
count,207917.0,207917.0
mean,5913.314197,1.0
std,3394.941674,0.0
min,1.0,1.0
25%,3087.0,1.0
50%,5682.0,1.0
75%,8944.0,1.0
max,12000.0,1.0


The dataset is inclouding null data, at that first let's check who has logged into the product on three separate days in at least one sevenday period.

In [948]:
# convert ot datetime
user_df = user_df.rename(columns={"object_id": "user_id"})

In [949]:
# merge 
df = pd.merge(user_df, user_engage_df, on="user_id", how="left")

In [950]:
df.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,time_stamp,visited
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,2014-04-22 03:53:30,1.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2013-11-15 03:45:04,1.0
2,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2013-11-29 03:45:04,1.0
3,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2013-12-09 03:45:04,1.0
4,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2013-12-25 03:45:04,1.0


**who were visited more than 3 times**

In [951]:
# rename
df = df.rename(columns={"creation_time": "start_day"})
# convert to datetime
df['start_day'] = pd.to_datetime(df['start_day'])
df['time_stamp'] = pd.to_datetime(df['time_stamp'])

# check people who are visiting more then 3 days withthin 7 days
freq_df = df.groupby([pd.Grouper(freq="W", key="time_stamp"), "user_id"]).sum()
freq_visiter = freq_df[freq_df["visited"] >= 3]

# unique user
freq_user = freq_visiter.reset_index(["time_stamp", "user_id"])["user_id"].unique()
# create df for merge with main df
freq_df = pd.DataFrame({"user_id": freq_user, "adopted_user": pd.Series([1] * len(freq_user))})

# merge
df = pd.merge(df, freq_df, how="left")
df["adopted_user"] = df["adopted_user"].fillna(0).astype(int)

In [952]:
# drop dulicate user id
df = df.drop_duplicates(subset="user_id", keep="first")
# covert to categorical
create_col = pd.get_dummies(df["creation_source"])
df = df.join(create_col)

In [953]:
df = df.drop(["name", "email", "last_session_creation_time", "start_day", "time_stamp", "visited", "creation_source", "user_id"], axis=1)

In [954]:
df["invited_by_user_id"] = df["invited_by_user_id"].fillna(0)

In [955]:
df.loc[df["invited_by_user_id"] > 0, "invited_by_user_id" ] = 1

In [956]:
y = df["adopted_user"]
x = df.drop("adopted_user", axis=1)

In [957]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x.values, y.values, test_size = 0.2, random_state=42)

In [958]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

params = {
    "min_child_weight": [1, 3],
    "gamma": [0.5, 1, 3],
    "learning_rate": [1, 0.1],
}

xgb = XGBClassifier()

xgb = GridSearchCV(estimator=xgb, param_grid=params, cv=5,verbose=1)
xgb.fit(X_train, Y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   28.7s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=None,
             param_grid={'gamma': [0.5, 1, 3], 'learning_rate': [1, 0.1],
                         'min_child_weight': [1, 3]},
             pre_dispatch='2*n_jobs', refit=True,

In [959]:
print("best parameter is {}".format(xgb.best_params_))
print("accuracy is {}".format(xgb.best_score_))

best parameter is {'gamma': 0.5, 'learning_rate': 0.1, 'min_child_weight': 1}
accuracy is 0.8798958333333333


In [960]:
# use best param
best_xgb = XGBClassifier(gamma=0.5, learning_rate=0.1, min_child_weight=1)
best_xgb.fit(X_train, Y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0.5,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [961]:
best_xgb.feature_importances_

array([0.04698585, 0.08175155, 0.14822811, 0.0988791 , 0.08209372,
       0.15802705, 0.2214874 , 0.06558742, 0.09695983], dtype=float32)

In [962]:
pd.DataFrame({"fearure": x.columns, "importance": best_xgb.feature_importances_}).sort_values("importance", ascending=False)

Unnamed: 0,fearure,importance
6,PERSONAL_PROJECTS,0.221487
5,ORG_INVITE,0.158027
2,org_id,0.148228
3,invited_by_user_id,0.098879
8,SIGNUP_GOOGLE_AUTH,0.09696
4,GUEST_INVITE,0.082094
1,enabled_for_marketing_drip,0.081752
7,SIGNUP,0.065587
0,opted_in_to_mailing_list,0.046986


# Conclusion

We can see clearly which feature is important for prediction.
Interestingly how user account was created is an important result, but whether they have opted in to receiving
marketing emails is not really related with result.