# Relax Take-home Challenge

Defining  an  "adopted  user"   as  a  user  who   has  logged  into  the  product  on  three  separate
days  in  at  least  one  seven­day  period ,  identify  which  factors  predict  future  user
adoption .

We  suggest  spending  1-2  hours  on  this,  but  you're  welcome  to  spend  more  or  less.

Please  send  us  a  brief  writeup  of  your  findings  (the  more  concise,  the  better -  no  more
than  one  page),  along  with  any  summary  tables,  graphs,  code,  or  queries  that  can  help
us  understand  your  approach.  Please  note  any  factors  you  considered  or  investigation
you  did,  even  if  they  did  not  pan  out.  Feel  free  to  identify  any  further  research  or  data
you  think  would  be  valuable.

In [295]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [296]:
df_users = pd.read_csv('./data/takehome_users.csv', parse_dates=['creation_time', 'last_session_creation_time'], encoding='iso-8859-1')
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   object_id                   12000 non-null  int64         
 1   creation_time               12000 non-null  datetime64[ns]
 2   name                        12000 non-null  object        
 3   email                       12000 non-null  object        
 4   creation_source             12000 non-null  object        
 5   last_session_creation_time  8823 non-null   object        
 6   opted_in_to_mailing_list    12000 non-null  int64         
 7   enabled_for_marketing_drip  12000 non-null  int64         
 8   org_id                      12000 non-null  int64         
 9   invited_by_user_id          6417 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(4), object(4)
memory usage: 937.6+ KB


## Users Data Cleaning

In [297]:
# Convert `last_session_creation_time` using UNIX time with seconds
df_users['last_session_creation_time'] = pd.to_datetime(df_users['last_session_creation_time'], unit='s')
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   object_id                   12000 non-null  int64         
 1   creation_time               12000 non-null  datetime64[ns]
 2   name                        12000 non-null  object        
 3   email                       12000 non-null  object        
 4   creation_source             12000 non-null  object        
 5   last_session_creation_time  8823 non-null   datetime64[ns]
 6   opted_in_to_mailing_list    12000 non-null  int64         
 7   enabled_for_marketing_drip  12000 non-null  int64         
 8   org_id                      12000 non-null  int64         
 9   invited_by_user_id          6417 non-null   float64       
dtypes: datetime64[ns](2), float64(1), int64(4), object(3)
memory usage: 937.6+ KB


In [298]:
df_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240.0


In [299]:
df_users.isna().sum()

object_id                        0
creation_time                    0
name                             0
email                            0
creation_source                  0
last_session_creation_time    3177
opted_in_to_mailing_list         0
enabled_for_marketing_drip       0
org_id                           0
invited_by_user_id            5583
dtype: int64

In [300]:
# Fill `invited_by_user_id` with 0 since it has meaning as "lack of invite"
df_users['invited_by_user_id'].fillna(0, inplace=True)

In [301]:
# Fill `invited_by_user_id` with 0 since it has meaning as "lack of invite"
df_users['last_session_creation_time'].fillna(0, inplace=True)

## EDA with User Engagement

In [302]:
df_engagement = pd.read_csv('./data/takehome_user_engagement.csv', parse_dates=['time_stamp'], index_col='time_stamp')
df_engagement.head()

Unnamed: 0_level_0,user_id,visited
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-04-22 03:53:30,1,1
2013-11-15 03:45:04,2,1
2013-11-29 03:45:04,2,1
2013-12-09 03:45:04,2,1
2013-12-25 03:45:04,2,1


In [303]:
# Add `week` column
df_engagement['week'] = df_engagement.index.isocalendar().week

In [304]:
# Create new Dataframe with aggregated logins grouped by user and week
df_logins_per_week = df_engagement.groupby(['user_id', 'week'])['visited'].sum().reset_index()

In [305]:
# Grab all unique user ids which have more than 3 logins in a week (7 days)
adopted_user_ids = df_logins_per_week[df_logins_per_week.visited >= 3].user_id.unique()
print(f"There are {len(adopted_user_ids)} adopted users.")

There are 1445 adopted users.


In [306]:
# Add `adopted` column to Users table
df_users['adopted'] = 0

In [307]:
# Update `adopted` column for adopted users
df_users.loc[df_users['object_id'].isin(adopted_user_ids), 'adopted'] = 1

In [308]:
# Inspect
df_users[df_users['adopted'] == 1]

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0,1
9,10,2013-01-16 22:08:03,Santos Carla,CarlaFerreiraSantos@gustr.com,ORG_INVITE,2014-06-03 22:08:03,1,1,318,4143.0,1
19,20,2014-03-06 11:46:38,Helms Mikayla,lqyvjilf@uhzdq.com,SIGNUP,2014-05-29 11:46:38,0,0,58,0.0,1
32,33,2014-03-11 06:29:09,Araujo José,JoseMartinsAraujo@cuvox.de,GUEST_INVITE,2014-05-31 06:29:09,0,0,401,79.0,1
41,42,2012-11-11 19:05:07,Pinto Giovanna,GiovannaCunhaPinto@cuvox.de,SIGNUP,2014-05-25 19:05:07,1,0,235,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...
11964,11965,2014-04-25 07:17:35,Storey Lewis,LewisStorey@cuvox.de,GUEST_INVITE,2014-05-21 07:17:35,0,0,65,11251.0,1
11966,11967,2014-01-12 08:12:37,Barbosa Pedro,PedroFernandesBarbosa@gmail.com,GUEST_INVITE,2014-05-31 08:12:37,0,0,15,5688.0,1
11968,11969,2013-06-01 00:48:14,Dickinson Aidan,AidanDickinson@hotmail.com,GUEST_INVITE,2014-05-30 00:48:14,1,1,52,6647.0,1
11974,11975,2013-03-23 11:10:11,Daecher Jürgen,JurgenDaecher@gustr.com,GUEST_INVITE,2014-05-22 11:10:11,1,0,31,6410.0,1


In [310]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df_users['creation_source_label'] = label_encoder.fit_transform(df_users.creation_source) 
df_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted,creation_source_label
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0,0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0,1,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0,0,1
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151.0,0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240.0,0,0


In [311]:
# Date features for `creation_time`
df_users['creation_time_hour'] = df_users['creation_time'].dt.hour
df_users['creation_time_day'] = df_users['creation_time'].dt.day
df_users['creation_time_month'] = df_users['creation_time'].dt.month
df_users['creation_time_year'] = df_users['creation_time'].dt.year
df_users['creation_time_dayofweek'] = df_users['creation_time'].dt.dayofweek
df_users['creation_time_weekofyear'] = df_users['creation_time'].dt.isocalendar().week

In [314]:
# Features
X = df_users.drop(['name', 'email', 'creation_source', 'creation_time', 'last_session_creation_time', 'adopted'], axis=1)
y = df_users.adopted

In [315]:
# Splitting to train test dataset
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.25, random_state = 42)
print(len(y_train), len(y_val))

9000 3000


In [316]:
# metric
from sklearn.metrics import f1_score, accuracy_score

In [320]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

# fit the data
model.fit(X_train, y_train)

# Get predictions
y_preds = model.predict(X_val)

# Get score
print(f"Accuracy: {accuracy_score(y_val, y_preds)}")
print(f"F1 Score: {f1_score(y_val, y_preds)}")

Accuracy: 0.8773333333333333
F1 Score: 0.0


In [321]:
print("Feature Importance")
for idx, col in enumerate(X.columns):
    print(f"{col}: {model.feature_importances_[idx]:.4f}")

Feature Importance
object_id: 0.1739
opted_in_to_mailing_list: 0.0211
enabled_for_marketing_drip: 0.0155
org_id: 0.1668
invited_by_user_id: 0.0972
creation_source_label: 0.0439
creation_time_hour: 0.1206
creation_time_day: 0.1143
creation_time_month: 0.0519
creation_time_year: 0.0270
creation_time_dayofweek: 0.0733
creation_time_weekofyear: 0.0945


In [143]:
# WRONG PATHS

# Resampling wasn't aggregating things in the right way
# user_weekly_visits = df_engagement.groupby(['user_id'])['visited'].resample('W').count()
# user_weekly_visits = user_weekly_visits.reset_index()

# Grouping by day and then using a rolling window turned out to be too complicated, though possibly more accurate
# df_user_visits = df_engagement.groupby(['user_id', 'date'])['visited'].sum().reset_index()