### Relax Challenge

The data is available as two attached CSV files:

takehome_user_engagement. csv

takehome_users . csv

The data has the following two tables:

1] A user table ( "takehome_users" ) with data on 12,000 users who signed up for 

product in the last two years. This table includes:

● name: the user's name

● object_id: the user's id

● email: email address

● creation_source: how their account was created. This takes on one
of 5 values:

○ PERSONAL_PROJECTS: invited to join another user's
personal workspace

○ GUEST_INVITE: invited to an organization as a guest
(limited permissions)

○ ORG_INVITE: invited to an organization (as a full member)

○ SIGNUP: signed up via the website

○ SIGNUP_GOOGLE_AUTH: signed up using Google

Authentication (using a Google email account for their login
id)

● creation_time: when they created their account

● last_session_creation_time: unix timestamp of last login

● opted_in_to_mailing_list: whether they have opted into receiving
marketing emails

● enabled_for_marketing_drip: whether they are on the regular
marketing email drip

● org_id: the organization (group of users) they belong to

● invited_by_user_id: which user invited them to join (if applicable).

2] A usage summary table ( "takehome_user_engagement" ) that has a row for each day
that a user logged into the product.
Defining an "adopted user" as a user who has logged into the product on three separate
days in at least one sevenday
period , identify which factors predict future user
adoption .
We suggest spending 12
hours on this, but you're welcome to spend more or less.

Please send us a brief writeup of your findings (the more concise, the better no
more
than one page), along with any summary tables, graphs, code, or queries that can help
us understand your approach. 


Please note any factors you considered or investigation
you did, even if they did not pan out. Feel free to identify any further research or data
you think would be valuable.

In [137]:
#importing the modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc, log_loss, classification_report
from sklearn.model_selection import cross_val_score
from sklearn import metrics
import datetime
from datetime import timedelta
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV,train_test_split

In [29]:
#importing the files
user_eng = pd.read_csv('takehome_user_engagement.csv',parse_dates=True)
take_h_users= pd.read_csv('takehome_users.csv',encoding='latin-1',parse_dates=True)

In [15]:
user_eng.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [17]:
user_eng.head(5)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [16]:
take_h_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 9 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(3)
memory usage: 843.8+ KB


In [18]:
take_h_users.head(5)

Unnamed: 0,object_id,creation_time,name,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,4/22/14 3:53,Clausen August,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,11/15/13 3:45,Poole Matthew,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,3/19/13 23:14,Bottrill Mitchell,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,5/21/13 8:09,Clausen Nicklas,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,1/17/13 10:14,Raw Grace,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [14]:
#defining adopted users

#An adopted user is defined as as a user who has logged into the product on three separate days in at least one sevenday period

In [30]:
#set the time_stamp to datetime and the set it as the index
user_eng.time_stamp = pd.to_datetime(user_eng.time_stamp)
#user_eng = user_eng.set_index('time_stamp', drop= True)

In [33]:
#check for null values
#user_eng = user_eng.resample('D').mean().dropna()

In [39]:
adopted_dict = {x:False for x in range(1, len(user_eng)+1)}

for group in user_eng.groupby('user_id'):
    
    #Define useful vars
    user_id = group[0]
    user_times = group[1]['time_stamp'].sort_values().reset_index(drop=True)
    no_visit = len(user_times)
    
    #If there are less than 3 engagements, they do not qualify
    if no_visit < 3:
        continue
     
    #Iterate over the engagement timestampe
    for i, stamp in enumerate(user_times):
        
        #Ensure we don't go off the end of the array of timestamps
        if i == no_visit-2:
            break
            
        #Define useful timestamp vars    
        start = stamp
        end = start + pd.Timedelta('7D')
        next1 = user_times[i+1]
        next2 = user_times[i+2]
        
        #Are the next two timestamps within a week?
        if (next1 < end) & (next2 < end):
            adopted_dict[user_id] = True
            break    
            


In [44]:
sum(adopted_dict.values())

1602

There are 1602 adopted users

#### Feature Selection
Lets see which features determine whether a user will be adopted or not


In [66]:
#adopted_dict

In [63]:
#adopted_df = pd.DataFrame(adopted_dict.items(), columns=['user_id', 'adopted'])
#data = []
#for row in adopted_dict:
#data.append({('value'): row["user_id"], 'key': row["adopted"]})

In [65]:
adopted_df = pd.DataFrame.from_dict(adopted_dict,orient='index')

In [71]:
adopted_df['user_id'] = adopted_df.index

In [76]:
adopted_df.rename(columns={0: 'adopted'}, inplace=True)

In [80]:
adopted_df = adopted_df*1

In [81]:
adopted_df.head(5)

Unnamed: 0,adopted,user_id
1,0,1
2,1,2
3,0,3
4,0,4
5,0,5


In [82]:
#Merge the adopted user info to users dataframe
users_all = pd.merge(adopted_df, take_h_users, left_on='user_id',right_on='object_id', how='inner')

In [83]:
users_all.head(5)

Unnamed: 0,adopted,user_id,object_id,creation_time,name,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,0,1,1,4/22/14 3:53,Clausen August,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,1,2,2,11/15/13 3:45,Poole Matthew,ORG_INVITE,1396238000.0,0,0,1,316.0
2,0,3,3,3/19/13 23:14,Bottrill Mitchell,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,0,4,4,5/21/13 8:09,Clausen Nicklas,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,0,5,5,1/17/13 10:14,Raw Grace,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [100]:
users_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4776 entries, 0 to 11997
Data columns (total 15 columns):
adopted                               4776 non-null object
user_id                               4776 non-null int64
object_id                             4776 non-null int64
creation_time                         4776 non-null object
name                                  4776 non-null object
last_session_creation_time            4776 non-null float64
opted_in_to_mailing_list              4776 non-null int64
enabled_for_marketing_drip            4776 non-null int64
org_id                                4776 non-null int64
invited_by_user_id                    4776 non-null float64
creation_source_GUEST_INVITE          4776 non-null uint8
creation_source_ORG_INVITE            4776 non-null uint8
creation_source_PERSONAL_PROJECTS     4776 non-null uint8
creation_source_SIGNUP                4776 non-null uint8
creation_source_SIGNUP_GOOGLE_AUTH    4776 non-null uint8
dtypes: float64

In [86]:
users_all = pd.get_dummies(users_all, columns = ['creation_source'])

In [99]:
users_all= users_all.dropna()

In [103]:
#users_all = users_all[~np.isnan(users_all)]
users_all = users_all[~pd.isnull(users_all)]

In [105]:
users_all.head(100)

Unnamed: 0,adopted,user_id,object_id,creation_time,name,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
0,0,1,1,4/22/14 3:53,Clausen August,1.398139e+09,1,0,11,10803.0,1,0,0,0,0
1,1,2,2,11/15/13 3:45,Poole Matthew,1.396238e+09,0,0,1,316.0,0,1,0,0,0
2,0,3,3,3/19/13 23:14,Bottrill Mitchell,1.363735e+09,0,0,94,1525.0,0,1,0,0,0
3,0,4,4,5/21/13 8:09,Clausen Nicklas,1.369210e+09,0,0,1,5151.0,1,0,0,0,0
4,0,5,5,1/17/13 10:14,Raw Grace,1.358850e+09,0,0,193,5240.0,1,0,0,0,0
5,0,6,6,12/17/13 3:37,Cunha Eduardo,1.387424e+09,0,0,197,11241.0,1,0,0,0,0
9,1,10,10,1/16/13 22:08,Santos Carla,1.401833e+09,1,1,318,4143.0,0,1,0,0,0
12,0,13,13,3/30/14 16:19,Fry Alexander,1.396196e+09,0,0,254,11204.0,0,1,0,0,0
16,0,17,17,4/9/14 14:39,Reynolds Anthony,1.397314e+09,1,0,175,1600.0,1,0,0,0,0
21,0,22,22,2/10/14 6:00,Myers Jordan,1.392012e+09,0,0,7,2994.0,0,1,0,0,0


In [None]:
#building a model that will tell which features play a role in a user becoming an adopted user.

In [106]:
cols=[
 'user_id',
 'opted_in_to_mailing_list',
 'enabled_for_marketing_drip',
 'org_id',                
 'invited_by_user_id',        
 'creation_source_GUEST_INVITE',            
 'creation_source_ORG_INVITE',              
 'creation_source_PERSONAL_PROJECTS',                 
 'creation_source_SIGNUP',  
 'creation_source_SIGNUP_GOOGLE_AUTH'    
]

In [110]:
users_all['adopted'].unique()

array([0, 1], dtype=object)

In [144]:
users_all['test_col']= np.random.choice([0, 1], users_all.shape[0])

In [145]:
users_all['test_col'].unique()

array([0, 1])

In [146]:
X_model = users_all[cols]
y_model = users_all.pop('test_col')

In [None]:
y_model

In [123]:
type(y_model)

pandas.core.series.Series

In [131]:
y_model=y_model.astype('int')

In [147]:
X_train, X_test, y_train, y_test = train_test_split(X_model,y_model, test_size=0.3, random_state=123)

In [113]:
X_test.head(5)

Unnamed: 0,user_id,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
6311,6312,0,0,81,6312.0,1,0,0,0,0
7345,7346,0,0,2,3153.0,1,0,0,0,0
637,638,1,0,174,9607.0,0,1,0,0,0
8585,8586,0,0,377,5008.0,0,1,0,0,0
726,727,0,0,3,10435.0,0,1,0,0,0


In [129]:
y_train.unique()

array([0, 1])

In [148]:
gbm = GradientBoostingClassifier(max_features='sqrt',
                                 n_estimators=50,learning_rate=.05,max_depth= 3)
gbm.fit(X_train,y_train)
feature_coef = pd.DataFrame(gbm.feature_importances_).transpose()
feature_coef.columns = list(X_train.columns)
feature_coef.index = ['GBM'] 
feature_coef.transpose().sort_values(by='GBM',ascending=False)

Unnamed: 0,GBM
invited_by_user_id,0.401338
user_id,0.335193
org_id,0.187394
opted_in_to_mailing_list,0.033648
creation_source_GUEST_INVITE,0.023461
enabled_for_marketing_drip,0.013387
creation_source_ORG_INVITE,0.005579
creation_source_PERSONAL_PROJECTS,0.0
creation_source_SIGNUP,0.0
creation_source_SIGNUP_GOOGLE_AUTH,0.0


From above, we can see that the most important feaures are org_id, invited by user id and if the login was created through Guest invite

In [149]:
graboost = GradientBoostingClassifier(max_features='sqrt')
parametergra = {'n_estimators':[40,45,50,55,60],'learning_rate':[.01,0.02,0.03,0.04,0.05],
              'max_depth':[1,2,3,4,5]}
grid1 = GridSearchCV(estimator=graboost,param_grid=parametergra,
                     scoring='precision',cv=5)
grid1.fit(X_train,y_train)
best_para_gra = grid1.best_params_
best_acc_gra = grid1.best_score_

print('The Tuned Paratmers :\n',best_para_gra,'\nAchieved %s Percent Precision' %(best_acc_gra*100))

The Tuned Paratmers :
 {'learning_rate': 0.04, 'max_depth': 1, 'n_estimators': 60} 
Achieved 51.835809901041486 Percent Precision


##### Conclusion
It looks like that the who invites the user plays an important role in determining if the user will be adopted or not. Guest 
invite also plays an important role.