## Relax_Inc Analysis 
by Ben Bellman for Springboard

## Instructions:

Defining  an  "adopted  user"   as  a  user  who   has  logged  into  the  product  on  three  separate
days  in  at  least  one  seven­day  period ,  identify  which  factors  predict  future  user
adoption .

In [1]:
## Import appropriate packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

In [2]:
## Load our data into dataframes
users = pd.read_csv('takehome_users.csv', encoding='latin-1')
engagement = pd.read_csv('takehome_user_engagement.csv', encoding='latin-1')

In [3]:
## Preview users
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [4]:
## We rename object_id to be the user_id
users = users.rename(columns={'object_id':'user_id'})

In [5]:
## Preview Engagement
engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [6]:
## Let's check the info for engagement. 
engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [7]:
## we convert the time_stamp to datetime.
engagement.time_stamp = pd.to_datetime(engagement.time_stamp)

In [8]:
## We look at our dataframe

df = engagement.set_index('time_stamp')
df.head()

Unnamed: 0_level_0,user_id,visited
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-04-22 03:53:30,1,1
2013-11-15 03:45:04,2,1
2013-11-29 03:45:04,2,1
2013-12-09 03:45:04,2,1
2013-12-25 03:45:04,2,1


In [9]:
## We find the pd.rolling method to help us achieve a rolling count of logins on a weekly interval: https://stackoverflow.com/questions/62369235/using-pandas-to-count-user-orders-that-happen-within-the-hour-from-start-time-wi
weekly = df.groupby('user_id').visited.rolling('7D').count()
weekly.head()

user_id  time_stamp         
1        2014-04-22 03:53:30    1.0
2        2013-11-15 03:45:04    1.0
         2013-11-29 03:45:04    1.0
         2013-12-09 03:45:04    1.0
         2013-12-25 03:45:04    1.0
Name: visited, dtype: float64

In [10]:
## We want to get the max number of logins per user and create a new dataframe that contains our max logins in a week. 
logins = weekly.groupby('user_id').max()
logins.head()

user_id
1    1.0
2    3.0
3    1.0
4    1.0
5    1.0
Name: visited, dtype: float64

In [11]:
## We create a dataframe with the max one_week_visits in days
logins = pd.DataFrame(logins).reset_index().rename(columns={'visited':'max_one_week_visits'})
logins.head()

Unnamed: 0,user_id,max_one_week_visits
0,1,1.0
1,2,3.0
2,3,1.0
3,4,1.0
4,5,1.0


In [12]:
# We merge our logins dataset w
merged = pd.merge(users,logins, on ='user_id')
merged.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,max_one_week_visits
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,1.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,3.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,1.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,1.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,1.0


# Supervised Machine Learning: 
- In this notebook, we will be trying to identify which factors influence a user to become adopted using supervised Machine Learning techniques, meaning we will be assigning labels and trying to create a model that best predicts these labels. We will then use the feature_importance_ method from these models to see which factors played the most important role in influencing the model's decision to classify a user as adopted or not. 

In [13]:
## First, we create our adopted column: 
merged['adopted']= np.where(merged.max_one_week_visits > 2,1,0)

In [14]:
## Now let's preview our df: 
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8823 entries, 0 to 8822
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   user_id                     8823 non-null   int64  
 1   creation_time               8823 non-null   object 
 2   name                        8823 non-null   object 
 3   email                       8823 non-null   object 
 4   creation_source             8823 non-null   object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    8823 non-null   int64  
 7   enabled_for_marketing_drip  8823 non-null   int64  
 8   org_id                      8823 non-null   int64  
 9   invited_by_user_id          4776 non-null   float64
 10  max_one_week_visits         8823 non-null   float64
 11  adopted                     8823 non-null   int32  
dtypes: float64(3), int32(1), int64(4), object(4)
memory usage: 861.6+ KB


The only columns missing values is invited_by_user_id, mostly missing values because they were not referred.

In [15]:
## 
merged.invited_by_user_id.fillna(0,inplace=True)

In [16]:
merged['referred'] = np.where(merged.invited_by_user_id>0,1,0)
merged.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,max_one_week_visits,adopted,referred
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,1.0,0,1
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,3.0,1,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,1.0,0,1
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,1.0,0,1
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,1.0,0,1


In [17]:
## 
cols_drop =['name','creation_time','email','last_session_creation_time','invited_by_user_id','max_one_week_visits','user_id']
merged.drop(columns = cols_drop, inplace = True)

In [18]:
dummies = pd.get_dummies(merged.creation_source, drop_first =True)
dummies.head()

Unnamed: 0,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH
0,0,0,0,0
1,1,0,0,0
2,1,0,0,0
3,0,0,0,0
4,0,0,0,0


In [19]:
final = pd.concat([merged,dummies], axis =1)

In [20]:
final.drop(columns = 'creation_source', inplace = True)

In [22]:
final.head()

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted,referred,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH
0,1,0,11,0,1,0,0,0,0
1,0,0,1,1,1,1,0,0,0
2,0,0,94,0,1,1,0,0,0
3,0,0,1,0,1,0,0,0,0
4,0,0,193,0,1,0,0,0,0


In [28]:
final.org_id.nunique()

417

In [21]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
cols_scale = ['']