###                                  Take-Home Challenge: Relax Inc.

1)A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years.
This table includes: 
● name: the user's name 
● object_id: the user's id 
● email: email address 
● creation_source: how their account was created. This takes on one of 5 values: 
        ○ PERSONAL_PROJECTS: invited to join another user's personal workspace 
        ○ GUEST_INVITE: invited to an organization as a guest (limited permissions) 
        ○ ORG_INVITE: invited to an organization (as a full member) 
        ○ SIGNUP: signed up via the website 
        ○ SIGNUP_GOOGLE_AUTH: signed up using Google Authentication (using a Google email account for their login id) 
            
● creation_time: when they created their account 
● last_session_creation_time: unix timestamp of last login 
● opted_in_to_mailing_list: whether they have opted into receiving marketing emails    
● enabled_for_marketing_drip: whether they are on the regular marketing email drip 
● org_id: the organization (group of users) they belong to 
● invited_by_user_id: which user invited them to join (if applicable).

2)A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven day period.

### Problem to be solved

Identify which factors predict future user adoption.

Steps: We will follow following steps to address this problem,

Import the files,
Clean and wrangle the data and
Identify top factors which will predict future adoption.

In [161]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


plt.rcParams['figure.figsize'] = [20,10]

import seaborn as sns
import scipy.stats as stats
import sklearn

# special matplotlib argument for improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")
plt.style.use('ggplot')

from IPython.display import Image
from IPython.core.display import HTML

In [162]:
data_engagement=pd.read_csv('takehome_user_engagement.csv')
pd.read_csv("takehome_users.csv",encoding="ISO-8859-1")

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1.398139e+09,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1.396238e+09,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1.363735e+09,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1.369210e+09,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1.358850e+09,0,0,193,5240.0
...,...,...,...,...,...,...,...,...,...,...
11995,11996,2013-09-06 06:14:15,Meier Sophia,SophiaMeier@gustr.com,ORG_INVITE,1.378448e+09,0,0,89,8263.0
11996,11997,2013-01-10 18:28:37,Fisher Amelie,AmelieFisher@gmail.com,SIGNUP_GOOGLE_AUTH,1.358275e+09,0,0,200,
11997,11998,2014-04-27 12:45:16,Haynes Jake,JakeHaynes@cuvox.de,GUEST_INVITE,1.398603e+09,1,1,83,8074.0
11998,11999,2012-05-31 11:55:59,Faber Annett,mhaerzxp@iuxiw.com,PERSONAL_PROJECTS,1.338638e+09,0,0,6,


In [163]:
data_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [164]:
data_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven day period, identify which factors predict future user adoption.

In [165]:
# converting the "time_stamp" column to datetimes
data_engagement.time_stamp = pd.to_datetime(data_engagement.time_stamp)
data_engagement.index=data_engagement.time_stamp
data_engagement.drop(labels='time_stamp',axis=1,inplace=True)

In [166]:
#Group by user_id and resample to 1 week period, sum over period
df_agg = user_engage.groupby([pd.Grouper(freq='W'),'user_id']).sum()

In [167]:
#find all user id's with a sum of 3 or more indicating an adopted user
df_adopt = df_agg[df_agg.visited>=3].unstack(level=1).melt()
adopted_users = pd.DataFrame(df_adopt.user_id.unique(),index=range(df_adopt.user_id.unique().shape[0]),columns=['user_id'])

In [168]:
adopted_users

Unnamed: 0,user_id
0,1693
1,728
2,11764
3,5297
4,6171
...,...
1440,7868
1441,7927
1442,9870
1443,10746


In [169]:
#create df of features
df_join = data_engagement.merge(adopted_users,how='inner',left_on='object_id',right_on='user_id')
df_join.head()

KeyError: 'object_id'

In [171]:
#drop irrelevant columns
drop_cols = list(df_join.columns[0:4])
drop_cols.append('user_id')
df_join = df_join.drop(drop_cols,axis=1)

NameError: name 'data_engament_join' is not defined

In [None]:
#one hot encode creation_source feature
df_create = pd.get_dummies(df_join['creation_source'])
df_features = pd.concat([df_join,df_create],axis=1)
df_features.drop('creation_source',axis=1,inplace=True)

In [None]:
#convert columns to float64
for col in df_features.columns:
    df_features[col] = df_features[col].astype('float64')
df_features.head()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#scale data
scaler = StandardScaler()
features = scaler.fit_transform(df_features)

#fit PCA
pca = PCA()
components = pca.fit_transform(features)