# Relex Inc Data Challenge
## Objectives
- Identify factors that help to preduct future user adoption for Relax Inc product.

*"Adopted user" is defined as a user who has logged into the product on three separate days in at least one seven-day period.*

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1">Objectives</a></span></li><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-2">Data Preprocessing</a></span></li><li><span><a href="#Predictive-Modeling" data-toc-modified-id="Predictive-Modeling-3">Predictive Modeling</a></span></li></ul></div>

## Data Preprocessing

In [3]:
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
# load user data for independent variable
user = pd.read_csv('takehome_users.csv', encoding = 'ISO-8859-1')
user.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [31]:
# load user engagement data
user_engagement = pd.read_csv('takehome_user_engagement.csv')
user_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [39]:
user_engagement.visited.value_counts()

1    207917
Name: visited, dtype: int64

In [32]:
user_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [33]:
# generate dependent variable: adopted_user
# a user who has logged into the product on three separate days in at least one seven-day period

# convert date object to datetime data type
user_engagement['time_stamp'] = pd.to_datetime(user_engagement['time_stamp'])

In [61]:
# create week column
user_engagement['week'] = user_engagement['time_stamp'].dt.week

# group by user id and week to sum up their visit times
adopt = pd.DataFrame(user_engagement.groupby(['user_id','week'])['visited'].sum())
adopt['adopted_user'] = adopt['visited'].apply(lambda x: 1 if x >= 3 else 0)
adopt = adopt.groupby('user_id').max()
adopt = adopt.reset_index()

# only keep the column of user_id and adopted_user
adopt = adopt[['user_id','adopted_user']]
adopt.head()

Unnamed: 0,user_id,adopted_user
0,1,0
1,2,1
2,3,0
3,4,0
4,5,0


In [64]:
# combine the user and adopt table
df = user.merge(adopt, how='left',  left_on='object_id', right_on='user_id')
df.drop(columns = 'user_id', inplace = True)
df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0.0


## Predictive Modeling

Random forest is selected as the primary model for this prediction task because it is robust for handling noise data and feature selection.