# Relax_Data Science Challenge 

In [81]:
# Importing libraries and packages 
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


plt.rcParams['figure.figsize'] = [20,10]

import seaborn as sns
import scipy.stats as stats
import sklearn

# special matplotlib argument for improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")
plt.style.use('ggplot')

from IPython.display import Image
from IPython.core.display import HTML 

## Importing and Cleaning the Data

## Briefly about the data the problem 

A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.
Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven day period

**Problem to be solved**

**Identify which factors predict future user adoption.**

Steps: We will follow following steps to address this problem,  
- Import the files, 
- Clean and wrangle the data and 
- Perform Recursive feature elimination (RFE) to identify top factors which will predict future adoption.  

In [82]:
#importing and opening the file
data_engagement = pd.read_csv('takehome_user_engagement.csv')
data_users = pd.read_csv('takehome_users.csv',encoding='latin-1')

In [83]:
data_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [84]:
data_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [85]:
data_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [86]:
data_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, identify which factors predict future user adoption.

### Making a dataframe from data_engagement which contains all the "adopted users".
An "adopted user" as a user who has logged into the product on three separate days in at least one seven day period

In [87]:
# converting the "time stamp" column to datetimes 
data_engagement['time_stamp']= pd.to_datetime(data_engagement['time_stamp'])

In [88]:
# setting up "time stamp to date time index" 
data_engagement = data_engagement.set_index('time_stamp')

In [89]:
#Data wrangling to count no. of times user logged into the product in '7D' period
data_weekly_count = data_engagement.groupby(['user_id', pd.TimeGrouper(freq = '7D')]).sum()

  


In [120]:
data_weekly_count.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,visited
user_id,time_stamp,Unnamed: 2_level_1
1,2014-04-17,1
2,2013-11-14,1
2,2013-11-28,1
2,2013-12-05,1
2,2013-12-19,1


In [91]:
# For adopted used: Filter df to get users with 3 or more logins in a week
data_filtered_weekly_count = data_weekly_count.loc[data_weekly_count['visited'] >=3, :]

In [92]:
data_filtered_weekly_count.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,visited
user_id,time_stamp,Unnamed: 2_level_1
10,2013-02-14,3
10,2013-02-28,3
10,2013-03-14,3
10,2013-04-11,4
10,2013-04-25,4


We need to get the unique user_id

In [93]:
data_filtered_weekly_count.shape

(33829, 1)

In [94]:
# extracting unique list of adopted users. 
data_filtered_weekly_count = data_filtered_weekly_count.reset_index()

In [95]:
adopted_users = data_filtered_weekly_count['user_id'].unique()

In [96]:
adopted_users

array([   10,    42,    43, ..., 11969, 11975, 11988])

In [97]:
# make a dataframe containing all adopted users 
df_adopted_users = pd.DataFrame({"user_id": adopted_users, 'user_adoption' : 'yes'})
df_adopted_users_count.head()

Unnamed: 0,user_id,user_adoption
0,10,yes
1,42,yes
2,43,yes
3,53,yes
4,63,yes


In [98]:
# joining adopted users to the user dataframe
combined_df = pd.merge(data_users, df_adopted_users, left_on = 'object_id', right_on = "user_id", how = 'outer')

In [99]:
combined_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,user_id,user_adoption
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,,
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,,
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,,
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,,
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,,


In [100]:
# Filling nan's in 'user_adoption' column with 'no'
combined_df["user_adoption"].fillna("no", inplace=True)

In [102]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 12 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
user_id                       1445 non-null float64
user_adoption                 12000 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 1.2+ MB


In [103]:
#dropping 'user_id' column
combined_df = combined_df.drop("user_id", axis=1)

In [104]:
# Dropping unnecessary columns
combined_df_model = combined_df.drop(["creation_time", 'name', 'email'], axis=1)

In [105]:
# check for NA or null values in the df. 
combined_df_model.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 8 columns):
object_id                     12000 non-null int64
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
user_adoption                 12000 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 843.8+ KB


There are null values in the column last_session_creation_time so we will drop these. 

In [106]:
combined_df_model_d= combined_df_model.dropna()

In [107]:
combined_df_model_d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4776 entries, 0 to 11997
Data columns (total 8 columns):
object_id                     4776 non-null int64
creation_source               4776 non-null object
last_session_creation_time    4776 non-null float64
opted_in_to_mailing_list      4776 non-null int64
enabled_for_marketing_drip    4776 non-null int64
org_id                        4776 non-null int64
invited_by_user_id            4776 non-null float64
user_adoption                 4776 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 335.8+ KB


In [108]:
# create a dummy table to run logistic regression 
combined_df_model_w_dum = pd.get_dummies(combined_df_model_d, drop_first = True)


In [109]:
combined_dfInstantiating model & use RFE
model_logreg = LogisticRegression()_model_w_dum.head()

Unnamed: 0,object_id,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,creation_source_ORG_INVITE,user_adoption_yes
0,1,1398139000.0,1,0,11,10803.0,0,0
1,2,1396238000.0,0,0,1,316.0,1,0
2,3,1363735000.0,0,0,94,1525.0,1,0
3,4,1369210000.0,0,0,1,5151.0,0,0
4,5,1358850000.0,0,0,193,5240.0,0,0


In [110]:
combined_df_model_w_dum.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4776 entries, 0 to 11997
Data columns (total 8 columns):
object_id                     4776 non-null int64
last_session_creation_time    4776 non-null float64
opted_in_to_mailing_list      4776 non-null int64
enabled_for_marketing_drip    4776 non-null int64
org_id                        4776 non-null int64
invited_by_user_id            4776 non-null float64
creation_source_ORG_INVITE    4776 non-null uint8
user_adoption_yes             4776 non-null uint8
dtypes: float64(2), int64(4), uint8(2)
memory usage: 270.5 KB


## Feature Selection with Recursive Feature Elimination (RFE)

In [121]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [122]:
# Selectinf features and target variable
X= combined_df_model_w_dum.drop("user_adoption_yes", axis=1)
Y = combined_df_model_w_dum["user_adoption_yes"]

In [123]:
# Instantiating model & use RFE
model_logreg = LogisticRegression()

In [124]:
rfe = RFE(model_logreg, 4)

In [125]:
rfe = rfe.fit(X, Y)

In [126]:
features_selected = rfe.support_

In [127]:
features_selected_rank = rfe.ranking_ 

In [128]:
#Finding column names of features selected
orig_columns = X.columns.values 

In [129]:
selected_column_names = np.array(orig_columns) * features_selected

#cleaning up list of selected column names, removing empty strings
final_selected_column_names = [x for x in selected_column_names if len(x)>1]
print("These are the Features that have been selected via RFE:\n\n", final_selected_column_names)

These are the Features that have been selected via RFE:

 ['object_id', 'last_session_creation_time', 'org_id', 'invited_by_user_id']


In [131]:
orig_columns

array(['object_id', 'last_session_creation_time',
       'opted_in_to_mailing_list', 'enabled_for_marketing_drip', 'org_id',
       'invited_by_user_id', 'creation_source_ORG_INVITE'], dtype=object)

In [130]:
selected_column_names

array(['object_id', 'last_session_creation_time', '', '', 'org_id',
       'invited_by_user_id', ''], dtype=object)

It was determined that 
- 'object_id', (remove it)
- 'last_session_creation_time', 
- 'invited_by_user_id' , 
- "org_id'

We will remove the object_id as it is similar to the user_id and will certinaly not affect the adoption. by the users. Hind sight we should have dropped this column from building the model.  

The rest of the three are the most important features picked up by RFE for prediciting whether a user will be an'adopted user' or not. 

**Recommendations:** 

- One recommendation that can be gained from looking at these features is for 'Relax' to create an incentive referal program to increase the potential of gaining an 'adopter user'. 

- Futher analysis is warranted after collecting additional data about users and organizations that they belong to. 