# Relax Inc. Data Science take home challenge

This notebook analyzes Relax Inc.'s user data and creates a model to find most important features in predicting an adopted user.

In [11]:
#General packages
import pandas as pd
import numpy as np
import csv
import matplotlib.pyplot as plt
import seaborn as sns


#Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

#evaluation
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

#warnings
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

In [2]:
#import user info data into pandas dataframe
users_df = pd.read_csv(r'C:\Users\Evan\Programming\Jupiter Projects\Springboard Interview Challenge\Relax_challenge\relax_challenge\takehome_users.csv', encoding='latin-1')
users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


Immediately its obvious not every user has created a session. While it is good to have new signups this means little if they don't actually use the product. The initial hook to get them in is working, but the reel needs to be improved in order to continue pulling in their interest. We're missing out on at least 3,177 users over the past 2 years. 

Let's see how signups that were invited compare.

In [3]:
invited = users_df[users_df.invited_by_user_id.notnull()]
invited.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6417 entries, 0 to 11997
Data columns (total 10 columns):
object_id                     6417 non-null int64
creation_time                 6417 non-null object
name                          6417 non-null object
email                         6417 non-null object
creation_source               6417 non-null object
last_session_creation_time    4776 non-null float64
opted_in_to_mailing_list      6417 non-null int64
enabled_for_marketing_drip    6417 non-null int64
org_id                        6417 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 551.5+ KB


So about 75% of invited users opened at least 1 session in the product. Only slightly above the overall users that signed up and opened a session (~73%). Using invites doesn't seem like a very significant improvement on keeping signups interested.

In [7]:
#import user engagement info data into pandas dataframe
engage_df = pd.read_csv(r'C:\Users\Evan\Programming\Jupiter Projects\Springboard Interview Challenge\Relax_challenge\relax_challenge\takehome_user_engagement.csv', encoding='latin-1')
engage_df.time_stamp = pd.to_datetime(engage_df.time_stamp)
engage_df.index=engage_df.time_stamp
engage_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 207917 entries, 2014-04-22 03:53:30 to 2014-01-26 08:57:12
Data columns (total 3 columns):
time_stamp    207917 non-null datetime64[ns]
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 6.3 MB


In [8]:
#look at individual users to sort out adopted users
#defined as logging in on 3 separate days within a 7 day period

users = engage_df["user_id"].unique()
adoption = []

for i in users:
    user_id = engage_df["user_id"] == i
    df = engage_df[user_id].resample("1D").count()
    df = df.rolling(window=7).sum()
    df = df.dropna()
    adoption.append(any(df["visited"].values >= 3))

In [9]:
adoption.count(True)

1597

Close to 1600 users that are considered adopted.

In [10]:
#create and organize dataframe for analysis

users_df = users_df.rename({"object_id":"user_id"}, axis=1)


adopted = list(zip(users, adoption))

df_adopt = pd.DataFrame(adopted)
df_adopt.columns = ["user_id", "adopted_user"]

df = users_df.merge(df_adopt, on="user_id", how="left")

In [12]:
# changing adopted values from true/false to 0/1
df.loc[:, "adopted_user"] = df["adopted_user"].map({False:0, True:1, np.nan:0})
df.dropna(subset=["adopted_user"], inplace=True)
df["adopted_user"] = df["adopted_user"].astype(int)

In [13]:
# changing invited_by_user_id to 0/1 since it doesn't matter who invited them
invite = lambda row: 0 if np.isnan(row) else 1
df["invited_by_user"] = df["invited_by_user_id"].apply(invite)
df.drop('invited_by_user_id', axis=1, inplace=True)

In [14]:
#select columns to use - leave out irrelevant ones
df = df[["adopted_user", "invited_by_user", "creation_source", \
         "opted_in_to_mailing_list", "enabled_for_marketing_drip"]]

In [15]:
df.head()

Unnamed: 0,adopted_user,invited_by_user,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip
0,0,1,GUEST_INVITE,1,0
1,1,1,ORG_INVITE,0,0
2,0,1,ORG_INVITE,0,0
3,0,1,GUEST_INVITE,0,0
4,0,1,GUEST_INVITE,0,0


## Machine Learning

Create a model to find most important factors in gaining an adopted user.

In [16]:
#define X (independant vars (features)) and y (dependant var(adopted)) and split for train and test data

X = df[df.columns[1:]]
y = df[df.columns[0]]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.6, random_state = 602)

pipeline = Pipeline(steps=[("encoder", OneHotEncoder()), ("rf", RandomForestClassifier(random_state = 602))])

params = {"rf__n_estimators" : [25, 50, 75, 100],
          "rf__max_depth" : [3, 5, 10, 15]}

cv = GridSearchCV(pipeline, param_grid=params, cv=3)
cv.fit(X_train, y_train)

print(f"Best parameters: {cv.best_params_}")
print(f"Accuracy score from tuned model: {cv.best_score_*100:.1f}%")

Best parameters: {'rf__max_depth': 3, 'rf__n_estimators': 25}
Accuracy score from tuned model: 86.7%


In [17]:
# test data
y_pred = cv.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {test_accuracy*100:.2f}%")

Accuracy: 86.69%


In [64]:
target_names = ['Users', 'Adopted_Users']
print(classification_report(y_test, y_pred, target_names=target_names))

               precision    recall  f1-score   support

        Users       0.87      1.00      0.93      6242
Adopted_Users       0.00      0.00      0.00       958

     accuracy                           0.87      7200
    macro avg       0.43      0.50      0.46      7200
 weighted avg       0.75      0.87      0.81      7200



In [18]:
# feature importance

X_ = pd.get_dummies(X_test)
pipeline.fit(X_, y_test)

feat = pipeline.named_steps["rf"].feature_importances_

feature_importance = zip(X_.columns, feat)
feature_importance = sorted(feature_importance, key=lambda x:x[1], reverse=True)

for i, j in feature_importance:
    print(f"Weight: {j:.3f} | Feature: {i}")

Weight: 0.129 | Feature: creation_source_ORG_INVITE
Weight: 0.065 | Feature: creation_source_PERSONAL_PROJECTS
Weight: 0.056 | Feature: enabled_for_marketing_drip
Weight: 0.053 | Feature: creation_source_SIGNUP_GOOGLE_AUTH
Weight: 0.041 | Feature: creation_source_GUEST_INVITE
Weight: 0.028 | Feature: invited_by_user
Weight: 0.024 | Feature: opted_in_to_mailing_list
Weight: 0.006 | Feature: creation_source_SIGNUP


## Feature Importance

Based off of this model the most predictive feature is the creation source of the account. Out of the five possibilities an invite to an organization as a full member was the most likely to produce an adopted user. Next, was if the user was invited to join another user's personal workspace. This is not entirely surprising as it means the users had a specific reason to continue logging in, versus someone who may be just creating an unsolicited account and deciding they aren't interested in continued use. It would be a good idea for the company to increase marketing to organizations as opposed to individuals and emphasis project collaboration.

The feature with the third highest weight is if the user is on the marketing drip email list. This may simply be because continued users want to keep up with current news or updates - but it may also be a good idea to increase email marketing to new users. 

While signing up with a google account is the fourth highest weight it is not necessarily a useful metric - it could be argued however, that increasing ads for google users (everyone?) may result in an increase of adopted users.

The lowest importance is from signing up on the website itself. This is probably from the population that doesn't have any real initial reason to create an account and once made, it doesn't hold their interest. While you could look at this as the website not having a real impact, I think it would be a good idea to possibly rework the website to better sell the program's best uses and features.