# Relax Challange

## Project Description 


Defining  an  `adopted user`   as  a  user  *who   has  logged  into  the  product  on  three  separate days  in  at  least  one  seven­day  period* ,  **identify  which  factors  predict  future  user adoption**.

We  suggest  spending  1­2  hours  on  this,  but  you're  welcome  to  spend  more  or  less. Please  send  us  a  brief  writeup  of  your  findings  (the  more  concise,  the  better  ­­  no  more than  one  page),  along  with  any  summary  tables,  graphs,  code,  or  queries  that  can  help us  understand  your  approach.  Please  note  any  factors  you  considered  or  investigation you  did,  even  if  they  did  not  pan  out.  Feel  free  to  identify  any  further  research  or  data you  think  would  be  valuable


## Data description

The  data  is  available  as  two  attached  CSV  files:
- `takehome_user_engagement_df. csv`
- `takehome_user.csv`

The  data  has  the  following  two  tables:
The  data  has  the  following  two  tables:

1. A  user  table  ( "takehome_user" )  with  data  on  12,000  user  who  signed  up  for  the product  in  the  last  two  years. This  table  includes:
     - `name`:  the  user's  name
     - `object_id`:   the  user's  id
     - `email`:  email  address
     - `creation_source`:   how  their  account  was  created.  This  takes  on  one of  5  values:
        - `PERSONAL_PROJECTS`:  invited  to  join  another  user's personal  workspace
        - `GUEST_INVITE`:  invited  to  an  organization  as  a  guest (limited  permissions)
        - `ORG_INVITE`:  invited  to  an  organization  (as  a  full  member)
        - `SIGNUP`:  signed  up  via  the  website
        - `SIGNUP_GOOGLE_AUTH`:  signed  up  using  Google Authentication  (using  a  Google  email  account  for  their  loginid)
    - `creation_time`:  when  they  created  their  account
    - `last_session_creation_time`:   unix  timestamp  of  last  login
    - `opted_in_to_mailing_list`:  whether  they  have  opted  into  receiving marketing  emails
    - `enabled_for_marketing_drip`:  whether  they  are  on  the  regular marketing  email  drip
    - `org_id`:   the  organization  (group  of  user)  they  belong  to
    - `nvited_by_user_id`:   which  user  invited  them  to  join  (if  applicable).
    
<b></b>
    
2. A  usage  summary  table  ( "takehome_user_engagement_df" )  that  has  a  row  for  each  day that  a  user  logged  into  the  product.


# 1. Problem Definition

The task is to identify factors that predict future user adoption, defining an adopted user as one who has logged into the product on three separate days in at least one seven-day period. The instruction is to provide a brief writeup of the findings, along with any relevant summary tables, graphs, code, or queries, and to note any factors considered or investigations done, even if they did not lead to a result. Further research or valuable data can also be identified.

# 2. Data Collection

## Installations 

In [1]:
'''Install required packages'''
#!pip install ppscore
#!pip install pycaret==2.3.4
#!pip install chardet

'''Install compatibale vesion of scikit-learn'''
#!pip install -U scikit-learn
#!pip install scikit-learn==0.23.2

'Install compatibale vesion of scikit-learn'

## Import Libraries

In [2]:
#Fundamental libraries
import numpy as np 
import pandas as pd 

#Plot libraries
import seaborn as sns
import matplotlib.pyplot as plt

#Missing data vizualization libraries
import missingno as msno
#import ppscore as pps

# read data
import os 
import json


## Utility Functions

In [3]:
def json_url_to_df (url):
    '''Read json url to df'''

    import urllib.request

    response = urllib.request.urlopen(url)
    data = response.read().decode()
    json_data = json.loads(data)

    df = pd.DataFrame(json_data)

    return df

In [4]:
def json_file_to_df(directory, file_name):
  
    '''Read json url to df'''
    # Change directory one step back and save as the root directory
    root_dir = os.path.normpath(os.getcwd() + os.sep + os.pardir)
    print(root_dir)

    # Define the location of data directory
    path = root_dir + '\\data\\'

    # Set the file name
    json_data = path + 'ultimate_data_challenge.json'

    #Read JSON file into a dataframe: df
    df = pd.DataFrame(json_data)

    return df

In [5]:
def csv_url_to_df(url):
    '''Read csv file from url to df'''
    import requests
    import chardet

    response = requests.get(url)
    encoding = chardet.detect(response.content)['encoding']
    df = pd.read_csv(url, encoding=encoding)

    return df

## Read Data

In [6]:
## Read json url to df
#url = "https://raw.githubusercontent.com/faridjn/Springboard/master/Unit%2027%20-%20Interview%20Challanges/1.%20ultimate_challenge/data/ultimate_data_challenge.json"
#df = json_url_to_df(url)

In [7]:
## Read json file to df
#directory = '\\data\\'
#file_name = 'ultimate_data_challenge.json'
#df = json_file_to_df(directory, file_name)

In [8]:
#Read csv from url to df
url_1 = 'https://raw.githubusercontent.com/faridjn/Springboard/master/Unit%2027%20-%20Interview%20Challanges/2.%20relax_challenge/Data/takehome_user_engagement.csv'
url_2 = 'https://raw.githubusercontent.com/faridjn/Springboard/master/Unit%2027%20-%20Interview%20Challanges/2.%20relax_challenge/Data/takehome_users.csv'

engagement_df = csv_url_to_df(url_1)
user_df = csv_url_to_df(url_2)

# 3. Data Wrangling

## Utility functions

In [9]:
def describe_dataframe(df):
    print('Describe non-numeric columns:')
    display(df.describe(include = ['O', 'bool']).round(2).T)
    
    print('\nDescribe numeric columns:')
    display(df.describe().round(2).T)
    
    return None

In [10]:
#Missing data helper function
def count_missing(df):
    ''' Count the number of missing values .isnull() in each column well as the percentages 
    Call pd.concat() to form a single table df with 'count' and '%' columns'''
    
    print('\nMissing data stasts')
    missing = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
    missing.columns=['count', '%']
    missing = missing.loc[missing['count'] > 0]
    missing.sort_values(by='count', inplace = True, ascending = False)
    
    return missing

## Data inspection and exploration

In [11]:
#Check size of the dataframe
print(engagement_df.shape)

#Check size of the dataframe
print(user_df.shape)

(207917, 3)
(12000, 10)


In [12]:
#Display top 10 rows of the df
display('engagement_df', engagement_df.head(10))
print('')
display('user', user_df.head(3).T)

'engagement_df'

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1
5,2013-12-31 03:45:04,2,1
6,2014-01-08 03:45:04,2,1
7,2014-02-03 03:45:04,2,1
8,2014-02-08 03:45:04,2,1
9,2014-02-09 03:45:04,2,1





'user'

Unnamed: 0,0,1,2
object_id,1,2,3
creation_time,2014-04-22 03:53:30,2013-11-15 03:45:04,2013-03-19 23:14:52
name,Clausen August,Poole Matthew,Bottrill Mitchell
email,AugustCClausen@yahoo.com,MatthewPoole@gustr.com,MitchellBottrill@gustr.com
creation_source,GUEST_INVITE,ORG_INVITE,ORG_INVITE
last_session_creation_time,1398138810.0,1396237504.0,1363734892.0
opted_in_to_mailing_list,1,0,0
enabled_for_marketing_drip,0,0,0
org_id,11,1,94
invited_by_user_id,10803.0,316.0,1525.0


In [13]:
print('engagement_df', engagement_df.info(), '\n\n')
print('user', user_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB
engagement_df None 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null 

In [14]:
#describe_dataframe(engagement_df)

In [15]:
#describe_dataframe(user)

In [16]:
#Number of unique user
engagement_df['user_id'].nunique()

8823

In [17]:
# Unique objects and number of unique objects
print_unique = ['creation_source', 'opted_in_to_mailing_list', 'enabled_for_marketing_drip']
count_unique = ['object_id', 'name', 'email', 'invited_by_user_id']

In [18]:
for col in count_unique:
    print(col, ':', user_df[col].nunique())

object_id : 12000
name : 11355
email : 11980
invited_by_user_id : 2564


In [19]:
for col in print_unique:
    print(col, ':', list(user_df[col].unique()))

creation_source : ['GUEST_INVITE', 'ORG_INVITE', 'SIGNUP', 'PERSONAL_PROJECTS', 'SIGNUP_GOOGLE_AUTH']
opted_in_to_mailing_list : [1, 0]
enabled_for_marketing_drip : [0, 1]


## Data cleaning

In [20]:
#Drop personal information that are not needed
user_df.drop(columns = ['name', 'email'], inplace=True)

In [21]:
engagement_df.dtypes

time_stamp    object
user_id        int64
visited        int64
dtype: object

In [22]:
user_df.dtypes

object_id                       int64
creation_time                  object
creation_source                object
last_session_creation_time    float64
opted_in_to_mailing_list        int64
enabled_for_marketing_drip      int64
org_id                          int64
invited_by_user_id            float64
dtype: object

## Handling of missing data

In [23]:
# missing data stats
count_missing(user_df)


Missing data stasts


Unnamed: 0,count,%
invited_by_user_id,5583,46.525
last_session_creation_time,3177,26.475


In [24]:
# Replace NaN values with a default value, e.g. -1
user_df['last_session_creation_time'].fillna(0, inplace=True)

In [25]:
# Replace NaN values with a default value 0
user_df['invited_by_user_id'].fillna(-1, inplace=True)

In [26]:
# missing data stats
count_missing(user_df)


Missing data stasts


Unnamed: 0,count,%


In [27]:
# missing data stats
count_missing(engagement_df)


Missing data stasts


Unnamed: 0,count,%


## Feature engineering

### Join dataframes

In [31]:
df = pd.merge(user_df, engagement_df, left_on='object_id', right_on='user_id')

### Date formatting

In [32]:
# Set datetime formt used in the dataset
datetime_format = '%Y-%m-%d %H:%M:%S'

#Change `date_columns` coluumn data type to `datetime`
engagement_df['time_stamp'] = pd.to_datetime(engagement_df['time_stamp'], format=datetime_format, errors="raise")
df['creation_time'] = pd.to_datetime(df['creation_time'], format=datetime_format, errors="raise")
df['last_session_creation_time'] = pd.to_datetime(df['last_session_creation_time'], unit='s')

### Add `adopted_use` feature

In [33]:
def adopted_user_checker(df, days=7, logins=3):
    '''Define a function to see if a user logged in on 3 seperate days in a 7-day period.'''
    #imports
    from datetime import timedelta
    
    #extract date
    df['time_stamp'] = pd.to_datetime(df['time_stamp'])
    df['date'] = df['time_stamp'].dt.date
    
    # first drop duplicate days and sort by day
    df = df.drop_duplicates(subset='date').sort_values('date')
    
    # calculate how many days has passed for every 3 logins
    passed_days = df['date'].diff(periods=logins-1)
    
    # check if any passed time is less than 7 days
    return any(passed_days <= timedelta(days=days))

In [34]:
# run the function on all users
adopted = engagement_df.groupby('user_id').apply(adopted_user_checker)

In [35]:
#pass series to a dataframe and name `adapter_user`
adopted_df = pd.DataFrame(adopted)
adopted_df.columns = ['adapted_user']
adopted_df.head()

Unnamed: 0_level_0,adapted_user
user_id,Unnamed: 1_level_1
1,False
2,True
3,False
4,False
5,False


In [36]:
print('count adopted users :', adopted_df.sum()[0])
print('count total users :', len(adopted_df))
print(f'adopttion ratio : {100*adopted_df.sum()[0]/len(adopted_df):.2f}%')

count adopted users : 1656
count total users : 8823
adopttion ratio : 18.77%


### Create new feature as `by_invitation`

In [37]:
# Convert float column A to integer type
df['invited_by_user_id'] = df['invited_by_user_id'].astype(int)

In [38]:
# create a new column 'by_invitation' based on the values of the 'invited_by_user_id' column
df['by_invitation'] = np.where(df['invited_by_user_id'] == -1, 0, 1)

### New feature to count activity days

In [39]:
#Find the max date value in last session
last_date = max(df['last_session_creation_time'])

df['days_since_creation'] = (last_date - df['creation_time']).dt.days
df['days_last_session'] = (last_date - df['last_session_creation_time']).dt.days

### Drop extra columns

In [40]:
df.drop(columns=['creation_time', 'last_session_creation_time', time_stamp], inplace=True)

In [41]:
df.sample(1).T

Unnamed: 0,101212
object_id,5516
creation_source,ORG_INVITE
opted_in_to_mailing_list,1
enabled_for_marketing_drip,0
org_id,91
invited_by_user_id,5516
time_stamp,2012-09-12 21:10:29
user_id,5516
visited,1
by_invitation,1


# 4. Exploratory Data Analysis (EDA):

### Plot referrals by users

In [None]:
#Count referrals by users
referred_by_df = pd.DataFrame(df['invited_by_user_id'].value_counts())
referred_by_df = referred_by_df.reset_index()
referred_by_df.columns = ['user_id', 'conut']

In [None]:
print('\nNumber of users joined without referrals:')
referred_by_df.loc[[0]]

In [None]:
#drop users without referrals
user_referral_df_plot = referred_by_df.drop([0], axis=0)
user_referral_df_plot.head()

In [None]:
plt.figure(figsize=(10,4))

sns.scatterplot(data=user_referral_df_plot,
                x=user_referral_df_plot.index,
                y='conut')

plt.xlabel('referring user')
plt.ylabel('user count')
plt.show()

### Plot count of users per organization 

In [None]:
#Count users in organizations
org_id_df = pd.DataFrame(df['org_id'].value_counts())
org_id_df = org_id_df.reset_index()
org_id_df.columns = ['org_id', 'conut']

In [None]:
plt.figure(figsize=(10,4))

sns.scatterplot(data=org_id_df,
                x=org_id_df.index,
                y='conut')

plt.xlabel('org_id')
plt.ylabel('count')
plt.show()

Define Categorical vs numerical features

In [None]:
#Define categrical and numerical data
num_columns = ['trips_in_first_30_days', 'avg_rating_of_driver', 'avg_rating_by_driver',
             'avg_surge', 'surge_pct', 'weekday_pct',  'avg_dist',  
             'since_signup_date']

#Seperate categorical data
cat_columns = ['city_Astapor', "city_King's Landing", 'city_Winterfell',  'ultimate_black_user']

## Categorical Features

### Stats

In [None]:
#create a pivot table for categorical columns
dfg_cat = pd.DataFrame(engagement_dfhat.groupby('active')[cat_columns].sum()).reset_index()
display(dfg_cat)

# metlt the pivot table to plotable features
dfg_cat_melt = pd.melt(dfg_cat, id_vars = ['active'], var_name='Feature', value_name = 'Count')
display(dfg_cat_melt)

### Plots

In [None]:
# Set the hue for the 'active' column
hue_order = [True, False]

#Plot the `dfg_melt`
fig, ax = plt.subplots(figsize=(7, 5))
sns.barplot(data=dfg_cat_melt, y='Feature', x='Count', hue = 'active', hue_order=hue_order)
plt.title('Categorical Features')
plt.show()

## Numerical Features

### Stats

In [None]:
#seperate active and disactive
df_active_num = engagement_dfhat[num_columns].loc[engagement_dfhat['active'] == 1]
df_disactive_num = engagement_dfhat[num_columns].loc[~engagement_dfhat['active'] == 1]

In [None]:
#Calcualte stats 
#Active 
df_active_describe= df_active_num.describe().loc[['count', 'mean', 'std']].T
df_active_describe['cv'] = df_active_describe['std']/df_active_describe['mean']
df_active_describe['active'] = 1

#Disactive
df_disactive_describe= df_disactive_num.describe().loc[['count', 'mean', 'std']].T
df_disactive_describe['cv'] = df_disactive_describe['std']/df_disactive_describe['mean']
df_disactive_describe['active'] = 0

In [None]:
#Concat stat tables
df_num_describe = pd.concat([df_active_describe,df_disactive_describe],axis = 0)

display(df_num_describe)

### Plots

In [None]:
#Plot histogram of all features
engagement_dfhat.hist(figsize=(12,12), bins = 12)
plt.subplots_adjust(hspace=0.5)

In [None]:
# Set the hue for the 'active' column
hue_order = [True, False]

#Plot the stats
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))

#plot mean values
sns.barplot(data = df_num_describe,
            y = df_num_describe.index,
            x = 'mean',
            hue = 'active',
            hue_order = hue_order,
            ax=axes[0])
axes[0].set_title('Mean Values')

#plot cv values
sns.barplot(data = df_num_describe,
            y = df_num_describe.index,
            x = 'cv',
            hue = 'active',
            hue_order = hue_order,
            ax=axes[1])
axes[1].set_title('Coefincent of Variance (CV)')

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(7,9))

sns.boxplot(data = engagement_dfhat,
            orient = 'h',
            width=0.8,
            palette='crest',
            linewidth= 1,
            sym = '')
plt.show()

In [None]:
df_plot = engagement_dfhat.sample(100)

# Set the style of the plots
sns.set(style="ticks", color_codes=True)

# Set the hue for the 'active' column
hue_order = [True, False]

# Plot histograms of numerical columns
g = sns.pairplot(df_plot, diag_kind="kde", hue='active', vars = num_columns, hue_order=hue_order)
plt.show()

## Multivariate Analysis

In [None]:
def plot_corr_matrix (df, round_vals, mask = True):
    '''This function plots Correlation matrix'''
    
    # Compute the correlation matrix
    corr = df.corr()
        
    # Generate a mask for the upper triangle
    if mask:
        mask = np.triu(np.ones_like(corr, dtype=bool))
    
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(20, 9))

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr.round(round_vals), mask=mask, cmap='coolwarm', vmin = -1, vmax=1, center=0, annot=True,
                square=True, linewidths=.5, cbar_kws={"shrink": .5}).set(title='Pearson Correlation Matrix')

    plt.show()

In [None]:
#Plot Corr matrix
plot_corr_matrix(df=engagement_dfhat, round_vals=2, mask = True)

In [None]:
def plot_pps_matrix(df, round_vals=2, mask = True):
    '''This function gets a df and plot PPS score matrix'''
    
    # Compute the PPS matrix
    matrix = pps.matrix(df)

    #Plot PPS
    matrix_pps = matrix[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')

    # Generate a mask for the upper triangle
    if mask:
        mask = np.triu(np.ones_like(matrix_pps, dtype=bool))

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(20, 9))

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(matrix_pps.round(round_vals), mask = mask, cmap="Blues", vmin = 0, vmax=1, center=0.5,
                square=True, linewidths=.5,annot=True, cbar_kws={"shrink": .5}).set(title='PPS Matrix')
    plt.show()


In [None]:
    #Plot PPS
plot_pps_matrix(df=engagement_dfhat, round_vals=2, mask=True)

# 5. Model Building

## Import PyCaret libraries

In [None]:
from pycaret.classification import *

# check version
from pycaret.utils import version
version()

## Initialize Setup

In [None]:
#set pointer to engagement_dfhat
data = engagement_dfhat
data.head().T

In [None]:
#Setup Pycaret Regression session
#Transform dataset, normalize and split the dataset.
#Log experiments and plots for experiments to be viewed later with MLflow. 

clf1 = setup(data=data,
             target = 'active',
             session_id=123,
             log_experiment=True,
             transformation=True,
             train_size=0.7,
             categorical_features= cat_columns,
             log_plots=True)


## Compare Models

In [None]:
best_model = compare_models()

In [None]:
models()

In [None]:
models(type='ensemble').index.tolist()

## Hyper-parameterization

## Ensemble Model

## Evalute models

# 6. Model Deployment

# 7. Communication of Results