# Data Analysis Interview Challenge

## Part 3 ‐ Predictive modeling

Ultimate Technologies Inc. is a transportation network company that has disrupted the taxi and logistics industry and is considered a prestigious company to work for. This challenge has been adapted from an actual Ultimate Inc. data science challenge.


Ultimate is interested in predicting rider retention. To help explore this question, we have provided a sample dataset of a cohort of users who signed up for an Ultimate account in January 2014. The data was pulled several months later; we consider a user retained if they were “active” (i.e. took a trip) in the preceding 30 days.

We would like you to use this data set to help understand what factors are the best predictors for retention, and offer suggestions to operationalize those insights to help Ultimate.

The data is in the attached file `ultimate_data_challenge.json`. See below for a detailed description of the dataset. Please include any code you wrote for the analysis and delete the dataset when you have finished with the challenge.

1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this analysis (a few sentences/plots describing your approach will suffice). What fraction of the observed users were retained?
2. Build a predictive model to help Ultimate determine whether or not a user will be active in their 6th month on the system. Discuss why you chose your approach, what alternatives you considered, and any concerns you have. How valid is your model? Include any key indicators of model performance.
3. Briefly discuss how Ultimate might leverage the insights gained from the model to improve its long term rider retention (again, a few sentences will suffice).


### Data description

- `city`: city this user signed up in
- `phone`: primary device for this user
- `signup_date`: date of account registration; in the form ‘YYYYMMDD’
- `last_trip_date`: the last time this user completed a trip; in the form ‘YYYYMMDD’
- `avg_dist`: the average distance in miles per trip taken in the first 30 days after signup
- `avg_rating_by_driver`: the rider’s average rating over all of their trips
- `avg_rating_of_driver`: the rider’s average rating of their drivers over all of their trips
- `surge_pct`: the percent of trips taken with surge multiplier > 1
- `avg_surge`: The average surge multiplier over all of this user’s trips
- `trips_in_first_30_days`: the number of trips this user took in the first 30 days after signing up
- `ultimate_black_user`: TRUE if the user took an Ultimate Black in their first 30 days; FALSE otherwise
- `weekday_pct`: the percent of the user’s trips occurring during a weekday

## 1. Problem Definition

The problem statement is to help Ultimate Technologies Inc., a transportation network company, predict rider retention by analyzing a provided dataset of users who signed up for an Ultimate account in January 2014. The objective is to identify the factors that are the best predictors for rider retention and provide suggestions for operationalizing those insights to help Ultimate. The challenge requires performing data cleaning, exploratory analysis, and building a predictive model to determine if a user will be active in their 6th month on the system. The solution should include a discussion of the chosen approach, alternatives considered, and key indicators of model performance. Finally, the challenge requires a brief discussion on how Ultimate can leverage the insights gained from the model to improve its long-term rider retention.

## 2. Data Collection

### Import Libraries

In [1]:
#Fundamental libraries
import numpy as np 
import pandas as pd 

#Plot libraries
import seaborn as sns
import matplotlib.pyplot as plt

#Missing data vizualization libraries
import missingno as msno
import ppscore as pps

# read data
import os 

ModuleNotFoundError: No module named 'missingno'

### Read Data

In [None]:
# Change directory one step back and save as the root directory
root_dir = os.path.normpath(os.getcwd() + os.sep + os.pardir)

# Define the location of data directory
path = root_dir + '\\data\\'


In [None]:
# Set the file name
file_path = path + 'ultimate_data_challenge.json'

#Read JSON file into a dataframe: df
df = pd.read_json(file_path)

## 3. Data Wrangling

### Utility functions

In [None]:
def describe_dataframe(df):
    print('Describe non-numeric columns:')
    display(df.describe(include = ['O', 'bool']).round(2).T)
    
    print('\nDescribe numeric columns:')
    display(df.describe().round(2).T)
    
    return None

In [None]:
#Missing data helper function
def count_missing(df):
    ''' Count the number of missing values .isnull() in each column well as the percentages 
    Call pd.concat() to form a single table df with 'count' and '%' columns'''
    
    print('\nMissing data stasts')
    missing = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
    missing.columns=['count', '%']
    missing = missing.loc[missing['count'] > 0]
    missing.sort_values(by='count', inplace = True, ascending = False)
    
    return missing

### Data inspection and exploration

In [None]:
#Check size of the dataframe
print(df.shape)

In [None]:
#Display top 10 rows of the df
display(df.head(10).T)

In [None]:
print(df.info())

In [None]:
describe_dataframe(df)

### Data cleaning

In [None]:
#No further cleaning is required

### Handling of missing data

In [None]:
# missing data stats
count_missing(df)

In [None]:
#drop 'phone' columns
df.drop('phone', axis=1, inplace=True)

In [None]:
#find the median value and replace missing values
median_1 = df['avg_rating_of_driver'].median()
df['avg_rating_of_driver'].fillna(median_1, inplace=True)

In [None]:
#find the median value and replace missing values
median_2 = df['avg_rating_of_driver'].median()
df['avg_rating_by_driver'].fillna(median_2, inplace=True)

In [None]:
count_missing(df)

### Transformation and formatting

In [None]:
# Set datetime formt used in the dataset
datetime_format = '%Y-%m-%d'

#create a list of datetime columns
date_columns = ['signup_date', 'last_trip_date']

#Change `date_columns` coluumn data type to `datetime`
for column in date_columns:
    df[column] = pd.to_datetime(df[column], format=datetime_format, errors="raise")

In [None]:
#Check data type of datetime columns
df[date_columns].dtypes

### Feature engineering

In [None]:
#Replace True False for ultimate_black_user with 1 and 0
df ['ultimate_black_user'] =  df['ultimate_black_user'].replace({True: 1, False: 0})

In [None]:
#Get the last date in data
last_date = (max(df['last_trip_date']))

# Define cut off date as 30 days before that date
threshold_date = last_date - pd.Timedelta(days=30)

# create the 'active' column based on the 'date' column and the threshold date
df['active'] = (df['last_trip_date'] > threshold_date).astype(bool)

In [None]:
# add a new column to calculate how many days since sign-up
df['since_signup_date'] = -1 * (df['signup_date'] - pd.to_datetime(last_date)).dt.days

#drop signup date and last_trip_date
df.drop('signup_date', axis=1, inplace=True)
df.drop('last_trip_date', axis=1, inplace=True)

In [None]:
retention_rate = 100 * df['active'].sum()/len(df)
print(f'Driver retention rate is {retention_rate:0.2f}%')

In [None]:
df_1hat = pd.get_dummies(df) 

In [None]:
df_1hat.head(10).T

## 4. Exploratory Data Analysis (EDA):

Define Categorical vs numerica features

In [None]:
#Define categrical and numerical data
num_columns = ['trips_in_first_30_days', 'avg_rating_of_driver', 'avg_rating_by_driver',
             'avg_surge', 'surge_pct', 'weekday_pct',  'avg_dist',  
             'since_signup_date']

#Seperate categorical data
cat_columns = ['city_Astapor', "city_King's Landing", 'city_Winterfell',  'ultimate_black_user']

### Categorical Features

#### Stats

In [None]:
#create a pivot table for categorical columns
dfg_cat = pd.DataFrame(df_1hat.groupby('active')[cat_columns].sum()).reset_index()
display(dfg_cat)

# metlt the pivot table to plotable features
dfg_cat_melt = pd.melt(dfg_cat, id_vars = ['active'], var_name='Feature', value_name = 'Count')
display(dfg_cat_melt)

#### Plots

In [None]:
# Set the hue for the 'active' column
hue_order = [True, False]

#Plot the `dfg_melt`
fig, ax = plt.subplots(figsize=(7, 5))
sns.barplot(data=dfg_cat_melt, y='Feature', x='Count', hue = 'active', hue_order=hue_order)
plt.title('Categorical Features')
plt.show()

### Numerical Features

#### Stats

In [None]:
#seperate active and disactive
df_active_num = df_1hat[num_columns].loc[df_1hat['active'] == 1]
df_disactive_num = df_1hat[num_columns].loc[~df_1hat['active'] == 1]

In [None]:
#Calcualte stats 
#Active 
df_active_describe= df_active_num.describe().loc[['count', 'mean', 'std']].T
df_active_describe['cv'] = df_active_describe['std']/df_active_describe['mean']
df_active_describe['active'] = 1

#Disactive
df_disactive_describe= df_disactive_num.describe().loc[['count', 'mean', 'std']].T
df_disactive_describe['cv'] = df_disactive_describe['std']/df_disactive_describe['mean']
df_disactive_describe['active'] = 0

In [None]:
#Concat stat tables
df_num_describe = pd.concat([df_active_describe,df_disactive_describe],axis = 0)

display(df_num_describe)

#### Plots

In [None]:
#Plot histogram of all features
df_1hat.hist(figsize=(12,12), bins = 12)
plt.subplots_adjust(hspace=0.5)

In [None]:
# Set the hue for the 'active' column
hue_order = [True, False]

#Plot the stats
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))

#plot mean values
sns.barplot(data = df_num_describe,
            y = df_num_describe.index,
            x = 'mean',
            hue = 'active',
            hue_order = hue_order,
            ax=axes[0])
axes[0].set_title('Mean Values')

#plot cv values
sns.barplot(data = df_num_describe,
            y = df_num_describe.index,
            x = 'cv',
            hue = 'active',
            hue_order = hue_order,
            ax=axes[1])
axes[1].set_title('Coefincent of Variance (CV)')

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(7,9))

sns.boxplot(data = df_1hat,
            orient = 'h',
            width=0.8,
            palette='crest',
            linewidth= 1,
            sym = '')
plt.show()

In [None]:
df_plot = df_1hat.sample(100)

# Set the style of the plots
sns.set(style="ticks", color_codes=True)

# Set the hue for the 'active' column
hue_order = [True, False]

# Plot histograms of numerical columns
g = sns.pairplot(df_plot, diag_kind="kde", hue='active', vars = num_columns, hue_order=hue_order)
plt.show()

### Multivariate Analysis

In [None]:
def plot_corr_matrix (df, round_vals, mask = True):
    '''This function plots Correlation matrix'''
    
    # Compute the correlation matrix
    corr = df.corr()
        
    # Generate a mask for the upper triangle
    if mask:
        mask = np.triu(np.ones_like(corr, dtype=bool))
    
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(25, 12))

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr.round(round_vals), mask=mask, cmap='coolwarm', vmin = -1, vmax=1, center=0, annot=True,
                square=True, linewidths=.5, cbar_kws={"shrink": .5}).set(title='Pearson Correlation Matrix')

    plt.show()

In [None]:
#Plot Corr matrix
plot_corr_matrix(df_1hat, 2, False)

In [None]:
def plot_pps_matrix(df, round_vals, mask = True):
    '''This function gets a df and plot PPS score matrix'''
    
    # Compute the PPS matrix
    matrix = pps.matrix(df)

    #Plot PPS
    matrix_pps = matrix[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')

    # Generate a mask for the upper triangle
    if mask:
        mask = np.triu(np.ones_like(matrix_pps, dtype=bool))

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(25, 12))

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(matrix_pps.round(round_vals), mask = mask, cmap="Blues", vmin = 0, vmax=1, center=0.5,
                square=True, linewidths=.5,annot=True, cbar_kws={"shrink": .5}).set(title='PPS Matrix')
    plt.show()


In [None]:
    #Plot PPS
plot_pps_matrix(df=df_1hat, round_vals=2, mask=False)

## 5. Model Building

In [None]:
from pycaret.classification import *

# check version
from pycaret.utils import version
version()

### Initialize Setup

In [None]:
data = df_1hat
data.head().T

In [None]:
data.dtypes

In [None]:
clf1 = setup(data=data,
             target = 'active',
             session_id=123,
             log_experiment=True,
             transformation=True,
             train_size=0.7,
             categorical_features= cat_columns,
             log_plots=True)

## 6. Model Deployment

## 7. Communication of Results