In [1]:
import pandas as pd
import numpy as np

# #1. Build a model that predicts conversion rate

### Explore the data

In [2]:
df = pd.read_csv('data/conversion_data.csv')

In [3]:
df.head()

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,UK,25,1,Ads,1,0
1,US,23,1,Seo,5,0
2,US,28,1,Seo,4,0
3,China,39,1,Seo,5,0
4,US,30,1,Seo,6,0


In [39]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, precision_recall_curve, auc
from sklearn.metrics import accuracy_score
import math

### Data processing: OHE

In [6]:
country_dummy = pd.get_dummies(df['country'],prefix='cnty')
source_dummy = pd.get_dummies(df['source'],prefix='sc')
df = pd.concat([df, country_dummy, source_dummy], axis = 1)
df = df.drop(['country', 'source'], axis = 1)

In [7]:
df.head()

Unnamed: 0,age,new_user,total_pages_visited,converted,cnty_China,cnty_Germany,cnty_UK,cnty_US,sc_Ads,sc_Direct,sc_Seo
0,25,1,1,0,0,0,1,0,1,0,0
1,23,1,5,0,0,0,0,1,0,0,1
2,28,1,4,0,0,0,0,1,0,0,1
3,39,1,5,0,1,0,0,0,0,0,1
4,30,1,6,0,0,0,0,1,0,0,1


### Split the data

In [8]:
X_columns = list(df.columns)
X_columns.remove('converted')
y_column = 'converted'

In [9]:
X_train, X_test, y_train, y_test = train_test_split(df[X_columns], df[y_column], test_size=0.2, random_state=0)

### Try two candidate methods: logistic regression and RandomForest

In [25]:
lr = LogisticRegression()
rf = RandomForestClassifier(n_estimators=100, max_depth=15,oob_score=True, random_state=812)

### Logitstic regression

In [54]:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train) 
X_test  = scaler.transform(X_test) 
lr.fit(X_train, y_train)
y_pred_prob = lr.predict_proba(X_test)[:,1]
y_pred = lr.predict(X_test)
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_pred_prob)
lr_auc_roc = roc_auc_score(y_test, y_pred_prob)
lr_acc = accuracy_score(y_test, y_pred)
lr_auc_pr = auc(recall, precision)
lr_f1 = f1_score(y_test, y_pred)

In [30]:
rf.fit(X_train, y_train)
y_pred_prob = rf.predict_proba(X_test)[:,1]
y_pred = rf.predict(X_test)
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_pred_prob)
rf_auc_roc = roc_auc_score(y_test, y_pred_prob)
rf_acc = accuracy_score(y_test, y_pred)
rf_auc_pr = auc(recall, precision)
rf_f1 = f1_score(y_test, y_pred)

In [55]:
print('logistic regression: \n auc_pr: ', lr_auc_pr, 'f1 score: ', lr_f1)
print('randomforest: \n auc_pr: ', rf_auc_pr, 'f1 score: ', rf_f1)

logistic regression: 
 auc_pr:  0.8445276914162049 f1 score:  0.7611140867702197
randomforest: 
 auc_pr:  0.8041779743862764 f1 score:  0.7393364928909952


It turns out logistic regression does a better job predicting the conversion. It should be used as the model.

---

# #2. Using the model, determine which features are important in predicting conversion.

Logistic regression is a linear model, so just the coefficient after scaler should be used to show important features. (Assuming there is no correlation among featuress)

In [89]:
# feature_coef = pd.concat([pd.Series(X_columns, name = 'features'), pd.Series(lr.coef_[0], name = 'coef')], axis = 1)
feature_coef['abs_coef'] = abs(feature_coef.coef)

feature_importance = feature_coef.sort_values('abs_coef', ascending=False)

feature_importance

Unnamed: 0,features,coef,abs_coef
2,total_pages_visited,2.517384,2.517384
3,cnty_China,-1.019989,1.019989
1,new_user,-0.782222,0.782222
0,age,-0.628436,0.628436
5,cnty_UK,0.445047,0.445047
6,cnty_US,0.438446,0.438446
4,cnty_Germany,0.296256,0.296256
8,sc_Direct,-0.063961,0.063961
7,sc_Ads,0.036519,0.036519
9,sc_Seo,0.02096,0.02096


Based on the analysis, the top 3 features for the model are total_page_visited (positive impact), if the rider signed up from China (negative impact), and if the user signed up at the current session (negative impact).

It is worth mentioning that the way I did above can only apply to linear model. For non-linear models, some of them contains attribute such as feature_importances in random forest. We can also use other packages such as SHAP to explain important features for non-linear models.

# #3. Come up with some recommendations of experiments you might run or changes you might make to the product team and to the marketing team

There are two experiments I would recommend to run. 
1. The experiment I would recommend to run is to split the traffic based on the classification model. The control group should include random users from traffic. The testing group should include users with predicted score of 0.5 or higher (The threshold could be higher than 0.5 based on traffic volume and specific product needs). This experiment would be able to verify the performance of the model in production.
2. The recommendations I would make to product team is to build promping banners or converting trigers for higher scored users to help them land on the conversion page more easily.

# #4. Conversion rate testing

Based on the statement, the z-value for both testA/control and testB/control is about 2.576 (assuming two-sided change). 
The formula for Z value here can be expressed as this:
Z = (CR_t - CR_c)/SE = (CR_t - CR_c)/((SE_t^2 + SE_c^2)^1/2)
CR_t: conversion rate of testing group
CR_c: conversion rate of control gruop
SE_t: standard error of testing group
SE_c: standard error of control group
Assuming the conversion rate and distribution of the holdout in both tests keep the same, and the sample size of testing group and holdout group are the same, we can get that:
For test A: 2.576 = 3.5%/SE_a ==> SE_a = 1.35%
For test B: 2.576 = 2.5%/SE_b ==> SE_b = 0.97%
SE_a^2 - SE_b^2 = SE_t_a^2 + SE_c_a^2 - SE_t_b^2 - SE_c_b^2 = SE_t_a^2 = SE_t_a^2 - SE_t_b^2 = 8.8e-5
SE_a_b ~ 1.5%
to determin if lift from old methodology to new methodology is significant, we need to calculate the z value:
z = (cr_t_a - cr_t_b)/SE_a_b = 1.0%/SE_a_b = 0.667
The corresponding p value is 0.509. The lift is **not significant**.


# #5.

I would write an automatic re-training algorithm, triggered by the case when a performance evaluation metrics, such as f1 score or pr_auc, drops below a preset threshold, or the case when a maximum period of time, such as 6 months, is reached.
Once the retraining process is triggered, the algorithm should be able to re-fetch the data, re-train the model and evaluate it automatically. The final performance metrics should be reported to view. If the retrained model performs good, an auto implementation should be started. Otherwise, a model improvement needs to be taken into consideration.
The auto-retraining part can be done in several ways, the way I am familiar with is to use DAG through AWS system, where several tasks could be linked and run sequentially.