# Experiment #1 - Baseline Model vs. Baseline ML Models

## Overview

The purpose of this experiment is to establish a baseline for a domain-driven model and to compare it to more sophisticated machine learning models using baseline features. Our baseline model will simply follow our intuited rule:

> Include the customer if they belong to tiers S or A, hence we are selecting 475 possible customers, which sounds a reasonable amount of customers to be reached in a certain period of time.

To estimate the performance of machine learning models, we will train the following models with some different hyperparameter configurations, selecting the best configuration and averaging the scores of the best models:

* Naive Bayes (XXXX)
* Decision tree  (XXXX)
* Logististic regression (XXXX)
* Neural network
* SVM (XXXX)

Scores will be based on how well a classifier can prioritize 475 customers considering the entire database.

In [1]:
%load_ext autoreload
%autoreload 2

from utils import code
from plot_libraries import setup_graphics
from datasets import get_data

In [2]:
# load libraries and set plot parameters
import os, random, re, sys, time, warnings
import math
import numpy as np
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

%matplotlib inline
sns.set()
pd.options.display.max_columns = None
setup_graphics()

In [3]:
import scikitplot as skplt

# Model evaluation
from sklearn.metrics import make_scorer, roc_auc_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_predict

# Support
import parameters as params

## Data

In [4]:
X, y = get_data('../data/trainDF.csv')
X.head()

Unnamed: 0_level_0,Activities_Last_30_Days,Employees,ZoomInfo_Employee_Range,ZoomInfo_Revenue_Range,Organic_Visits,Pct_Organic_Visits,SEO_Visits,URLs_Indexed,ZoomInfo_Global_HQ_Country,Annual_Revenue_converted,Adjusted_Industry,Account_ICP_Score,Account_ICP_Tier,Page_Count,Page_Count_Range,Alexa_Rank,Parent_Account_Status
Account_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0012400000L5cmZ,0.0,10.0,-,-,61688430.0,0.34,61688430.0,27700000.0,-,3333900.0,Retail,91.667,Tier A,27700000.0,>1M,331.0,-
00124000004sEH5,51.0,10000.0,-,-,19397082.0,0.93,28615923.0,76200.0,-,13335600000.0,Retail,100.0,Tier A,206300.0,Between 100K and 250K,8881.0,Prospect
00124000004sUGG,0.0,5000.0,-,-,49283858.0,0.53,50132407.0,12600000.0,-,555650000.0,Media,100.0,Tier A,12709000.0,>1M,1118.0,Lost Customer
0011p00002SeaiQ,0.0,383.0,250 - 500,$50 mil. - $100 mil.,177515.0,,177515.0,1090000.0,US,73600000.0,Classified,70.833,Tier A,1090000.0,>1M,126905.0,-
0011p00001SghSL,0.0,5000.0,"1,000 - 5,000",$500 mil. - $1 bil.,8052961.0,0.59,10416602.0,2340000.0,US,250000000.0,Classified,100.0,Tier A,3640000.0,>1M,4742.0,Prospect


In [5]:
n_instances = len(X)
p_instances = y.sum() / len(y)
p_targeted = 475/n_instances
n_targeted = int(n_instances*p_targeted)

print('Number of instances: {:,}'.format(n_instances))
print('Number of conversions {:,}'.format(y.sum()))
print('Conversion rate: {:.2f}%'.format(p_instances*100.))
print('Expected number of conversions targetting {:,} @ {:.2f}%: {:,}'.format(n_targeted, p_instances*100., int(p_instances * n_targeted)))

Number of instances: 1,849
Number of conversions 315
Conversion rate: 17.04%
Expected number of conversions targetting 475 @ 17.04%: 80


### Split Dataset

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y, random_state=1)
n_targeted_test = int(len(X_test) * p_targeted)

### Baseline Model

To evaluate our baseline model, we will include some financial features, which will allow us to compare with ML models later.

In [7]:
# Setup costs and benefits
avg_revenue = params.AVG_REVENUE
avg_cost = params.AVG_COST

In [8]:
# Get all of the instances with tiers S&A
X_test_SA = X_test[X_test.Account_ICP_Tier.isin(['Tier S', 'Tier A'])]

# Calcuate how many more instances we need
n_rest = n_targeted_test - len(X_test_SA)

# Randomly choose from the remaining instances
rest = X_test[~(X_test.index.isin(X_test_SA.index))].sample(n=n_rest, random_state=1)

In [9]:
# Combine the targeted and random groups
baseline_targets = pd.concat([X_test_SA, rest], axis=0)
baseline_ys = y_test.loc[baseline_targets.index]
baseline_outcomes = baseline_ys.apply(lambda x: avg_cost if x == 0 else avg_cost + avg_revenue)
assert(len(baseline_targets) == n_targeted_test)

In [10]:
# Create the random targets
random_targets = X_test.sample(n=n_targeted_test)
random_ys = y.loc[random_targets.index]
random_outcomes = random_ys.apply(lambda x: avg_cost if x == 0 else avg_cost + avg_revenue)

In [11]:
# Compute profit
random_profit = sum(random_outcomes)
baseline_profit = sum(baseline_outcomes)

print('Number of customers targeted: {:,}/{:,}\n'.format(len(baseline_targets), len(X_test)))

print('Conversion rate under random policy: {:.1f}%'.format(random_ys.sum() / len(random_ys)*100.))
print('Expected profit under random policy: ${:,}\n'.format(random_profit))

print('Conversion rate under baseline policy: {:.3}%'.format(baseline_ys.sum() / len(baseline_ys)*100.))
print('Expected profit under baseline policy: ${:,}'.format(baseline_profit))
print('Lift over random policy: {:.1f} or ${:,}'.format(baseline_profit / random_profit, baseline_profit - random_profit))

Number of customers targeted: 95/370

Conversion rate under random policy: 13.7%
Expected profit under random policy: $12,050

Conversion rate under baseline policy: 25.3%
Expected profit under baseline policy: $23,050
Lift over random policy: 1.9 or $11,000


### ML Models
Optimize each model we're interested in evaluating then choose the best one and estimate financial impact.

### Analyzing Train DF

##### Features importantes
* Pct Organic Visits - Se esse número é alto, isso quer dizer que a empresa depende de SEO para gerar receita. Números extremos perto de 0% e perto de 100% provavelmente estão errados/não são confiáveis
* SEO Visits -  Quanto maior esse número, em geral, mais poderoso é o time de SEO, e mais faz sentido investir numa tecnologia como a nossa. Abaixo de 1M de visitas orgãnicas ou acima de 500M não faz muito sentido. (deveríamos descartar esses entao?)
* Page Count - Quanto maior esse número, em geral, mais poderoso é o time de SEO, e mais faz sentido investir numa tecnologia como a nossa. Abaixo de 25K não faz muito sentido.

### Missing Values

In [12]:
nan_mean = dataset.isna().mean()
nan_mean = nan_mean[nan_mean != 0].sort_values()
nan_mean

NameError: name 'dataset' is not defined

##### Fill NA Methods
* **Alexa_Rank:** MAX
* **Account_ICP_Score:** MEDIAN (industry)
* **Employees:** MEDIAN (industry)
* **Annual_Revenue_converted:** MEDIAN (industry)
* **Organic_Visits:** MIN
* **Page_Count:** MIN
* **Pct_Organic_Visits:**  MIN
* **Combined_Pages:** DROP

In [None]:
fill_max = ['Alexa_Rank' ]
fill_min = ['Organic_Visits', 'Page_Count', 'Pct_Organic_Visits']
fill_median = ['Account_ICP_Score', 'Employees', 'Annual_Revenue_converted' ]
drop_missing_values = ['Combined_Pages']

In [None]:
dataset[fill_max] = dataset[fill_max].fillna(dataset[fill_max].max())
dataset[fill_min] = dataset[fill_min].fillna(dataset[fill_min].min())
dataset.drop(columns=drop_missing_values, inplace=True)

In [None]:
values_dict = dataset.groupby(['Adjusted_Industry'])[fill_median].median().to_dict()
for col in fill_median:
    dataset[col] = dataset[col].fillna(dataset['Adjusted_Industry'].map(values_dict[col]))

### PIPELINE DATACAMP