# 6 Sense Model Training

<ul>  
<li><a href="#intro">Introduction</a></li> 
<li><a href="#data">Data Preprocessing</a></li> 
<li><a href="#model">Model Predict</a></li> 
    <ul>
        <li><a href="#call">Predict Call Disposition</a></li>
        <li><a href="#week"> Predict Best Day of Week to Call</a></li>    
        <li><a href="#hour">Predict Best Hour to Call</a></li>
    </ul>
<li><a href="#conclusion">Conclusion</a></li> 
</ul>

<a id='intro'></a>
## Introduction

> This notebook is for training the 6sense data through different models (such as Random Froest, KNN, XGBoost and so) on and find the best fit model to predict several variables including call disposition, best day of week to call and best hour to call so that the company can get engagement with prospective customers.

### Data Files:
<ul>
    <li>calls.csv: a timeline of outgoing sales calls and the disposition of those calls</li>
    <li>events.csv: any activities that we have on record taking place before the phone calls were made</li>
    <li>companies.csv: the industry and employee count of the companies</li>
    <li>people.csv: the people who were called, along with their job level and function and the ID of the company they work for</li>
</ul>

In [1]:
#import data and library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
events = pd.read_csv("events.csv")
calls = pd.read_csv("calls.csv")
companies1 = pd.read_csv("companies.csv")
people = pd.read_csv("people.csv")
opportunities = pd.read_csv("opportunities.csv")
events = events.rename(columns={"date": "event_date"})
calls = calls.rename(columns={"timestamp": "call_time"})
del calls['date']

<a id='data'></a>
# Data Preprocessing

In [2]:
companies_people = pd.merge(companies1, people, on='company_id')
companies_people_calls = pd.merge(companies_people, calls, on='contact_id')
companies_people_calls_opp = pd.merge(companies_people_calls, opportunities, on='company_id', how='outer')
opportunity_created = companies_people_calls_opp['created_date'].notnull()
opportunity_label = opportunity_created.astype(int)
merged = pd.merge(companies_people_calls_opp, events, on='contact_id')
del merged['company_id']
del merged['activity_name']
del merged['contact_id']
del merged['created_date']
merged['event_date'] =  pd.to_datetime(merged['event_date'])
merged['event_day'] = pd.Series([e.strftime('%a') for e in merged['event_date']]).astype('category').cat.codes
merged['event_month'] = merged['event_date'].dt.strftime('%b').astype('category').cat.codes
del merged['event_date']
merged['call_time'] =  pd.to_datetime(merged['call_time'])
merged['call_day'] = pd.Series([e.strftime('%a') for e in merged['call_time']]).astype('category').cat.codes
merged['call_month'] = merged['call_time'].dt.strftime('%b').astype('category').cat.codes
merged['call_hour'] = merged['call_time'].dt.hour
del merged['call_time']
merged['industry'] = merged['industry'].replace(np.nan, 'UNKNOWN', regex=True).astype('category').cat.codes
merged['employee_range'] = merged['employee_range'].replace(np.nan, '1,000 - 4,999', regex=True).astype('category').cat.codes
merged['job_level'] = merged['job_level'].astype('category').cat.codes
merged['job_function'] = merged['job_function'].astype('category').cat.codes
merged['activity_type'] = merged['activity_type'].astype('category').cat.codes
merged['activity_action'] = merged['activity_action'].astype('category').cat.codes
merged['call_disposition'] = merged['call_disposition'].astype('category').cat.codes

merged

Unnamed: 0,industry,employee_range,job_level,job_function,call_disposition,activity_action,activity_type,event_day,event_month,call_day,call_month,call_hour
0,14,1,1,9,3,4,3,6,10,4,8,17
1,14,1,1,9,3,5,5,1,10,4,8,17
2,14,1,1,9,3,4,3,4,11,4,8,17
3,14,1,1,9,3,4,3,1,10,4,8,17
4,14,1,1,9,3,4,3,1,10,4,8,17
...,...,...,...,...,...,...,...,...,...,...,...,...
4663,9,2,6,9,2,2,4,6,6,6,3,21
4664,9,2,6,9,2,4,3,1,0,6,3,21
4665,9,2,6,9,2,4,3,0,0,6,3,21
4666,9,2,6,9,2,5,5,6,6,6,3,21


<a id='model'></a>
# Training Model

In [3]:
#import the library
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
# from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

import xgboost as xgb

<a id='call'></a>
# Predict Call Disposition

### 1. Random Forest

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
target=merged['call_disposition']
features=merged.drop(['call_disposition'],axis=1)
x_train,x_test,y_train,y_test=train_test_split(features,target,test_size=0.2,random_state=42)

model = RandomForestRegressor()
model.fit(x_train, y_train)
prediction = model.predict(x_test)
training_accuracy = model.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = model.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.858248921291096
Testing Accuracy: 0.6505634340431581


### 2. KNN

In [5]:
knn = KNeighborsClassifier(n_neighbors = 3)                 
knn.fit(x_train, y_train)                                     

training_accuracy = knn.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = knn.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.8910016068559186
Testing Accuracy: 0.7837259100642399


### 3. SVM

In [6]:
svc = SVC(kernel='poly', C=100)                                             
svc.fit(x_train, y_train)                                       

training_accuracy = svc.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = svc.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.7546866630958757
Testing Accuracy: 0.6970021413276232


### 4. Perceptron

In [7]:
perceptron = Perceptron()                           
perceptron.fit(x_train, y_train)                                     

training_accuracy = perceptron.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = perceptron.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.4633101231922871
Testing Accuracy: 0.43897216274089934


### 5. XGBoost

In [8]:
gradboost = xgb.XGBClassifier(n_estimators=1000)          
gradboost.fit(x_train, y_train)                                         

training_accuracy = gradboost.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = gradboost.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.950990894483128
Testing Accuracy: 0.8790149892933619


Above all, the best model is XGBoost, testing accuracy is 0.9058.

<a id='week'></a>
# Predict Best Day of Week to Call

### 1. Random Forest

In [9]:
target=merged['call_day']
features=merged.drop(['call_day'],axis=1)
x_train,x_test,y_train,y_test=train_test_split(features,target,test_size=0.2,random_state=42)

model = RandomForestRegressor()
model.fit(x_train, y_train)
prediction = model.predict(x_test)
training_accuracy = model.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = model.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.9688083463728847
Testing Accuracy: 0.8419053968828119


### 2. KNN

In [10]:
knn = KNeighborsClassifier(n_neighbors = 3)                 
knn.fit(x_train, y_train)                                     

training_accuracy = knn.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = knn.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.9065345474022496
Testing Accuracy: 0.7858672376873662


### 3. SVM

In [11]:
svc = SVC(kernel='poly', C=100)                                             
svc.fit(x_train, y_train)                                       

training_accuracy = svc.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = svc.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.7608462774504553
Testing Accuracy: 0.7012847965738758


### 4. Perceptron

In [12]:
perceptron = Perceptron()                           
perceptron.fit(x_train, y_train)                                     

training_accuracy = perceptron.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = perceptron.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.2774504552758436
Testing Accuracy: 0.2569593147751606


### 5. XGBoost

In [13]:
gradboost = xgb.XGBClassifier(n_estimators=1000)          
gradboost.fit(x_train, y_train)                                         

training_accuracy = gradboost.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = gradboost.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.9820567755757901
Testing Accuracy: 0.917558886509636


Above all, the best model is XGBoost, testing accuracy is 0.9176.

<a id='hour'></a>
# Predict Best Hour to Call
### 1. Random Forest

In [14]:
target=merged['call_hour']
features=merged.drop(['call_hour'],axis=1)
x_train,x_test,y_train,y_test=train_test_split(features,target,test_size=0.2,random_state=42)

model = RandomForestRegressor()
model.fit(x_train, y_train)
prediction = model.predict(x_test)
training_accuracy = model.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = model.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.9380271181342651
Testing Accuracy: 0.8773804008201168


### 2. KNN

In [15]:
knn = KNeighborsClassifier(n_neighbors = 3)                 
knn.fit(x_train, y_train)                                     

training_accuracy = knn.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = knn.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.8754686663095875
Testing Accuracy: 0.7077087794432548


### 3.SVM

In [16]:
svc = SVC(kernel='poly', C=1000)                                             
svc.fit(x_train, y_train)                                       

training_accuracy = svc.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = svc.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.9507230851633637
Testing Accuracy: 0.8008565310492506


### 4. Perceptron

In [17]:
perceptron = Perceptron()                           
perceptron.fit(x_train, y_train)                                     

training_accuracy = perceptron.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = perceptron.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.13042313872522765
Testing Accuracy: 0.12205567451820129


### 5. XGBoost

In [18]:
gradboost = xgb.XGBClassifier(n_estimators=1000)          
gradboost.fit(x_train, y_train)                                         

training_accuracy = gradboost.score(x_train, y_train)
print ('Training Accuracy:',training_accuracy)
test_accuracy = gradboost.score(x_test, y_test)
print ('Testing Accuracy:',test_accuracy)

Training Accuracy: 0.9708087841456883
Testing Accuracy: 0.9057815845824411


Above all, the best model is XGBoost, testing accuracy is 0.9058.

<a id='conclusion'></a>
# Conclusion

1. After using different models to train data, the best model is XGBoost among three different predicting datasets. The testing accuracy is over 90%. 
2. If we want to use XGBoost as our prefer model in the future, we could do more XGBoost parameter tuning. (Tutorial: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)
3. We could also use optuna to find the automatic hyperparameter optimization for each model. (Tutorial: https://github.com/optuna/optuna)