<a href="https://colab.research.google.com/github/elsaprafitri/repo_practice_elsa/blob/main/Module_4_Take_home_Assignment_Elsa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Packages

In [136]:
import sys
import os
import gdown
import logging
logging.getLogger('matplotlib.font_manager').setLevel(level=logging.CRITICAL)

import warnings
warnings.filterwarnings('ignore')

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

random_state = 1234 # get reproducible trees


# Prepare Data

**Restartnet** is a telecommunication company that are market leader in in Wakanda since 1990 and the first to create high speed mobile internet that integrate satellite and ground cable.

In the last 5 years, there are more fierce competition with new competitor emerging. A lot of Restartnet customer is moving to that new competitor and Restartnet CEO is quite concern about that issue.

After digging some data, Restarnet CEO realize that their churn rate is high at 25%.



As a **CEO Analyst**, we initiate to find which customers are likely to churn by creating a **customer churn model** so that we can offer accurate engagement packages to the targeted customers.

After we provide the list of customer, we calculate the impact for the company.

With assumsions:

* For each customer churn, we lost $500.

* Engagement program cost is $100, and

* All customer that get will stay


The **data** is provided in this [link](https://drive.google.com/file/d/1jAFn03vk055D9gZrrzM70_cdPyUDg-bv/view) which consist of sample **unique customer** that have already bought internet package in Restartnet company from 2010 to 2020. The customer data consist of their demographic data and the summary of their transaction in Restartnet. The detail of the data definition can be seen below.

Data Definition:

| Field           | Description                                     |
|-----------------|-------------------------------------------------|
| customerID      | Customer's unique identifier                     |
| gender          | Whether the customer is a male or a female      |
| SeniorCitizen   | Whether the customer is a senior citizen or not |
| Partner         | Whether the customer has a partner or not       |
| Dependents      | Whether the customer has dependents or not      |
| tenure          | Number of months the customer has stayed        |
| PhoneService    | Whether the customer has a phone service or not |
| MultipleLines   | Whether the customer has multiple lines or not  |
| InternetService | Customer's internet service provider            |
| OnlineSecurity  | Whether the customer has online security or not |
| OnlineBackup    | Whether the customer has online backup or not   |
| DeviceProtection| Whether the customer has device protection or not |
| TechSupport     | Whether the customer has tech support or not    |
| StreamingTV     | Whether the customer has streaming TV or not    |
| StreamingMovies | Whether the customer has streaming movies or not|
| Contract        | The contract term of the customer               |
| PaperlessBilling| Whether the customer has paperless billing or not |
| PaymentMethod   | The customer's payment method                   |
| MonthlyCharges  | The amount charged to the customer monthly      |
| TotalCharges    | The total amount charged to the customer        |
| Churn           | Whether the customer churned or not              |



In [137]:
# Download Data
gdrive_url = "https://drive.google.com/file/d/1jAFn03vk055D9gZrrzM70_cdPyUDg-bv/view"
file_name = 'churn_data.csv'
gdown.download(gdrive_url, file_name, fuzzy=True)


Downloading...
From: https://drive.google.com/uc?id=1jAFn03vk055D9gZrrzM70_cdPyUDg-bv
To: /content/churn_data.csv
100%|██████████| 977k/977k [00:00<00:00, 30.6MB/s]


'churn_data.csv'

In [138]:
df = pd.read_csv('churn_data.csv')

In [139]:
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod']

features = numeric_features + categorical_features
target = 'Churn'

print("numeric_features : ", numeric_features)
print("categorical_features : ", categorical_features)
print("features: ", features)
print("target: ", target)
print("columns used: ", features + [target])


numeric_features :  ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features :  ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
features:  ['tenure', 'MonthlyCharges', 'TotalCharges', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
target:  Churn
columns used:  ['tenure', 'MonthlyCharges', 'TotalCharges', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']


In [140]:
df = df[ features + [target] ]


In [141]:
# Handle missing value on TotalCharges with value 0
df['TotalCharges'] = df['TotalCharges'].fillna(0)

# Handle Categorical Data
## we transform categorical into several column as it will treated differently
df = pd.get_dummies(df, columns = categorical_features)


In [142]:
# transform target to 1 if Yes, 0 if No
df[target] = (df[target] == 'Yes').astype(int)

In [143]:
# Split data
## Asumming df_test data is new data
df_train, df_test = train_test_split(df, test_size=0.33, random_state=random_state)

In [144]:
df_train.head()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,SeniorCitizen_0,SeniorCitizen_1,Partner_No,Partner_Yes,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
2632,55,64.75,3617.1,0,True,False,True,False,True,False,...,True,False,False,True,False,True,False,False,False,True
1210,17,69.0,1149.65,1,False,True,True,False,False,True,...,False,True,False,False,False,True,False,False,True,False
5018,72,19.7,1379.8,0,True,False,True,False,False,True,...,False,False,False,True,True,False,False,True,False,False
4891,4,65.6,250.1,0,False,True,True,False,False,True,...,False,True,False,False,True,False,False,False,True,False
3794,8,54.75,445.85,0,False,True,True,False,False,True,...,False,True,False,False,False,True,False,False,False,True


In [145]:
df_train.columns

Index(['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn', 'gender_Female',
       'gender_Male', 'SeniorCitizen_0', 'SeniorCitizen_1', 'Partner_No',
       'Partner_Yes', 'Dependents_No', 'Dependents_Yes', 'PhoneService_No',
       'PhoneService_Yes', 'MultipleLines_No',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic',
       'InternetService_No', 'OnlineSecurity_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No', 'OnlineBackup_No internet service',
       'OnlineBackup_Yes', 'DeviceProtection_No',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No', 'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No', 'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No', 'StreamingMovies_No internet service',
       'StreamingMovies_Yes', 'Contract_Month-to-month', 'Contract_One year',

In [146]:
features = list(df_train.columns)
features.remove(target)

features

['tenure',
 'MonthlyCharges',
 'TotalCharges',
 'gender_Female',
 'gender_Male',
 'SeniorCitizen_0',
 'SeniorCitizen_1',
 'Partner_No',
 'Partner_Yes',
 'Dependents_No',
 'Dependents_Yes',
 'PhoneService_No',
 'PhoneService_Yes',
 'MultipleLines_No',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_DSL',
 'InternetService_Fiber optic',
 'InternetService_No',
 'OnlineSecurity_No',
 'OnlineSecurity_No internet service',
 'OnlineSecurity_Yes',
 'OnlineBackup_No',
 'OnlineBackup_No internet service',
 'OnlineBackup_Yes',
 'DeviceProtection_No',
 'DeviceProtection_No internet service',
 'DeviceProtection_Yes',
 'TechSupport_No',
 'TechSupport_No internet service',
 'TechSupport_Yes',
 'StreamingTV_No',
 'StreamingTV_No internet service',
 'StreamingTV_Yes',
 'StreamingMovies_No',
 'StreamingMovies_No internet service',
 'StreamingMovies_Yes',
 'Contract_Month-to-month',
 'Contract_One year',
 'Contract_Two year',
 'PaperlessBilling_No',
 'PaperlessBilling_Yes',
 'P

# Evaluation metrics comparison from several models

## Train & Evaluate Decision Tree Classifier

with specs
```
max depth = 7
class weight = balanced
random state = 1234
```

In [147]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4718 entries, 2632 to 2863
Data columns (total 47 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   tenure                                   4718 non-null   int64  
 1   MonthlyCharges                           4718 non-null   float64
 2   TotalCharges                             4718 non-null   float64
 3   Churn                                    4718 non-null   int64  
 4   gender_Female                            4718 non-null   bool   
 5   gender_Male                              4718 non-null   bool   
 6   SeniorCitizen_0                          4718 non-null   bool   
 7   SeniorCitizen_1                          4718 non-null   bool   
 8   Partner_No                               4718 non-null   bool   
 9   Partner_Yes                              4718 non-null   bool   
 10  Dependents_No                            4718 non-

In [148]:
# import model
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression

#Define the specific parameters for the decision tree
tree_param = {
    'max_depth' : 7,
    'class_weight' : 'balanced',
    'random_state' : 1234
}

# initiate model
model_tree = DecisionTreeClassifier(**tree_param)

# Train model
model_tree.fit(df_train[features].values, df_train[target])

In [149]:
# Evaluate Precision, Recall, and F1 using Test Data
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

prediction = model_tree.predict(df_test[features])


In [150]:
print("precision_score\t:" ,precision_score(df_test[target], prediction))
print("recall_score \t:" ,recall_score(df_test[target], prediction))
print("f1_score \t:" ,f1_score(df_test[target], prediction))

print("confusion_matrix:")
confusion_matrix(df_test[target], prediction)

precision_score	: 0.46303901437371664
recall_score 	: 0.754180602006689
f1_score 	: 0.573791348600509
confusion_matrix:


array([[1204,  523],
       [ 147,  451]])

## Train & Evaluate Random Forest

with specs
```
n estimators = 10
max_depth = 3
random_state=random_state
class_weight = 'balanced'
```

In [151]:
# import model
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

#Define the specific parameters for the decision tree
random_param = {
    'n_estimators' : 10,
    'max_depth' : 3,
    'class_weight' : 'balanced',
    'random_state' : 1234
}

# initiate model
model_rf = RandomForestClassifier(**random_param)

# Train model
model_rf.fit(df_train[features].values, df_train[target].values)

In [152]:
#create prediction from test data
prediction = model_rf.predict(df_test[features])


In [153]:
# Evaluate Precision, Recall, and F1 using Test Data

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("precision \t:" ,round(precision_score(df_test[target], prediction), 4))
print("recall \t\t:" ,round(recall_score(df_test[target], prediction), 4))
print("f1_score \t:" ,round(f1_score(df_test[target], prediction), 4))

precision 	: 0.4733
recall 		: 0.801
f1_score 	: 0.595


## Train & Evaluate Your own model

Feel free to pick any classification model in https://scikit-learn.org/stable/supervised_learning.html

But you required to have higher f1_score more than `0.61`


In [154]:
# import model
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

#Define the specific parameters for the decision tree
random2_param = {
    'n_estimators' : 47,
    'max_depth' : 9,
    'class_weight' : 'balanced',
    'random_state' : 1234
}

# initiate model
model = RandomForestClassifier(**random2_param)

# Train model
model.fit(df_train[features].values, df_train[target].values)


In [155]:
#create prediction from test data
prediction2 = model.predict(df_test[features])

In [156]:
# Evaluate Precision, Recall, and F1 using Test Data
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("f1_score \t:" ,round(f1_score(df_test[target], prediction2), 5))

f1_score 	: 0.62256


# Business impact comparison from several models

[recall the assumption]

assumsions:

* For each customer churn, we lost $500.

* Engagement program cost is $100, and

* All customer that get engagement will stay

----
We want to compare the business impact on:
* Case 1: if no engagement program
* Case 2: if we send engagement program to all user
* Case 3: if we send engagement program based on above decision tree (`model_tree`)
* Case 4: if we send engagement program based on above random forest (`model_rf`)  
* Case 5: if we send engagement program based on above the best model (`model`)

----

First we calculate how many customer and churn customer in test dataset

In [157]:
total_customer = len(df_test)
real_churn = len(df_test.loc[df_test[target] == 1])

print("Total customer \t:", total_customer)
print("Total churn \t:", real_churn)

Total customer 	: 2325
Total churn 	: 598


In [158]:
df_test

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,SeniorCitizen_0,SeniorCitizen_1,Partner_No,Partner_Yes,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
6692,41,74.65,3090.65,0,True,False,True,False,True,False,...,True,True,False,False,False,True,True,False,False,False
2624,17,66.70,1077.05,0,False,True,True,False,False,True,...,False,True,False,False,False,True,False,False,False,True
1076,58,24.50,1497.90,0,True,False,True,False,False,True,...,False,False,True,False,False,True,False,True,False,False
1428,1,50.45,50.45,1,False,True,True,False,True,False,...,False,True,False,False,False,True,False,False,True,False
7026,9,44.20,403.35,1,True,False,True,False,True,False,...,False,True,False,False,False,True,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1827,45,50.90,2333.85,0,True,False,True,False,False,True,...,True,True,False,False,False,True,True,False,False,False
2158,57,84.50,4845.40,0,True,False,True,False,False,True,...,True,False,False,True,False,True,False,False,True,False
2396,38,95.00,3591.25,0,True,False,False,True,False,True,...,True,False,True,False,True,False,False,True,False,False
2217,66,89.90,5958.85,0,True,False,True,False,False,True,...,True,False,False,True,True,False,False,True,False,False


Save the assumption into variable

In [159]:
churn_value_lost_per_customer = 500
engagement_cost_per_customer = 100

print("Churn Value Lost per customer\t:", churn_value_lost_per_customer)
print("Engagement Cost per customer\t:", engagement_cost_per_customer)

Churn Value Lost per customer	: 500
Engagement Cost per customer	: 100


## Case 1: if no engagement program

In [160]:
print("CASE 1: If no engagement program")

value_lost_case1 = real_churn * churn_value_lost_per_customer
engagement_cost_case1 = 0 # because no engagement
total_cost_case1 = value_lost_case1 + engagement_cost_case1
print("\t Value Lost \t: $", value_lost_case1)
print("\t Engagement cost: $", engagement_cost_case1)
print("\t Total cost \t: $",  total_cost_case1)


CASE 1: If no engagement program
	 Value Lost 	: $ 299000
	 Engagement cost: $ 0
	 Total cost 	: $ 299000


## Case 2: if we send engagement program to all user

In [161]:
print("Case 2: if we send engagement program to all user")

value_lost_case2 = 0 # because no customer lost
engagement_cost_case2 = total_customer * engagement_cost_per_customer
total_cost_case2 = value_lost_case2 + engagement_cost_case2
print("\t Value Lost \t: $", value_lost_case2)
print("\t Engagement cost: $", engagement_cost_case2)
print("\t Total cost \t: $",  total_cost_case2)


Case 2: if we send engagement program to all user
	 Value Lost 	: $ 0
	 Engagement cost: $ 232500
	 Total cost 	: $ 232500


Looks like if we send engagement program to all customer, it is more beneficial for the company (232500 < 299000)

But lets see how the model performs

## Case 3: if we send engagement program based on above decision tree (`model_tree`)

Tips, you need to find the number of
* how many customer that predicted as churn (`predict_churn`)
* how many customer that actually churn **but** we predict it as stay (`real_churn_predict_stay`)

Hint: you can use confussion matrix
```python
confusion_matrix(y_true_test, y_pred_test)
```
explore the indexing of `confusion_matrix` like using `[0,0]` to get the number inside confusion matrix
```python
confusion_matrix(y_true_test, y_pred_test)[0,0]
```

for reminder, this is the content of confusion matrix
![Confusion metrics](https://miro.medium.com/v2/resize:fit:974/1*H_XIN0mknyo0Maw4pKdQhw.png)

In [162]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [163]:
# Model training

# Define the specific parameters for the decision tree
tree_param = {
    'max_depth' : 7,
    'class_weight' : 'balanced',
    'random_state' : 1234
}
#initiate model
model_tree = DecisionTreeClassifier(**tree_param)

# Train Model
model_tree.fit(df_train[features], df_train[target]) #no longer use df, but df_train

In [164]:
# Prediction on test data
y_pred_test = model_tree.predict(df_test[features])
y_true_test = df_test[target].values

# Metrics
print("Accuracy Score \t:", accuracy_score(y_true_test, y_pred_test))
print("Precision Score\t:", precision_score(y_true_test, y_pred_test))
print("Recall Score \t:", recall_score(y_true_test, y_pred_test))
print("F1 Score \t:", f1_score(y_true_test, y_pred_test))

# Confusion Matrix
cm = confusion_matrix(y_true_test, y_pred_test)
print("Confusion Matrix:")
print(cm)


Accuracy Score 	: 0.7118279569892473
Precision Score	: 0.46303901437371664
Recall Score 	: 0.754180602006689
F1 Score 	: 0.573791348600509
Confusion Matrix:
[[1204  523]
 [ 147  451]]


In [165]:
print("CASE 3:  if we send engagement program based on above decision tree (model_tree)")

# Extract TN, FP, FN, TP from the confusion matrix
TN, FP, FN, TP = cm.ravel()

# Printing results
print("True Negatives:", TN)
print("False Positives:", FP)
print("False Negatives:", FN)
print("True Positives:", TP)

CASE 3:  if we send engagement program based on above decision tree (model_tree)
True Negatives: 1204
False Positives: 523
False Negatives: 147
True Positives: 451


In [166]:
predict_churn = FP + TP #(all customers predicted to churn)
real_churn_predict_stay = FN #(customers who churned but were predicted to stay)

# Calculating business impacts
value_lost_case3 = real_churn_predict_stay * churn_value_lost_per_customer  # Cost of churns that were not predicted
engagement_cost_case3 = predict_churn * engagement_cost_per_customer    # Cost of sending engagement programs
total_cost_case3 = value_lost_case3 + engagement_cost_case3 #total cost


# Display the costs
print("\t Value Lost \t: $", value_lost_case3)
print("\t Engagement cost: $", engagement_cost_case3)
print("\t Total cost \t: $",  total_cost_case3)

	 Value Lost 	: $ 73500
	 Engagement cost: $ 97400
	 Total cost 	: $ 170900


## Case 4: if we send engagement program based on above random forest (`model_rf`)  

In [167]:
# import model
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [168]:
#Define the specific parameters for the decision tree
random_param = {
    'n_estimators' : 10,
    'max_depth' : 3,
    'class_weight' : 'balanced',
    'random_state' : 1234
}

# initiate model
model_rf = RandomForestClassifier(**random_param)

# Train model
model_rf.fit(df_train[features].values, df_train[target].values)

In [169]:
# Prediction on test data
y_pred_testrf = model_rf.predict(df_test[features])

In [170]:
#create array number from the actual data
y_true_testrf = df_test[target].values

In [171]:
# Evaluate Precision, Recall, and F1 using Test Data
print("precision_score\t:" ,precision_score(y_true_testrf, y_pred_testrf),4)
print("recall_score \t:" ,recall_score(y_true_testrf, y_pred_testrf),4)
print("f1_score \t:" ,f1_score(y_true_testrf, y_pred_testrf),4)

print("confusion_matrix_rf:")
confusion_matrix(y_true_testrf, y_pred_testrf)

precision_score	: 0.4733201581027668 4
recall_score 	: 0.8010033444816054 4
f1_score 	: 0.5950310559006211 4
confusion_matrix_rf:


array([[1194,  533],
       [ 119,  479]])

In [172]:
# Extract TN, FP, FN, TP from the confusion matrix
cm_rf = confusion_matrix(y_true_testrf, y_pred_testrf)

TN_RF, FP_RF, FN_RF, TP_RF = cm_rf.ravel()

# Printing results
print("True Negatives:", TN_RF)
print("False Positives:", FP_RF)
print("False Negatives:", FN_RF)
print("True Positives:", TP_RF)

True Negatives: 1194
False Positives: 533
False Negatives: 119
True Positives: 479


In [173]:
print("Case 4: if we send engagement program based on above random forest (model_rf) ")

predict_churn = FP_RF + TP_RF #(all customers predicted to churn)
real_churn_predict_stay = FN_RF #(customers who churned but were predicted to stay)

# Calculating business impacts
value_lost_case4 = real_churn_predict_stay * churn_value_lost_per_customer  # Cost of churns that were not predicted
engagement_cost_case4 = predict_churn * engagement_cost_per_customer    # Cost of sending engagement programs
total_cost_case4 = value_lost_case4 + engagement_cost_case4 #total cost

print("\t Value Lost \t: $", value_lost_case4)
print("\t Engagement cost: $", engagement_cost_case4)
print("\t Total cost \t: $",  total_cost_case4)


Case 4: if we send engagement program based on above random forest (model_rf) 
	 Value Lost 	: $ 59500
	 Engagement cost: $ 101200
	 Total cost 	: $ 160700


## Case 5: if we send engagement program based on above the best model (model)

In [174]:
print ("BEST MODEL IS F1 > 0.61 which using Random Forest with these parameters")

BEST MODEL IS F1 > 0.61 which using Random Forest with these parameters


In [175]:
# import model
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

#Define the specific parameters for the decision tree
random2_param = {
    'n_estimators' : 47,
    'max_depth' : 9,
    'class_weight' : 'balanced',
    'random_state' : 1234
}

# initiate model
model_rf2= RandomForestClassifier(**random2_param)

# Train model
model_rf2.fit(df_train[features].values, df_train[target].values)

In [176]:
# Prediction on test data
y_pred_testrf2 = model_rf2.predict(df_test[features])

In [177]:
#create array number from the actual data
y_true_testrf2 = df_test[target].values

In [115]:
# Evaluate Precision, Recall, and F1 using Test Data
print("precision_score\t:" ,precision_score(y_true_testrf2, y_pred_testrf2),4)
print("recall_score \t:" ,recall_score(y_true_testrf2, y_pred_testrf2),4)
print("f1_score \t:" ,f1_score(y_true_testrf2, y_pred_testrf2),4)

print("confusion_matrix_rf2:")
confusion_matrix(y_true_testrf2, y_pred_testrf2)

precision_score	: 0.5334128878281623 4
recall_score 	: 0.7474916387959866 4
f1_score 	: 0.6225626740947076 4
confusion_matrix_rf2:


array([[1336,  391],
       [ 151,  447]])

In [178]:
# Extract TN, FP, FN, TP from the confusion matrix
cm_rf2 = confusion_matrix(y_true_testrf2, y_pred_testrf2)

TN_RF2, FP_RF2, FN_RF2, TP_RF2 = cm_rf2.ravel()

# Printing results
print("True Negatives:", TN_RF2)
print("False Positives:", FP_RF2)
print("False Negatives:", FN_RF2)
print("True Positives:", TP_RF2)

True Negatives: 1336
False Positives: 391
False Negatives: 151
True Positives: 447


In [179]:
print("Case 5: if we send engagement program based on above the best model (model)")

predict_churn = FP_RF2 + TP_RF2 #(all customers predicted to churn)
real_churn_predict_stay = FN_RF2 #(customers who churned but were predicted to stay)

# Calculating business impacts
value_lost_case5 = real_churn_predict_stay * churn_value_lost_per_customer  # Cost of churns that were not predicted
engagement_cost_case5 = predict_churn * engagement_cost_per_customer    # Cost of sending engagement programs
total_cost_case5 = value_lost_case5 + engagement_cost_case5 #total cost

print("\t Value Lost \t: $", value_lost_case5)
print("\t Engagement cost: $", engagement_cost_case5)
print("\t Total cost \t: $",  total_cost_case5)


Case 5: if we send engagement program based on above the best model (model)
	 Value Lost 	: $ 75500
	 Engagement cost: $ 83800
	 Total cost 	: $ 159300


In [135]:
print ("If minimizing overall cost then based on the calculation above, we can 	save the total cost $299,000 to $170,900 with Decision Tree model, $ 160,700")
print ("with Random Forest model (n_estimators : 10, max_depth : 3)and $159.300 with optimized model (Random Forest with n_estimators : 47 , max_depth : 9)")
print ("compared with if we do not take action")


print ("But  if the risk of customer churn (value lost) is more critical, then the original Random Forest (n_estimators : 10, max_depth : 3")
print ("model could be considered better since it has a lower value lost despite a slightly higher total cost")

If minimizing overall cost then based on the calculation above, we can 	save the total cost $299,000 to $170,900 with Decision Tree model, $ 160,700
with Random Forest model (n_estimators : 10, max_depth : 3)and $159.300 with optimized model (Random Forest with n_estimators : 47 , max_depth : 9)
compared with if we do not take action
But  if the risk of customer churn (value lost) is more critical, then the original Random Forest (n_estimators : 10, max_depth : 3
model could be considered better since it has a lower value lost despite a slightly higher total cost
