<a href="https://colab.research.google.com/github/ekapolc/exxon_training/blob/master/MLpipeline/ExxonTraining_TelcoChurnPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

  # **INTRODUCTION - PREDICTING CHURN RATE**

In this exercise, you will analyse customer data and predict behavior, in order to develop focused customer retention programs. The dataset contains data from over 7000 customers, including whether they have left within the past month -- a behavior called Churn.

---
There are three parts to this exercise :
* **Part 1 - Data Cleaning** Formatting the data so that our predictive algorithm can understand it.
* **Part 2 - Data Learning** Giving our algorithm the data to learn from,  discover hidden patterns, and make accurate predictions.
* **Part 3 - Play with ROC** Finding the most suitable threshold for this task.
---



## Import Libraries

In [0]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
import sklearn.model_selection

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
import random 
from google.colab import files 
from collections import OrderedDict

#Classifiers
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

#Set seed
seed = 18

### Dataset Retrieval

You can have a look at the dataset as a spreadsheet <a href="https://github.com/busyML/Predictiong-Customer-Churn/raw/master/Telco_Customer_Churn.xlsx">here</a>. However, this exercise automatically downloads this spreadsheet to the Google Colab runtime for you, so you are not required to download it.
 
 Data Dictionary : <a href="https://www.kaggle.com/blastchar/telco-customer-churn">https://www.kaggle.com/blastchar/telco-customer-churn</a>

The following code downloads the spreadsheet and automatically converts it to a Pandas dataframe. What is Pandas? Think of it as Excel, only 100 times faster and for practical. We use it to manipulate huge spreadsheets of data in just a few seconds.


In [0]:
# We previously uploaded the data to this url and here we simply retrieving it.
data_url= ('https://github.com/busyML/Predictiong-Customer-Churn/raw/master/Telco_Customer_Churn.xlsx')

# Download the spreadsheet using Pandas's importing function.
data=pd.read_excel(data_url)

columnnames=data.columns 
print(columnnames) 

In [0]:
data.set_index('CustomerID', inplace=True)

#### Using the ***''.head"*** command to explore data


---




In [0]:
data.head(10)

# Part 1 - DATA CLEANING


###Todo#1 - Label Encoding

Some columns contain binary values, meaning they are either one thing or not (for example: yes or no, true or false). To make these columns compatible with our algorithms, we can encode (represent) their values with **1** and **0**. Here we will set **Yes=1** and **No=0**. 

Let's encode these problem columns.

In [0]:
data['Churn']=data['Churn'].apply(lambda x:1 if x=='Yes' else 0) 
data['Gender']=data['Gender'].apply(lambda x:1 if x=='Female' else 0) # Note here that unlike the other column, the keyword is "Female" not "Yes", however it is of course still binary class.
data['Partner']=data['Partner'].apply(lambda x:1 if x=='Yes' else 0)
data['Dependents']=data['Dependents'].apply(lambda x:1 if x=='Yes' else 0)
data['PhoneService']=data['PhoneService'].apply(lambda x:1 if x=='Yes' else 0)
data['MultipleLines']=data['MultipleLines'].apply(lambda x:1 if x=='Yes' else 0)
data['OnlineSecurity']=data['OnlineSecurity'].apply(lambda x:1 if x=='Yes' else 0)
data['OnlineBackup']=data['OnlineBackup'].apply(lambda x:1 if x=='Yes' else 0)
# data['DeviceProtection']=data['DeviceProtection'].apply(lambda x:1 if x=='Yes' else 0)

## By the way, you can use sklearn to do this.
le = LabelEncoder()
le.fit(data['DeviceProtection'])
data['DeviceProtection'] = le.transform(data['DeviceProtection'])

In [0]:
data.head()

### Let's check other problem columns and fix them.
There are other columns that have binary values. Explore the data more to see which other columns need to be encoded, and encode them in the same way as the previous code.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
data['TechSupport']=data['TechSupport'].apply(lambda x:1 if x=='Yes' else 0)
data['StreamingTV']=data['StreamingTV'].apply(lambda x:1 if x=='Yes' else 0)
data['StreamingMovies']=data['StreamingMovies'].apply(lambda x:1 if x=='Yes' else 0)
data['PaperlessBilling']=data['PaperlessBilling'].apply(lambda x:1 if x=='Yes' else 0)
        </code>
      </pre>
</details>

In [0]:
data.head()

###Todo#2 - One Hot Encoding

Some column contains three or more possible text values. For instance, the column **['InternetService']**, which tells what type of internet service the customer is using, has the following possible outcomes:

*   **Fiber optic**
*   **DSL**
*   **No Internet**

To encode such a column as numerical values, we may use a technique called ***One Hot Encoding***. The trick is to **create** new columns for each possible outcome, each with binary values. So from the column **['InternetService']**, we are going to create three new columns : **['InternetService-FiberOptic']**, **['InternetService-DSL']**, and **['InternetService-NoInternet']**, each with values of either **1** or **0**. So if our customer has DSL Internet, as is the case with the customer ***7590-VHVEG***  in our first row, then our three new columns will look like this in the first row:

*   **['InternetService-FiberOptic']** =  0
*   **['InternetService-DSL']**=  1
*   **['InternetService-NoInternet']**=  0


And finally, don't forget to delete original column  **['InternetService']**.

You can use **value_counts()** command in pandas for exploring data.


Let's use one hot encoding to deal with it!!

In [0]:
data_tmp = data.copy() 

In [0]:
for x in data['InternetService'].value_counts().keys(): 
      data[x]=data['InternetService'].apply(lambda d: 1 if d==x else 0)
data.drop(columns=['InternetService'], inplace=True)

In [0]:
data.head()

### Let's check other problem columns and fix them.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
for x in data['Contract'].value_counts().keys():
    data[x]=data['Contract'].apply(lambda d: 1 if d==x else 0)
data.drop(columns=['Contract'], inplace=True) 
    
for x in data['PaymentMethod'].value_counts().keys():
    data[x]=data['PaymentMethod'].apply(lambda d: 1 if d==x else 0)
data.drop(columns=['PaymentMethod'], inplace=True) 
        </code>
      </pre>
</details>

In [0]:
#### By the way, you can use "pd.get_dummies(data)" to solve this problem. Yayyyy
pd.get_dummies(data_tmp).head()

*italicized text*###Todo#3 - Splitting the Data

Training : Testing = 80 : 20 

random_state = seed (18)



In [0]:
x=data.drop(columns='Churn')
y=data['Churn']
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

x_training, x_testing, y_training, y_testing = 

<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
x_training, x_testing, y_training, y_testing = train_test_split(x, y, test_size=0.2, random_state=seed)
        </code>
      </pre>
</details>

### Todo#4 - Mean Normalization

While some columns already contain numerical values, they may have very different ranges. Some may range between 0-20 while others may be 100-2,000. Not ideal. 

To deal with this, let's pick out these columns that already have numerical data and perform **Mean Normalization** on them.  

Equation :  X_norm = (X - mean) / std 

In [0]:
x_training_Tenure_mean = x_training['Tenure'].mean()
x_training_Tenure_std = x_training['Tenure'].std()

x_training['Tenure']=(x_training['Tenure']-x_training_Tenure_mean)/x_training_Tenure_std
x_testing['Tenure']=(x_testing['Tenure']-x_training_Tenure_mean)/x_training_Tenure_std

## You can use standard scaler
scaler = StandardScaler()
scaler.fit(x_training['MonthlyCharges'].values.reshape(-1,1))
x_training['MonthlyCharges'] = scaler.transform(x_training['MonthlyCharges'].values.reshape(-1,1)).reshape(-1)
x_testing['MonthlyCharges'] = scaler.transform(x_testing['MonthlyCharges'].values.reshape(-1,1)).reshape(-1)

### Let's check other problem columns and fix them.

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################









<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
x_training_TotalCharges_mean = x_training['TotalCharges'].mean()
x_training_TotalCharges_std = x_training['TotalCharges'].std()

x_training['TotalCharges']=(x_training['TotalCharges']-x_training_TotalCharges_mean)/x_training_TotalCharges_std
x_testing['TotalCharges']=(x_testing['TotalCharges']-x_training_TotalCharges_mean)/x_training_TotalCharges_std
        </code>
      </pre>
</details>

### Todo#5 - Exploring Correlations
Now that we have separated our target value (**y**) from the features (**x**), we can explore the correlation between them.

We can use the "x.corr(y)" function for this, which outputs the correlation coefficient between the two columns x and y.

A low correlation coefficient means that the column has low correlation with our target, **['Churn']**. We can consider columns with low correlation to be useless for predicting **['Churn']**.

Now, let's explore the usefulness of each column. In this task, you will write some code to calculate correlation coefficients of all the columns, and sort them in descending order. We have provided code to help you plot a bar chart of each column's correlation coefficient.


In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################
corr = {}
for columnname in x_training.columns: 
    corr[columnname] = 
corr = 

################################################################################
name = [x[0] for x in corr]
val = [x[1] for x in corr]

plt.figure(figsize=(15,10))
plt.title('Customer Churn Predictors _ Correlations')
plt.barh(range(len(corr)), val, color='b', align='center')
plt.yticks(range(len(corr)), name)
plt.xlabel('Importance Level')
plt.show()

<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
corr = {}
for columnname in x_training.columns: 
    corr[columnname] = abs(x_training[columnname].corr(y_training))
corr = sorted(corr.items(), key=lambda x: x[1], reverse=False)
        </code>
      </pre>
</details>

### Todo#6 - Feature Importance from Decision Tree
Using a Decision Tree can tell us which features are useful.

In this task, you will train a <a href = https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html>decision tree</a> from sklearn on our data. Then, get the feature importance levels calculated by the decision tree, and sort them in descending order like you did with the correlation coefficients. Again, we have code for plotting the bar chart.

Parameter Setup : n_estimator = 100


In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

### Initialize model
feature_importance_indicator = 

### Fit model
feature_importance_indicator

### Find Feature Importance
features = 
importances = 
indices = np.argsort(importances)

################################################################################

plt.figure(figsize=(15,10))
plt.title('Customer Churn Predictors')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Importance Level')
plt.show()

<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
        feature_importance_indicator=ExtraTreesClassifier(n_estimators = 100)
        feature_importance_indicator.fit(x,y)
        features = x.columns
        importances = feature_importance_indicator.feature_importances_
        indices = np.argsort(importances)
        </code>
      </pre>
</details>

### Todo#7  - Drop Useless Columns

Consider the feature importances based on both the correlation and decision tree approaches. Which columns should you drop? Why? Please justify your answer.

**Ans**

Why do some columns have high correlation, but low importance according to the decision tree?

**Ans**



In [0]:
### Hint for above question

corr = x_training.corr()
plt.figure(figsize=(15,10))
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True, linewidths=.5
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);




<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
          According to the heat map above, some columns are correlated with each other. For example, the columns from StreamingTV to FiberOptic have significantly high correlations among themselves. This means that, if the decision tree picks more than one of these columns, it wouldn't get much more information than it already knows. So it makes sense that the decision tree will only need to pick a subset of these columns.
        </code>
      </pre>
# </details>

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

#### Drop columns



<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
x_training.drop(columns=['Gender','PhoneService','MultipleLines','OnlineBackup','DeviceProtection','StreamingTV','StreamingMovies'],inplace=True)
x_testing.drop(columns=['Gender','PhoneService','MultipleLines','OnlineBackup','DeviceProtection','StreamingTV','StreamingMovies'],inplace=True)
        </code>
      </pre>
</details>

In [0]:
#Let's print our final data
x_training.head(10)



---



#Part 2 - Data Learning 

### Choosing The Evaluation Metric

When a model makes a prediction, there are four possible outcomes: **True Positive**, **False Positive**, **True Negative**, and **False Negative**.

* **True Positive**:  Predict 1, Actually 1
* **False Positive**:  Predict 1, Actually 0
* **True Negative** : Predict 0 , Actually 0
* **False Negative**: Predict 0 , Actually 1

These outcomes are commonly counted in the form of a matrix, called a **Confusion Matrix**. From this matrix, we can calculate three common evaluation metrics:

1.   **Precision**: All correctly identified positives out of all positive predictions
  - True Positive / (True Positive + False Positive)
2. **Recall**: All positives covered by the predictions out of all positives
  - True Positive / (True Positive + False Negative)
3.  **F1 Score**: Harmonic mean of Precision and Recall

For further reading : https://towardsdatascience.com/precision-vs-recall-386cf9f89488



---



Between precision and recall, which evaluation metric do you think is more suitable for this task? Please justify your answer.

**Ans**

### Todo#8 - Creating, training and testing our algorithms 

Use **<a href=https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>Logistic Regression model</a>** in Scikit-Learn to train a model with default parameters. Study the documentation for how to build and train (fit) the model.



In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

### Build Model ###
logistic_regression= 

### Fit Model ###


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
        logistic_regression= LogisticRegression()
        logistic_regression.fit(x_training,y_training)
        </code>
      </pre>
</details>

### Todo#9 - Evaluation Model

Our model is now trained! 
Next, we need to evaluate how well it can make predictions.

In this task, you will implement a function to calculate the confusion matrix. From the true labels and predictions, count the occurrences of all four outcomes and return all the counts.



In [0]:
def confusionCal(y_testing , pred):
  ################################################################################
  #                            WRITE YOUR CODE BELOW                             #
  ################################################################################

  #We initialize the following variables that help us count the number of True Positives, False Positives, etc.
  TP=0
  FP=0
  TN=0
  FN=0

  #We use a loop to count the number of TN, TP, TN, FN. Each time we detech one, we add a '1' to the corresponding variable
  for i in range(len(x_testing)):

  
  return TP, FP, TN, FN

  ### By the way, you can do this in one line using sklearn.
  ### confusion_matrix(y_testing, pred)

<details>
<summary>SOLUTION HERE!</summary>
<pre>
<code>
TP=0
FP=0
TN=0
FN=0

for i in range(len(x_testing)):

  if pred[i]==1 and y_testing.iloc[i]==1:
                         TP=TP+1
  if pred[i]==1 and y_testing.iloc[i]==0:
                         FP=FP+1      
  if pred[i]==0 and y_testing.iloc[i]==0:
                         TN=TN+1    
  if pred[i]==0 and y_testing.iloc[i]==1:
                         FN=FN+1   
</code>
</pre>
</details>

In [0]:
# predict test data and calculate TP, FP, TN and FN
pred = logistic_regression.predict(x_testing)
TP, FP, TN, FN = confusionCal(y_testing, pred)

#Printing our results
print ("Logistic Regression", "True Positives:",TP, "False Positives:",FP,"True Negatives:",TN , "False Negatives:", FN) 


#Calculating the 'Recall' metric with this simple formula
log_recall=TP/(TP+FN)

#Printing our the 'Recall' metric in %
print("Logistic Regression Recall On Training Data:", log_recall *100,'%')

### Todo#10 - Play with Threshold

In reality, our model is not trained to output either 0 or 1; it outputs a probability. The predict() function internally checks if this probability is more than a certain threshold, and outputs 1 if it is, 0 if not. By default, this threshold is 0.5.

In this task, you will apply a different threshold value. You can get this probability from **predict_proba** function instead of **predict**. Assign all predictions with probabilities higher than this threshold to 1, and the others to 0.

In [0]:
threshold = 0.45
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

#### using predict_proba and set threshold
pred = 

<details>
<summary>SOLUTION HERE!</summary>
<pre>
<code>
pred = logistic_regression_opt.predict_proba(x_testing)
pred = pred[:,1]
pred[pred >= ratio] = 1
pred[pred < ratio] = 0
</code>
</pre>
</details>

In [0]:
#Calculate TP, FP, TN, FN and print the result
TP, FP, TN, FN = confusionCal(y_testing, pred)

print ("Logistic Regression", "True Positives:",TP, "False Positives:",FP,"True Negatives:",TN , "False Negatives:", FN) 

log_recall=TP/(TP+FN)

print("Logistic Regression Recall On Training Data:", log_recall *100,'%')

# Part 3 - Playing with ROC

### Todo#11 - Exploring our model using ROC

After exploring different thresholds, we will explore our model in deeper depth by its <a href=https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html>Receiver Operating Characteristic (ROC)</a> curve. In this task, you will be working with two models with slightly different parameters. Take the probability outputs from both models, and use the roc_curve function to calculate the false positive rates (FPR) and true positive rates (TPR).

In [0]:
logistic_regression_line1 = LogisticRegression(max_iter=1, class_weight={1:0.9, 0:0.1})
logistic_regression_line1.fit(x_training,y_training)

logistic_regression_line2= LogisticRegression(max_iter=1, class_weight={1:0.99, 0:0.01})
logistic_regression_line2.fit(x_training,y_training)

################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

#### Predict and Use roc_curve function to get fpr and tpr of prediction of line1 model
pred = 
fpr, tpr, thresholds = 
roc_auc = 

#### Predict and Use roc_curve function to get fpr and tpr of prediction of line2 model
pred2 = 
fpr2, tpr2, thresholds = 
roc_auc2 = 

<details>
<summary>SOLUTION HERE!</summary>
<pre>
<code>


##### Predict and Use roc_curve function to get fpr and tpr of prediction of line1 model

pred = logistic_regression_line1.predict_proba(x_testing)
fpr, tpr, thresholds = roc_curve(  y_testing.values, pred[:,1] )
roc_auc = auc(fpr, tpr)

##### Predict and Use roc_curve function to get fpr and tpr of prediction of line2 model

pred2 = logistic_regression_line2.predict_proba(x_testing)
fpr2, tpr2, thresholds = roc_curve(  y_testing.values, pred2[:,1] )
roc_auc2 = auc(fpr2, tpr2)
</code>
</pre>
</details>

In [0]:
plt.figure(figsize=(12,8))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot(fpr2, tpr2, lw=1, label='ROC curve (area = %0.2f)' % roc_auc2)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")

plt.show()

According to the graph above, which line (orange or blue) do you think is better? 

**Ans**

In [0]:
### If we would like to set FPR as 0.1

print("With the condition, threshold is ",thresholds[np.argmin(abs(fpr-0.1))])
print("TPR : ",tpr[np.argmin(abs(fpr-0.1))])
print("FPR : ",fpr[np.argmin(abs(fpr-0.1))])

In [0]:
### If we would like to set TPR as 0.8

print("With the condition, threshold is ",thresholds[np.argmin(abs(tpr-0.8))])
print("TPR : ",tpr[np.argmin(abs(tpr-0.8))])
print("FPR : ",fpr[np.argmin(abs(tpr-0.8))])

###Todo#12 - Model Optimization

You may have noticed that the data is imbalanced. There are significantly more 0 labels than 1 labels. One possible workaround for this issue is to set *class weight* parameter in the model.


In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

#we initialize a class weights for our two labels : 1 (churned) and 0 (churned) at 70-30

class_weights ={1: 0.70, 0: 0.30}

### we create a new logistic regression model optimized for a better recall and retrain it on our data
logistic_regression_opt = 
### Fit model


<details>
<summary>SOLUTION HERE!</summary>
<pre>
<code>
class_weights ={1: 0.70, 0: 0.30}

logistic_regression_opt= LogisticRegression(class_weight=class_weights)
logistic_regression_opt.fit(x_training,y_training)

</code>
</pre>
</details>

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

#Re-evaluating the model on the 70-30 split.

pred = 
TP, FP, TN, FN = 

print ("Logistic Regression", "True Positives:",TP, "False Positives:",FP,"True Negatives:",TN , "False Negatives:", FN) 

log_recall=TP/(TP+FN)

print("Logistic Regression Recall On Training Data:", log_recall *100,'%')

<details>
<summary>SOLUTION HERE!</summary>
<pre>
<code>

pred = logistic_regression_opt.predict(x_testing)
TP, FP, TN, FN = confusionCal(pred, y_testing)

print ("Logistic Regression", "True Positives:",TP, "False Positives:",FP,"True Negatives:",TN , "False Negatives:", FN) 

log_recall=TP/(TP+FN)

print("Logistic Regression Recall On Training Data:", log_recall *100,'%')


</code>
</pre>
</details>

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

#Update the weights to 85-15%
class_weights = {1: 0.85, 0: 0.15}

### Retrain the algorithm
logistic_regression_opt = 
### Fit model 


<details>
<summary>SOLUTION HERE!</summary>
<pre>
<code>
class_weights ={1: 0.85, 0: 0.15}

logistic_regression_opt= LogisticRegression(class_weight=class_weights)
logistic_regression_opt.fit(x_training,y_training)


</code>
</pre>
</details>

In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################

#Re-evaluating the model on the 85-15 split.
pred = logistic_regression_opt.predict(x_testing)
TP, FP, TN, FN = confusionCal(y_testing, pred)
print ("Logistic Regression", "True Positives:",TP, "False Positives:",FP,"True Negatives:",TN , "False Negatives:", FN) 

log_recall=TP/(TP+FN)
print("Logistic Regression Recall On Training Data:", log_recall *100,'%')

<details>
<summary>SOLUTION HERE!</summary>
<pre>
<code>
pred = logistic_regression_opt.predict(x_testing)
TP, FP, TN, FN = confusionCal(pred, y_testing)

print ("Logistic Regression", "True Positives:",TP, "False Positives:",FP,"True Negatives:",TN , "False Negatives:", FN) 

log_recall=TP/(TP+FN)

print("Logistic Regression Recall On Training Data:", log_recall *100,'%') **bold text**

</code>
</pre>
</details>

As you have seen previously, a model's prediction falls into one of these four outcomes: TP,  FP, TN and FN.

You may have also seen the values TPR and FPR. These rates are actually related to our four outcomes. TPR represents the number of TP out of all correct samples. On the other hand, FPR represents number of FP out of all incorrect samples. 

Adjusting the threshold affects TPR and FPR. Choosing the most suitable threshold therefore depends on the task, as each real-world task may have different operational costs for TP, FP, FN and TN. 



In [0]:
pred = logistic_regression_opt.predict_proba(x_testing)

b_y = np.linspace(0, 1.05)
b_x = 1 - b_y

fpr, tpr, thresholds = roc_curve(  y_testing.values, pred[:,1] )
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(12,8))

plt.plot(b_x, b_y)
plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")

plt.show()

The blue line in this graph represents where a miss (FN) is as bad or costly as a false alarm (FP). The point where it intersects with the ROC curve (orange) is called Equal Error Rate (EER). 

The following code calculates the threshold most suitable for this condition.

In [0]:
### Calculate Miss
miss = 1 - tpr
print("With the condition, threshold is ",thresholds[(np.nanargmin(abs(miss - fpr)))])

###Todo#13 - Tuning on task (2)

Let's think about how we can prevent churn.

One way is to give promotions to potential churn customers.

Let's say we give out coupons to customers that will reduce the price by 5 USD.
The average payment for a customer is 65 USD.

A false alarm would cost us 5 USD. (why?)
A correct churn prediction would give us 60 USD.

Can you draw a graph and find the threshold that matches with the condition?


In [0]:
################################################################################
#                            WRITE YOUR CODE BELOW                             #
################################################################################



################################################################################
plt.figure(figsize=(12,8))
plt.plot(b_x, b_y)
plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")

plt.show()

<details>
<summary>SOLUTION HERE!</summary>
<pre>
<code>
pred = logistic_regression_opt.predict_proba(x_testing)

b_y = np.linspace(0, 1.00)
b_x = (1-b_y)/12

fpr, tpr, thresholds = roc_curve(  y_testing.values, pred[:,1], drop_intermediate=False )
roc_auc = auc(fpr, tpr)

</code>
</pre>
</details>

In [0]:
### Calculate Miss


<details>
<summary>SOLUTION HERE!</summary>
<pre>
<code>
miss = (1 - tpr)/13
print("With the condition, threshold is ",thresholds[(np.nanargmin(abs(miss - fpr)))])
</code>
</pre>
</details>

In this problem, we also know the current monthly payment of each customer, so we can actually have different thresholds for each customer. How would we accomplish this? What would be the criterion to intervene? This is left as a thought excercise :)