<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/EN/S07_Intro_ML/S07_In_After_Class_Demo_Telco.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 7: Introduction to ML - Classification
---
*Created by Yossiri Adulyasak*

---

## Demo: Classification on Customer Churn Dataset - full-scale with tree exploration and visualization

Classification using customer service churn dataset (https://www.kaggle.com/blastchar/telco-customer-churn).

This notebook is used to demonstrate the pipeline runs on the full dataset. You can review this once you are familiar with the main pipeline and building blocks of the predictive task. For the detailed description of each step, please refer to the other simple notebook example of Telco.

In [None]:
!pip install -q dtreeviz
import dtreeviz

In [None]:
import sklearn
from sklearn import *

## **Blocks 1 and 2**: data input and feature preparation

In [None]:
import pandas as pd
# Load transformed dataset
url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/EN/S07_Intro_ML/data/Telco-Customer-Churn_dummies.csv'
customer_data = pd.read_csv(url)

#take all columns except the first one (ID) until the column prior to the predicted value (the last column is the label y)
selected_features = customer_data.columns[1:-1].values

print(selected_features)
X = customer_data[selected_features].values
y = customer_data['Churn_Yes'].values
customer_data.head(5)

In [None]:
# Split into training and testing data (72/25 by default)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=0)

print("N. Training:", len(y_train), ", N. Testing:", len(y_test))

## **Blocks 3 and 4**: Model and validation

### Model 1: Logistic regression

In addition to precision and recall, we can also measure the aggregate measure of both which is called F-Score (see [link](https://en.wikipedia.org/wiki/F-score)). The F-1 Score is a harmonic mean of both measures where `F-Score = 1.0`  indicates the perfect precision and recall whereas the lowest value is `F-Score = 0` which occurs when either precision or recall equals zero. F-Score score can be calculated as follows:

$F_1 = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} = 2\frac{Precision \times Recall}{Precision + Recall}$

We can use sklearn function `.f1_score(y_true, y_predict)` to calculate it (see [link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html))

In [None]:
# Run Logistic regression
logreg = sklearn.linear_model.LogisticRegression().fit(X_train, y_train)

# Print the results
print("Logistic Regression: Training accuracy: {:.3f}".format(logreg.score(X_train, y_train)))
print("Logistic Regression: Testing accuracy: {:.3f}".format(logreg.score(X_test, y_test)))

# Predict the values y from the trained model on the test set
y_predict = logreg.predict(X_test)
print("Logistic Regression: Precision: {:.3f}".format(sklearn.metrics.precision_score(y_test, y_predict)))
print("Logistic Regression: Recall: {:.3f}".format(sklearn.metrics.recall_score(y_test, y_predict)))
print("Logistic Regression: F1-Score: {:.3f}".format(sklearn.metrics.f1_score(y_test, y_predict)))

# We can also print out a confusion matrix
print("Display confusion matrix")
print(sklearn.metrics.confusion_matrix(y_test, y_predict))

### Model 2: Decision tree

Here we try the decision tree with different `max_depth` to evaluate the results.

In [None]:
# Run decision tree

tree_results = pd.DataFrame(columns = ['Depth','Train-Accuracy','Test-Accuracy','Precision','Recall', 'F-Score'])
best_depth = 1

for depth in range(2,8):
  tree = sklearn.tree.DecisionTreeClassifier(max_depth = depth, random_state=0).fit(X_train, y_train)
  y_predict = tree.predict(X_test)

  # compute each measure and add to the new row in DataFrame
  train_acc = tree.score(X_train, y_train) # accuracy on the training set
  test_acc = tree.score(X_test, y_test)  # accuracy on the test set
  test_precision = sklearn.metrics.precision_score(y_test, y_predict) # precision on the test set
  test_recall = sklearn.metrics.recall_score(y_test, y_predict) # recall on the test set
  f_score = 2*test_precision*test_recall/(test_precision+test_recall) # f-score on the test set (manual calculation)

  # add the results (list) to the last row in the DataFrame
  tree_results.loc[len(tree_results)] = [depth,train_acc,test_acc,test_precision,test_recall,f_score]

  # we keep track of the best depth
  if f_score >= max(tree_results['F-Score']):
    best_depth = depth

print("Best depth = ", best_depth)
tree_results

We can also plot the feature importance to explore the insights of the best tree (based on F-score).

In [None]:
# retrain the model using the best depth
tree = sklearn.tree.DecisionTreeClassifier(max_depth = best_depth, random_state=0).fit(X_train, y_train)

# obtain feature importance scores and save in dataframe
feature_df = pd.DataFrame(index = selected_features)
feature_df['Importance'] = tree.feature_importances_
feature_df =feature_df[feature_df['Importance'] > 0].sort_values('Importance')
display(feature_df)

feature_df.plot(kind='barh', figsize = (5,7))

If you want to obtain the probability associated with the positive class (label = 1) rather than the binary value. We can use the function `.predict_proba(X)` [link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba). This function returns a 2-D array of pairs of probabilities associated with label 0 and label 1 so we can slice only the second element from each data point by using `[:, 1]` (see [link](https://machinelearningmastery.com/index-slice-reshape-numpy-arrays-machine-learning-python/) on slicing of 2-D array).

In [None]:
# y_prob = logreg.predict_proba(X_test)[:, 1] # use this for logistic regression
y_prob = tree.predict_proba(X_test)[:, 1] # use this for tree
print(y_prob)

### Tree exploration and visualization

In the lecture, we plot the tree in `sklearn` using the function [link text] (https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html). However, we might consider using a better visualization from `dtreeviz` package [link1](https://github.com/parrt/dtreeviz), [link2](https://github.com/parrt/dtreeviz/blob/master/notebooks/dtreeviz_sklearn_visualisations.ipynb) and [link3](https://explained.ai/decision-tree-viz/) to visualize the tree. This package can be slow to run in particular with a large dataset and deep tree so you should limit the tree size when using it.

In [None]:
tree = sklearn.tree.DecisionTreeClassifier(max_depth = 3, random_state=0).fit(X_train, y_train)
viz = dtreeviz.model(tree, X_train, y_train, target_name="Churn", feature_names=selected_features)
viz.view()

You can also save the tree visualization by calling the function.

In [None]:
viz.view().save('tree.svg')

For Colab, we also need to download it again.

In [None]:
from google.colab import files
files.download('tree.svg')

Another use of this library is to display a specific path of the prediction for a single data point. Here we highlight a specific path on the tree.

In [None]:
viz.view(x=X_test[4])

Alternatively, you can also show a single path (rather than the entire tree).

In [None]:
viz.view(x=X_test[4], show_just_path=True)