<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/S7_intro_ML/S07_In_After_Class_Demo_Telco.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 7: Introduction to ML - Classification

## Demo 3: Classification on Customer Churn Dataset

Classification using customer service churn dataset (https://www.kaggle.com/blastchar/telco-customer-churn)

In [None]:
!pip install -q dtreeviz
from dtreeviz.trees import *

In [None]:
import sklearn
from sklearn import *

In [None]:
import pandas as pd
# Load transformed dataset
url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/S7_intro_ML/data/Telco-Customer-Churn_dummies.csv'
customer_data = pd.read_csv(url)

selected_features = customer_data.columns[1:-1].values #take only column 2 until the column prior to the predicted value (the last column is the label)
print(selected_features)
X = customer_data[selected_features].values
y = customer_data['Churn_Yes'].values
customer_data.head(5)

In [None]:
# Split into training and testing data (72/25 by default)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=0)

### 1. Logistic regression model

In addition to precision and recall, we can also measure the aggregate measure of both which is called F-Score (see [link](https://en.wikipedia.org/wiki/F-score)). The F-1 Score is a harmonic mean of both measures where `F-Score = 1.0`  indicates the perfect precision and recall whereas the lowest value is `F-Score = 0` which occurs when either precision or recall equals zero. F-Score score can be calculated as follows:

$F_1 = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} = 2\frac{Precision \times Recall}{Precision + Recall}$

We can use sklearn function `.f1_score(y_true, y_predict)` to calculate it (see [link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html))

In [None]:
# Run Logistic regression
logreg = sklearn.linear_model.LogisticRegression().fit(X_train, y_train)

# Print the results
print("Logistic Regression: Training accuracy: {:.3f}".format(logreg.score(X_train, y_train)))
print("Logistic Regression: Testing accuracy: {:.3f}".format(logreg.score(X_test, y_test)))

# Predict the values y from the trained model on the test set
y_predict = tree.predict(X_test)
print("Logistic Regression: Precision: {:.3f}".format(sklearn.metrics.precision_score(y_test, y_predict)))
print("Logistic Regression: Recall: {:.3f}".format(sklearn.metrics.recall_score(y_test, y_predict)))
print("Logistic Regression: F1-Score: {:.3f}".format(sklearn.metrics.f1_score(y_test, y_predict)))

# We can also print out a confusion matrix
print(sklearn.metrics.confusion_matrix(y_test, y_predict))

### 2. Decision tree model

In [None]:
# Run decision tree

tree_results = pd.DataFrame(columns = ['Depth','Train-Accuracy','Test-Accuracy','Precision','Recall', 'F-Score'])

for depth in range(3,10,2):
  tree = sklearn.tree.DecisionTreeClassifier(max_depth = depth, random_state=0).fit(X_train, y_train)
  y_predict = tree.predict(X_test)

  # compute each measure and add to the new row in DataFrame
  train_acc = tree.score(X_train, y_train)
  test_acc = tree.score(X_test, y_test)
  test_precision = sklearn.metrics.precision_score(y_test, y_predict)
  test_recall = sklearn.metrics.recall_score(y_test, y_predict)
  f_score = 2*test_precision*test_recall/(test_precision+test_recall)
  
  # add the results (list) to the last row in the DataFrame
  tree_results.loc[len(tree_results)] = [depth,train_acc,test_acc,test_precision,test_recall,f_score]

tree_results

[link1](https://github.com/parrt/dtreeviz) and [link2](https://colab.research.google.com/github/parrt/dtreeviz/blob/master/notebooks/examples.ipynb) 

In [None]:
viz = dtreeviz(tree, X_train, y_train, target_name="Churn", feature_names=selected_features, max_X_features_LR = 2, max_X_features_TD = 2)
viz

#### Explore feature importance and tree

In [None]:
print(selected_features)
print(tree.feature_importances_)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,12))
sklearn.tree.plot_tree(tree, feature_names=selected_features, fontsize=10, max_depth=2)
plt.show()

### 3. Evaluation - calculate precision and recall

Calculate precision and recall for the model recently fitted (either logistic regression or decision tree)

In [None]:
# calculate precision and recall

y_predict = logreg.predict(X_test) #use this for logistic regression 
# y_predict = tree.predict(X_test) #use this for decision tree

print("Precision score: {:.3f}".format(sklearn.metrics.precision_score(y_test, y_predict)))
print("Recall score: {:.3f}".format(sklearn.metrics.recall_score(y_test, y_predict)))

Plot precision-recall curve (see https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py)

In [None]:
import matplotlib.pyplot as plt

y_prob = logreg.decision_function(X_test) #use this for logistic regression
# y_prob = tree.predict_proba(X_test)[:,1]  #use this for decision tree

precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, y_prob)

plt.figure()
plt.step(recall, precision)
average_precision = sklearn.metrics.average_precision_score(y_test, y_prob)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Average precision score: AP={0:0.2f}'.format(average_precision))