# Import & Prepare Data

In [None]:
import pandas as pd
churn = pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/churn.csv")

## Check Structure of Data

In [None]:
churn.______



*   No missing data
*   23 Features
* Some features are objects (text data). We need to transform them because scikit-learn cannot work with them



In [None]:
churn.______

## Separate Features and Labels

In [None]:
X = ______ # Features
y = ______ # Target variable

In [None]:
y

In [None]:
X.info()

## Recode pandas "objects"

"Objects" in Pandas are textual variables. Sklearn cannot work with them. We have to recode them

In [None]:
churn["occupation"].unique() #unique returns the unique values for a variable

### One-hot encoding

In [None]:
X_onehot = pd.get_dummies(X, drop_first = False)
X_onehot.head()

Check out feature "customer_suspended". Due to the onehot encoding there is now a "customer_suspended_Yes" and a "customer_suspended_No" version. Of course, both variables contain the same information.

### Dummy coding

In [None]:
X = pd.get_dummies(X, drop_first = ______)
X.head()

Adding the argument "drop_first = True" deletes the redundant features.

Some algorithms have problems with dealing with the redundant information (e.g. linear models). Thus, dummy coding is a safer bet without loosing any information (= preferred choice)

Bet we now have a bigger number of features



In [None]:
X.info()

# Train and Plot a First Decision Tree


## 1) Import Model Function

In [None]:
from sklearn.tree import DecisionTreeClassifier

## 2) Instantiate Model

In [None]:
tree = DecisionTreeClassifier(criterion="entropy", 
                              max_depth=2)

Used hyperparameters:

* **criterion="entropy":** using informatin gain as measure for splitting
* **max_depth:** allowed number of maximum splits



## 3) Fit Model to Data

In [None]:
tree.fit(X,y)

## 4) Make Predictions

.predict() gets the predicted class. Sklearn is using a default threshold of 0.5. Every predicted probability that is higher is set to 1. Eevery predicted propability that is smaller is set to 0.

In [None]:
tree.predict(X)

We can also get the predicted probabilities with .preedict_proba(). We get two columents: One for the negative class (remain) and one for the positive class (churn). Both values sum up two 1. 

The positive class is in the second columm: 

In [None]:
tree.predict_proba(X)

## Plot Decision Tree

The following function can be used to draw a DecistionTreeClassifier Model:

In [None]:
def plot_tree_classification(treemodel, X):
    from sklearn import tree
    import matplotlib.pyplot as plt
    fig = plt.figure(figsize=(60,20))
    _ = tree.plot_tree(treemodel,filled=True,class_names=['0','1'],feature_names = X.columns,proportion=True,precision=2)

In [None]:
plot_tree_classification(tree, X)

## Check out other Hyperparameters - How is the tree affected?


Hyperparameters to vary

* **max_depth:** allowed number of maximum splits
* **min_samples_leaf:** The minimum number of samples required to be at a leaf node. Split will be considered if each child leave has at least min_sample_leaf instances.
* **min_samples_split:** The minimum number of instances required to split an internal node. Must be at least 2.






*  Increasing max_depth: tree becomes bigger and more complex
*  Increasing min_samples_leaf and min_samples_split: trees becomes more compact



In [None]:
tree = DecisionTreeClassifier(criterion="entropy", 
                              max_depth=______,
                              min_samples_split=______,
                              min_samples_leaf=______)
tree = tree.fit(X,y)
plot_tree_classification(tree, X)

# Evaluate Accuracy of Classifier

## 1) Import Model Functions

In [None]:
from sklearn.tree import ______
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## 2) Instantiate Model

In [None]:
tree_train = DecisionTreeClassifier(criterion="entropy", 
                                    max_depth=4)

## 3) Create Test & Training Data


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)


*   **X:** Features to be split into testing and training data
*   **y:** Labels to be split into testing and training data
*   **test_size:** proportion of the dataset in the test data; usually ~ 30%
*   **random_state:** seed for making results reproducible. Instances are randomly distributed among testing and training data. However, every computer splits randomly in a different fashion. Providing a seed, makes results reproducible because with the same seed, all computers split the data in the same fashion.




We can use the .shape method to investigate whether data splitting has been succesful

In [None]:
X.shape #4863 instances and 25 variables in the entire dataset

In [None]:
X_train.______ #3404 instances and 25 variables in the training dataset

In [None]:
X_test.______ #1459 instances and 25 variables in the training dataset

## 4) Fit Model to Training Data


In [None]:
tree_train.______(______, ______)

## 5) Make Predictions on Testing Data


In [None]:
y_pred = tree_train.predict(______)
y_pred

## 6) Score Accuracy

In [None]:
accuracy_score(y_test, y_pred)

## Determining Overfitting

Calculate the accuracy of the decision tree for a “max_depth“ of 10, 20, 30, 40, 50!

In [None]:
tree_train = DecisionTreeClassifier(criterion="entropy", max_depth=______)
tree_train.fit(X_train, y_train)
y_pred = tree_train.predict(X_test)
accuracy_score(y_test, y_pred)

# Evaluating Model Performance the Data Scientist Way

## Accuracy Paradox

In [None]:
len(y_test)

In [None]:
accuracy_score(y_test,[0]*1459)

## Instantiating a new model at the "sunshine spot"

In [None]:
tree_educatedguess = DecisionTreeClassifier(criterion="entropy", 
                              max_depth=______,
                              min_samples_leaf=50,
                              random_state=12)
tree_educatedguess.fit(______, ______)
y_pred = tree_educatedguess.______(______)

We provide random_state = 1 such that we all get identical results. The seed (=number) itself does not matter, it just has to be the same!

min_samples_leaf=50 is added for didactical reasons to show a specific effect

## Get confusion_matrix

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

- True Negatives = 696
- False Positives = 229
- False Negatives = 269
- True Positives = 265

## Calculating accuracy

In [None]:
(696 + 265) / (696 + 265 + 229 + 269)

## Calculating Recall / Sensitivity

- True Positives / (True Positives + False Negatives) 
- Recall / Sensitivity = Proportion of churning customers that we can detect!

In [None]:
265 / (265 + 269)

## Calculate Precision

- True Positives / (True Positives + False Positives)
- Precision = If we flag a customer as churning, what is the chance that this is correct? 

In [None]:
265 / (265 + 229)

## Specificity

- True Negatives / (True Negatives + False Positives)
- Specificity = Proportion of not churning customers that we can identify
- False Alarm = 1 - Specificity

In [None]:
696 / (696 + 229)

## Classification Report - Getting it all at once

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Additional measures:


*   **F1-Score:** (harmonic) mean of precision and recall >> very good measure for "overall accuracy" as it balances precision and recall
*   **Support:** Number of instances that churn (1) / not churn (0)



## Variying the threshold for classification


By default a class probability of > 0.5 determines a customer as churning!

using .predict_proba() we can get predicted probabilities and try out own tresholds:

In [None]:
tree_educatedguess.predict_proba(X_test)

In [None]:
y_pred =  (tree_educatedguess.predict_proba(X_test)[:, 1] > 0.1).astype(int)
print(classification_report(y_test, y_pred))

* This model has lower accuracy
* Perfect Recall (all churning customers can be identified)
* Low Precision (Only 49% of predictions are correct)

The threshold of 0.1 is very low, each customer with probability of churning than > 10% is flagged as churning (a quite optimistic model)

In [None]:
y_pred =  (tree_educatedguess.predict_proba(X_test)[:, 1] > ______).astype(int)
print(classification_report(y_test, y_pred))

* This model has higher accuracy
* Lower Recall (only 30% of churning customers are identified)
* Higher Precision (59%)

The threshold of 0.7 is quite strict, only customers with probability of churning than > 70% are flagged as churning (a quite conservative model)

## ROC and AUC - Model performance independent of threshold

In [None]:
def plot_ROC(model, X_test, y_test):
  import matplotlib.pyplot as plt
  from sklearn.metrics import RocCurveDisplay
  tree_ROC = RocCurveDisplay.from_estimator(model, X_test, y_test, color='green', linewidth=3)
  plt.title('ROC Curve')
  plt.xlabel('False Alarm (1 - Specificity)')
  plt.ylabel('Recall (Sensitivity)')
  plt.show()

# Strategies for Improving Model Performance

## Strategy 1: Finding the best model - Hyperparameter Tuning & Cross validation

### 1) Import model functions

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import ______

### 2) Instantiate Model with Cross-Validation Setup

In [None]:
parameters = {'max_depth':range(1,30), 
              'min_samples_leaf':[1, 10, 20, 30, 50, 100]}
tree_CV = GridSearchCV(DecisionTreeClassifier(criterion="entropy", random_state=1), parameters, cv=5)

*  **cv** number of cross validation folds

How many Models are calculated?

In [None]:
______

### 3) Fit Model to Data using Cross validation 

In [None]:
tree_CV.______(X_train, y_train)

In [None]:
tree_CV.best_params_

### 4) Make Prediction

In [None]:
y_pred = tree_CV.______(______)

### 5) Evaluate Model

In [None]:
print(______(______, ______))

In [None]:
plot_ROC(______, X_test, y_test)

## Strategy 2: Building Many Models - Ensembling (Random Forests)

### 1) Import Model Functions

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import ______
from sklearn.metrics import ______

### 2) Instantiate Model

In [None]:
forest = RandomForestClassifier(n_estimators=1000)

* **n_estimators** = Number of Decision Trees in the forest

### 3) Create test and training data

In [None]:
X_train, X_test, y_train, y_test = ______(______, ______, test_size=0.3, random_state=1)

### 4) Fit Model to Training Data

In [None]:
forest.______(______,______)

### 5) Make Prediction on Testing Data

In [None]:
y_pred = forest.______(______)

### 6) Evaluate Performance 

In [None]:
print(classification_report(y_test, y_pred))

Clear improvement over cross-validated tree in every performance metrics -> way better model

In [None]:
plot_ROC(______, X_test, y_test)

### Determine Variable Importances

In [None]:
def plot_variable_importance(model, X_train):
  import pandas as pd
  import matplotlib.pyplot as plt
  importances = pd.Series(data=model.feature_importances_,
                          index=X_train.columns)
  importances.sort_values().plot(kind='barh', color="#00802F")
  plt.title('Features Importances')

In [None]:
plot_variable_importance(forest, X_train)

## Strategy 3: Learning from Past Prediction Errors - Boosting

### 1) Import Model Functions

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import ______
from sklearn.metrics import ______

### 2) Instantiate Model

In [None]:
boost = GradientBoostingClassifier(n_estimators=1000, learning_rate=0.5)

*  **n_estimators:** number of bosted decision trees
*  **learning_rate:** degree to which predictions are updated after each round (usually rather small number between 0.01 and 1)



### 3) Create Test & Training Data

In [None]:
X_train, ______, y_train, ______ = train_test_split(X, y, test_size=0.3, random_state=1)

### 4) Fit Model to Training Data

In [None]:
______.fit(X_train,y_train)

### 5) Make Predictions on Testing Data 

In [None]:
y_pred = boost.predict(X_test)

### 6) Evaluate Performance

In [None]:
print(classification_report(y_pred, y_test))

Higher pricision, but lower recall than random forest > usage depends on business context (cancer or spam?)

In [None]:
plot_ROC(______,X_test,y_test)