# Hands On - Predicting Customer Churn - Introduction to Classification

# Import & Prepare Data

In [None]:
import pandas as pd
churn = pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/churn.csv")

## Check Structure of Data

In [None]:
churn.______



*   No missing data
*   23 Features
* Some features are objects (text data). We need to transform them because scikit-learn cannot work with them



In [None]:
churn.______

## Separate Features and Labels

In [None]:
X = ______ # Features
y = ______ # Target variable

In [None]:
y

In [None]:
X.info()

## Recode pandas "objects"

"Objects" in Pandas are textual variables. Sklearn cannot work with them. We have to recode them

In [None]:
churn["occupation"].unique() #unique returns the unique values for a variable

### One-hot encoding

In [None]:
X_onehot = pd.get_dummies(X, drop_first = False)
X_onehot.head()

Check out feature "customer_suspended". Due to the onehot encoding there is now a "customer_suspended_Yes" and a "customer_suspended_No" version. Of course, both variables contain the same information.

### Dummy coding

In [None]:
X = pd.get_dummies(X, drop_first = ______)
X.head()

Adding the argument "drop_first = True" deletes the redundant features.

Some algorithms have problems with dealing with the redundant information (e.g. linear models). Thus, dummy coding is a safer bet without loosing any information (= preferred choice)

Bet we now have a bigger number of features



In [None]:
X.info()

In [None]:
from sklearn.model_selection import train_test_split

## 2) Create Test & Training Data


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)


*   **X:** Features to be split into testing and training data
*   **y:** Labels to be split into testing and training data
*   **test_size:** proportion of the dataset in the test data; usually ~ 30%
*   **random_state:** seed for making results reproducible. Instances are randomly distributed among testing and training data. However, every computer splits randomly in a different fashion. Providing a seed, makes results reproducible because with the same seed, all computers split the data in the same fashion.




We can use the .shape method to investigate whether data splitting has been succesfull

In [None]:
X.shape #4863 instances and 25 variables in the entire dataset

In [None]:
X_train.______ #3404 instances and 25 variables in the training dataset

In [None]:
X_test.______ #1459 instances and 25 variables in the training dataset

## 3) Import, Initiate, and Train Models

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
tree_educatedguess = DecisionTreeClassifier(criterion="entropy",
                              max_depth=30,
                              min_samples_leaf=50,
                              random_state=12)
tree_educatedguess.fit(______, y_train)

In [None]:
forest = RandomForestClassifier(n_estimators=1000)
______.______(X_train,y_train)

In [None]:
boost = GradientBoostingClassifier(n_estimators=1000, learning_rate=0.5)
boost.fit(______,______)

# Evaluating Model Performance the Business Way

We need to install the scikit-plot library:


*  **Google Collab:** The following command needs to be executed every time the notebook is started (installations are deleted on collab after the notbook is closed.
* **Jupyter:** It has to be installed only once (as it is installed in the local python environment)



In [None]:
!pip install scikit-plot

## Lift Curve

scikit-plot is already imported and we have already predictions at the customer level (y_pred)

We can directly draw the lift curve

In [None]:
import scikitplot as skplt
y_pred = boost.predict_proba(X_test)
skplt.metrics.plot_lift_curve(y_test, y_pred)

In [None]:
y_pred = tree_educatedguess.predict_proba(X_test)
skplt.metrics.plot_lift_curve(y_test, y_pred)

# Expected Value of Models

Define value of business outcomes 

In [None]:
value_true_positive = 186
value_false_positive = -30

Define Function for Scoring Model:

In [None]:
def calculate_expected_value_model(matrix, value_true_positive, value_false_positive):
  """ works only for confusion matrices in specified form """

  #calculate prior probability of positive class
  p_prior_pos = matrix[1,:].sum() / matrix.sum() 
  
   #calculate conditional probabilities
  p_neg_instances = matrix[0,:]/matrix[0,:].sum()
  p_pos_instances = matrix[1,:]/matrix[1,:].sum() 

  # calculate expected values
  pos = p_prior_pos * (value_true_positive * p_pos_instances[1] + 0 * p_pos_instances[0])
  neg = (1 - p_prior_pos) * (value_false_positive * p_neg_instances[1] + 0 * p_neg_instances[0])
  return round(pos + neg, 2)

Get Expected Value for each contacted customer for random forest:

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_test,y_pred)
calculate_expected_value_model(matrix, value_true_positive, value_false_positive)

Get Expected Value for each contacted customer for decision tree:

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = tree_educatedguess.predict(X_test)
matrix = confusion_matrix(y_test,y_pred)
calculate_expected_value_model(matrix, value_true_positive, value_false_positive)