In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

churn = pd.read_csv("https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/WA_Fn-UseC_-Telco-Customer-Churn.csv")

 # Logistic regression for binary classification

Last week we explored the utility of applying decision trees as classification models. We applied decision tree classifiers in two cases, (i) to predict a binary response variable indicating whether or not a customer at a subscription-based firm would *churn* or not, and then (ii) a computer vision problem to classify handwritten digits -- there were 10 different classes in this second example. We saw that decision tree classifiers worked fine in both the binary and multiclass cases without changing any of our workflow. We did note, however, that a drawback to decision tree classifiers that we did not necessarily obtain a *propensity* estimate (an estimate of the probability with which we believe an observation belongs to the class it was assigned to). In this week's notebook we explore *logistic regression*, a classification model which does indeed provide us with a propensity estimate but which has the drawback of only being applicable in binary classification. We'll revisit the customer *churn* example here.

At the beginning of last week's notebook on decision tree classifiers we noted that a linear regression model could not be applied to a classification problem. All linear regression models aside from the null model ($\mathbb{E}\left[y\right] = \beta_0$) become unbounded. Logistic regression is indeed a regression model -- it predicts a numerical output, but that numerical output is bounded between 0 and 1.

This notebook introduces logistic regression. While logistic regression is not well-suited to the multi-class problem, it is a really common method applied to the two-class case. Logistic regression even provides some potential advantages over a more flexible model, like a tree. The output of a logistic model is a propensity for belonging to the class corresponding to an output of 1 -- the form for a logistic regression model is expressed below.
$$\mathbb{P}\left[Y = 1\right] = \displaystyle{\frac{e^{\beta_0 +\beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k}}{1 + e^{\beta_0 +\beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k}}}$$

Notice that the exponentials are being "raised to [one of our usual] regression models".

## A reminder on the `churn` data

We are bringing back last week's first classification dataset which dealt with customer *churn*. As a reminder, churn is when an existing customer, user, player, subscriber or any kind of return client stops doing business or ends the relationship with a company. A natural goal is to figure our which customers may be most likely churn in future so that we can provide preventative outreach.

We'll start by preparing our data in the same way it was prepared for last week's decision tree classifier.

In [10]:
churn = churn[churn["TotalCharges"] != " "]
churn["TotalCharges"] = pd.to_numeric(churn["TotalCharges"])

num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
unique_id = ["customerID"]
cat_cols = [name for name in list(churn.columns) if ((name not in num_cols) & (name not in unique_id))]

churn = pd.get_dummies(churn, columns = cat_cols, drop_first = True)

churn.drop(["customerID"], axis = 1, inplace = True)

X = churn.drop(["Churn_Yes"], axis = 1)
y = churn["Churn_Yes"]

X_train, X_safe, y_train, y_safe = train_test_split(X, y, test_size = 0.1, random_state = 42)

X_train.head()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,gender_Male,SeniorCitizen_1,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,OnlineBackup_Yes,DeviceProtection_No internet service,DeviceProtection_Yes,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
6183,44,54.3,2317.1,1,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0,1,1,0,1,0,1,0
73,62,24.25,1424.6,1,0,1,1,1,0,1,0,1,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,0,0,0
4799,10,19.8,196.75,0,0,1,1,1,0,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,0,0,1,0,0,0
4991,58,106.45,6145.85,0,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,1,1,0,1,0,1,0
1405,1,76.0,76.0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


## Logistic Regression

As a recap, a logistic regression model can be used to predict a two-class response variable (for example: *churn* or *no churn*). The outputs of a logistic regression model are bounded between 0 and 1, and their outputs can be interpreted as the propensity (likelihood) for an observation to belong to the class assigned by 1. The form for a logistic regression model is expressed below.
$$\mathbb{P}\left[Y = 1\right] = \displaystyle{\frac{e^{\beta_0 +\beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k}}{1 + e^{\beta_0 +\beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k}}}$$

Notice that the exponentials are being "raised to [one of our usual] regression models".

## Building a Logistic Regression Model

Now that our data has been pre-processed and split, we'll fit our first logistic regression model.

In [11]:
#we define the model object
log_reg_clf = LogisticRegression()
#and now we fit the model
log_reg_clf.fit(X_train, y_train)

print("Intercept: ", log_reg_clf.intercept_[0])
print(pd.DataFrame({"Variable" : X_train.columns, "Coefficient" : log_reg_clf.coef_[0]}))

Intercept:  -0.19589531546241928
                                 Variable  Coefficient
0                                  tenure    -0.063332
1                          MonthlyCharges     0.006333
2                            TotalCharges     0.000303
3                             gender_Male    -0.036091
4                         SeniorCitizen_1     0.289154
5                             Partner_Yes     0.054990
6                          Dependents_Yes    -0.193023
7                        PhoneService_Yes    -0.447929
8          MultipleLines_No phone service     0.254309
9                       MultipleLines_Yes     0.197261
10            InternetService_Fiber optic     0.586026
11                     InternetService_No    -0.116405
12     OnlineSecurity_No internet service    -0.116405
13                     OnlineSecurity_Yes    -0.575020
14       OnlineBackup_No internet service    -0.116405
15                       OnlineBackup_Yes    -0.232388
16   DeviceProtection_No interne

From the model output above we can see that the following characteristics **increase** the likelihood of *churn* (because their coefficients are positive):
+ The higher the customers `MonthlyCharges` and `TotalCharges`, the more likely they are to *churn*.
+ If they are a senior citizen.
+ The customer has a partner.
+ If they have no phone line (or if they have multiple phone lines).
+ If they have fiber-optic internet.
+ If they stream TV and/or Movies.
+ If they have paperless billing.
+ If they pay their bill by electronic check.

If we decide that our model is a good one, we can look for profiles fitting these characteristics to prioritize outreach. Check back to last week's notebook -- do those decision trees indicate the same sorts of trends?

We'll continue now with evaluating our model. Similar to last week, we'll use a confusion matrix along with the *accuracy*, *precision*, and *recall* metrics to assess our model utility. We don't typically do this, but we will utilize the training data for computing these metrics since this is just our baseline model. We'll use cross-validation shortly in order to tune hyperparameters and compare a few different logistic regression models.

In [12]:
confusion_matrix(y_train, log_reg_clf.predict(X_train))

array([[4184,  468],
       [ 754,  922]])

From the confusion matrix, we see that our accuracy is about 80.69% while our recall is 55.01% and precision is 66.33%. Remember that our goal is not necessarily to build a highly accurate model. In this particular application, we really want a model with high recall so that we can outreach customers who are at risk of *churn* before we lose their business. These outreaches will likely come along with promotional offers, so while we want a high recall metric, we don't want to lose too much precision because that will lead to lost revenue. We have a challenge -- **increase recall without decreasing precision** [too much]!

## Competing models

We'll build a set of competing models here and try using cross-validation to help us identify a model which will have high recall without leading to poor precision.

### Hyperparameters for logistic regression

Logistic regression has several hyperparameters which can be set. I'll list some of the ones we'll use below, but you can [see an exhaustive list here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). 
+ The `C` hyperparameter essentially supplies a budget which can be spent on coefficients.
+ The `penalty` hyperparameter determines how we compute our spent budget. The details are beyond the scope of our course but you can look up `l1` and `l2` regularization, or take MAT300 to learn more.
+ Both the `C` and `penalty` hyperparameters can be used to reduce overfitting.

In [13]:
budget = [0.1, 0.5, 1, 10, 100, 1000]
regularization = ["l1", "l2"]

for budget in budget:
    for reg in regularization:
        log_reg_clf = LogisticRegression(C = budget, penalty = reg, solver = "liblinear")
        
        cv_scores = cross_val_score(log_reg_clf, X_train, y_train, cv = 10, scoring = "recall")
        
        print("Budget ", budget, " and ", reg, " regularization:")
        print("\t recall rates: ", cv_scores)
        print("\t cv-average-recall: ", cv_scores.mean())

Budget  0.1  and  l1  regularization:
	 recall rates:  [0.52694611 0.51497006 0.58333333 0.53571429 0.56547619 0.50595238
 0.5297619  0.53571429 0.5748503  0.58682635]
	 cv-average-recall:  0.5459545195323638
Budget  0.1  and  l2  regularization:
	 recall rates:  [0.52694611 0.52694611 0.57738095 0.53571429 0.55357143 0.5
 0.5297619  0.54761905 0.59281437 0.58083832]
	 cv-average-recall:  0.547159252922726
Budget  0.5  and  l1  regularization:
	 recall rates:  [0.54491018 0.52694611 0.57142857 0.5297619  0.56547619 0.51190476
 0.52380952 0.54761905 0.5988024  0.59281437]
	 cv-average-recall:  0.5513473053892215
Budget  0.5  and  l2  regularization:
	 recall rates:  [0.52095808 0.53293413 0.58333333 0.53571429 0.58333333 0.48809524
 0.51785714 0.54166667 0.59281437 0.58083832]
	 cv-average-recall:  0.547754491017964
Budget  1  and  l1  regularization:
	 recall rates:  [0.54491018 0.53293413 0.58333333 0.53571429 0.56547619 0.51190476
 0.52380952 0.54761905 0.5988024  0.59281437]
	 cv-av

From the regression output above, we see that `l1` regularization is superior to `l2` regularization here but that recall continues to improve with larger and larger budgets. Let's run another round of cross-validation, but we'll include only `l1` regularization and allow for larger coefficient budgets.

In [16]:
largerBudgets = [10**3, 10**4, 10**5, 10**6, 10**7, 10**8]

for budget in largerBudgets:
    log_reg_clf = LogisticRegression(C = budget, penalty = "l1", solver = "liblinear")
    
    cv_scores = cross_val_score(log_reg_clf, X_train, y_train, cv = 10, scoring = "recall")
    print("Budget: ", budget)
    print("\t recall rates: ", cv_scores)
    print("\t cv-average-recall: ", cv_scores.mean())

Budget:  1000
	 recall rates:  [0.54491018 0.53293413 0.58333333 0.53571429 0.56547619 0.51190476
 0.52380952 0.54761905 0.60479042 0.58682635]
	 cv-average-recall:  0.5537318220701455
Budget:  10000
	 recall rates:  [0.54491018 0.53293413 0.58333333 0.53571429 0.55952381 0.51190476
 0.52380952 0.54761905 0.60479042 0.58682635]
	 cv-average-recall:  0.5531365839749073
Budget:  100000
	 recall rates:  [0.54491018 0.53293413 0.58333333 0.53571429 0.56547619 0.51190476
 0.5297619  0.54761905 0.60479042 0.58682635]
	 cv-average-recall:  0.5543270601653836
Budget:  1000000
	 recall rates:  [0.54491018 0.53293413 0.58928571 0.53571429 0.56547619 0.51190476
 0.52380952 0.54761905 0.60479042 0.58682635]
	 cv-average-recall:  0.5543270601653835
Budget:  10000000
	 recall rates:  [0.54491018 0.53293413 0.58928571 0.53571429 0.56547619 0.51190476
 0.52380952 0.54761905 0.60479042 0.58682635]
	 cv-average-recall:  0.5543270601653835
Budget:  100000000
	 recall rates:  [0.54491018 0.53293413 0.5892

Okay, according to our recall rates, it looks like a budget of around `C = 10000` produces the best average recall. Since this is the case, let's re-fit that model and analyse our results.

In [17]:
log_reg_clf = LogisticRegression(C = 1000, penalty = "l1", solver = "liblinear")

log_reg_clf.fit(X_train, y_train)

confusion_matrix(y_safe, log_reg_clf.predict(X_safe))

array([[457,  54],
       [ 95,  98]])

In [19]:
98/(54 + 98)

0.6447368421052632

Okay, on the unseen *safe* data, our final model had about 78.84% accuracy, a recall of about 50.78%, and a precision of about 64.47%. 

## Closing for Logistic Regression

So there it is, you've constructed and evaluated a series of logistic regression models, tuning hyperparameters along the way. There's much more to learn about logistic regression and I really encourage you to read about this classifier in Chapter 10 of our textbook.

Before we leave this topic and move to uplift and ensemble methods, there are two major takeaways you should consider here.

+ **Weakness:** A major weakness of logistic regression models is that they are not well-suited to classification scenarios with more than two classes. There are methods for applying them to these cases though -- the easiest method is to decide that we only really care about whether an observation belongs to a single class of interest or not -- from there we have a *one-versus-all* classification problem, which is binary. If we do indeed care about differentiating all classes from one another we could engage in building a series of *one-versus-one* classifiers and then using them in a pseudo-ensemble fashion. The challenge here, however, is that even with just four classes we end up needing to work with $4\left(3\right)/2 = 6$ individual models.
+ **Strength:** Rather than outputting a *class prediction*, a logistic regression model outputs a likelihood that an observation belongs to the class represented by 1. This means that it is easy to adjust the threshold for belonging to that class -- is greater than 50% likelihood the right choice? Or would we be better off using a really strict threshold like 90%, or a really loose threshold like 20%?$^\dagger$ Furthermore, since we are predicting a likelihood of belonging to a particular class, it is really easy to identify the predictions we are "most certain" about.

<center><span style = "font-size:8pt"><i>$^\dagger$How do you choose the appropriate threshold??? -- the threshold is a hyperparameter -- keep calm and cross-validate!</i></span></center>

In [20]:
log_reg_clf.predict_proba(X_safe)

array([[0.99489071, 0.00510929],
       [0.87939001, 0.12060999],
       [0.31844796, 0.68155204],
       ...,
       [0.93073112, 0.06926888],
       [0.97802079, 0.02197921],
       [0.82923916, 0.17076084]])