# Logistic Regression

Our problem is a binary classification problem. We are trying to predict whether a customer will leave their current telecom company or whether a customer will not leave the company. This problem is well suited for a Logistic Regression model.

### Import Packages

In [8]:
import pandas as pd
import os 

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score


## Step 1. Build DataFrame and Define ML Problem

#### Load a Data Set and Save it as a Pandas DataFrame

We will work with a data set called "cell2celltrain." This data set is already pre-processed, with the proper formatting, outliers and missing values taken care of, and all numerical columns scaled to the [0, 1] interval.

In [10]:
filename = os.path.join(os.getcwd(), "..", "..", "data", "cell2celltrain.csv")
df = pd.read_csv(filename, header=0)

#### Define the Label

This is a binary classification problem in which we will predict customer churn. The label is the `Churn` column. The label will either have the value `True` or `False`.

#### Identify Features

To implement a Logistic Regression model, we must use only the numeric columns. 


In [11]:
feature_list_df = df.select_dtypes(include='float64')
feature_list = [col for col in feature_list_df.columns]

## Step 2: Create Labeled Examples from the Data Set 

Let's obtain columns from our data set to create labeled examples. 

In [12]:
y = df['Churn']
X = feature_list_df

print("Number of examples: " + str(X.shape[0]))
print("\nNumber of Features: " + str(X.shape[1]))
print(str(list(X.columns)))


Number of examples: 51047

Number of Features: 35
['MonthlyRevenue', 'MonthlyMinutes', 'TotalRecurringCharge', 'DirectorAssistedCalls', 'OverageMinutes', 'RoamingCalls', 'PercChangeMinutes', 'PercChangeRevenues', 'DroppedCalls', 'BlockedCalls', 'UnansweredCalls', 'CustomerCareCalls', 'ThreewayCalls', 'ReceivedCalls', 'OutboundCalls', 'InboundCalls', 'PeakCallsInOut', 'OffPeakCallsInOut', 'DroppedBlockedCalls', 'CallForwardingCalls', 'CallWaitingCalls', 'MonthsInService', 'UniqueSubs', 'ActiveSubs', 'Handsets', 'HandsetModels', 'CurrentEquipmentDays', 'AgeHH1', 'AgeHH2', 'RetentionCalls', 'RetentionOffersAccepted', 'ReferralsMadeBySubscriber', 'IncomeGroup', 'AdjustmentsToCreditRating', 'HandsetPrice']


## Step 3: Create Training and Test Data Sets

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1234)

In [14]:
print(X_test.shape)
print(X_train.shape)
print(y_test.shape)
print(y_train.shape)

(16846, 35)
(34201, 35)
(16846,)
(34201,)


## Step 4: Train a Logistic Regression Classification Model and Evaluate the Model


The code cell below contains code that must be completed to train a Logistic Regression classification model, analyze its performance and print the results. The code below will train a Logistic Regression model on the training data, test the resulting model on the test data, and compute and return (1) the log loss of the resulting probability predictions on the test data and (2) the accuracy score of the resulting predicted class labels on the test data.

*Note*: It is worth noting that evaluating a model’s training loss and evaluating a model’s accuracy is different. Accuracy measures what fraction of the examples are correctly predicted by the classifier, while training loss measures the average prediction error per training example over all training examples.

In [15]:
  
# 1. Create the LogisticRegression model object below and assign to variable 'model'
model = LogisticRegression()

# 2. Fit the model to the training data below
model.fit(X_train, y_train)

# 3. Make predictions on the test data using the predict_proba() method and assign the 
# result to the variable 'probability_predictions' below
probability_predictions = model.predict_proba(X_test)

# print the first 5 probability class predictions
df_print = pd.DataFrame(probability_predictions, columns = ['Class: False', 'Class: True'])
print('Class Prediction Probabilities: \n' + df_print[0:5].to_string(index=False))

# 4. Compute the log loss on 'probability_predictions' and save the result to the variable
# 'l_loss' below
l_loss = log_loss(y_test, probability_predictions)
print('Log loss: ' + str(l_loss))


# 5. Make predictions on the test data using the predict() method and assign the result 
# to the variable 'class_label_predictions' below
class_label_predictions = model.predict(X_test)

# print the first 5 class label predictions 
print('Class labels: ' + str(class_label_predictions[0:5]))

# 6.Compute the accuracy score on 'class_label_predictions' and save the result 
# to the variable 'acc_score' below
acc_score = accuracy_score(y_test, class_label_predictions)
print('Accuracy: ' + str(acc_score))




Class Prediction Probabilities: 
 Class: False  Class: True
     0.745478     0.254522
     0.658678     0.341322
     0.724462     0.275538
     0.848644     0.151356
     0.749374     0.250626
Log loss: 0.5878464111333068
Class labels: [False False False False False]
Accuracy: 0.7097233764691915


## Step 5: Thresholds: Map Probabilities to a Class Label

Examine the output of the code cell above.

Note that the `predict_proba()` method returns two columns. As stated, the first column contains the probability that an unlabeled example belongs to class `False` and the second column contains the probability that an unlabeled example belongs to class `True`.

The `predict()` method outputs the actual class label (`True` or `False`).

In [16]:
print(model.classes_)

[False  True]


Notice how the probabilities map to labels in the table below. The table contains:
* 3 unlabeled examples
* the resulting class probability values from the logistic regression model's `predict_proba()` method
* the corresponding class label from the same logistic regression model's `predict()` method. 

The probability that that unlabeled "Example 1" is of class `False` is 0.745386. The probability that unlabeled "Example 1" is of class `True` is 0.254614. The `predict()` method assigns "Example 1" the class label `False`.


<table align=left>
   <tr>
    <th></th>
    <th>Class: False</th>
    <th>Class: True</th>
    <th>Class Label</th>
    </tr>
    <tr>
    <th>Example 1</th>
    <th>0.745386</th>
    <th>0.254614</th>
    <th>False</th>
    </tr>
     <tr>
    <th>Example 2</th>
    <th>0.745386</th>
    <th>0.254614</th>
    <th>False</th>
    </tr>    
     <tr>
    <th>Example 3</th>
    <th>0.436033</th>
    <th>0.563967</th>
    <th>True</th>
    </tr>    
</table>



How does the Scikit-learn `predict()` method assign a class label based on probability values? For binary classification, the method defaults to a 0.5 threshold. If the resulting probability for class 0 is greater than or equal to 0.5, the unlabeled example is given a label of `False` On the other hand, if the probability for class 0 is less than 0.5,  the unlabeled example is given a label of `True`.

Sometimes we may want a different threshold. 

The function `computeAccuracy()` takes a threshold value as an argument. It does the following:

1. Loops through the array `probability_predictions` (obtained from the Logistic Regression model above)
    * It extracts the first column's probability 
    * It checks if that probability is greater than or equal to the threshold value.
    * If so, it assigns a class label of `False`. Otherwise it assigns a class label of `True`.
    * It saves the new class label to list `labels`.
2. Computes the accuracy score by comparing the new class labels contained in list `labels` with the ground truth labels contained in `y_test`.
3. Returns the accuracy score.

In [17]:
def computeAccuracy(threshold_value):
    
    labels=[]
    for p in probability_predictions[:,0]:
        if p >= threshold_value:
            labels.append(False)
        else:
            labels.append(True)
    
    acc_score = accuracy_score(y_test, labels)
    return acc_score


The code cell below calls the `computeAccuracy()` function with a few different threshold values. 

In [18]:
thresholds = [0.43, 0.44, 0.50, 0.55, 0.67, 0.75]
for t in thresholds:
    print("Threshold value {:.2f}: Accuracy {}".format(t, str(computeAccuracy(t))))

Threshold value 0.43: Accuracy 0.7115042146503621
Threshold value 0.44: Accuracy 0.7106731568324824
Threshold value 0.50: Accuracy 0.7097233764691915
Threshold value 0.55: Accuracy 0.708298705924255
Threshold value 0.67: Accuracy 0.6491155170366852
Threshold value 0.75: Accuracy 0.4929360085480233
