# Supervised Learning — How to do a logistic regression in Python

## When can logistic regression be used?

- When the response variable (the one being predicted) is binary or categorical.
- When the observations are independent.

## Which packages can be used for performing logistic regression?

- scikit-learn (used here)
- statsmodels
- PyCaret, TensorFlow, Keras, PyTorch

## Case study: predicting slasher movie deaths

In the 1996 movie [Scream](https://www.imdb.com/title/tt0117571/), characters discuss [the rules](https://scream.fandom.com/wiki/The_Rules) of surviving a slasher movie. Rule number one is "You can never have sex." (Otherwise the villain will kill you.)

Naturally a data scientist decided to analyze this claim, so here we'll look at data from [Welsh (2010)](https://link.springer.com/article/10.1007/s11199-010-9762-x), which looks at survival probabilities from a random sample of 50 slasher movies. ([Data](https://users.stat.ufl.edu/~winner/data/slash_survsex.dat) and its [description](https://users.stat.ufl.edu/~winner/data/slash_survsex.txt).)

We'll need **pandas** for importing the data, and doing some manipulation. **scikit-learn** for modeling, and **plotly.express** for plotting.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

The dataset is imported from a CSV file.

In [None]:
organics = pd.read_csv("organics.csv")
organics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17122 entries, 0 to 17121
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Gender              17122 non-null  object 
 1   Geographic Region   17122 non-null  object 
 2   Loyalty Status      17122 non-null  object 
 3   Affluence           17122 non-null  float64
 4   Age                 17122 non-null  float64
 5   Purchased Organics  17122 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 802.7+ KB


## Data dictionary

Each row corresponds to one character in a slash movie.

- **Gender**: gender of the customer; either **M** (male), **F** (female), **U** (unknown), or missing.
- **Geographic Region**: where in the UK was the customer based; **North**, **Midlands**, **South East**, **South West**, **Scottish**, or missing.
- **Loyalty Status**: what type of loyalty card did the customer have? **Tin**, **Silver**, **Gold**, or **Platinum**.
- **Age**: how old was the customer in years?
- **Organics Purchase Indicator**: did they purchase an organic product? **0** (no), or **1** (yes)

## Converting categorical columns to dummy variables

Scikit-learn can't deal with categorical columns directly. They must be converted to dummy columns of ones and zeroes. The pandas function [`get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) can be used for this.

In [None]:
organics_dum = pd.get_dummies(organics)
organics_dum

Unnamed: 0,Affluence,Age,Purchased Organics,Gender_Female,Gender_Male,Gender_Unknown,Geographic Region_Midlands,Geographic Region_North,Geographic Region_Scottish,Geographic Region_South East,Geographic Region_South West,Loyalty Status_Gold,Loyalty Status_Platinum,Loyalty Status_Silver,Loyalty Status_Tin
0,10.0,76.0,0,0,0,1,1,0,0,0,0,1,0,0,0
1,4.0,49.0,0,0,0,1,1,0,0,0,0,1,0,0,0
2,5.0,70.0,1,1,0,0,1,0,0,0,0,0,0,1,0
3,10.0,65.0,1,0,1,0,1,0,0,0,0,0,0,0,1
4,11.0,68.0,0,1,0,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17117,13.0,49.0,0,0,1,0,0,0,0,1,0,0,0,1,0
17118,13.0,65.0,0,1,0,0,0,0,0,1,0,0,0,1,0
17119,15.0,73.0,0,0,0,1,0,0,0,1,0,1,0,0,0
17120,9.0,70.0,0,1,0,0,0,1,0,0,0,1,0,0,0


## Splitting into response and explanatory columns

The response column is `"Purchased Organics"`. The explanatory (input) columns are all the other columns.

In [None]:
response = organics_dum["Purchased Organics"]
explanatory = organics_dum.drop(columns="Purchased Organics")

## Splitting into training and testing sets

The explanatory and response datasets need to be split into training and testing sets. 

Here we'll use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with the default arguments.

In [None]:
explanatory_train, explanatory_test, response_train, response_test = train_test_split(explanatory, response)

## Fitting the model to the training set

The data is now ready to model. The first modeling step is to create a `LogisticRegression` object.

Note that scikit-learn uses regularization (a technique for minimizing the effect of less important parameters) by default. This is a controversial default, so to use standard logistic regression, you need to set `penalty="none"`.

In [None]:
mdl = LogisticRegression(penalty="none")

Use the [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) method to fit the model to the training set.

In [None]:
mdl.fit(explanatory_train, response_train)

LogisticRegression(penalty='none')

## Making predictions on the testing set

You can calculate the predicted response with the [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?#sklearn.linear_model.LogisticRegression.predict) method.

In [None]:
predicted_response = mdl.predict(explanatory_test)
predicted_response

array([0, 0, 0, ..., 0, 1, 0])

## Assessing model performance

There are four possible outcomes, depending on whether the actual response and the predicted response are true or false. The confusion matrix, created with [`confusion_matrix()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) shows the counts of each case.

|                     |**predicted false** |**predicted true** |
|:--------------------|:-----------------|:----------------|
|**actual false** |correct           |false positive   |
|**actual true**  |false negative    |correct          |

In [None]:
conf_mat = confusion_matrix(response_test, predicted_response)
conf_mat

array([[2966,  202],
       [ 632,  481]])

A classification report prints a lot of metrics about the performance of the model. There are five numbers we typically care about.

```
                   precision           recall                           f1-score    support

           0  TN / (TN + FN)   TN / (TN + FP)                                   .         .
           1  TP / (TP + FP)   TP / (TP + FN)                                   .         .

    accuracy                                      (TN + TP) / (TN + TP + FN + FP)         .
   macro avg               .                .                                   .         .
weighted avg               .                .                                   .         .
```

- **Accuracy**: What fraction of the values were correctly predicted?
- **Precision 0**: What fraction of the values that were predicted to be negative actually were negative?
- **Precision 1**: What fraction of the values that were predicted to be positive actually were positive?
- **Recall 0** a.k.a. **specificity**: What fraction of the values that were actually negative were predicted to be negative?
- **Recall 1** a.k.a. **sensitivity**: What fraction of the values that were actually positive were predicted to be positive?

In [None]:
print(classification_report(response_test, predicted_response))

              precision    recall  f1-score   support

           0       0.82      0.94      0.88      3168
           1       0.70      0.43      0.54      1113

    accuracy                           0.81      4281
   macro avg       0.76      0.68      0.71      4281
weighted avg       0.79      0.81      0.79      4281

