# Supervised Learning — How to do a logistic regression in Python

## When can logistic regression be used?

- When the response variable (the one being predicted) is binary or categorical.
- When the observations are independent.

## Which packages can be used for performing logistic regression?

- scikit-learn (used here)
- statsmodels
- PyCaret, TensorFlow, Keras, PyTorch

## Case study: predicting slasher movie deaths

In the 1996 movie [Scream](https://www.imdb.com/title/tt0117571/), characters discuss [the rules](https://scream.fandom.com/wiki/The_Rules) of surviving a slasher movie. Rule number one is "You can never have sex." (Otherwise the villain will kill you.)

Naturally a data scientist decided to analyze this claim, so here we'll look at data from [Welsh (2010)](https://link.springer.com/article/10.1007/s11199-010-9762-x), which looks at survival probabilities from a random sample of 50 slasher movies. ([Data](https://users.stat.ufl.edu/~winner/data/slash_survsex.dat) and its [description](https://users.stat.ufl.edu/~winner/data/slash_survsex.txt).)

We'll need **pandas** for importing the data, and doing some manipulation. **scikit-learn** for modeling, and **plotly.express** for plotting.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import plotly.express as px

The dataset is imported from a CSV file.

In [None]:
slash = pd.read_csv("slash.csv")
slash

Unnamed: 0,gender,sex_act,survived
0,male,present,1
1,male,present,1
2,male,present,1
3,male,present,1
4,male,present,1
...,...,...,...
480,female,absent,0
481,female,absent,0
482,female,absent,0
483,female,absent,0


## Data dictionary

Each row corresponds to one character in a slash movie.

- **gender**: gender of the character; either **male** or **female**.
- **sex_act**: Was the character involved in a sex act during the movie?; either **present** or **absent**.
- **survived**: Did the character survive through to the end of the movie?; either **1** if they survived or **0** if they died.

## Converting categorical columns to dummy variables

Scikit-learn can't deal with categorical columns directly. They must be converted to dummy columns of ones and zeroes. The pandas function [`get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) can be used for this.

In [None]:
slash_dum = pd.get_dummies(slash)
slash_dum

Unnamed: 0,survived,gender_female,gender_male,sex_act_absent,sex_act_present
0,1,0,1,0,1
1,1,0,1,0,1
2,1,0,1,0,1
3,1,0,1,0,1
4,1,0,1,0,1
...,...,...,...,...,...
480,0,1,0,1,0
481,0,1,0,1,0
482,0,1,0,1,0
483,0,1,0,1,0


## Splitting into response and explanatory columns

In [None]:
response = slash_dum["survived"]
explanatory = slash_dum.drop(columns="survived")

## Splitting into training and testing sets

The explanatory and response datasets need to be split into training and testing sets. 

Here we'll use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with the default arguments.

In [None]:
explanatory_train, explanatory_test, response_train, response_test = train_test_split(explanatory, response)

## Fitting the model to the training set

The data is now ready to model. The first modeling step is to create a `LogisticRegression` object.

Note that scikit-learn uses regularization (a technique for minimizing the effect of less important parameters) by default. This is a controversial default, so to use standard logistic regression, you need to set `penalty="none"`.

In [None]:
mdl = LogisticRegression(penalty="none")

Use the [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) method to fit the model to the training set.

In [None]:
mdl.fit(explanatory_train, response_train)

LogisticRegression(penalty='none')

In [None]:
responses = pd.DataFrame({
    "actual": response_test,
    "predicted": mdl.predict(explanatory_test)
})
responses

Unnamed: 0,actual,predicted
445,0,0
372,0,0
387,0,0
296,1,0
298,1,0
...,...,...
323,1,0
109,1,0
108,1,0
482,0,0


In [None]:
responses.value_counts()

actual  predicted
0       0            103
1       0             19
dtype: int64

In [None]:
slash.value_counts()

gender  sex_act  survived
female  absent   0           161
male    absent   0           100
        present  0            72
female  present  0            67
male    absent   1            39
female  absent   1            28
male    present  1            11
female  present  1             7
dtype: int64

In [None]:
confusion_matrix(responses["actual"], responses["predicted"]) 

array([[103,   0],
       [ 19,   0]])

In [None]:
explanatory_train.value_counts()

gender_female  gender_male  sex_act_absent  sex_act_present
1              0            1               0                  136
0              1            1               0                  108
                            0               1                   67
1              0            0               1                   52
dtype: int64

In [None]:
response_train.value_counts()

0    297
1     66
Name: survived, dtype: int64

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
oe = OrdinalEncoder()

In [None]:
response = slash["survived"]
explanatory = slash.drop(columns="survived")

In [None]:
explanatory_train, explanatory_test, response_train, response_test = train_test_split(explanatory, response)

In [None]:
explanatory_train2 = oe.fit_transform(explanatory_train)

In [None]:
explanatory_test2 = oe.transform(explanatory_test)

In [None]:
mdl.fit(explanatory_train2, response_train)

LogisticRegression(penalty='none')

In [None]:
responses = pd.DataFrame({
    "actual": response_test,
    "predicted": mdl.predict(explanatory_test2)
})
responses.value_counts()

actual  predicted
0       0            101
1       0             21
dtype: int64

In [None]:
mdl.predict_proba(explanatory_test2)

array([[0.87922383, 0.12077617],
       [0.83341757, 0.16658243],
       [0.87922383, 0.12077617],
       [0.83341757, 0.16658243],
       [0.87922383, 0.12077617],
       [0.93113971, 0.06886029],
       [0.93113971, 0.06886029],
       [0.83341757, 0.16658243],
       [0.72924896, 0.27075104],
       [0.87922383, 0.12077617],
       [0.87922383, 0.12077617],
       [0.72924896, 0.27075104],
       [0.87922383, 0.12077617],
       [0.72924896, 0.27075104],
       [0.83341757, 0.16658243],
       [0.83341757, 0.16658243],
       [0.72924896, 0.27075104],
       [0.87922383, 0.12077617],
       [0.83341757, 0.16658243],
       [0.83341757, 0.16658243],
       [0.83341757, 0.16658243],
       [0.83341757, 0.16658243],
       [0.87922383, 0.12077617],
       [0.83341757, 0.16658243],
       [0.87922383, 0.12077617],
       [0.83341757, 0.16658243],
       [0.83341757, 0.16658243],
       [0.93113971, 0.06886029],
       [0.83341757, 0.16658243],
       [0.83341757, 0.16658243],
       [0.

In [None]:
mdl.predict(explanatory_test2)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
import numpy as np
pd.Series(mdl.predict_proba(explanatory_test2)[:, 1]).value_counts()

0.166582    48
0.270751    30
0.068860    23
0.120776    21
dtype: int64