# Supervised Learning — How to do a logistic regression in Python

## When can logistic regression be used?

- When the response variable (the one being predicted) is binary or categorical.
- When the observations are independent.

## Which packages can be used for performing logistic regression?

- scikit-learn (used here)
- statsmodels
- PyCaret, TensorFlow, Keras, PyTorch

## Case study: predicting organic product purchases

A supermarket provided coupons incentivizing buying organic products to its loyalty program members, and recorded whether or not they actually bought any.

This workspace uses a subset of [data sourced from Kaggle](https://www.kaggle.com/datasets/papercool/organics-purchase-indicator).

We'll need **pandas** for importing the data and doing some manipulation, then modeling with **scikit-learn**.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

The dataset is imported from a CSV file named `"organics.csv"`.

In [None]:
organics = pd.read_csv("organics.csv")
organics

## Data dictionary

Each row corresponds to one character in a slash movie.

- **Gender**: gender of the customer; either **M** (male), **F** (female), **U** (unknown).
- **Geographic Region**: where in the UK was the customer based; **North**, **Midlands**, **South East**, **South West**, **Scottish**.
- **Loyalty Status**: what type of loyalty card did the customer have? **Tin**, **Silver**, **Gold**, or **Platinum**.
- **Affluence**: how well off does the supermarket estimate the customer is?
- **Age**: how old was the customer in years?
- **Purchased Organics**: did they purchase an organic product? **0** (no), or **1** (yes).

## Converting categorical columns to dummy variables

Scikit-learn can't deal with categorical columns directly. They must be converted to dummy columns of ones and zeroes. The pandas function [`get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) can be used for this.

In [None]:
organics_dum = pd.get_dummies(organics)
organics_dum

## Splitting into response and explanatory columns

The response column is `"Purchased Organics"`. The explanatory (input) columns are all the other columns.

In [None]:
response = organics_dum["Purchased Organics"]
explanatory = organics_dum.drop(columns="Purchased Organics")

## Splitting into training and testing sets

The explanatory and response datasets need to be split into training and testing sets. 

Here we'll use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with the default arguments.

In [None]:
explanatory_train, explanatory_test, response_train, response_test = train_test_split(explanatory, response)

## Fitting the model to the training set

The data is now ready to model. The first modeling step is to create a `LogisticRegression` object.

Note that scikit-learn uses regularization (a technique for minimizing the effect of less important parameters) by default. This is a controversial default, so to use standard logistic regression, you need to set `penalty="none"`.

In [None]:
mdl = LogisticRegression(penalty="none")

Use the [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) method to fit the model to the training set.

In [None]:
mdl.fit(explanatory_train, response_train)

## Making predictions on the testing set

You can calculate the predicted response with the [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?#sklearn.linear_model.LogisticRegression.predict) method.

In [None]:
predicted_responses = mdl.predict(explanatory_test)
predicted_responses

## Assessing model performance

There are four possible outcomes, depending on whether the actual response and the predicted response are true or false. The confusion matrix, created with [`confusion_matrix()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) shows the counts of each case.

|                     |**predicted false** |**predicted true** |
|:--------------------|:-----------------|:----------------|
|**actual false** |correct           |false positive   |
|**actual true**  |false negative    |correct          |

In [None]:
confusion_matrix(response_test, predicted_responses)

[`classification_report()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) prints a lot of metrics about the performance of the model. There are five numbers we typically care about.

```
                   precision           recall                           f1-score    support

           0  TN / (TN + FN)   TN / (TN + FP)                                   .         .
           1  TP / (TP + FP)   TP / (TP + FN)                                   .         .

    accuracy                                      (TN + TP) / (TN + TP + FN + FP)         .
   macro avg               .                .                                   .         .
weighted avg               .                .                                   .         .
```

- **Accuracy**: What fraction of the values were correctly predicted?
- **Precision 0**: What fraction of the values that were predicted to be negative actually were negative?
- **Precision 1**: What fraction of the values that were predicted to be positive actually were positive?
- **Recall 0** a.k.a. **specificity**: What fraction of the values that were actually negative were predicted to be negative?
- **Recall 1** a.k.a. **sensitivity**: What fraction of the values that were actually positive were predicted to be positive?

In [None]:
print(classification_report(response_test, predicted_responses))

These DataCamp courses cover logistic regression in Python.

- [Machine Learning with scikit-learn](https://app.datacamp.com/learn/courses/machine-learning-with-scikit-learn) provides an introduction to modeling with scikit-learn, including logistic regression.
- [Introduction to Regression with statsmodels in Python](https://app.datacamp.com/learn/courses/introduction-to-regression-with-statsmodels-in-python) and [Intermediate Regression with statsmodels in Python](https://app.datacamp.com/learn/courses/intermediate-regression-with-statsmodels-in-python) provide a deep dive into linear and logistic regression, using the statsmodels package.