<a href="https://colab.research.google.com/github/dymiyata/intro-to-ml-and-ai-2025-2026/blob/main/logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression

Here we learn how to perform logistic regression using sklearn

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now we import the dataset we will use.  This is a dataset built in to sklearn.  It has data on a bunch of tumors.  

- We will load all the features into a DataFrame called `X`.  
- The target variable is whether or not the tumor is malignant (cancerous)
  - We have avalue of `1` for malignant
  - We have a value of `0` for not malignant (aka benign).
- We'll load the values of the target variable into `y`

In [3]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

In [4]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [6]:
print(y[100:200])

100    0
101    1
102    1
103    1
104    1
      ..
195    1
196    0
197    0
198    0
199    0
Length: 100, dtype: int64


### Split into Training and Testing sets

Now we must split the data into a training set and a testing set.  Let's do a 75/25 train test split.  We'll use 2026 for the `random_state`.

- We'll need to add one more argument to `train_test_split` which is `stratify=y`.  What do you think this does?

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=2026, stratify=y
)

`stratify=y` just ensures the proportion of 0s and 1s is about the same in the training and testing sets.

### Preprocessing the Data

Recall, in linear regression it's important for your features to be similar in scale. Why?

The same holds true for logistic regression.  Before I had you just manually and imprecisely scale the features, there is a better way to do things.



To do this, we will take each feature variable $x$ then:
- subtract its mean $\bar{x}$ from each training example so that the mean gets shifted to $0$
- divide each training example by the standard deviation $s$ so that the standard deviation becomes 1.

In this way, you will replace each feature $x$ with $$\frac{x - \bar{x}}{s}.$$  Thus, each resulting feature will have mean 0 and standard deviation 1 so they will all be on the same scale.

Luckily, `sklearn` has a built-in way to do this automatically:

In [7]:
from sklearn.preprocessing import StandardScaler

In [10]:
scaler = StandardScaler()

# learn the mean and st. dev. for each feature from the training set
scaler.fit(X_train)

# scale the features in both the training set and testing set
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [12]:
X_train_scaled

array([[-0.79043824,  0.13774499, -0.72878826, ..., -0.31250645,
         0.13296378, -0.76361944],
       [-1.21147418, -0.05213392, -1.20310801, ..., -1.02623386,
         0.55098372, -0.20595589],
       [-0.27731801,  0.72094453, -0.19372018, ..., -0.01855649,
         2.16755875,  1.32817656],
       ...,
       [-0.37439481, -1.35190033, -0.38401131, ..., -0.44890769,
        -0.54523455, -0.06988598],
       [-1.07057414, -0.71671014, -1.01844917, ..., -0.24755347,
        -0.46891555,  0.73816852],
       [ 1.95544838,  0.83396769,  1.82505547, ...,  1.31524523,
        -0.09772772, -0.47530738]])

### Training the Logistic Regression model

Now that we've split the data into training and testing sets and preprocessed the data, we're ready to train the model.  

This part will be almost the same as it was for linear regression:

First we import and define the logistic regression model:

In [13]:
from sklearn.linear_model import LogisticRegression

Now we define the model and fit it to the data.

In [14]:
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

In [16]:
model.intercept_

array([0.25403613])

With logistic regression, our model predicts probabilities that $y=1$ meaning probabilities that tumors are malignant.  

We can access these values as follows:

In [17]:
y_train_prob = model.predict_proba(X_train_scaled)
y_test_prob = model.predict_proba(X_test_scaled)

In [20]:
# print(y_train_prob)

3.2763 * 10**(-4) + 9.99672362 * 10**(-1)

0.999999992

To get the actual predictions (just 0 or 1), we compare the probabilities to a threshold value (by defualt we use a threshould of 0.5).
- values above the threshold are predicted to be 1.
-values below the threshould are predicted to be 0.

`sklearn` does this automatically with just `model.predict()`

In [21]:
y_test_pred = model.predict(X_test_scaled)

In [26]:
y_test_pred

array([1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0])

### Evaluating the model

Now how do we evaluate the model?

A good way is to check our accuracy on the test set.
- Accuracy just means what percentage of examples we classify correctly.

`sklearn` has a nice function to help us compute this called `accuracy_score`.  Let's import it and use it:

In [27]:
from sklearn.metrics import accuracy_score

In [28]:
accuracy_score(y_test, y_test_pred)

0.972027972027972

Another nice visualization is something called the "Confusion Matrix".  Let's see how that works

In [29]:
from sklearn.metrics import confusion_matrix

In [30]:
confusion_matrix(y_test, y_test_pred)

array([[51,  2],
       [ 2, 88]])

The confusion matrix outputs a $2\times 2$ matrix showing True/False Positives/Negatives.  The order goes:
$$\begin{bmatrix}
\text{TN} & \text{FP} \\
\text{FN} & \text{TP}
\end{bmatrix}$$

I always forget the order and have to look it up...