# Logistic Regression

## Hi!
In today's workshop we are going to learn about most known concept of supervised learning which is **classification**.

### What is classification?
Classification is a problem of predicting discrete value (classes) for given features. It is mainly viewed as a supervised learning problem.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from ipywidgets import interact, fixed
import ipywidgets as widgets

import solutions

%load_ext autoreload
%autoreload 2

Just like last time, we'll work with a very real-world dataset describing a couple hundred cases of breast cancer, which presents an example of a case for **binary classification**

In [None]:
print(load_breast_cancer().DESCR)

First, we'll split our data int train, test, and validation datasets

In [None]:
X, y = load_breast_cancer(return_X_y=True)

In [None]:
np.random.seed(0)
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, train_size=0.7)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, train_size=0.66)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_val.shape, y_val.shape

In [None]:
X_test.shape, y_test.shape

### What about applying linear regression for classification?

Let's take a look at the target data:

In [None]:
y

It's a bunch of ones and zeros! Wouldn't it make sense to just train a linear regressor on the data?

In [None]:
from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

In [None]:
X_train[0]

In [None]:
linear_reg.predict(X_val)

How to interpret these predictions? Maybe we need something different?

![classification_regression](img/clas_reg.png)

### What is logistic regression?

Logistic regression is about applying a "squashing" function to the hypotheses when calculating loss.

### $$h_w(x) = \sum_{j=0}^k w_j x_j = wx$$

### $$\hat{y} = \sigma(h_w(x))$$ 

### One of such squashing functions is sigmoid function:
### $$\sigma(x) = \frac{1}{1+e^{-x}}$$

In [None]:
x = np.linspace(-10, 10)
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [None]:
plt.plot(x, sigmoid(x))
plt.grid(True)
plt.show()

In [None]:
sigmoid(np.inf), sigmoid(-np.inf)

### Because of non-linearities in our hypotheses, we also need to update our loss function.

We'll use a logarythmic loss function which quite nicely captures an intuition, that we want the predictions datapoins which should be predicted as $0$ as close to $0$ as possible, and, analogically, predictions which should be $1$, as close to $1$ as possible:

### $$ L(w) = \frac{-1}{n}(\sum_{i=0}^n y^{(i)}\log{h_w(x^{(i)})} + (1-y^{(i)})\log{(1-h_w(x^{(i)}))} )$$

### $$ y^{(i)} \in \{0, 1\}$$

In [None]:
# y = 0
x = np.linspace(0, 0.9999, 1000)
plt.plot(x, -np.log(1 - x))
plt.ylim(-1, 10)
plt.show()

In [None]:
# y = 1
x = np.linspace(0.0001, 1, 1000)
plt.plot(x, -np.log(x))
plt.ylim(-1, 10)
plt.show()

Let's try and implement this new loss function!

In [None]:
def loss(
    W: np.ndarray, 
    X: np.ndarray, 
    Y: np.ndarray, 
    eps: float = 0.01 # the epsilon parameter is for numeric stability of logarithm
) -> float:
    return 0

In [None]:
loss = solutions.loss

In [None]:
W = np.random.rand(X.shape[1])
print(loss(W, X, y, eps=0.1))
print(solutions.loss(W, X, y, eps=0.1))

What about gradient descent procedure? How does it change? Let's derive the gradient!

[we'll do that on the board] 
It turns out, it's very simple!

### $$
\frac{\partial L(W)}{\partial W} =\frac{1}{n}(\sum_{i=0}^n x^{(i)} \cdot (h_w(x^{(i)}) - y^{(i)}))
$$

In [None]:
def gradient_step(
    W, 
    X, 
    Y,
    learning_rate=0.01
) -> np.ndarray:
    return np.zeros_like(W)

In [None]:
gradient_step = solutions.gradient_step

In [None]:
W = np.random.rand(X.shape[1])

yours = gradient_step(W, X, y, learning_rate=0.1)
provided = solutions.gradient_step(W, X, y, learning_rate=0.1)
print(yours - provided)

Let's not forget about adding the bias feature and normalizing the data!

In [None]:
def add_bias_feature(X):
       return np.c_[np.ones(len(X)), X]

In [None]:
X_train = add_bias_feature(X_train)
X_val = add_bias_feature(X_val)
X_test = add_bias_feature(X_test)

In [None]:
X_train, *norm_parameters = solutions.std_normalization(X_train)
X_val, *_ = solutions.std_normalization(X_val, *norm_parameters)
X_test, *_ = solutions.std_normalization(X_test, *norm_parameters)

In [None]:
np.random.seed(0)
W = np.random.randn(X_train.shape[1])
train_costs = []
val_costs = []
train_steps = 100
for _ in range(train_steps):
    train_costs.append(loss(W, X_train, y_train, eps=0.001))
    val_costs.append(loss(W, X_val, y_val, eps=0.001))
    W = gradient_step(W, X_train, y_train, learning_rate=0.1)
   

In [None]:
plt.plot(np.arange(train_steps), train_costs)
plt.plot(np.arange(train_steps), val_costs)
plt.show()

In [None]:
accuracy_score(y_train, solutions._hypotheses(W, X_train) >= 0.5)

In [None]:
accuracy_score(y_val, solutions._hypotheses(W, X_val) >= 0.5)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logistic_reg = LogisticRegression(C=10**6)

In [None]:
logistic_reg.fit(X_train, y_train)

In [None]:
logistic_reg.score(X_train, y_train)

In [None]:
logistic_reg.score(X_val, y_val)

### A great score! Or is it?

In [None]:
positive_ind = np.argwhere(y_val == 1).reshape(-1)
negative_ind = np.argwhere(y_val == 0).reshape(-1)
X_val_pos = X_val[positive_ind]
y_val_pos = y_val[positive_ind]
X_val_neg = X_val[negative_ind]
y_val_neg = y_val[negative_ind]

In [None]:
accuracy_score(y_val_pos, solutions._hypotheses(W, X_val_pos) >= 0.5)

In [None]:
accuracy_score(y_val_neg, solutions._hypotheses(W, X_val_neg) >= 0.5)

We achieve higher accuracies on positive examples, than on negative ones. In practice, this means we're likelier to classify tumors as malignant than not. 

Better safe than sorry? Turns out, not always. Can we dig deeper into the performance of our model?

### Precision and recall
We can divide classifications of our model into four classes:

| Predicted/Actual | 0   | 1   |
|------------------|-----|-----|
| 0                | True negative | False negative|
| 1                | False positive | True positive | 


**Accuracy - a first intuition**

$$
Accuracy = \frac{T_p + T_n}{T_n + T_p + F_n + F_p}
$$

However, as we've just seen, this metric may be deceiving (consider class imbalance!)

Turns out there is a more reliable way to measure the performance of our model:

- **Precision** - *what fraction of our positive classifications is correct?*
$$
Precision = \frac{T_p}{T_p + F_p}
$$

- **Recall** - *what fraction of actual positive examples has been classified correctly?*
$$
Recall = \frac{T_p}{T_p + F_n}
$$

We want both of those values to be as high as possible (duh).

However, sometimes we have to make a trade off between them and decide with our classification method that one will be higher and the other lower.

A metric which nicely mixes the two above is called the **F1 score** - it's high when both precision and recall are high enough, but low when one of them is sacrificed for the sake of another.

$$
F1 = \frac{2PR}{P +R}
$$

#### Can precision and recall be manipulated without tinkering with the model?

In [None]:
def calc_precision_recall(
    X: np.ndarray,
    y: np.ndarray,
    W: np.ndarray,
    threshold: float
):
    y_pred = solutions._hypotheses(W, X)
    y_pred_bin = y_pred >= threshold
    print('precision', precision_score(y, y_pred_bin))
    print('recall', recall_score(y, y_pred_bin))
    print('F1 score', f1_score(y, y_pred_bin))
    positive_ind = np.argwhere(y == 1).reshape(-1)
    negative_ind = np.argwhere(y == 0).reshape(-1)
    y_pos = y[positive_ind]
    y_neg = y[negative_ind]
    y_pos_pred = y_pred_bin[positive_ind]
    y_neg_pred = y_pred_bin[negative_ind]
    print('total accuracy', accuracy_score(y, y_pred_bin))
    print('positive accuracy', accuracy_score(y_pos, y_pos_pred))
    print('negative accuracy', accuracy_score(y_neg, y_neg_pred))

In [None]:
interact(
    calc_precision_recall,
    X=fixed(X_val),
    y=fixed(y_val),
    W=fixed(W),
    threshold=widgets.FloatSlider(
        value=0.5,
        min=0,
        max=1,
        step=0.01
    )
)

#### How does F1 score depend on the threshhold?

In [None]:
thresholds = np.linspace(.01, .99, 100)
scores = []

for t in thresholds:
    y_pred = solutions._hypotheses(W, X_val)
    y_pred_bin = y_pred >= t
    scores.append(f1_score(y_val, y_pred_bin))
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.grid(True)
plt.plot(thresholds, scores)


#### To better visualize how precision recall depend on each other, we can also plot an AUROC curve

**A**rea

**U**nder

**R**eceiver

**O**perating

**C**haracteristic

In [None]:
thresholds = np.linspace(.01, .99, 100)
precisions = []
recalls = []

for t in thresholds:
    y_pred = solutions._hypotheses(W, X_val)
    y_pred_bin = y_pred >= t
    precisions.append(precision_score(y_val, y_pred_bin))
    recalls.append(recall_score(y_val, y_pred_bin))
plt.xlim(0, 1.2)
plt.ylim(0, 1.2)
plt.grid(True)
plt.plot(precisions, recalls)