# Hi!

In today's workshop we are going to learn about most known concept of supervised learning which is `classification`.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

In [None]:
print(load_breast_cancer().DESCR)

In [None]:
X, y = load_breast_cancer(True)

In [None]:
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, train_size=0.7)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, train_size=0.66)

###### What is classification?

Classification is a problem of predicting discrete value (classes) for given features. It is mainly viewed as a supervised learning problem.

###### What about applying linear regression for classification?

In [None]:
from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

In [None]:
X_train[0]

In [None]:
linear_reg.predict(X_val)

How to interpret these predictions? Maybe we need something different?

###### What is logistic regression?

Logistic regression is about applying "squashing" function to the hypotheses.

$$\hat{y} = h_w(x)$$ 

$$h_w(x) = \sum_{j=0}^k w_j x_j = wx$$

$$\hat{y} = \sigma(h_w(x))$$ 

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

In [None]:
x = np.linspace(-10, 10)
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [None]:
plt.plot(x, sigmoid(x))
plt.grid(True)
plt.show()

In [None]:
sigmoid(np.inf), sigmoid(-np.inf)

What about loss? Is MSE still applicable? 

There are reasons why we are not using MSE. Instead we use log-loss.

$$ L(w) = -\sum_{i=0}^n y^{(i)}\log{h_w(x^{(i)})} + (1-y^{(i)})\log{(1-h_w(x^{(i)}))}$$

$$ y^{(i)} \in \{0, 1\}$$

In [None]:
# y = 1

x = np.linspace(0.0001, 1, 1000)
plt.plot(x, -np.log(x))
plt.ylim(-1, 10)
plt.grid(True)
plt.show()

In [None]:
# y = 0

x = np.linspace(0, 0.9999, 1000)
plt.plot(x, -np.log(1 - x))
plt.ylim(-1, 10)
plt.grid(True)
plt.show()

What about gradient descent procedure? How does it change? Let's derive gradient on blackboard.

In [None]:
from importlib import reload

In [None]:
import solutions

In [None]:
solutions = reload(solutions)

In [None]:
def add_bias_feature(X):
       return np.c_[np.ones(len(X)), X]

In [None]:
X_train = add_bias_feature(X_train)
X_val = add_bias_feature(X_val)

In [None]:
X_train, *norm_parameters = solutions.std_normalization(X_train)
X_val, *_ = solutions.std_normalization(X_val, *norm_parameters)

In [None]:
W = np.zeros(X_train.shape[1])
train_costs = []
val_costs = []
train_steps = 100000
for _ in range(train_steps):
    train_costs.append(solutions.cost(W, X_train, y_train, eps=0.001))
    val_costs.append(solutions.cost(W, X_val, y_val, eps=0.001))
    W = solutions.gradient_step(W, X_train, y_train, learning_rate=0.01)
   

In [None]:
plt.plot(np.arange(train_steps), train_costs)
plt.plot(np.arange(train_steps), val_costs)
plt.show()

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(solutions._hypotheses(W, X_train) >= 0.5, y_train)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logistic_reg = LogisticRegression(C=10**6)

In [None]:
logistic_reg.fit(X_train, y_train)

In [None]:
logistic_reg.score(X_train, y_train)

### How to deal with overfitting?

**Regularization to the rescue!**

In Logistic (as well as Linear) Regression we can make sure that elements of the weights vector don't grow too dramatically. We do this by penalizing their size additionally in the cost function.

#### Regularized cost function

$$
L_{reg}(W) = L(W) + \frac{\lambda}{2n}\sum_{j=1}^k w_j^2
$$

...whatever the original cost function was!

**We don't regularize the first weight element, which is responsible for bias!**

$\lambda$ parameter is, just like learning rate, something that is best calculated empirically.

In [None]:
def cost_reg(W, X, Y, l=0.1):
    # implement me!

In [None]:
cost_reg = solutions.cost_reg

#### Regularized gradient descent

The only thing that changes is the partial derivative of cost for all $j \in [1.. k]$ (since we don't regularize $w_0$)

$$\epsilon_0 = \frac{\partial}{\partial w_j}L(W) = \frac{1}{N} \sum_{i=1}^N(h_W(x^{(i)}) - y^{(i)})x_j^{(i)}$$
$$\epsilon_j = \frac{\partial}{\partial w_j}L(W) = \frac{1}{N} \sum_{i=1}^N(h_W(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{n}w_j$$



In [None]:
def gradient_step_reg(W, X, Y, l=0.1):
    #implement me!

In [None]:
gradient_step_reg = solutions.gradient_step_reg

### How to measure performance of our model?
#### Metrics

We can divide classifications of our model into four classes:

| Predicted/Actual | 0   | 1   |
|------------------|-----|-----|
| 0                | True negative | False negative|
| 1                | False positive | True positive | 


**Accuracy - a first intuition**

$$
Accuracy = \frac{T_p + T_n}{total}
$$



In [None]:
def accuracy(actual_predictions, model_predictions):
    # implement me!
    # both arguments are np.arrays of zeros and ones symbolizing 
    # results of classification for each exampleb

In [None]:
accuracy = solutions.accuracy

###### What problems do you see with such a metric?

Turns out there is a more reliable way to measure the performance of our model:

- **Precision** - *what fraction of our positive classifications is correct?*
$$
Precision = \frac{T_p}{T_p + F_p}
$$

- **Recall** - *what fraction of actual positive examples has been classified correctly?*
$$
Recall = \frac{T_p}{T_p + F_n}
$$

We want both of those values to be as high as possible (duh).
However, sometimes we have to make a trade off between them and decide with our classification method that one will be higher and the other lower.

###### Can you think of any simple ways to increase one of those metrics? (without changing the model or the data)

One of the possible metrics which takes those two into account is the **F score**, which will be high if both precision and recall are high, but low if we sacrifice precision to increase recall or the other way around.

$$
F score = \frac{2PR}{P + R}
$$

In [None]:
def precision(actual_predictions, model_predictions):
    # implement me!
    # both arguments are np.arrays of zeros and ones symbolizing 
    # results of classification for each example
    
def recall(actual_predictions, model_predictions):
    # implement me!
    # both arguments are np.arrays of zeros and ones symbolizing 
    # results of classification for each example
    
def f_score(actual_predictions, model_predictions):
    # implement me!
    # both arguments are np.arrays of zeros and ones symbolizing 
    # results of classification for each example

#### AUROC Curve 
Another way to visualize the performance of our model is to plot 

**A**rea

**U**nder

**R**eceiver

**O**perating

**C**haracteristic

curve.

This curve indicates the relation between two metrics:

- **True positive rate (TPR)** (which is another name for recall)

*what fraction of actual positive examples has been classified correctly?*

$$
TPR = \frac{T_p}{T_p + F_n} = Recall
$$

- **False positive rate (FPR)**


*what fraction of actual negative examples has been classified incorrectly?*
$$
FPR = \frac{F_p}{F_p + T_n}
$$

The metrics should be calculated for different thresholds of classification in the classifier and then plotted

In [None]:
tpr = recall

def fpr(actual_predictions, model_predictions):
    # implement me!
    # both arguments are np.arrays of zeros and ones symbolizing 
    # results of classification for each example

In [None]:
fpr = solutions.fpr

In [None]:
thresholds = np.arange(0,1, 0.02)

classifications_for_thresholds = [model.classify(features, t) for t in thresholds]

tprs = [tpr(actual_predictions, model_predictions) for model_predictions in classifications_for_thresholds]

fprs = [fpr(actual_predictions, model_predictions) for model_predictions in classifications_for_thresholds]

plt.plot(fprs, tprs)
plt.show()
