<a href="https://colab.research.google.com/github/ch00226855/CMP414765Fall2022/blob/main/Week06_LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 6
# Logistic Regression

We have studied how to use linear regression and polynomial regression to *predict a target numeric value*. There is another learning task, **classification**, aiming at predicting group membership rather than numeric values. Email spam filter is a good example: it is trained with many example emails with their class (spam or non-spam), and it must learn how to classify new emails.

Linear regression is **not** a good choice for classification tasks. We will introduce the **logistic regression** model and use the iris dataset to illustrate how the model works.

**Readings4**: Textbook Chapter 4

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

## Logistic Regression: Intuition
- Picture the data as points on the plane.
- A classifier's job is to determine the decision regions for each class.
- If a point is far from the decision boundary, then the classifier should be fairly confident about its prediction.
- If a point is near the decision boundary, then the classifier may be less confident about its prediction.
- The **logistic regression** model aims to provide a **probablity distribution** for each point. A good classifier will produce a **probability distribution with low variance** on most inputs.
- **Probability distribution with high variance**: rolling a die - there is no way to predict the likely outcome
- **Probability distribution with low variance**: hitting the Powerball jackpot - probably not going to happen

<img src="https://miro.medium.com/proxy/1*fBjniQPOKigqxYSKEumXoA.png" width="400">
<img src="https://www.researchgate.net/profile/Tyler-Grear-2/publication/346931728/figure/fig3/AS:967322328113154@1607639023428/An-application-of-SVM-to-non-linearly-separable-distributions-of-two-classes-a-A-2D.ppm" width="400">

In [None]:
# Examples of low-variance probability distributions:
[0.99, 0.001, 0.009] # low variance: the outcome most likely will be 0.
[0.1, 0.8, 0.1] # low variance: the outcome most likely will be 1.

# Examples of high-variance probability distributions:
[0.333, 0.333, 0.334] # high variance
[0.5, 0.5] # high variance

## Basic Case: Binary Classifier
- Suppose there are only two classes for the output feature: **Class 0** (the negative class) and **Class 1** (the positive class).
- A **binary classifer** tries to estimate the probability $p$ that a point belongs to Class 1.
- The probability that a point belongs to Class 0 is $1 - p$.
- Given the probability, the binary classifier will compare it with a chosen **threshold** (for example, 0.5), and then predict the class as
    - prediction = 1 if $\hat{p}$ $\ge$ threshold
    - prediction = 0 if $\hat{p}$ < threshold
- The **boundary** of decision regions is given by the curve formed by points whose probability equals to the threshold value.

## Example: The Iris Dataset

**Iris dataset** is a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica. [wiki page](https://en.wikipedia.org/wiki/Iris_flower_data_set)

- Import dataset using <code>sklearn.dataset.load_iris()</code>
- Explore the dataset: data description, feature names, data types, data histograms, scatter plots.
- Split the dataset into train_set and test_set
- Apply <code>sklearn.linear_model.LogisticRegression</code> to build a binary classifier on **Iris-Virginica**.
- Evaluate the performance of the model: Accuracy, cross-validation, precision vs. recall, confusion matrix...
- Visualize the model (show decision boundary)

<img src="https://miro.medium.com/max/1000/1*Hh53mOF4Xy4eORjLilKOwA.png" width="600">


In [None]:
# Load the dataset
from sklearn import datasets
iris = datasets.load_iris()

iris.keys()

In [None]:
# Description of the dataset
print(iris['DESCR'])

In [None]:
print(iris['feature_names'])

In [None]:
# Convert the data into a data frame
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
iris_df.head()

In [None]:
# Add the target class
iris_df['target'] = iris['target']
iris_df.head()

In [None]:
# Explore the dataset
# How many examples are there for each type of Iris?

iris_df['target'].value_counts()

In [None]:
# Flower names are contained in the original iris object
iris['target_names']

In [None]:
# Create a function that maps 0-2 to the actual type of iris
def get_target_name(x):
    return iris['target_names'][x]

x = iris_df.loc[124, 'target']
name = get_target_name(x)
print(x, name)

In [None]:
# Apply get_target_name() to all target values
iris_df['target_name'] = iris_df['target'].apply(get_target_name)
iris_df.head()

In [None]:
# Draw scatter plots.
plt.scatter(iris_df.loc[:, 'sepal length (cm)'], iris_df.loc[:, 'sepal width (cm)'],
            c=iris_df['target'])

plt.colorbar()
plt.xlabel("sepal length")
plt.ylabel("sepal width")
plt.show()

In [None]:
# Draw all scatter plots
from pandas.plotting import scatter_matrix
scatter_matrix(iris_df.iloc[:, :4], figsize=(15, 15), marker='o',
               c=iris_df['target'])
plt.show()

## Build A Binary Classifier for Iris-Virginica

In [None]:
# Define a function is_virginica(target) that returns 1 if target is Virginica
# and 0 otherwise
def is_virginica(target):

    return int(target == 2)

In [None]:
# Apply function is_virginica() to the data frame, creating a new 
# column "Is_Virginica"

iris_df["Is_Virginica"] = iris_df['target'].apply(is_virginica)

# OR
# Define the function using the lambda expression
# iris_df["Is_Virginica"] = iris_df['target'].apply(lambda x: int(x == 2))

iris_df.head()

In [None]:
# Train-test split
# Split the data frame into 85% training data and 15% test data
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(iris_df, test_size=0.15)

In [None]:
# Build the logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(df_train.loc[:, ['sepal length (cm)',
                           'sepal width (cm)',
                           'petal length (cm)',
                           'petal width (cm)']], df_train['Is_Virginica'])

In [None]:
# Since using .loc[] expression requires the full names of the columns, sometimes it
# is easier to use their underlying integer indices in .iloc[] expression

# For example, the expression 
# df_train.loc[:, ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
# is equivalent to
# df_train.iloc[:, :4]

## Model Evaluation
- Classification accuracy
- Cross Validation
- Examine four categories using the confusion matrix:
    - True Positive
    - True Negative
    - False Positive
    - False Negative
- Precision, recall, and F1 score

In [None]:
# 1. Find the prediction accuracy on test set
from sklearn.metrics import accuracy_score

input_cols = ['sepal length (cm)',
            'sepal width (cm)',
            'petal length (cm)',
            'petal width (cm)']

# df_test.head()
# get model's prediction on the test records
test_predictions = model.predict(df_test.loc[:, input_cols])

# model.predict(df_test.iloc[:, :4])

accuracy_score(df_test['Is_Virginica'], test_predictions)

In [None]:
# Let's calculate the accuracy score without sklearn
# Convert both Is_Virginica and predictions into numpy arrays
array1 = np.array(df_test['Is_Virginica'])
array2 = np.array(test_predictions)
print(array1)
print(array2)

# Count the number of pairs that have identical values
count = 0
for i in range(len(array1)):
    actual = array1[i]
    pred = array2[i]
    if actual == pred:
        count = count + 1
print(count)
accuracy = count / len(array1)
print(accuracy)

The accuracy score can be mislead. Consider the following scenario:
- Suppose that the model returns 0 for any input.
- Suppose that 99% of the test set are non-Virginica.
- The accuracy score for this model on this particular test set will be: 0.99

In order to make sure the model is indeed a good one, we need to examine its performance further.

In [None]:
# 2. confusion matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(df_test['Is_Virginica'], test_predictions)



<img src="https://hackernoon.com/hn-images/1*YV7zy1NGN1-HGQxY56nc_Q.png" width="600">

### 3. cross validation
**Cross validation** is an efficient method that uses limited data to obtain multiple evaluations of the model.

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F4788946%2F82b5a41b6693a313b246f02d79e972d5%2FK%20FOLD.png?generation=1608195745131795&alt=media" width="600">

In [None]:
# Perform 3-fold cross validation
from sklearn.model_selection import cross_val_score
input_cols = iris_df.columns[:4]
print(cross_val_score(model, df_train[input_cols], df_train['Is_Virginica'],
                      cv=5)) # accuracy is returned by default

In [None]:
# Display precision score
print(cross_val_score(model, df_train[input_cols], df_train['Is_Virginica'],
                      scoring="precision", cv=3))

In [None]:
# Display recall score
print(cross_val_score(model, df_train[input_cols], df_train['Is_Virginica'],
                      scoring="recall", cv=3))

### 4. Precision, Recall, and F-1 Score
**Precision** and **recall** are two important metrics that evaluates different aspects of the model. **F-1 score** is a combination of the precision and recall.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/525px-Precisionrecall.svg.png" width="400">

In [None]:
# precision - recall - f1 score
from sklearn.metrics import precision_score, recall_score, f1_score
precision = precision_score(df_test['Is_Virginica'], test_predictions) # How much Virigincia iris are correctly identified?
recall = recall_score(df_test['Is_Virginica'], test_predictions) # How much Virginica predictions are correct?
f1 = f1_score(df_test['Is_Virginica'], test_predictions)
print(precision, recall, f1)

In [None]:
# Calculate the scores ourselves.

# First, we need the number of true positives, false positives, and false negatives.

num_true_positives = 0
for i in range(len(array1)):
    label = array1[i]
    pred = array2[i]
    if label == 1 and pred == 1 :
        num_true_positives = num_true_positives + 1
print(num_true_positives)

num_false_positives = 0
for i in range(len(array1)):
    label = array1[i]
    pred = array2[i]
    if label == 0 and pred == 1:
        num_false_positives = num_false_positives + 1
print(num_false_positives)

precision = num_true_positives / (num_true_positives + num_false_positives)
print(precision)

In [None]:
# Exercise: Calculate the recall score on your own.

# Use a for loop to find the number of true positives
num_true_positives = ???

# Use a for loop to find the number of false negatives
num_false_negatives = ???

# Calculate recall: num_true_positives / (num_true_positives + num_false_negatives)
recall = num_true_positives / (num_true_positives + num_false_negatives)


Consider the following scenario:
- Suppose that the model returns 0 for any input.
- Suppose that there are 99 non_Virginica and 1 Virginica in the test set.
- num_true_positive: 0
- num_false_positive: 0
- num_false_negative: 1
- precision: 0 / (0 + 0) --> undefined
- recall: 0 / (0 + 1) --> 0

### F-1 Score: A Combination of Precision and Recall

Since we expect the model to achieve high precision score and high recall score, we want to combine them into one score.

- $F-1 score = \frac{2}{\frac{1}{precision} + \frac{1}{recall}}$

In [None]:
f1 = 2 / (1 / 0.875 + 1 / 1.0)
print(f1)

In [None]:
# high precision: 0.9
# low recall: 0.1
f1 = 2 / (1 / 0.9 + 1 / 0.1)
print(f1)

In [None]:
# low precision: 0.1
# high recall: 0.9
f1 = 2 / (1 / 0.1 + 1 / 0.9)
print(f1)

## Logistic Regression: Model Assumption
**Binary classifier model**: Logistic regression model assumes that the decision boundary is represented as a linear function:

$\log\frac{\hat{p}}{1 - \hat{p}} = \theta_0 + \theta_1x_1 + \theta_2x_2 +\cdots + \theta_nx_n,$
- n: number of input features.
- $x_1, ..., x_n$: input features
- $\hat{p}$: the estimated probability of data belonging to the class
- $\theta_0,...,\theta_n$: parameters of the model

**Alternative format**:

$\hat{p} = \sigma(\textbf{x}\cdot\theta^T).$

- $\textbf{x} = (1, x_1, ..., x_n)$.
- $\theta = (\theta_0, \theta_1, ..., \theta_n)$.
- $\sigma(t) = \frac{1}{1+e^{-t}}$: logistic function

In [None]:
# Plot the graph of logistic function

# 1. Pick a list of x coordinates (`np.linspace`)
x = np.linspace(-10, 10, 100)
# 2. For each x, find the value of the function
values = 1 / (1 + np.exp(-x)) # Since x is a numpy array, we can apply
                                # np.exp directly
# 3. plot the list of x coordinates and y coordinates using
plt.plot(x, values, )


## Logistic Regression: Decision Rule

**Decision rule**: Pick a threshold (for example, 0.5), and then

- prediction = 1 if $\hat{p}$ $\ge$ threshold
- prediction = 0 if $\hat{p}$ < threshold

**Trade-off with threshold**:
- If threshold is chosen closer to 1, then the positive predictions are __more likely__ to be correct (fewer **false positives**). However, the negative predictions are __less likely__ to be correct.
- If threshold is chosen closer to 0, then the negative predictions are __more likely__ to be correct (fewer **false negatives**). However, the positive predictions are __less likely__ to be correct.

## Logistic Regression: Cost Function and Training Algorithm
For classification tasks, it is no longer appropriate to use MSE as the cost function.

**Cost (loss) function** for logistic regression:

\begin{equation}
c(\theta) = \left\{
\begin{array}{cc}
-\log(\hat{p}) & \textit{if  }y=1,\\
-\log(1-\hat{p}) & \textit{if  }y=0.
\end{array}
\right.
\end{equation}

The cost function $c(\theta)$:

- small if $y=1$ (data example belongs to the class) and $\hat{p}$ is close to 1.
- small if $y=0$ (data example does not belong to the class) and $\hat{p}$ is close to 0.
- is a convex function no matter what $y$ is.

**Uniformed expression for the cost function**:

$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}\big[y^{(i)}\log(\hat{p}^{(i)}) + (1-y^{(i)})\log(1-\hat{p}^{(i)})\big]$

- $c(\theta) = J(\theta)$ for $y=0$ and $y=1$.
- There is no equivalent of the Normal Equation.
- $J(\theta)$ is a convex function, so the *gradient descent algorithm* will guarantee to find its global minimum.
- $\frac{\partial J}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^{m}\big(\sigma(\textbf{x}^{(i)}\cdot\theta^T) - y^{(i)}\big)x_j^{(i)}$.

## Logistic Regression: Model Visualization
- Create a grid of points from a list of x coordinates and y coordinates.
- Use the model to obtain prediction probability on each point from the grid
- Find points with marginal probabilities.
- Plot the grid.

In [None]:
# Train a new logistic regression model on petal length and petal width only
model2 = LogisticRegression(solver='lbfgs')
model2.fit(df_train.iloc[:, 2:4], df_train['Is_Virginica'])

In [None]:
# 1. Create a grid of points
x0, x1 = np.meshgrid(np.linspace(0, 7, 100),
                     np.linspace(0, 2.7, 100))
print(x0.shape, x1.shape)

In [None]:
# Illustration of a meshgrid
x_coordinates = [1, 2, 3, 4]
y_coordinates = [10, 20, 30, 40]
xx, yy = np.meshgrid(x_coordinates, y_coordinates)
# print(xx)
# print(yy)
plt.plot(xx, yy, 'b.')

In [None]:
# 2. Obtain prediction probabilities
X_new = np.hstack([x0.reshape([-1, 1]), x1.reshape([-1, 1])])
y_new_prob = model2.predict_proba(X_new)

In [None]:
# 3. Find boundary points.
# Which points give 0.5 probability?
indices = np.where((y_new_prob[:, 1] > 0.49) & (y_new_prob[:, 1] < 0.51))
X_boundary = X_new[indices]

In [None]:
# 4. Plot the boundary
plt.plot(X_boundary[:, 0], X_boundary[:, 1])
index_virginica = (iris_df['Is_Virginica'] == 1)
index_not_virginica = (iris_df['Is_Virginica'] == 0)
plt.scatter(iris_df.loc[index_virginica, 'petal length (cm)'],
            iris_df.loc[index_virginica, 'petal width (cm)'],
            c='yellow',
            label='Virginica')
plt.scatter(iris_df.loc[index_not_virginica, 'petal length (cm)'],
            iris_df.loc[index_not_virginica, 'petal width (cm)'],
            c='purple',
            label='Not Virginica')
plt.legend()

In [None]:
# 5. Plot probabilities
plt.scatter(X_new[:, 0], X_new[:, 1], c=y_new_prob[:, 0])
plt.colorbar()
plt.scatter(iris_df.loc[index_virginica, 'petal length (cm)'],
            iris_df.loc[index_virginica, 'petal width (cm)'],
            c='yellow',
            label='Virginica')
plt.scatter(iris_df.loc[index_not_virginica, 'petal length (cm)'],
            iris_df.loc[index_not_virginica, 'petal width (cm)'],
            c='purple',
            label='Not Virginica')
plt.legend()

# Build a Multi-Class Classifier with Logistic Regression

Now consider a classifier for more than 2 classes. Instead of outputting $p$ and $1-p$, this classifier will need to output $p_1, p_2, ..., p_n$, where $p_i$ is the probability of Class $i$. The output must satisfy:
1. Each $p_i$ takes value in $[0, 1]$.
2. The sum of all values must be 1.
3. If the true class of an object is k, then we want $p_k\approx 1$ and $p_i\approx 0$ for all $i\neq k$.

Requirement 1 and 2 is guaranteed if we use the following **softmax** transformation:
$$
(t_1, t_2, ..., t_n) ⟶ (\frac{e^{t_1}}{e^{t_1} + e^{t_2} +\cdots + e^{t_n}}, \frac{e^{t_2}}{e^{t_1} + e^{t_2} +\cdots + e^{t_n}}, ..., \frac{e^{t_n}}{e^{t_1} + e^{t_2} +\cdots + e^{t_n}})
$$

In [None]:
t1 = 3.2 
t2 = 1.2
t3 = -1_000_000

e_t1 = np.exp(t1)
e_t2 = np.exp(t2)
e_t3 = np.exp(t3)

print(e_t1, e_t2, e_t3)

p1 = e_t1 / (e_t1 + e_t2 + e_t3)
p2 = e_t2 / (e_t1 + e_t2 + e_t3)
p3 = e_t3 / (e_t1 + e_t2 + e_t3)

print(p1, p2, p3)

print("Sum:", np.sum([p1, p2, p3]))


In [None]:
# Use the Iris dataset as an example
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
iris_df['target'] = iris['target']
iris_df.head()

In [None]:
# Let's switch class label 0 and 1
def label_switch(x):
    if x == 0:
        return 1
    if x == 1:
        return 0
    else:
        return x

iris_df['target'] = iris_df['target'].apply(label_switch)
iris_df.head()

In [None]:
# Split the data into traing set, validation set, and test set.
from sklearn.model_selection import train_test_split
training_val_set, test_set = train_test_split(iris_df, test_size=0.2)
training_set, val_set = train_test_split(training_val_set, test_size=0.25)
# print(training_val_set.shape)
print(training_set.shape)
print(val_set.shape)
print(test_set.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
input_cols = iris_df.columns[:4]
model = LogisticRegression(solver="newton-cg")
model.fit(training_set[input_cols], training_set['target'])

In [None]:
from sklearn.metrics import accuracy_score
predictions = model.predict(test_set[input_cols])
accuracy = accuracy_score(test_set['target'], predictions)
print(accuracy)

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(test_set['target'], predictions)