# Topic 25-Pt 1: Intro to Logistic Regression 

- onl01-dtsc-ft-022221
- 05/06/21

## Questions?

- 

## Announcements

- Blog Post Deadline Extended until Monday at 10 AM EST.

## Overview

- For Today:
    - Types of machine learning.
        - Supervised vs Unsupervised Learning
        - Regression vs Classification 
    - From Linear to Logistic Regression - Theory
    - Applying Logistic Regression with `scikit-learn`
        - Proper preprocessing with train-test-split.
    - Evaluating Classifiers
        - Confusion Matrices
        - Accuracy, ~~Precision, Recall, F1-Score~~



- For Next Class:
    - Classification Metrics 
        - Confusion Matrices
        - Accuracy, Precision, Recall, F1-Score
        - ROC-AUC curve
    - Class Imbalance Problems
    - Functionizing evaluating classification models

# Types of Machine Learning Models

<img src="https://raw.githubusercontent.com/jirvingphd/fsds_pt_100719_cohort_notes/master/Images/ai_machine_learning_deep_learning.png">

## Intro to Supervised Learning

> "The term **_Supervised Learning_** refers to a class of machine learning algorithms that can "learn" a task through **_labeled training data_**."

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-intro-to-supervised-learning-v2-1-online-ds-pt-100719/master/images/new_ml-hierarchy.png" width=60%>

- All machine learning models fall into one of two categories:
    - Regressors/Regression
    - Classifiers/Classification

### Regression

Trying to find the **relationship** and predict a specific value.

- Examples of regressions:
    - House prices
    - Salary
    - Reviews/Ratings

### Classification

Trying to identify what features can predict which class a particular observation/row belongs to.
- Can be a "binary classification" 
    - "yes" or "no"
    - Survived or died.
    - Diabetic or not-diabetic
- Can also be a "multiclass classification"
    - Which type of flower?
    - Will a football game end one team winning, or the other team, or a tie?


# From Linear Regression to Logistic Regression


<img src="https://raw.githubusercontent.com/jirvingphd/online-dtsc-pt-041320-cohort-notes/master/assets/images/logistic_vs_linear.jpg">

## Recall Linear Regression

### Formula

$$ \large \hat y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n = \sum_{i=0}^{N} \beta_i x_i $$

- Output is specifying the **predicted value** for the target

## Classification: Use Logistic Regression

- Output is specifying the **probability** of belonging to a particular group

- Visual Example:
    - https://www.desmos.com/calculator/y2ilpxiqys

Transform from linear regression!

$$ \large \hat y = \sum_{i=0}^{N} \beta_i x_i $$

$$\large P = \displaystyle \frac{1}{1+e^{-\hat y}} = \frac{1}{1+e^{-\sum_{i=0}^{N} \beta_i x_i}} $$

$$ \large = \frac{1}{1+e^{-\beta_0}e^{-\beta_1 x_1}\ldots e^{-\beta_N x_N}} $$

# Implementing Logistic Regression

## Predict Passenger Survival on the Titanic with `scikit-learn`

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder
from sklearn.impute import SimpleImputer

from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import metrics 


# import statsmodels.api as sm

In [None]:
url = "https://raw.githubusercontent.com/jirvingphd/dsc-dealing-missing-data-lab-online-ds-ft-100719/master/titanic.csv"
df = pd.read_csv(url,index_col=0,na_values='?')
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
df = df[relevant_columns]
df.head()

In [None]:
df['Survived'].value_counts(normalize=True,dropna=False)

In [None]:
fig,ax= plt.subplots(ncols=2,figsize=(10,4))
sns.scatterplot(data=df, x='Fare',y='Survived',ax=ax[0])
sns.scatterplot(data=df, x='Age',y='Survived',ax=ax[1])
fig.suptitle('X Features vs Survived')

In [None]:
fig,ax= plt.subplots(ncols=2,figsize=(10,4))
sns.regplot(data=df, x='Fare',y='Survived',ax=ax[0])
sns.regplot(data=df, x='Age',y='Survived',ax=ax[1])
fig.suptitle('X Features vs Survived - Regression')

### Q: What are the preprocessing steps I need to perform before I create the model?

- Recast data types
- Train-test-split
- Fill/drop in missing/null values
- Feature Selection / Feature Engineering (interaction terms)
- Handling categorial variables
    - One Hot Encoding 
    - Label Encoding
- Handling Outliers (maybe)
- Normalizing/Standardizing our data

- **Multicollinearity (does it still matter as much for Logistic?)**



In [None]:
## Check out the .info
df.info()

In [None]:
## Check Object cols value_counts
display(df['Embarked'].value_counts(dropna=False),
        df['Sex'].value_counts(dropna=False))

### Preprocessing

In [None]:
## Separate X and y and train-test-split
target = 'Survived'

y = df[target]
X = df.drop(target, axis=1)

# Perform test train split
X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,)


In [None]:
## Check for nulls in training set


In [None]:
## Specify which values to impute with which method

## Most frequent

## Fill with median


In [None]:
## Copying X_train and X_test as start of X_train_tf,X_test_tf


In [None]:
## Impute the columns with most-frequent value


## Verify it worked


In [None]:
## Impute cols with 0s

## Verify it worked


In [None]:
## Specifing which cols to encode and which to scale.
#  make cat_cols and num_cols


In [None]:
## Encode cat_cols


In [None]:
## Scaling Num_cols


In [None]:
## Combine Num and Cat Cols


In [None]:
## check the .decribe of X_train_tf for scaling


## Fitting a Logistic Regression with `scikit-learn`

In [None]:
## Fit a logistic regression model with defaults


> ### But how do we know how GOOD it is?

In [None]:
## Check the .score of the model
log_reg.score(X_train_tf,y_train)

In [None]:
## Get the model's .score for training and test set 
print(f"Training Score:\t{log_reg.score(X_train_tf,y_train):.2f}")
print(f"Test Score:\t{log_reg.score(X_test_tf,y_test):.2f}")

> But what "score" is this?

In [None]:
## Get Predictions for training and test data to check metrics functions


In [None]:
## is it r-squared?


> Hmmm...its not $R^{2}$

In [None]:
## Try root_mean_square_error


> Hmmm...its not $\text{RMSE}$

In [None]:
## Try accuracy_score


> Ah-ha! The default `.score` for a classification model is **accuracy**. 

In [None]:
### Getting our model's coefficients
## Our function from last class
def get_coefficients(model,X_train):
    coeffs = pd.Series(model.coef_.flatten(), index=X_train.columns)
    coeffs['intercept'] = model.intercept_[0]
    return coeffs

In [None]:
## get the model's coefficients


### Understanding Our Model's Mistakes

- For classification tasks, it can be extremely helpful to examine a "Confusion Matrix" to understand how our model is wrong. 

In [None]:
## Use metrics.confusion_matrix


In [None]:
## Use metrics.plot_confusion_matrix


>- The Confusion Matrix separated out the correct (true) predictions for the positive class (1) and negative class (0). 

- **_True Positives (TP)_**: The number of observations where the model predicted the person has the disease (1), and they actually do have the disease (1).

- **_True Negatives (TN)_**: The number of observations where the model predicted the person is healthy (0), and they are actually healthy (0).

- **_False Positives (FP)_**: The number of observations where the model predicted the person has the disease (1), but they are actually healthy (0). 

- **_False Negatives (FN)_**: The number of observations where the model predicted the person is healthy (0), but they actually have the disease (1).


In [None]:
## Remake the conf matrix (raw counts)


```
[TN,FP],
[FN,TP]
```

In [None]:
## SLice out TN/FP/etc from cm
TN = None
FP = None
FN = None
TP = None

In [None]:
## Make sure we got the order right
print(TN,FP)
print(FN,TP)

### Classification Metrics are based on the confusion matrices of our model

#### Accuracy

$$ \large \text{Accuracy} = \frac{\text{Number of True Positives + True Negatives}}{\text{Total Observations}} $$

> "Out of all the predictions our model made, what percentage were correct?"
- "Accuracy is the most common metric for classification. It provides a solid holistic view of the overall performance of our 
model."

#### When to use?
- **Accuracy** is good for non-technical audiences (but can be misleading with imbalanced classes)


In [1]:
## calcualte accuracy manually


In [2]:
## compare against the accuracy_score


### Sidebar: Normalized Confusion Matrices

In [3]:
## Try using normalize=true for confusion matrix


In [4]:
## Try using normalize=true for plot confusion matrix


### How do I know if my accuracy score is good?

> Does your model predict better than chance/just getting the class distribution?
- Compare your accuracy to your normalized value counts for y
- Compare your model against a `DummyClassifier` (https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)

In [None]:
## Check the class balance for the y_train


In [None]:
## Check the class balance for y_test


In [None]:
from sklearn.dummy import DummyClassifier
## Make and fit  dummy classifier


In [None]:
## Get the model's .score
print(f"Training Score:\t{model.score(X_train_tf,y_train):.2f}")
print(f"Test Score:\t{model.score(X_test_tf,y_test):.2f}")

In [5]:
## Check the confusion matrix


> But accuracy isn't the best metric when you have imbalanced classes. 
- Next class we will introduce more classification metrics

# Intro to Part 2: Classification Metrics / Evaluating Classifiers 

> [The 5 Classification Evaluation metrics every Data Scientist must know](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226)

## Evaluation Metrics

### Accuracy

$$ \large \text{Accuracy} = \frac{\text{Number of True Positives + True Negatives}}{\text{Total Observations}} $$

> "Out of all the predictions our model made, what percentage were correct?"
- "Accuracy is the most common metric for classification. It provides a solid holistic view of the overall performance of our 
model."

#### When to use?
- **Accuracy** is good for non-technical audiences (but can be misleading with imbalanced classes)


In [None]:
metrics.accuracy_score(y_test, y_hat_test)

### Precision

> "**_Precision_** measures what proportion of predicted Positives is truly Positive?


$$ \large \text{Precision} = \frac{\text{Number of True Positives}}{\text{Number of Predicted Positives}} $$


#### When to use?
- **Use precision** when the cost of acting is high and acting on a positive is costly.
   - e.g. Allocating resources/interventions for prisoners who are at-risk for recidivism. 

In [None]:
metrics.precision_score(y_test, y_hat_test)

### Recall

> **_Recall_** indicates what percentage of the classes we're interested in were actually captured by the model."
$$ \large \text{Recall} = \frac{\text{Number of True Positives}}{\text{Number of Actual Total Positives}} $$ 


#### When to use?
- **Use recall** when the number of true positives/opportunities is small and you don’t want to miss one.
    - e.g. cancer diagnosis. (telling someone they do not have cancer when they actually do is fatal)

In [None]:
metrics.recall_score(y_test, y_hat_test)

<img src='https://raw.githubusercontent.com/jirvingphd/fsds_100719_cohort_notes/master/images/precisionrecall.png' width=10%>

### $F_1$ Score

F1 score represents the **_Harmonic Mean of Precision and Recall_**.  In short, this means that the F1 score cannot be high without both precision and recall also being high. When a model's F1 score is high, you know that your model is doing well all around. 

> Harmonic Mean: "the reciprocal of the arithmetic mean of the reciprocals of a given set of observatins." - *[Wikipedia](https://en.wikipedia.org/wiki/Harmonic_mean)*

#### Arithmetic Mean:

$$\large \bar{X} = \frac{a+b+c}{n} $$

#### Harmonic Mean:

$$ \large \bar{X} = \frac{n}{\frac{1}{a}+ \frac{1}{b}+ \frac{1}{c}}$$


**The formula for F1 score is:**

> $$ \text{F1 score} =  \frac{2}{\text{Precision}^{-1}\ x\ \text{Recall}^{-1}}= 2\ \frac{\text{Precision}\ x\ \text{Recall}}{\text{Precision} + \text{Recall}} $$

#### When to use?
- **F1 score** is really the most informative about overall model quality.
- BUT is the most difficult to express to a non-tech audience

## Which metric to use?

- **When in doubt, use them all!** 
 -`metrics.classification_report`
 

In [None]:
print(metrics.classification_report(y_test,y_hat_test,target_names=['Died','Survived']))

In [None]:
metrics.recall_score(y_test, y_hat_test).round(2)

- **But some good rules of thumb:**
    - **Accuracy** is good for non-technical audiences (but can be misleading with imbalanced classes)
    
    - **Use recall** when the number of true positives/opportunities is small and you don’t want to miss one.
        - e.g. cancer diagnosis. (telling someone they do not have cancer when they actually do is fatal)
    - **Use precision** when the cost of acting is high and acting on a positive is costly.
       - e.g. Allocating resources/interventions for prisoners who are at-risk for recidivism. 

- **F1 score** is really the most informative about overall model quality, but is the most difficult to express to a non-tech audience

# APPENDIX 

## IMPORTANT NOTE ABOUT PACKAGE VERSIONS

### scikit-learn & matplotlib

In [None]:
#### scikit-learn version
## Run COnda List to Verify what versions are installed and how
# %conda list scikit-learn

- You will need sklearn to be version 0.23 + to have all of the tools covered in lessons.
    > Note: sklearn is listed as `scikit-learn`<br>to update: 
    `conda update scikit-learn`
 

In [None]:
## If have less than 0.23, run this command
# %conda update scikit-learn

In [None]:
#### matplotlib version
# %conda list matplotlib

- You will want to update matplotlib to fix errors with your confusion matrix plots
    > `pip install -U matplotlib`

In [None]:
# !pip install -U matplotlib