# Week 4 Learning Objectives

-----

## Classification & Logistic Regression

### Distinguish between regression and classification problems.
Supervised machine learning problems fall into two broad categories: regression and classification. In both cases we want to make predictions about observations that haven't been seen based on information from observations that have been seen.  

**Regression** problems seek to predict a value, such as the number of minutes until a bus arrives, the number of fishforks set on a table, or the revenue of a division for the next quarter.  

**Classification** problems, on the other hand, try to predict membership in a group, where groups are mutually exclusive. In the case of "Mean Girls," we might look to predict "Plastic" or "Non-Plastic." Any data point we have has to fall into one and only one of these classes.

### Understand how Logistic Regression is similar & different from linear Regression.

**Similarities**
- Both a Parametric Regressors
- Both use a linear equation to arrive at a prediction.

**Differences**
- Linear Regression is used to predict a continuous value.
- Logistic Regression is used to predict a probability of a class.  
- Logistic Regression typically needs iterative learning and use of gradient descent and regularization to work well.


### Understand the math behind the Logit Function
- The Log function is used to find the exponent value needed to take a base value to the target value.
- $ log_{10}(100) = 2 $ Basically states that to take the base $10$ to the value of $100$ you need to apply a power of $2$ to the base.
- The Logit Equation applies this same log aspect to convert an odds ratio to a log odds value. $log(P) = log(\frac{p}{1-p})$



### Code and Calculate the Odds and Log Odds ratios.
- **Odds** = $\frac{p}{1-p}$
- **Log Odds** = $log(\frac{p}{1-p})$
_Where $p$ is the probability of some observed event occuring_

### Understand how to interpret the coefficients of a Logistic Regression
- First off, the Logistic Function we're trying to solve is $\frac{1}{1+e^{-\beta_0 + \sum\limits_{i=1}^j \beta_i X_i}}$
- There is a linear equation that gets applied the Euler's number.  This equation is solved and optimized via the Linear Equation, Gradient Descent and Regularization with a target of the log odds ratio.
- Because there is a linear equation in the Logistic Formula, we can interpret the coefficients.
- In sklearn, after fitting a LogisticRegression model named logreg, you can get the coefficient with `logreg.coef_`
- If the coefficient of a feature is 3.2
**Interpretation:** A 1-unit increase in the feature is associated with a 3.2-unit increase in the log odds of the outcome variable, all other things held constant.


### Know the Benefits of the Logistic Regression as a Classifier
- It's a classification algorithm that shares similar properties to linear regression.
- It's efficient and is a very common classification algorithm.
- The coefficients in a logistic regression model are interpretable (albeit somewhat complex); they represent the change in log-odds caused by the input variables.

------

## KNN and Classification Metrics

### Intuition behind the KNN algorithm
- KNN is a simple ML model that takes into consideration know values and makes a prediction based on the "K" most similar (closest) known observations.
- KNN is a Parametric Distance based algorithm with 1 very important Hyperparamter, "K", and several others of note.
    - `n_neighbors` : This is the "K" argument in SKLearn, which denotes the number of known observations to consider when making a prediction.
    - `weights`     : Allows you to adjust how much each neighbor contributes to the prediction.
    - `metric`      : Allows you to change the distance based algorithm that is used to find the closest points. I.E. Minkowski(Euclidean with the default value of p) or Manhattan.  
    [Available Distance Metrics](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html)


### Implementing KNN with sklearn
- `from sklearn.neighbors import KNeighborsClassifier`
- There is a complementary regression model called `KNeighborsRegressor` which uses the same principal of identifying the most similar observations and taking an average of their target to create a prediction.


### Define true positives, true negatives, false positives, false negatives.  
When working with classification problems, its important to state a reference class by which you will be associating Positives and Negatives with.  With Binary this is between "True & False" or "Happen & Doesn't Happen" or "Is & Isn't".
- True Positives : These are observations in the reference class that your model _correctly_ predicted as being in said reference class.
    - Predicted Cancerous : Is Cancerous
- True Negatives : These are observations NOT in the reference class that your model _correctly_ predicted as NOT being in said reference class.
    - Predicted Benign  : Is Benign
- False Positives (Type 1): These are observations NOT in the reference class that your model _incorrectly_ predicted as being in said reference class.
    - Predicted Cancerous  : Is Benign
- False Negatives (Type 2): These are observations in the reference class that your model _incorrectly_ predicted as NOT being in said reference class.
    - Predicted Benign  : Is Cancerous

### Construct a confusion matrix.
- [SKLearn Confusion Matrix Documentation](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py)

|n = 175  | Predicted True | Predicted False |
|:---|:---:|:---:|
|**Is True**|91 | 13 |
|**Is False**| 29 | 42 |


### Calculate accuracy, misclassification rate, sensitivity, specificity, precision.

_We'll use the above confusion matrix for these examples_


- **Accuracy** : Proportion of _correctly_ labeled predictions.

    #### $\frac{TP+TN}{n} = \frac{91+42}{175} = 0.76$


- **Missclassification Rate** : Proportion of _incorrectly_ labeled predictions. (1 - accuracy)

    #### $\frac{FP+FN}{n} = \frac{29+13}{175} = 0.24$


- **Sensitivity** : Of the actual positive observations, how many did we correctly identify? This is also known as "Recall" or "True Positive Rate"

    #### $\frac{TP}{TP+FN} = \frac{91}{91+13} = 0.875$


- **Specificity** : Of the actual negative observations, how many did we correctly predict?

    #### $\frac{TN}{TN+FP} = \frac{42}{42+29} = 0.591$


- **Precision** : How many of our "Positive" predictions are correct?

    #### $\frac{TP}{TP+FP} = \frac{91}{91+29} = 0.758$

Sensitivity & Specificity  |  Precision & Recall
:-------------------------:|:-------------------------:
![](https://raw.githubusercontent.com/dstrodtman/dsi_public/master/images/Sen-Spe.png)  |  ![](https://raw.githubusercontent.com/dstrodtman/dsi_public/master/images/Pre-Rec.png)

### Compare the "cost" or "badness" of false positives versus false negatives.
- Using/visualizing the ROC is a good way to approach the "Cost" associated with _your specific problem_. In almost all situations, your model will never be perfect and you will have some misclassifications.  Depending on your problem, you need to choose whether to optimize to reduce False Positives or False Negatives.  

#### Optimizing to reduce False Negatives.  
  
- Say you are trying to predict wheter a tumor is cancerous or benign.  You will want to decrease False negatives. These are observations your model predicts as being "Non-Cancerous", but turn out to be "Cancerous".  
  
- On the other hand predicting "Cancerous" when a tumor is "benign" is not as detrimental to the individual who is being assessed.


#### Optimizing to Reduce False Positives. 
  
- Say you are a loan provider and are trying to predict whether or not an applicant will default on their loan.  You will probably want to optimize your model so you make fewer loans to people who will default on them.  

- In this scenario a False Positive is an individual you would provide a load who would default, and a False Negative is someone not given a loan that would not default.   

- In this imaginary company having a loan defaulted on is more detrimental to the business than missing a loan opportunity.

### Describe the inverse relationship between sensitivity and specificity.
- Sensitivity : Of all the Class 1 observations, how many did we correctly identify?
- Specificity : Of all the Non-Class 1 observations, how many did we correctly Identify?

Typically, there is a general trade off and a gray area of observations that are difficult to classify as its a pretty even mixture of the two or more classes and thus nearly impossible to fully separate.  Maximizing sensitivity and correctly classifying all of the Class 1 observations typically results in having more false positives. 

Increasing FPs increases the denominator of Specificity and thus will reduce Specificity. This is bad! We want high Specificity!

On the other hand, decreasing False Positives will decrease the denominator of Specificity and thus will increase the Specificity. This is good! We want high Specificity!

However, these changes will be at the cost of increasing the number False Negatives (and thus decreasing the Sensitivity). Is this tradeoff acceptable to you? Depends on the problem you are trying to solve.

## ROC AUC

TPR vs FPR

### Construct the ROC Curve.
- The Receiver Operating Characteristic curve is used to show us the relationship between the True Positives and False Positives.
- Typically we construct this curve to analyze how well a model performs and the optimal threshold. 
- You want to see a curve that has as much area as possible under it. Points closest to the top left corner are ideal. 
- You should set your threshold depending upon whether you want fewer False Positives or False Negatives. 
- A straight diagonal line from bottom left to top right essentially means your model is worthless. 
- You need to use the `predict_proba()` from a binary classifier model to construct a ROC. 

### Understand how AUC ROC is calculated and interpret AUC ROC.
- Area Under the Curve. If your AUC = 1 your model is perfect, if AUC = 0.5 its worthless.  AUC is the area under a curve.

### Define _imbalanced classes_ and discuss the implications.
- Imbalanced classes are class distributions in which 1 or more class(es) significantly out-numbers the other(s). 
- Highly imbalanced classes make the accuracy score almost worthless .
- In the event of a sample with 1000 observations: 990 are Class A and 10 are Class B. Predicting Class A every time, our baseline (null model), will result in a accuracy of 99%.  
- When one class has significantly fewer observations than other classes, it makes it difficult for the model learn which features are predictive of that class.  


### Describe methods for handling imbalanced classes.

- Randomly reducing the number of observations of Class A = downsampling
- Increasing the observations of Class B = upsampling - bootstrapping is doing this with replacement

- Stratifying samples ensures the test set sees a relatively equal number of cases from all classes.

-----

## Regularization

### Define 'loss function'
A loss function or objective function or cost function is the metric that our algorithm optimizes on. It tells us how well (or how badly) our model is performing. In the case of an ordinary least squares regression, the loss function is the sum of squared errors (SSE).

### Define 'regularization'
Regularization is the group of methods for constraining (regularizing) the coefficients of a linear model in order to reduce the error due to variance.

### Least Squares Loss Function

---

Ordinary least squares regression minimizes the mean squared error (MSE) to fit the data:

### $$ \text{minimize:}\; SSE(\beta_0, \beta_1, ...) = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 $$

Where our model predictions for $y$ are based on the sum of the $\beta_0$ intercept and the products of $\beta_i$ with $x_i$.

### Defining the Lasso

---

Now we do the same thing as above but for the Lasso. You will be able to see how the coefficients change differently for both!  But first, let's define lasso.

Lasso regression takes a different approach. Instead of adding the sum of squared $\beta$ coefficients to the RSS, it adds the sum of the absolute values of the $\beta$ coefficients:

### $$ \text{minimize:}\; SSE + Lasso = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \alpha\sum_{j=1}^p |\beta_j|$$

**Where:**

$|\beta_j|$ is the absolute value of the $\beta$ coefficient for variable $x_j$.

$\alpha$ is the strength of the regularization penalty component in the loss function.

---

### Defining the Ridge

---

### $$ \text{minimize:}\; SSE+Ridge = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \alpha\sum_{j=1}^p \beta_j^2$$

**Where:**

$\beta_j^2$ is the squared coefficient for variable $x_j$.

$\sum_{j=1}^n \beta_j^2$ is the sum of these squared coefficients for every variable in the model. This does **not** include the intercept $\beta_0$.

$\alpha$ is a constant for the _strength_ of the regularization parameter. The higher the value, the greater the impact of this new component in the loss function. If the value was zero, we would revert back to just the least squares loss function. If the value was a billion, however, the residual sum of squares component would have a much smaller effect on the loss/cost than the regularization term.

---

### Defining the Elastic Net

---

The Elastic Net combines the Ridge and Lasso penalties.  It adds *both* penalties to the loss function:

> "[Elastic Net] allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge."
-- sklearn docs

#### $$ \text{minimize:}\; SSE + Ridge + Lasso = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \alpha\rho\sum_{j=1}^p |\beta_j| + \alpha(1-\rho)\sum_{j=1}^p \beta_j^2$$

In the elastic net, the effect of the ridge versus the lasso is balanced by the $\rho$ parameter.  It is the ratio of Lasso penalty to Ridge penalty and must be between zero and one.

- **Ridge** is good at "shrinking" model coefficients.
- **Lasso** is good at eliminating coefficients.
- **ElasticNet** combines Ridge and Lasso.

### Bottom line?

In all cases, "regularization strength" is defined by a parameter $\alpha$ (sometimes called $\lambda$):
- Increase $\alpha$ (turn up regularization) 
    - Increase bias
    - Decrease variance
- Decrease $\alpha$ (turn down regularization) 
    - Decrease bias
    - Increase variance 
    
---

- The Ridge is best suited to deal with multicollinearity. 
- Lasso also deals with multicollinearity between variables, but in a more brutal way (it "zeroes out" the less effective variable).
- The Lasso is particularly useful when you have redundant or unimportant variables. If you have 1000 variables in a dataset the Lasso can perform "feature selection" automatically for you by forcing coefficients to be zero.
- Elastic Net combines both.

### Regularization and the Bias-Variance Tradeoff
We can think of regularization as exchanging error due to variance for error due to bias. Regularization does this by reducing or eliminating the coefficients of the independent variables, which reduces the complexity of our model.

##  Gridsearching, Hyperparameters and Pipelines

### Describe what the terms gridsearch and hyperparameter mean.
- Gridsearch is the act of searching a potentially multi-dimensional grid of possible values for the optimal combination.
- Hyperparameters are values that can be adjusted in a model to have effects on how the model performs.

### Build a gridsearching procedure from scratch.
```python

# for param1 in param1_list:
#     for param2 in param2_list:
#         for param3 in param3_list:
#             Instantiate model with current params
#             fit the model
#             score the model
#             store the scores and params list for each model
           
# search through Model_Score_Dict for max score 
# return Max score and ist associated params.
```        

### Apply sklearn's GridSearchCV object with basketball data to optimize a KNN model.


### Use and evaluate attributes of the gridsearch object.


### Describe the pitfalls of searching large hyperparameter spaces.
- More hyper parameters = more combinations = more models = more time/computational expense
- If you have a lot of data, model building takes time as well.
- If you have a complex or deep model each model times more time to build as well.
- Multiply this by however many Hyperparameters you're searching over.
- You will want to use model based EDA to identify small hyperparameter spaces to search through instead of brute forcing and searching through ever value in existence. 

### Pipeline Syntax



We set up a pipeline by passing a list of tuples in the format
```
('string_name', ClassObject())
```
Note that we can name our steps beforehand (each of the methods that we're using are a class in sklearn).
```
lasso = Lasso()
('lasso', lasso)
```

We can include as many steps as we'd like. 


`make_pipeline` is an a convenience function to make a Pipeline.

After we create/set up the pipeline, we `fit` it on our training data.

Then we can `score` on our training set and testing set, generate predictions using `predict`, etc. - anything you would do with a "regular" model!

### GridSearch Syntax

`GridSearchCV` accepts a `Pipeline` object as an estimator and a param grid.

The param grid uses the `string_name`s from your pipeline followed by a dunder `__` and the argument name for that particular step. You then provide an iterable to search over (generally a list or a range-style object).

For example:  
parameters = {'hyper_parameter__name of hyperparameter' : [different values for hyperparameter]}

You can also specify the number of folds using `cv`. Default is 3.

We use this the same as other models, `fit`ting and `score`ing like normal (but now using the hyperparameters that gave us the best results).

Note that we'll use our `best_estimator_` to access the `Pipeline` that was fit with our `best_params_`.

Within the `best_estimator_`there is a dictionary called `named_steps`. We can use our `string_names` to access the steps in our `Pipeline`. This is where we'll go to access info about the transformations and parameters done at each step.

### Use and evaluate attributes of the gridsearch object.
- A fit GridsearchCV object has the same methods as our normal fit models, and can be used with .predict and .score generate cross-validated scores and predictions on test data.  
- The GridsearchCV object also has attributes that return the .best_params_, the .best_estimator_, and .best_score_ of the estimator chosen.

```params = {
      parmeter_1:[3, 5, 7]  
      parmeter_2:[0.01, 0.1, 1.0]  
      paramter_3:('x', 'y', 'z')
}  ```
We 3 values in each parameter, so it's $3x3x3=27$.  
If we're using CV, than we multiply it by the number of folds (so CV=3 would be $3x3x3x3$) 