# Chapter 5: Logistic Regression

* Logistic Regression: Building an equation to do classification
    * Aim: Try to find the best-fit set of parameters

## Classification with logistic regression and the sigmoid function: a tractable step function
* Pros:
    * Computationally inexpensive
    * Easy to implement
    * Knowledge representation easy to interpret
* Cons:
    * Prone to underfitting
    * May have low accuracy
* Works with:
    * Numeric values
    * Nominal values

### 2 Feature Graphs
* Represented by a sigmoid equation:
    * sigma = 1 / (1 + e^-z)
* Classification:
    * Normalize the weights then sum them
    * Place the results into a sigmoid
    * Results akin to a probability
        * X > .5: Classified as 1
        * X < .5: Classified as 0
    
## Using optimization to find the best regression coefficients
* Z = w^T * x
    * Equivalent to z = w0x0 + w1x1 + ... + wnxn
    * x is input data
* We want to find the best coefficients w to be as successful as possible.

### Gradient Ascent
* Idea: The best way to find the maximum of a function is to move in the direction of the gradient
* delta(f(x,y)) = (f(x,y)/x , f(x,y)/y)
    * Moving in the x direction by amount f(x,y)/x
    * Moving in the y direction by amount f(x,y)/y
* The gradient operator will always point in the direction of the greatest increase.
* Gradient Ascent algorithm: w:= w + a DELw f(w)
    * Step size: a
* Repeated until a stopping condition is reached:
    * Specified number of steps
    * Certain tolerance margin
* Gradient Descent algorithm: w:= w - a DELw f(w)
    * For function minimization

```
# Assumptions
Data in form (x1, x2, classifier)

# Pseudocode
Start with the weights all set to 1
Repeat R times:
    Calculate the gradient of the entire dataset
    Update the weights vector by alpha*gradient
    Returns the weights vector
```
* Goal: Solving for a set of weights to make a line that seperates the different classes of data

### Stochastic Gradient Ascent
* Gradient ascent is very inefficient process-wise
    * Instead update the weights using only one instance / point at a time
    * Online learning algorithm
        * Online: Updating the classifier as new data comes in
        * Batch: Processing data all at once

```
# Assumptions
Data in form (x1, x2, classifier)

# Pseudocode
Start with the weights all set to 1
Repeat R times:
    Calculate the gradient of one data point
    Update the weights vector by alpha*gradient
    Returns the weights vector
```

* How can we optimize this?
    * 1 pass is never enough as it may misclassify a large number of samples
    * Let's instead run this many times to try and reach a convergence
    * Are the parameters reaching a steady value or constantly changing?
        * If steady value, smaller deviation in values
        * If changing then there is data that doesn't get classified correctly

## Example
* Determine if a horse will die from colic
    * Sample Data: 368 instances w/ 28 features
* Data Problem: It's missing 30% of values

* Options
    * Use the feature's mean value from all available data
    * Fill in the unknown with a special value like 0
    * Ignore the instance
    * Use a mean value from similar items
    * Use another ML algorithm to predict the value

## Summary
* Logistic regression is about finding best-fit parameters to a sigmoid function
* Optimization methods can be used to find these best-fit parameters
* Gradient Ascent and Stochastic Gradient Ascent are the most common optimization methods
* Stochastic Gradient Ascent has the benefit of being computationally cheap and is an online/steaming algorithm.