# Linear Classifiers in Python

#### Logistic Regression
* Logistic Regression is a linear classifier
* sklearn's Logistic Regression can also output confidence scores rather than "hard" or definite predictions with `.predict_proba()`

```
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.predict(X_test)
lr.score(X_test, y_test)
```

#### Using Linear SVC
* In sklearn the basic SVM classifier is called `LinearSVC()` or Linear Support Vector Classifier
* Note that sklearn's Logistic Regression and SVM implementations handle multiple classes (if a dataset has more than 2 classes) automatically.

#### Using SVC
* The SVC class fits a nonlinear SVM by default

* **Underfitting:** model is too simple, training accuracy low
* **Overfitting:** model is too complex, testing accuracy low

#### Linear Decision Boundaries
* A decision boundary tells us what class our classifier will predict for any value of x
* A decision boundary is considered **linear** when it is a line (in any orientation)
    * This definition extends to (classifying) more than 2 features
    * For five features, the space of possible x-values would be five-dimensional. In this case, the boundary would be a higher-dimensional **hyperplane** cutting the space into two halves.
* A **nonlinear** boundary is any other type of boundary.
    * Sometimes this leads to non-contiguous regions regions of a certain prediction ("islands", etc).
* In their basic forms, logistic regression and SVMs are linear classifiers, which means they learn linear decision boundaries.
    * However in some more complex forms, both may learn nonlinear decision boundaries

#### Vocabulary:
* **Classification:** learning to predict categories
* **Regression:** learning to predict a continuous value
* **Decision boundary:** the surface separating different predicted classes
* **Linear classifier:** a classifier that learns linear decision boundaries 
    * e.g. logistic regression, linear SVM
* **Linearly separable:** A data set is called linearly separable if it can be perfectly explained by a linear classifier **(straight line)**

```
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier

# Define the classifiers
classifiers = [LogisticRegression(), LinearSVC(), SVC(), KNeighborsClassifier()]

# Fit the classifiers
for c in classifiers:
    c.fit(X, y)

# Plot the classifiers
plot_4_classifiers(X, y, classifiers)
plt.show()
```

#### Linear Classifiers: Prediction Equations

#### Dot products
* Create two arrays, x and y:

```
x = np.arange(3)
y = np.arange(3, 6)
```
* `x = array([0, 1, 2])`
* `y = array([3, 4, 5])`

* To take the **dot product** between these two arrays, we need to multiply them element-wise.
* The result is:
    * 1. `x` * `y` == `array([0, 4, 10])`
    * 2. The sum of the numbers in this array (0 + 4 + 10) or `np.sum(x*y)` = `14`
* A convenient notation for this is `@`
    * `x@y` = 14
    * In math notation, this is written x dot y
* You can think of a **dot product** as multiplication in higher dimensions, since x and y are arrays of values
* Using dot products, we can express how linear classifiers make predictions 

#### Linear classifier predictions:
* `raw model output = coefficients * features + intercept`
    * Dot product of coefficients and features, plus an intercept.
* Linear classifier prediction: compute raw model output, check the **sign**:
    * If **positive**, predict one class
    * If **negative**, predict the other class
    
* Crucially, this pattern is the same for logistic regression and linear SVMs
* In sklearn terms, we can say logistic regression and linear SVM have different `fit` functions but same `predict` function.
    * The differences in `fit` relate to loss functions
    
* We can get the learned coefficients and intercept with:
    * `lr.coef_`
    * `lr.intercept_`
* To compute raw model output for example 10:
    * `lr.coef_ @ X[10] + lr.intercept_`
        * If the raw model output is negative, then we predict the negative class ("0", for example)
* In general, this is what the predict function does for *any* X: it computes the raw model output, checks if it's positive or negative, and then returns result based on the names of the classes in your data set (for example, "0" and "1").
* The sign (positive or negative), tells you what side of the decision boundary you're on, and thus, your prediction
* Along the decision boundary itself, the raw model output is zero
* Furthermore, the values of the coefficients and intercept determine the boundary 
* When you call `fit` with scikit-learn, the logistic regression coefficients are automatically learned from your dataset.

#### Loss functions
* Many machine learning algorithms involve minimizing a loss function (example loss function: least squares)
* You can think of minimizing the loss as jiggling around the coefficients or parameters of the model until this error term, or loss function is as small as possible
* In other words, the loss function is a penalty score that tells us how well (or, to be precise, how poorly) the model is doing on the training data
* We can think of the `fit` function as **running code that minimizes the loss.**
* "Minimization" is with respect to the coefficients or parameters of the model
* **Note** that the `.score` function in sklearn isn't necessarily the loss function
* **The loss is used to fit the model on the data and the score is used to see how well we're doing.**
* Often, however, they are the same.

#### Classification errors: the 0-1 loss
* Squared loss is not appropriate for classification problems, because our y-values are categories, not numbers
* For classification, a natural quantity to think about is the number of errors we've made 
* This is the **0-1 loss**: it's 0 for a correct prediction and 1 for an incorrect prediction
* By summing this function over all the training samples, we get the total number of mistakes we've made on the training set, since we add 1 to the total for each mistake
* While the 0-1 loss is important to understand conceptually, it turns out to be very hard to minimize it directly in practice (which is why logistic regression and SVM don't use it)

#### Minimizing a loss
* with `scipy.optimize.minimize`
* `minimize(np.square, 0).x`
    * first argument represents equation to be minimized: $y=x^2$
    * the second argument is our initial guess 
    * `.x` at the end to grab the input value that makes the function as small as possible
    * Think of the code as answering the question, "What values of the model coefficients make my sqaured error as small as possible?"
    
#### Implementing linear regression "from scratch"
```
# The squared error, summed over training examples
def my_loss(w):
    s = 0
    for i in range(y.size):
        # Get the true and predicted target values for example 'i'
        y_i_true = y[i]
        y_i_pred = w@X[i]
        s = s + (y_i_pred - y_i_true)**2
    return s

# Returns the w that makes my_loss(w) smallest
w_fit = minimize(my_loss, X[0]).x
print(w_fit)

# Compare with scikit-learn's LinearRegression coefficients
lr = LinearRegression(fit_intercept=False).fit(X,y)
print(lr.coef_)
```

#### Loss function diagrams
* In linear regression, the raw model output is the prediction
* Intuitively, the loss is higher as the prediction is further away
* This makes sense for linear regression, but not for a linear classifier; we need specialized loss functions for classification, and can't just use the squared error from linear regression
* Logistic loss is like a "smooth version of the 0-1 loss."
* Hinge loss, used in SVM: the general shape is the same as for logistic loss, both penalize incorrect predictions

### Logistic Regression
#### Logistic Regression and Regularization
* Regularization combats overfitting by making the model coefficients smaller
* In sklearn, the hyperparameter `C` is the inverse of the regularization strength
    * In other words, larger `C` means less regularization, and smaller `C` means more regularization.
* How does regularization affect training and testing accuracy?
    * 
    
```
lr_weak_reg = LogisticRegression(C = 100)
lr_strong_reg = LogisticRegression(C = 0.01)

lr_weak_reg.fit(X_train, y_train)
lr_strong_reg.fit(X_train, y_train)

lr_weak_reg.score(X_train, y_train)
lr_strong_reg.score(X_train, y_train)
```
* Outputs: 
* `1.0`
* `0.92`
* The model with weak regularization gets a higher training accuracy 
* Regularization is an extra term that we add to the original loss function, which penalizes large values of the coefficients
* Intuitively, without regularization, we are maximizing the training accuracy, so we do well on that metric 
* Regularized loss = original loss + large coefficient penalty 
* But, how does regularization affect test accuracy?

```
lr_weak_reg.score(X_test, y_test)
lr_strong_reg.score(X_test, y_test)
```
* Outputs: 
* `0.86`
* `0.88`
* More regularization (almost always) leads to higher test accuracy
* Regularizing, and thus making your coefficient smaller, is like a compromise between not using the feature at all (setting the coefficient to zero) and fully using it (the un-regularized coefficient value). 

#### L1 vs L2 Regularization
* For linear regression, we use the terms Ridge and Lasso for two different types of regularization
* **Lasso = L1 Regularization**
* **Ridge = L2 Regularization**
* Both help reduce overfitting, and L1 also performs feature selection
* **Scaling features is usually good practice, especially when using regularization.**
* Plot coefficients of L1 and L2 regularized models:

```
plt.plot(lr_l1.coef_.flatten())
plt.plot(lr_l2.coef_.flatter())
```

```
# Train and validaton errors initialized as empty list
train_errs = list()
valid_errs = list()

# Loop over values of C_value
for C_value in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    # Create LogisticRegression object and fit
    lr = LogisticRegression(C=C_value)
    lr.fit(X_train, y_train)
    
    # Evaluate error rates and append to lists
    train_errs.append( 1.0 - lr.score(X_train, y_train) )
    valid_errs.append( 1.0 - lr.score(X_valid, y_valid) )
    
# Plot results
plt.semilogx(C_values, train_errs, C_values, valid_errs)
plt.legend(("train", "validation"))
plt.show()
```

```
# Specify L1 regularization
lr = LogisticRegression(penalty='l1')

# Instantiate the GridSearchCV object and run the search
searcher = GridSearchCV(lr, {'C':[0.001, 0.01, 0.1, 1, 10]})
searcher.fit(X_train, y_train)

# Report the best parameters
print("Best CV params", searcher.best_params_)

# Find the number of nonzero coefficients (selected features)
best_lr = searcher.best_estimator_
coefs = best_lr.coef_
print("Total number of features:", coefs.size)
print("Number of selected features:", np.count_nonzero(coefs))
```

#### Logistic Regression and Probabilities
* **Hard predictions:** we predict either one class or the other; "yes" or "no", "black" or "white", "apple" or "banana"
* **Soft predictions:** interpretting the raw model output as a probability.
* Soft vs. hard predictions similar to soft vs. hard voting
* Remember: output probabilities with **`.predict_proba()`**
* When we "turn on" regularization, we see the coefficients get smaller. The effect of regularization is that probabilities are close to 0.5; in other words, smaller coefficients mean less confident predictions
* Connection between overconfidence and overfitting
* If you have multiple features, you have multiple coefficients; **the ratio of the coefficients represents the slope of the line and the magnitude of the coefficients gives us our confidence level.**
* **Regularization affects not only confidence, but also the orientation of the boundary.** 

#### How are logistic regression probabilities computed?
* logistic regression predictions come from the sign of the raw model output
* The raw model output can be any number but probabilities are from 0 to 1; so, logistic regression probabilities are "squashed" raw model output ("squashed" to be between 0 and 1): 
* **Sigmoid Function:** When the raw model output is 0, then the probability is 0.5.

#### Multi-class logistic regression
* Multi-class classification means having more than 2 classes
* Two popular approaches to multi-class classification:

#### 1. **Train a series of binary classifiers for each class**
* aka combine binary classifiers with **one-vs-many**
   
```
# One vs rest strategy
lr0.fit(X, y == 0)
lr1.fit(X, y == 1)
lr2.fit(X, y == 2)
```
* **The code `y==0` returns an array the size of y, that's True when y equals 0 and False otherwise, so the classifier learns to predict these values.**
    * In other words, it's a binary classifier learning to distinguish between "zero" and "not zero."
* In order to make predictions using one-vs-rest, we take the class (`y==0`, `y==1`, etc) whose classifier gives the largest raw model output- or, `decision_function` in sklearn terminology.

```
# Get raw model output 
lr0.decision_function(X)[0]
lr1.decision_function(X)[0]
lr2.decision_function(X)[0]
```
* In our example, let's say the largest raw model output comes from classifier 0. 
* This means, it's more confident that the class is 0 than any of the other classes, so we predict class 0.
* **One vs rest is the default behavior of sklearn's Logistic Regression**
* We can just let sklearn do the work by fitting a logistic regression model on the original multi-class dataset. 

* Another way to achieve multi-class classification with logistic regression is to modify the loss function so that it directly tries to optimize accuracy on the multi-class problem. Various words related to this concept:
    * Multinomial logistic regression
    * Softmax
    * Cross-entropy loss
    
* **One-vs-rest:**
    * fit a binary classifier for each class
    * predict with all, take largest output
    * **pro:** simple, modular; reuse your binary classifier implementation rather than needing a new one
    * **con:** not directly optimizing accuracy
    * common for SVMs as well
    * can produce probabilities

* **"Multinomial" or "softmax":**
    * fit a single classifier to all classes 
    * prediction directly outputs best class
    * **con:** more complicated, new code
    * **pro:** tackles the problem directly (loss more directly aligned with accuracy than one-vs-rest)
    * In the field of neural networks, the multinomial approach is standard
    * possible for SVMs, but less common
    * can product probabilities 

#### Model coefficients for multi-class

```
lr_mn = LogisticRegression(multi_class = "multinomial", solver= "lbfgs")
lr_mn.fit(X, y)

lr_mn.coef_.shape
lr_mn.intercept_.shape
```
* We can instantiate the multinomial version by setting the multi_class argument. In sklearn, this also requires changing to a non-default solver (like "lbfgs").
* The **solver** hyperparameter specifies the algorithm usd to minimize the loss; the default algorithm is for the binary problem, so it can be used for one-vs-rest but not multinomial.
* The multinomial classifier has the same number of coefficients and intercepts as one-vs-rest

```
# Fit one-vs-rest logistic regression classifier
lr_ovr = LogisticRegression()
lr_ovr.fit(X_train, y_train)

print("OVR training accuracy:", lr_ovr.score(X_train, y_train))
print("OVR test accuracy    :", lr_ovr.score(X_test, y_test))

# Fit softmax classifier
lr_mn = LogisticRegression(multi_class='multinomial', solver = 'lbfgs')
lr_mn.fit(X_train, y_train)

print("Softmax training accuracy:", lr_mn.score(X_train, y_train))
print("Softmax test accuracy    :", lr_mn.score(X_test, y_test))
```

### Support Vectors
#### What is an SVM?
* Linear SVMs are also **linear classifiers**, but they use **hinge loss** instead (as well as **L2 regularization**)
* RECAP: logistic and hinge loss look very similar, but a key difference is in the "flat" part of the hinge loss, which occurs when the raw model output is greater than one (meaning you predicted some example correctly, beyond some margin of error
    * If a training error falls in this "zero loss" region, it doesn't contribute to the fit (if we removed that example, nothing would change).
    * $\Uparrow$ **This** is a key property of SVMs.
* **Support Vectors** are defined as examples that are **not** in the flat part of the loss diagram.
    * **Another way of defining support vectors is that they include incorrectly classified examples, as well as correctly classified examples that are close to the boundary.**
    * If you're wondering how close is close enough, this is controlled by the regularization strength.
    * **Support Vectors** Are the examples that matter to your fit. 
    * If an example is not a support vector, removing it has no effect on the model, because its loss was already zero.
    * **Critically important:** Having a small number of support vectors makes Kernel SVMs really fast. 
        * Part of the speed comes from clever algorithms whose running time only scales with the number of support vectors, rather than the total number of training examples.
        
#### Max-margin viewpoint
* The SVM maximizes the margin for linearly separable datasets
* **Margin:** distance from the boundary to the closest points.
* If the regularization strength is not too strong, SVMs maximize the margin of linearly separable datasets
* Unfortunately, most datasets are not linearly separable