## K-Nearest Neighbors & Supervised Learning

- Supervised learning can be broken down into two categories:
    - Classification - discrete class value
    - Regression - continuous class value
    
    
- KNN makes few assumptions about the structure of the date and gives potentially accurate and sometimes unstable predictions (sensitive to small changes in training data). 


- Linear model (fit using least-squares) makes strong assumptions about structure of data and gives stable but potentially inaccurate predictions


- Model - A specific mathematical or computational description that expresses the relationship between a set of input variables and 1+ outcome variables that are being studied or predicted.

## Overfitting and Underfitting

- Generalization ability refrs to an algorithm's ability to give accurate predictions for new, previously unseen data.


- Assumptions:
    - Future unseen data (test set) will have same properties as current training data.
    - Thus, the models that are accurate on training set are expected to be accurate on test set.
    - May not happen if trained model is tuned too specifically to training set.
    

- Models that are too complex for the amount of training data available are said to **overfit** and are not likely to generalize well to new examples.


- Models that are too simple, that don't even do well on training data, are said to **underfit** and also are not likely to generalize well to new examples.

## $R^2$ Regression Score

- Measures how well a prediction model for regression fits given data


- Score is between 0 and 1:
    - Value of 0 corresponds to a constant model that predicts mean value of all training target values.
    - Value of 1 corresponds to prefect prediction.
    

- Also known as "coefficient of determination"


- KNN is a good baseline to compare against more complicated models

## KNeighbors Classifier & KNeighbors Regressor: Important Parameters

- Model Complexity: 
    - n-neighbors: # of nearest neighbors (k) to consider
        - Default: k = 5


- Model Fitting:
    - Metric: Distance function between data points
        - Default: Minkowski distance with power parameter p = 2 (Euclidean distance)


## Least Squares

- A **linear model** is a sum of weighted variables that predicts a target output value given an input data instance. Example: Predicting housing prices.
    - Housing features: taxes per year ($X_{Tax}$), age in years ($X_{Age}$)
        - $\hat{Y}$ = 212,000 + 109 $X_{Tax}$ - 2,000 $X_{Age}$
    - A house with feature values ($X_{Tax}$, $X_{Age}$) of (10,000, 75) would have a predicted sale price of:
        - $\hat{Y}_{price}$ = 212,000 + 109(10,000) - 2,000(75) = 1,152,000 
        

- Input instance - feature vector: x = ($x_0$, $x_1$, ... , $x_n$)
  
  Predicted output: $\hat{y}$ = $\hat{w}_0 x_0 + \hat{w}_1 x_1 + ... + \hat{w}_n x_n + b$
  
  Parameters to estimate: $$\hat{w} = (\hat{w}_0, ..., \hat{w}_n ) : feature \ \ weights/model \ \ coefficients $$
  $$\hat{b} = constant \ \ bias \ \ term/intercept $$
  

- $\hat{y} = \hat{w}_0 x_0 + \hat{b}$ <br> $\hat{w}_0$ (slope) <br> $\hat{b}$ (y-int) 


- Least squares (or Ordinary Least Squares) finds *w* and *b* that minimize the mean squared error of the model: the sum of squared differences between predicted target and actualy target values.


- No parameters to control model complexity


- Parameters estimated from training data


- Many different way to estimate *w* and *b*:
    - Different methods correspond to different "fit" criteria and goals and ways to control model complexity
    

- The learning algorithm finds the parameters that optimize an objective function, typically to minimize some kind of loss function of the predicted target values vs. the actual target values


- For LMs, model complexity is based on the nature of the weights, w, on the input features. Simpler LMs have a weight vector, w, that is closer to 0, i.e. where more features are either not used at all (that have zero weight) or have less influence on the outcome, a very small weight.

$$ RSS(w,b) = \sum \limits_{i=1} ^{N} (y_i - (w \cdot x_i + b)) ^{2} $$

## Ridge Regression

- Ridge regression learns parameters w and b using the same OLS criterion buf adds a penalty for large variations in w parameters.

$$ RSS_{Ridge} (w,b) = \sum \limits_{i=1} ^{N} (y_i - (w \cdot x_i + b)) ^{2} $$

- Once parameters are learned, the ridge regression prediction formula is the **same** as OLS.


- The addition of a parameter penalty is called *regularization*. Regularization **prevents overfitting** by restricting the model, typically to reduce its complexity.


- Ridge regression use *L2 regularization*: minimizes the sum of squares of w entries.


- Influence of the regularization term is controlled by the $\alpha \$ parameter.


- Higher $\alpha \$ means more regularization and simpler models.


### Need fore Feature Normalization

- Important for some ML methods that all features are on same scale (e.g. faster convergence in learning, more uniform of "fair" influence for all weights)
    - E.g. Regularized regression, KNN, SVM, NNs, ...
    

- **Can also depend on the data.** For now, we do MinMax scaling on features.
    - For each feature $x_i$: Compute the min value $x_i^{MIN}$ and max value $x_i ^{MAX}$ achieved across all instances in training set.
    - For each feature: Transform a given feature $x_i$ value to a scaled version $x_i^{'}$ using the formula: $$ x_i ^{'} = \frac{(x_i - x_i ^{MIN})}{(x_i ^{MAX} - x_i ^{MAX})} $$
    

- **Feature Normalization: Test set must use identical scaling to training set.**
    - Fit scaler using training set, then apply same scaler to transform test set.
    - Do not scale training and test sets using different scalers: **this could lead to random skew in the data, which will invalidate results.**
    - Do not fit scaler using any part of the test data: **referencing the test data can lead to a form of data leakage.**
    

- Downside to normalization is that resulting model and transformed features may be harder to interpret.


- **In general, regularization works well when you have relatively small amounts of training data compared to the number of features in your model.**


- Regularization becomes less important as the amount of training data you have increases.

## Lasso Regression

- Another form of regularized linear regression is **Lasso regression**. It uses an *L1 regularization* penalty for training (instead of Ridge's L2 penalty).


- L1 penalty: Minimum sum of *absolute values* of coefficients.
$$ RSS_{Lasso} (w,b) = \sum \limits_{i=1}^{N} (y_i - (w \cdot x_i + b))^{2} + \alpha \ \sum \limits_{j=1} ^{N} |w_j| $$


- This has the effect of setting parameter weights in w to *zero* for the least influential variables. This is called a *sparse* solution: a kind of feature selection.


- Parameter $\alpha$ controls the amount of L1 regularization (default = 1.0).


- Prediction formula is same as OLS.


- **When to use Ridge vs. Lasso regression**:
    - Many variables with small/medium sized effects: use Ridge
    - Only a few variables with medium/large effect: use Lasso
    


## Polynomial Features with Linear Regression

- x = ($x_0$, $x_1$) &emsp;====> $x^{'}$ = ($x_0, x_1, x_0 ^{2}, x_0x_1, x_1 ^{2})$ <br> $\hat{y}$ = $\hat{w_0}x_0$ + $\hat{w_1}x_1$ + $\hat{w}_{00}x_0 ^{2}$ + $\hat{w}_{01} x_0 x_1$ + $\hat{w}_{11} x_1 ^{2} + b $ 


- Generate new features consisting of all polynomial combinations of the original two features $(x_0, x_1)$


- The degree of the polynomial specifies how many variables participate at a time in each new feature (above example: degree 2).


- This is still a weighted linear combination of features, so it's still a linear model and can use least-squares estimation method for w and b.


- **Why do this kind of transformation?**
    - Called a *polynomial feature transformation*
    - Can use to transform a problem into a higher dimensional regression space
    - In effect, adding these extra polynomial features allows us much richer set of complex functions that we can use to fit the data
    - Can think of this intuitively as allowing polynomials to be fit to the training data instead of simply a straight line, but using the same least-squares criterion that minimizes mean squared error
    - To capture interactions between original features by adding them as features to the linear model
    

- Effective with classification.


- More generally, we can apply other non-linear transformations to create new features (technically, these are called non-linear basis functions).


- **Beware of polynomial feature expansion with higher degree, as this can lead to complex models that overfit**.
    - Thus, polynomial feature expansion is often combined with a regularized method like Ridge regression

## Logistic Regression
- Target is binary instead of continuous.


- Uses the logistic function, which compresses the output of the linear function so that it's limited to a range between 0 and 1, interpreted as the *probability* the input object belongs to the positive class.