# Module 2: Supervised Machine Learning

## Introduction to Supervised Maching Learning
`A more detailed and lower level ML course suggested: by Andrew Ng, also on Coursera`

### Learning Objectives
1. understand how a number of different supervised learning algorithms elarn by estimating their parameters from data to make new predictions

2. understand the strengths and weaknesses of particular supervised learning methods

3. learn how to apply specific supervised machine learning algorithms in Python with scikit-learn

4. learn about general principles of supervised machine learning, like overfitting and how to avoid it

### Review of important terms
1. feature representation
    - convert an object into datasets that a computer understands
2. data instances/samples/examples(X)
    - one row of variables, or one representation of an object instance
3. target value
    - the label of an object made by human
4. training and test sets
    - training set / test set = 75 / 25
5. model/estimator
    - model fitting produces a 'trained model'
    - training is the process of estimating model parameters
6. evaluation method b 

### Classification and Regression
1. Both classification and regression take a set of training instances and learn a mapping to a *target value*

2. For classification, the target value is a discrete class value
    - Binary: target value is 0 (negative class) or 1 (positive class)
        - e.g. detecting a fraudulent credit card transaction
    - Multi-class: target value if one of a set of discrete values
        - e.g. labelling the type of fruit from physical attributes
    - Multi-label: there are multiple target values (labels)
        - e.g. labelling the topics discussed on a Web page
3. For regression, the target value is *continuous* (floating point/real-valued)

4. Looking at the target value's type will guide you on what supervised learning method to use.

5. Many supervised learning methods have 'flavors' for both classificatinon and regression

### Supervised learning methods: Overview
1. Two simple but powerful prediction algorighms:
    - K-nearest neighbors
    - Linear model fit using least-quares
2. These represent two complementary approaches to supervised learning:
    - K-nearest neighbors makes few assumptions about the strucutre of the data and gives potentially accurate but sometimes unstable predictions (sensitive to small changes in the training data).
    - Linear models make strong assumptions about the structure of the data and give stable but potentially inaccurate predictions.
    
#### What is a model?
It's a specific mathematical or computational description that expresses the relationship between a set of input variables and one or more outcome variables that are being studied or predicted. In statistical terms the input variables are called independent variables and the outcome variables are termed dependent variables.

In Machine Learning we sue the term features to refer to the input, or independent variables. And target value or target label to refer tot he output, dependent variables.

Models can be either used to understand and explore the structure within a given dataset, aka unsupervised learning. 

## Overfitting and Underfitting
### Generalization, Overfitting, and Underfitting
1. `Generalization ability` refers to an algorithm's ability to give accurate predictions for new, previously unseen data.

2. Assumptions:
    - Future unseen data (test set) will have the same properties as the current training sets.
    - Thus, models that are accurate on the training set are expected to be accurate on the test set.
    - But that may not happen if the trained model is tuned too specifically to the training set.

3. Models that are too complex for the amount of training data available are said to *overfit* and are not likely to generalize well to new examples.

4. Models that are too simple, taht don't even do well on the training data, are said to *underfit* and also not likely to generalize well.

<img src="https://img.ceclinux.org/ef/71cece8042a8c604d9ebfd411737fa527f9d26.png">

<img src="https://img.ceclinux.org/20/ccf12e2d6fe70ec1a680d0e94102cb739bb01a.png">

<img src="https://img.ceclinux.org/48/6bfeb4b589147831425eb8f9799af8c87be47b.png"> 
    - In K-Nearest Classification, when we decrese K, we increase the risk of overfitting because we're trying to capture very local changes in the decision boundary hat may not lead to good generalization behavior for future data.
    

## Supervised Learning: Datasets

<img src="https://img.ceclinux.org/89/b8b328ea5f2d46790f923929495eeee13513d5.png">

<img src="https://img.ceclinux.org/15/7f3fd778c50a05cdbf0311ff81c8ba41dfbbc5.png">

<img src="https://img.ceclinux.org/24/4130d8719b033a1b9b0143b5bca905dfec355c.png">


## K-Nearest Neighbors: Classfication and Regression

### k-Nearest neighbors classification
<img src="https://img.ceclinux.org/ae/1ef55f55e051d656dfc8dd3496dc2b7c0620fa.png">

<img src="https://img.ceclinux.org/d7/53bcfdea65a335cfe9ae3a0b5259dcdb9e7141.png">
    - The two exmaples above, when K increases, the accuracy in training data drops a bit but the accuracy in test data goes up a bit too, indicating the model is more effitive at ignoring minor variations in training data.
    
### k-Nearest neighbors regression
<img src="https://img.ceclinux.org/ed/c6b9cb3e37b36bdfbbae2a43727caa4d601ee9.png">

The R-squared Regression
<img src="https://img.ceclinux.org/09/b896c23b05e8b12f189fbaed3037d25ccbf12c.png">

1. Pros and Cons of nearest neighbor approach
    - simple and easy to understand why a particular prediction is made
    - could be a reasonble baseline against performance of more sophisticated models
    - when training data has many instances, or each instance has lots of features, it slows down the performance of a k-nearest neighbors model
    - so if your data set has hundreds or thousands of feature, esp. your data is sparse, you should consider other alternative models
    
### KNeighborsClassifier and KNeighborsRegressor: important parameters

1. *Model complexity*
    - n_neighbors: number of nearest neighbors(k) to consider 
        - default = 5
2. *Model fittnig*
    - metric: distance function between data points
        - deault: Minkowsk distance with power parameter p=2 (Euclidean)
 

## Linear Regression: Least-Squares

### Linear Models
A linear model is a *sum of weighted variables* that predicts a target output value given an input data instance. E.g. predicting housing prices
<img src="https://img.ceclinux.org/66/e788ce762b8ea553e8d77d524db933cd3359ea.png">

### Linear Regression is an Example of a Linear Model
<img src="https://img.ceclinux.org/b8/db2255a79bcb9ea43e5026f1a0268350a21eb2.png">

<img src="https://img.ceclinux.org/5c/3be4f292bd715fd18b7a43e1c99adea3766c06.png">

### Least-squares Linear Regression ("Ordinary least-squares")
1. Finds w and b that minimizes the mean squared error of the model: the sum of squared differences between predicted target and actual target values (RSS), aka mean squared error of the linear model

2. No parameters to control model complexity -- both pro and con
<img src="https://img.ceclinux.org/4d/933ff29df41b17904125fe121fd6a21a12c6b8.png">
<img src="https://img.ceclinux.org/4b/8343b730d5ade002ca5e57a34c214cfba6cd29.png">

### How are Linear Regression Parameters *w*, *b* Estimated?
1. Parameters are estimated from training data

2. There are many different ways to estimate *w* and *b*:
    - Different methods correspond to different "fit" criteria and goals and wyas to control model complexity
3. The learning algorighm finds the parameters that optimize an `objective function`, typically to minimize some kind of `loss function` of the predicted target values vs. actual target values

### Least-Squares Linear Regression in Sciki-Learn
<img src="https://img.ceclinux.org/fe/cc874229fd0bb6a4ca9cbfd7b38f914610eb8f.png">

## Linear Regression: Ridge, Lasso, and Polynomial Regression

### Ridge Regression
1. Ridge regression learns *w*, *b* using the same least-squares criterion but adds a penalty for larget variations in *w* parameters
    - $ RSS_{RIDGE}(w, b) = \sum_{i=1}^{N} (y_i - (w \times x_i + b))^2 + \alpha \sum_{j=1}^{p} w_j^2$
2. Once the parameters are learned, the ridge regression **prediction** formula is the **same** as ordinary least-squares

3. The addition of a parameter penalty is call **regularization**. Regularization prevents overfitting by restricting the model, typically to reduce its complexity.

4. Ridge regress uses **L2 regularization**: minimize sum of squares of *w* entries

5. The influence of the regularization term is controlled by the $\alpha$ parameter. Default of $\alpha$ is 1. Setting it to zero corresponds to ordinary least-squares linear regression

6. Higher alpha means more regularization and simpler models

### The Need for Feature Normalization
1. Important for some machine learning methods taht all features are on the same scale (e.g. faster convergence in learning, more uniform or 'fair' in influence for all weights)
    - e.g. regularzied regression, k-NN, support vector machines, neural networks, ...
2. Can also depend on the data. More on feature engineering alter in the course. For now, we do MinMax scaling of the features:
    - for each feature $x_i$: compute the min value $x-{i_MIN}$ and the max value $x-{i_MAX}$ achieved across all instances in the training set.
    - for each feature: transform a given feature $x_i$ value to a scaled version $x_i^{'}$ using the formula

### Feature Normalization: The test set must use identical scaling to the training set
1. Fit the scaler using the training set, then apply the same scaler to transform the test set.

2. Do not scale the training and test sets using different scalers: this could lead to random skew in the data

3. Do not fit the scaler using any part of the test data: referencing the test data can lead to a form of *data leakage*

*regularization works especially well when you have relatively small amounts of trainign data compared to the number of features in the model. Regularization becomes less important as the amount of training data increases.*

### Lasso Regression: another form of regularized linear regression that uses and **L1 regularization** penalty for training (instead of Ridge's L2 penalty)
1. L1 penalty: minimize the sum of the **absolute values** of the coefficients
    - $ RSS_{LASSO}(w, b) = \sum_{i=1}^{N} (y_i - (w \times x_i + b))^2 + \alpha \sum_{j=1}^{p} \left| w_j\right |$
2. This has the effect of setting parameter weights in *w* to **zero** for the least influential variables. This is called a **sparse** solution: a kind of feature selection

3. The parameter $\alpha$ controls amount of L1 regularization (default = 1.0)

4. The prediction formula is the same as ordinary least-squares

5. When to use Ridge or Lasso regression:
    - many small/ medium sized effects: use Ridge
    - Only a few variables with medium/ large effect: use lasso

### Polynomial Features with Linear Regression
$x= (x_0, x_1)$ --> $x^{'}= (x_0, x_1, x_0^2, x_0x_1, x_1^2)$

$\hat y = \hat w_0x_0 + \hat w_1x_1 + \hat w_{00}x_0^2 + \hat w_{01}x_0x_1 + \hat w_{11}x_1^2 + b$

1. Geneerate new features consisting of all polynomial combinations fo the original two features $(x_0, x_1)$

2. the *degree* of the polynomial specifies how many variables participate at a time in each new feature (above example: degree2)

3. This is still a weighted linear combination of features, so it's **still a linear model**, and can use smae least-squares estimation method for *w* and *b*

<img serc="https://img.ceclinux.org/93/0715603fa87e8c4617a7449f8e316d3dbed129.png">

### Polynomial Features with Linear Regression
1. Why would we want to transform our data this way?
    - To capture interactions between the roiginal eatures by adding them as features to the linear model
    - To make a classification problem easier
2. More generally, we can apply other non-linear transformations to create new features
    - Techically, these are called non-linear basis functions
3. Beware of polynomial feature expansion with high degree, as this can lead to complex models that overfit
    - Thus polynomial feature expansion is often combined with a regularzied learning method like Ridge regression