# Module 2: Supervised Machine Learning

## Introduction to Supervised Maching Learning
`A more detailed and lower level ML course suggested: by Andrew Ng, also on Coursera`

### Learning Objectives
1. understand how a number of different supervised learning algorithms elarn by estimating their parameters from data to make new predictions

2. understand the strengths and weaknesses of particular supervised learning methods

3. learn how to apply specific supervised machine learning algorithms in Python with scikit-learn

4. learn about general principles of supervised machine learning, like overfitting and how to avoid it

### Review of important terms
1. feature representation
    - convert an object into datasets that a computer understands
2. data instances/samples/examples(X)
    - one row of variables, or one representation of an object instance
3. target value
    - the label of an object made by human
4. training and test sets
    - training set / test set = 75 / 25
5. model/estimator
    - model fitting produces a 'trained model'
    - training is the process of estimating model parameters
6. evaluation method b 

### Classification and Regression
1. Both classification and regression take a set of training instances and learn a mapping to a *target value*

2. For classification, the target value is a discrete class value
    - Binary: target value is 0 (negative class) or 1 (positive class)
        - e.g. detecting a fraudulent credit card transaction
    - Multi-class: target value if one of a set of discrete values
        - e.g. labelling the type of fruit from physical attributes
    - Multi-label: there are multiple target values (labels)
        - e.g. labelling the topics discussed on a Web page
3. For regression, the target value is *continuous* (floating point/real-valued)

4. Looking at the target value's type will guide you on what supervised learning method to use.

5. Many supervised learning methods have 'flavors' for both classificatinon and regression

### Supervised learning methods: Overview
1. Two simple but powerful prediction algorighms:
    - K-nearest neighbors
    - Linear model fit using least-quares
2. These represent two complementary approaches to supervised learning:
    - K-nearest neighbors makes few assumptions about the strucutre of the data and gives potentially accurate but sometimes unstable predictions (sensitive to small changes in the training data).
    - Linear models make strong assumptions about the structure of the data and give stable but potentially inaccurate predictions.
    
#### What is a model?
It's a specific mathematical or computational description that expresses the relationship between a set of input variables and one or more outcome variables that are being studied or predicted. In statistical terms the input variables are called independent variables and the outcome variables are termed dependent variables.

In Machine Learning we use the term features to refer to the input, or independent variables. And target value or target label to refer to the output, dependent variables.

Models can be either used to understand and explore the structure within a given dataset, aka unsupervised learning. 

## Overfitting and Underfitting
### Generalization, Overfitting, and Underfitting
1. `Generalization ability` refers to an algorithm's ability to give accurate predictions for new, previously unseen data.

2. Assumptions:
    - Future unseen data (test set) will have the same properties as the current training sets.
    - Thus, models that are accurate on the training set are expected to be accurate on the test set.
    - But that may not happen if the trained model is tuned too specifically to the training set.

3. Models that are too complex for the amount of training data available are said to *overfit* and are not likely to generalize well to new examples.

4. Models that are too simple, taht don't even do well on the training data, are said to *underfit* and also not likely to generalize well.

<img src="https://img.ceclinux.org/ef/71cece8042a8c604d9ebfd411737fa527f9d26.png">

<img src="https://img.ceclinux.org/20/ccf12e2d6fe70ec1a680d0e94102cb739bb01a.png">

<img src="https://img.ceclinux.org/48/6bfeb4b589147831425eb8f9799af8c87be47b.png"> 
    - In K-Nearest Classification, when we decrease K, we increase the risk of overfitting because we're trying to capture very local changes in the decision boundary hat may not lead to good generalization behavior for future data.
    

## Supervised Learning: Datasets

<img src="https://img.ceclinux.org/89/b8b328ea5f2d46790f923929495eeee13513d5.png">

<img src="https://img.ceclinux.org/15/7f3fd778c50a05cdbf0311ff81c8ba41dfbbc5.png">

<img src="https://img.ceclinux.org/24/4130d8719b033a1b9b0143b5bca905dfec355c.png">


## K-Nearest Neighbors: Classfication and Regression

### k-Nearest neighbors classification
<img src="https://img.ceclinux.org/ae/1ef55f55e051d656dfc8dd3496dc2b7c0620fa.png">

<img src="https://img.ceclinux.org/d7/53bcfdea65a335cfe9ae3a0b5259dcdb9e7141.png">
    - The two exmaples above, when K increases, the accuracy in training data drops a bit but the accuracy in test data goes up a bit too, indicating the model is more effitive at ignoring minor variations in training data.
    
### k-Nearest neighbors regression
<img src="https://img.ceclinux.org/ed/c6b9cb3e37b36bdfbbae2a43727caa4d601ee9.png">

The R-squared Regression
<img src="https://img.ceclinux.org/09/b896c23b05e8b12f189fbaed3037d25ccbf12c.png">

1. Pros and Cons of nearest neighbor approach
    - simple and easy to understand why a particular prediction is made
    - could be a reasonble baseline against performance of more sophisticated models
    - when training data has many instances, or each instance has lots of features, it slows down the performance of a k-nearest neighbors model
    - so if your data set has hundreds or thousands of feature, esp. your data is sparse, you should consider other alternative models
    
### KNeighborsClassifier and KNeighborsRegressor: important parameters

1. *Model complexity*
    - n_neighbors: number of nearest neighbors(k) to consider 
        - default = 5
2. *Model fittnig*
    - metric: distance function between data points
        - deault: Minkowsk distance with power parameter p=2 (Euclidean)
 

## Linear Regression: Least-Squares

### Linear Models
A linear model is a *sum of weighted variables* that predicts a target output value given an input data instance. E.g. predicting housing prices
<img src="https://img.ceclinux.org/66/e788ce762b8ea553e8d77d524db933cd3359ea.png">

### Linear Regression is an Example of a Linear Model
<img src="https://img.ceclinux.org/b8/db2255a79bcb9ea43e5026f1a0268350a21eb2.png">

<img src="https://img.ceclinux.org/5c/3be4f292bd715fd18b7a43e1c99adea3766c06.png">

### Least-squares Linear Regression ("Ordinary least-squares")
1. Finds w and b that minimizes the mean squared error of the model: the sum of squared differences between predicted target and actual target values (RSS), aka mean squared error of the linear model

2. No parameters to control model complexity -- both pro and con
<img src="https://img.ceclinux.org/4d/933ff29df41b17904125fe121fd6a21a12c6b8.png">
<img src="https://img.ceclinux.org/4b/8343b730d5ade002ca5e57a34c214cfba6cd29.png">

### How are Linear Regression Parameters *w*, *b* Estimated?
1. Parameters are estimated from training data

2. There are many different ways to estimate *w* and *b*:
    - Different methods correspond to different "fit" criteria and goals and ways to control model complexity
3. The learning algorighm finds the parameters that optimize an `objective function`, typically to minimize some kind of `loss function` of the predicted target values vs. actual target values

### Least-Squares Linear Regression in Sciki-Learn
<img src="https://img.ceclinux.org/fe/cc874229fd0bb6a4ca9cbfd7b38f914610eb8f.png">

## Linear Regression: Ridge, Lasso, and Polynomial Regression

### Ridge Regression
1. Ridge regression learns *w*, *b* using the same least-squares criterion but adds a penalty for larget variations in *w* parameters
    - $ RSS_{RIDGE}(w, b) = \sum_{i=1}^{N} (y_i - (w \times x_i + b))^2 + \alpha \sum_{j=1}^{p} w_j^2$
2. Once the parameters are learned, the ridge regression **prediction** formula is the **same** as ordinary least-squares

3. The addition of a parameter penalty is call **regularization**. Regularization prevents overfitting by restricting the model, typically to reduce its complexity.

4. Ridge regress uses **L2 regularization**: minimize sum of squares of *w* entries

5. The influence of the regularization term is controlled by the $\alpha$ parameter. Default of $\alpha$ is 1. Setting it to zero corresponds to ordinary least-squares linear regression

6. Higher alpha means more regularization and simpler models

### The Need for Feature Normalization
1. Important for some machine learning methods that all features are on the same scale (e.g. faster convergence in learning, more uniform or 'fair' in influence for all weights)
    - e.g. regularzied regression, k-NN, support vector machines, neural networks, ...
2. Can also depend on the data. More on feature engineering alter in the course. For now, we do MinMax scaling of the features:
    - for each feature $x_i$: compute the min value $x-{i_MIN}$ and the max value $x-{i_MAX}$ achieved across all instances in the training set.
    - for each feature: transform a given feature $x_i$ value to a scaled version $x_i^{'}$ using the formula

### Feature Normalization: The test set must use identical scaling to the training set
1. Fit the scaler using the training set, then apply the same scaler to transform the test set.

2. Do not scale the training and test sets using different scalers: this could lead to random skew in the data

3. Do not fit the scaler using any part of the test data: referencing the test data can lead to a form of *data leakage*

*regularization works especially well when you have relatively small amounts of trainign data compared to the number of features in the model. Regularization becomes less important as the amount of training data increases.*

### Lasso Regression: another form of regularized linear regression that uses and **L1 regularization** penalty for training (instead of Ridge's L2 penalty)
1. L1 penalty: minimize the sum of the **absolute values** of the coefficients
    - $ RSS_{LASSO}(w, b) = \sum_{i=1}^{N} (y_i - (w \times x_i + b))^2 + \alpha \sum_{j=1}^{p} \left| w_j\right |$
2. This has the effect of setting parameter weights in *w* to **zero** for the least influential variables. This is called a **sparse** solution: a kind of feature selection

3. The parameter $\alpha$ controls amount of L1 regularization (default = 1.0)

4. The prediction formula is the same as ordinary least-squares

5. When to use Ridge or Lasso regression:
    - many small/ medium sized effects: use Ridge
    - Only a few variables with medium/ large effect: use lasso

### Polynomial Features with Linear Regression
$x= (x_0, x_1)$ --> $x^{'}= (x_0, x_1, x_0^2, x_0x_1, x_1^2)$

$\hat y = \hat w_0x_0 + \hat w_1x_1 + \hat w_{00}x_0^2 + \hat w_{01}x_0x_1 + \hat w_{11}x_1^2 + b$

1. Geneerate new features consisting of all polynomial combinations fo the original two features $(x_0, x_1)$

2. the *degree* of the polynomial specifies how many variables participate at a time in each new feature (above example: degree2)

3. This is still a weighted linear combination of features, so it's **still a linear model**, and can use smae least-squares estimation method for *w* and *b*

<img serc="https://img.ceclinux.org/93/0715603fa87e8c4617a7449f8e316d3dbed129.png">

### Polynomial Features with Linear Regression
1. Why would we want to transform our data this way?
    - To capture interactions between the roiginal eatures by adding them as features to the linear model
    - To make a classification problem easier
2. More generally, we can apply other non-linear transformations to create new features
    - Techically, these are called non-linear basis functions
3. Beware of polynomial feature expansion with high degree, as this can lead to complex models that overfit
    - Therefore, polynomial feature expansion is often combined with a regularzied learning method like Ridge regression

## Logistic Regression
<img src="https://img.ceclinux.org/f8/33ad9a4e2342d4795598c6ad820a789b747ade.png">
<img src="https://img.ceclinux.org/d8/d3db156fe37554edf73014261f729d5ef55ca7.png">
<img src="https://img.ceclinux.org/e6/95884b87d69a7eae2ebec6f52315f4c3edf82f.png">
<img src="https://img.ceclinux.org/36/ce48434d5b467ff8e7744b6e2cad8f25d0d921.png">
<img src="https://img.ceclinux.org/8b/995cc768612c0336dd9334e8e43bdf92562972.png">
<img src="https://img.ceclinux.org/89/2439f9857f840d4ae6770c304530da4e08ef50.png">
<img src="https://img.ceclinux.org/11/e17cb99cc4fd015605d5947c4a929277210398.png">

### Logistic Regression: Regularization
1. L2 regularization is 'on' by default (like Ridge regression)

2. Parameter C controls amount of regularization (default 1.0). For both logistic regress and support vector ml: higher values of C corresponde to less regularization

3. As with regularized linear regression, it can be important to normalize all features so that they are one the same scale.

<img src="https://img.ceclinux.org/92/51b7e069e7aca3db23f0f5f3091051578651af.png">

## Linear Classifiers: Support Vector Machines
<img src="https://img.ceclinux.org/45/66990376c9795ff0147edde5d0abca4abee659.png">
<img src="https://img.ceclinux.org/20/1f6e6c8d619be4b5adbcc97db919f2c5aeea46.png">

### Classifier Margin
Defined as the maximum width the decision boundary area can be increased before hitting a data point
<img src="https://img.ceclinux.org/c5/fc19185c8acfa2f6bcc9e4c7e9e7480d0e2d19.png">
<img src="https://img.ceclinux.org/e1/f7ac6b2048b1291faf309557519077d5bca48b.png">

### Maximum Margin Linear Classifier: Linear Support Vector Machine
Maximum margin classifier: the linear classifier with maximum margin is a linear Support Vector Machine (LSVM)

### Regularization for SVMs: the C parameter
1. The strength of regularization is determined by C

2. Larger values of C: less regularization 
    - fit the training data as well as possible
    - each individual data point is important to classify correctly

3. Smaller values of C: more regularization
    - more tolerant of errors on individual data points

### Linear Models: Pros and Cons
#### Pros:
1. simple and easy to train

2. fast prediction

3. scales well to very large datasets

4. works well with sparse data

5. reasons for prediction are relatively easy to interpret

#### Cons:
1. for lower-dimensional data, other modesl may have superior generalization performance

2. for classificatino, data may not be linearly separable

## Multi-Class Classification

<img src="https://img.ceclinux.org/c7/9e2f2d41903ef94e5568a9e7147918bb60b09b.png">

Basically, sciki-learn goes through the binary classfication one by one and predict the instance as highest scores

## Kernalized Support Vector Machines

Linear support vector classifiers could effectively find a decision boundary with maixmum margin. But how about more complex binary classification problems?
<img src="https://img.ceclinux.org/43/4549378159c95782c187b1ca5a95f3c743bfe0.png">

What do kernelized SVMs do: they take the originl input data space and transform it to a new higher dimensional feature space, where it becomes much easier to classify the transformed data using a linear classfier.

<img src="https://img.ceclinux.org/f1/5dc635085f4f1cffb9de9a54d35b434651fe0c.png">
<img src="https://img.ceclinux.org/29/bb4489b7efe53d574204db8814218a541d0ff8.png">
<img src="https://img.ceclinux.org/b6/3d76b30545a0c3622ba3f192278a6de78e256d.png">
<img src="https://img.ceclinux.org/1f/5b2eae27bf6b68de0ad11ee9613416823756e5.png">
<img src="https://img.ceclinux.org/c2/f5881a9a95f9b0477e324060c69ac47b4d7e43.png">
<img src="https://img.ceclinux.org/89/9c61ee6e677ea227c91e423d4f909dfb3eff20.png">
<img src="https://img.ceclinux.org/08/36a2d959a0f59c7a3883bc460248ea81dbcd2d.png">
<img src="https://img.ceclinux.org/3a/3bc7ee0837c10aee5484de7579ab2cc01972ed.png">

### Radial Basis Function Kernel
<img src="https://img.ceclinux.org/8a/1621f4aaae9581f6534200c227bb2b8b64b671.png"> 

"Now, one of the mathematically remarkable things about kernelized support vector machines, something referred to as the kernel trick, is that internally, the algorithm doesn't have to perform this actual transformation on the data points to the new high dimensional feature space. Instead, the kernelized SVM can compute these more complex decision boundaries just in terms of similarity calculations between pairs of points in the high dimensional space where the transformed feature representation is implicit." 
<img src="https://img.ceclinux.org/ee/1f0e365bf03f4c7ef5452e2565253f3d05fbbd.png">

### Radial Basis Function kernal: Gamma Parameter
Gamma controls how far the influence of a single trending example reaches, which in turn affects how tightly the dicision boundaries end up surrounding points in the input space.

<img src="https://img.ceclinux.org/b1/84fcb7df2d691124917e9916ee0d8947239eea.png">

Small gamma means a larger similarity radius. Points farther apart are considered similar. Thus, more points being group together and smoother decision boundaries.

Important to normalized data!!


### Kernalized Support Vector Machines: pros and cons
#### Pros:
1. Can perform well on a range of datasets

2. Versatile: different kernel functions can be specified, or custom kernals ca be defined for specific data types

3. Works well for both low- and high-dimensional data

#### Cons:
1. Efficiency (runtime speed and memory usage) decreseases as training set size increases (e.g. over 50,000 samples)

2. Needs careful nomalization of input data and parameter tuning

3. Does not provide direct probbility estimates (but can be estimated suing e.g. Platt scaling)

4. Difficult to interpret why a prediction was made

### Kernelized Support Vector Machines (SVC): Important parameters
#### **Model complexity**
1. Kernel: Type of kernel function to be used
    - default = 'rbf' for radial basis function
    - other types include 'polynomial'
2. kernel parameters 
    - gamma ($\gamma$): RBF kernel width
3. C: regularization parameter

4. Typically C and gamma are tuned at the same time

## Cross-Validation
Cross-validation goes beyond evaluating a single model using a single Train/Test split of the data by using mulitple Train/Test splits, each of which is used to train and evaluate a seprate model.

<img src="https://img.ceclinux.org/15/e0754306c962190aeb215806f300fd96f3ba1f.png">
<img src="https://img.ceclinux.org/39/c6773d29c8fc971bdc569ac12e231a4591b8b0.png">
<img src="https://img.ceclinux.org/f5/8a607ca9adc89b808a40da5bc3c15aa820dc91.png">
<img src="https://img.ceclinux.org/df/5851222d6d18b9b5c1c129e807aec4fb465793.png">
<img src="https://img.ceclinux.org/74/5c4aa11dd3453caf373005c0b691fa6e9da1fd.png">

Notes: Cross-validation is used to evaluate the model and not learn or tune a new model.

### A note on performing cross-validation for more advanced scenarios
In some cases (e.g. when feature values have very different ranges), we've seen the need to scale or normalize the training and test sets before use with a classifier. The proper way to do cross-validation when you need to scale the data is not to scale the entire dataset with a single transform, since this will indirectly leak information into the training data about the whole dataset, including the test data (see the lecture on data leakage later in the course). Instead, scaling/normalizing must be computed and applied for each cross-validation fold separately. To do this, the easiest way in scikit-learn is to use pipelines. While these are beyond the scope of this course, further information is available in the scikit-learn documentation here:

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

or the Pipeline section in the recommended textbook: Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido (O'Reilly Media).

## Decision Trees
A decision trees learn a series of explicit if-then rules on feature values that result in a decision that predicts the target value.

<img src="https://img.ceclinux.org/65/d305b77a93bb3296aaedcb6ba1f32f5a7d421b.png">
<img src="https://img.ceclinux.org/a6/3c808285e323c6e46c5e3453c78ad8e301bc09.png">

### The Iris Dataset
<img src="https://img.ceclinux.org/a0/fa92d3e432d76f15fcfc1d8a756cda1c76a057.png">
<img src="https://img.ceclinux.org/e9/f560bf57687738f3b3dc02521c8b7dd8baa0ce.png">

A best split should produce as homogeneous a set of classes as possible. 

<img src="https://img.ceclinux.org/10/ec127f92e167fc61f3e8b7a419f57af54f466b.png">

<img src="https://img.ceclinux.org/3c/341735b012c3904010f8b7ab5f1fdb74b4162c.png">
<img src="https://img.ceclinux.org/fe/787a8fa19ef5e53690cdd567572f60d4597a6b.png">

Can also use plot_decision_tree() function in adspy_shared_utilities.py code to visualize the decision tree.

### Feature Importance: How important is a feature to overall prediction accuracy?
1. A number between 0 and 1 assigned to each feature

2. Feature importance of 0 --> the feature was not used in prediction

3. Feature importance of 1 --> the feature predicts the target perfectly

4. All feature importances are normalized to sum to 1.

Note 1. In sciki-learn, it is called feature\_importances_ (underscore at the end of a name indicates it's a property of the object that's set as a result of fitting the model ans not say as a user defined propterty).

Note 2. If a feature has a low feature importance value, that doesn't necessarily mean that the feature is not important for prediction. It simply means that the particular feature wasn't chosen at an early level of the tree and this could be because the future may be identical or highly correlated with another informative feature and so doesn't provide any new additional signal for prediction.

### Decision Trees: Pros and Cons
#### Pros:
1. Easily visualized and interpreted

2. No feature normalization or scaling typically needed

3. Work well with datasets using a mixture of feature types (continuous, categorical, binary)

#### Cons:
1. Even after tuning, decision trees can often still overfit

2. Usually need an ensemble of trees for better generalization performance

### Decision Trees: Decision TreeClassifier Key Parameters
1. **max_depth**: controls maximum depth (number of split points). Most common way to reduce tree complexity and overfitting.

2. **min_samples_leaf**: threshold for the minimum number of data instances a leaf can have to avoid further splitting

3. **max_leaf_nodes**: limits total number of leaves in the tree

4. In practice, adjusting only one of these (e.g. max_depth) is enough to reduce overfitting. 
