
### How to decide which algo performs better for our given dataset?
* divide data into training and validation
* train models and build confusion matrix

#### True Positives
* model correctly (truely) predicted as 1

#### True Negatives
* model correctly (truely) predicted as 0

#### False Positives
* model incorrectly (falsely) predicted as 1 (prediction 1 is incorrect)

#### False Negatives
* model incorrectly (falsely) predicted as 0 (prediction 0 is incorrect)

#### Positives
* model predicted as 1

#### Negatives
* model predicted as 0

#### True
* model correctly predicted

#### False
* model incorrectly predicted

#### Sensitivity = Recall = (True Positive Rate)
* what percentage of examples with label 1 were correctly classified by the model
* proportion of correctly classified label 1 examples wrt number of 1s in the data
* $\frac{TP}{TP + FN}$
* model with high sensitivity is better at predicting label 1

#### Specificity
* what percentage of examples with label 0 were correctly classified
* $\frac{TN}{TN + FP}$
* model with high specificity is better at predicting label 0

#### Model comparison
* if it is more important to correctly predict label 1 then model with high sensitivity would be better
* if it is more important to correctly predict label 0 then model with high specificity would be better

#### Multi class metrics
* for multi-class problems we need to calculate the above metric per class
* the confusion matrix for n-class problem has n rows and n columns

### How to decide which hyperparameter values are better?
* Done with **ROC: Receiver Operator Characteristic** graphs
* Y axis: **True Positive Rate = Sensitivity**
* X axis: **False Positive Rate = 1 - Specificity**
* $False\ Positive\ Rate = \frac{FP}{FP+TN}$
* ROC summarizes all confusion matrices

### How to decide which ROC curve is better?
* The ROC curve with more **AUC: Area under curve is better**

#### Precision
* proportion of 1 predictions that were correct out of all positive predictions
* $\frac{TP}{TP+FP}$
* if there are lots of samples with label 0 precision may be a better metric than sensitivity since precision is unaffected by $TN$
* so the imbalance in the dataset does not skew the metric

### What is Kernel Trick & How is it useful?
* SVM optimization objective is dependent upon the dot product of individual samples $x_i.x_j$
* SVM by default tries to find linear separation that maximizes margin
* To find non-linear decision boundaries / hyperplanes we could transform feature vector $x_i$ via a non-linear function $\phi$
* Generally computing $\phi(x)$ is compuationally expensive
* Kernel $K(x_i, x_j) = \phi(x_i)^T . \phi(x_j)$ computes the dot product without the knowledge of $\phi$ (hence the name kernel trick)
* It allows us to operate in the original feature space without computing the coordinates of the data in a higher dimensional space.
* when we map data to a higher dimension, there are chances that we may overfit the model. Thus choosing the right kernel function (including the right parameters) and regularization are of great importance.


### SVM
* We maximize the margin — the distance separating the closest pair of data points belonging to opposite classes. These points are called the support vectors, because they are the data observations that “support”, or determine, the decision boundary. To train a support vector classifier, we find the maximal margin hyperplane, or optimal separating hyperplane, which optimally separates the two classes in order to generalize to new data and make accurate classification predictions.

https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f#:~:text=The%20%E2%80%9Ctrick%E2%80%9D%20is%20that%20kernel,the%20data%20by%20these%20transformed
https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d

### Can Kernel trick be used in any other algo?
* in dual optimization objectives kernel trick can be used like in SVM
* kernelized PCA, kernelized logistic regression

### When should we use which kernel?
https://stats.stackexchange.com/questions/18030/how-to-select-kernel-for-svm
* a dataset with more than 2 dimensions? Here, we want to keep an eye on our objective function: minimizing the hinge-loss. We would setup a hyperparameter search (grid search, for example) and compare different kernels to each other. Based on the loss function (or a performance metric such as accuracy, F1, MCC, ROC auc, etc.) we could determine which kernel is "appropriate" for the given task.
https://www.kdnuggets.com/2016/06/select-support-vector-machine-kernels.html

### Why lasso does feature selection and why ridge does not?
The consequence of this is that ridge regression will tend to shrink the large weights while hardly shrinking the smaller weights at all. In LASSO regression, the shrinkage will be directly proportionate to the importance of the feature in the model.

## Linear Regression and $R^2$

* When we fit a line to data using linear regression
* The sum of error residuals is identified as:
\begin{equation}
SS(fit) = \sum_{i=1}^{m} (f(x) - y)^2
\end{equation}
* the variation around the fitted line is:
\begin{equation}
Var(fit) = \frac{SS(fit)}{m}
\end{equation}

* the sum of squares around mean value of $y$ is:
\begin{equation}
SS(mean) = \sum_{i=1}^{m} (y_{mean} - y)^2
\end{equation}
* the variation around the fitted line is:
\begin{equation}
Var(mean) = \frac{SS(mean)}{m}
\end{equation}

* $R^2$ is defined as:
\begin{equation}
R^2 = \frac{Var(mean) - Var(fit)}{Var(mean)}
\end{equation}
* alternatively
\begin{equation}
R^2 = \frac{SS(mean) - SS(fit)}{SS(mean)}
\end{equation}
since the denominator cancels out

* predicting mean value of y always means $R^2$ is 0 hence no features explain variation in $y$
* when $R^2$ is 1 i.e. $Var(fit) = 0$ mean features fully explain variation in the data
