# Machine Learning

### Cross Validation

What:
- Cross Validation allows us to compare different Machine Learning methods and get a sense of how well they will perform in practice.
- In the real world, you DO NOT re-use the same data for both training and testing. You create a train-test split. However, how you create the train-test split can have different effects. 
- One work around for this, is 10-fold Cross Validation. 
- You also use Cross Validation for hyper-parameter tuning. Where you train the model across different folds with various hyper-parameter configurations, you then pick the hyper-parameter that yeilded the best results.


How:
- For 10-fold cross validation, you create 10 splits in the dataset, train on 9 blocks, and test the model on 1. You repeat this across all different blocks/folds in the dataset. 
- You summarize the results from each block/split. 


### Confusion Matrix

|                        | Actual Positive          | Actual Negative          |
|------------------------|--------------------------|--------------------------|
| Predicted Positive     | True Positive (TP)       | False Positive (FP)      |
| Predicted Negative     | False Negative (FN)      | True Negative (TN)       |


$$ Sensitivity (Recall) = (\frac{TP}{TP + FN})$$

Sensitivity defines what number of items with treatment (TP) were correctly identified. 

$$ Specificity = (\frac{TN}{TN + FP}) $$

Specificity defines what number of items without treatment (TN) were correctly identified. 

$$ Precision = (\frac{TP}{TP + FP})$$

Precision is the proportion of positive results that were correctly classified.

### Bias 

- Bias is the difference between the prediction of our model and the correct value.
- Model with high bias pays very little attention to the training data, and oversimplifies the model. It leads to high error on the training and test dataset.
- Ex. Fitting a straight line on a curved training dataset, leads to high-bias. As the model fits the curved training data poorly. 

### Variance

- Variance is the variability of model prediction for a given data point or value which tells us spread of the data.
- Model with high variance pays a lot of attention to training data, and does not generalize on the data which it hasn't seen before. 
Ex. Overfitting the highly curved line on the specific training dataset. The model overfits on the training dataset, and performs poorly on the test dataset.

|                        | Low Variance         | High Variance          |
|------------------------|--------------------------|--------------------------|
| High Bias     | Underfitting       | Horrible      |
| Low Bias     | Perfect      | Overfitting       |


### Bias-Variance Tradeoff

- If the model is too simple and has very few parameters, then it may have high bias and low variance. 
- On the other hand, if the model has large number of parameters, then it could have high variance and low bias. 

- In ML, a good ML model has low bias and low variance, by producing consistent results across different datasets.
- This is done, by finding the sweet spot between simple models and complex models. This done usually with the help of regularization, boosting, and bagging.

### ROC (Receiver Operating Characteristic) Curve

- A curve used to evaluate the performance of a binary classification model. 
- ROC curve plots:
    1. True Positive Rate (TPR) on the Y-axis, also called sensitivity or recall. 
    2. False Positive Rate (FPR) on the X-axis, calculated as (1 - specififity).

    Each point on the curve represents a different 'threshold' for deciding whether a predicated probability counts as a positive or negative classification. 


What the curve tells you?
- Ideal: Passes through the top-left corner (TPR=1, FPR=0)
- Random model: Forms a diagonal line from (0, 0) to (1,1).
- Area Under Curve (AUC): A number between 0 and 1- higher is better.

### Entropy

To derive Entropy, we first need to understand 'Surprise'. How suprised are you when an outcome appears that is extremely less likely? 

- Surprise is the inverse of the probability of an event occurring. 

$$ Surprise = (log(\frac{1}{Probability}))$$

Entroy is the expected value of the surprise. In other words, you multiply Probability with surprise for all events, to calculate Entropy or expected value.

$$ Entropy(Surprise) = (ProbabilityOfEvent)(log(\frac{1}{Probability_Of_Event})) +
(ProbabilityOfNOEvent)(log(\frac{1}{ProbabilityOfNOEvent}))$$

### R^2

- R-squared is defined as:
$$R^2 = \frac{Var(Mean) - Var(Fit)}{Var(Mean)}$$

R^2 helps one understand how much of the variance in mean of a given feature is explained by another feature. For example, how much of the variance in mean mouse weight, is explained by the variance by mean mouse size regressed weight.

### Linear Regression

- Use Ordinary Least Squares (OLS) to fit a line to the data.
- Calculate R^2
- Calculate p-value for R^2

Equation of Line:
$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \varepsilon_i
$$

Equation of a line in matrix representation.
$$
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}
$$

where, 
- y is an nx1 vector of observed values.
- X is an n x (p+1) matrix (including a column of ones for the intercept).
- Beta/B is a (p+1) x 1 vector of coefficients.
- Epsilon is the residual error of vector.

The goal of OLS:
- The goal is to choose $\beta$ such that the sum of squared residuals is minimized. 

$$
S(\boldsymbol{\beta}) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2
$$

### Derivation of OLS Estimator:

Step 1: Express the cost function

$$
S(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^T (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})
$$

Step 2: Take the derivative with respect to $( \boldsymbol{\beta} )$

$$
\frac{\partial S}{\partial \boldsymbol{\beta}} = -2 \mathbf{X}^T (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})
$$

Step 3: Set the derivative equal to 0 and solve

$$
\mathbf{X}^T \mathbf{y} = \mathbf{X}^T \mathbf{X} \boldsymbol{\beta}
$$

Step 4: Solve for the OLS estimator

$$
\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$

### Results

- **Estimated coefficients**:

$$
\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$

- **Predicted values**:

$$
\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}}
$$

- **Residuals**:

$$
\hat{\boldsymbol{\varepsilon}} = \mathbf{y} - \hat{\mathbf{y}}
$$

### Assumptions of OLS

1. **Linearity**: The model is linear in parameters  
2. **Full rank**: $( \mathbf{X}^T \mathbf{X} )$ is invertible  
3. **Exogeneity**: $( \mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = 0 )$  
4. **Homoscedasticity**: Constant variance of errors  
5. **No autocorrelation**: Errors are uncorrelated 