# Notations

### **Input Space:** $\mathcal{X}$

- The input space contains the set of all possible examples/instances in a **population**.

- This is generally unknown.

---

### **Output Space:** $\mathcal{Y}$

- The output space is the set of all possible labels/targets that corresponds to each point in $\mathcal{X}$. 

---

### **Distribution**: $\mathcal{P}$

- **Reference: Learning From Data p43.**
- The unknown distribution that generated our input space $\mathcal{X}$.
- In general, instead of the mapping $\mathrm{y} = f(\mathrm{x})$, we can take the output $\mathrm{y}$ to be a random variable that is affected by, rather than determined by, the input $\mathrm{x}$. 
- Formally, we have a **target distribution** $\mathcal{P}(\mathrm{y} | \mathrm{x})$ instead of just $\mathrm{y} = f(\mathrm{x})$. Now we say that any point $(\mathrm{x}, \mathrm{y})$ in $\mathcal{X}$ is now generated by the **joint distribution** $$\mathcal{P}(\mathrm{x}, \mathrm{y}) = \mathcal{P}(\mathrm{x})\mathcal{P}(\mathrm{y} | \mathrm{x})$$

---

### **Data:** $\mathcal{D}$ 
- This is the set of samples drawn from $\mathcal{X} \times \mathcal{Y}$ over a distribution $\mathcal{P}$.
- The general notation is as follows:
    $$\mathcal{D} = [(\mathrm{x^{(1)}}, \mathrm{y^{(1)}}), (\mathrm{x^{(2)}}, \mathrm{y^{(2)}}), ..., (\mathrm{x^{(N)}}, \mathrm{y^{(N)}}))]$$
    where $N$ denotes the number of training samples, and each $\mathrm{x}^{(i)} \in \mathbb{R}^{n}$ with $n$ features. In general, $\mathrm{y}^{(i)} \in \mathbb{R}$ and is a single label.
- We can split $\mathcal{D}$ into two sets respectively, where $\mathrm{X}$ consists of all the $\mathrm{x}$, and $\mathrm{Y}$ consists of all the $\mathrm{y}$. We will see this next.

---


### **Design Matrix:** $\mathrm{X}$
- Let $\mathrm{X}$ be the design matrix of dimensions $m \times (n + 1)$ where $m$ is the number of observations (training samples) and $n$ independent feature/input variables. Note the inconsistency in the matrix size, I just want to point out that the second matrix, has a column of one in the first row because we usually have a bias term $\mathrm{x_0}$, which we set to 1.

$$\mathrm{X} = \begin{bmatrix} (\mathbf{x^{(1)}})^{T} \\ (\mathbf{x^{(2)}})^{T} \\ \vdots \\ (\mathbf{x^{(m)}})^{T}\end{bmatrix}_{m \times n} = \begin{bmatrix} 1 &  x_1^{(1)} & x_2^{(1)} & \cdots & x_n^{(1)} \\\\
                 1 &  x_1^{(2)} & x_2^{(2)} & \cdots & x_n^{(2)} \\\\ 
                \vdots & \vdots & \vdots & \vdots & \vdots \\\\
                1 &  x_1^{(m)} & x_2^{(m)} & \cdots & x_n^{(m)} \end{bmatrix}_{m \times (n+1)} 
$$

---

### **Single Training Vector:** $\mathrm{x}$

- It is worth noting the $\mathrm{x}^{(i)}$ defined above is formally defined to be the $i$-th column of $\mathrm{X}$, which is the $i$-th training sample, represented as a $n \times 1$ **column vector**. However, the way we define the Design Matrix is that each row of $\mathrm{X}$ is the transpose of $\mathrm{x}^{(i)}$. 
- Note $x^{(i)}_j$ is the value of feature/attribute j in the ith training instance.

$$\mathbf{x^{(i)}} = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix}_{n \times 1}$$

---

### **Target/Label:** $\mathrm{Y}$

- This is the target vector. By default, it is a column vector of size $m \times 1$.

$$\mathbf{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix}_{m \times 1}$$

---

### **Hypothesis Set:** $\mathcal{H}$

- The set where it contains all possible functions to approximate our true function $f$. Note that the Hypothesis Set can be either continuous or discrete, means to say it can be either a finite or infinite set. But in reality, it is almost always infinite.

---

### **Hypothesis:** $\mathcal{h}: \mathrm{X} \to \mathrm{Y}$ where $\mathrm{x} \mapsto \mathrm{y}$

- Note that this $\mathcal{h} \in \mathcal{H}$ is the hypothesis function,
- The final best hypothesis function is called $g$, which approximates the true function $f$.

---

### **Learning Algorithm:** $\mathcal{A}$

- **What this does is from the set of Hypothesis** $\mathcal{H}$, **the learning algorithm's role is to pick one** $\mathcal{h} \in \mathcal{H}$ **such that this $h$ is the hypothesis function.**
- More often, we also call our final hypothesis learned from $\mathcal{A}$ $g$.

### **Hypothesis Subscript $\mathcal{D}$:** $h_{\mathcal{D}}$

- This is no different from the previous hypothesis, instead the previous $h$ is a shorthand for this notation.
- This means that the hypothesis we choose is dependent on the sample data given to us, that is to say, given a $\mathcal{D}$, we will use $\mathcal{A}$ to learn a $h_{\mathcal{D}}$ from $\mathcal{H}$.

### **Generalization Error/Test Error/Out-of-Sample Error:** $\mathcal{E}_{\text{out}}(h)$

- **Reference from Foundations of Machine Learning**.
- Given a hypothesis $h \in \mathcal{H}$, a true function $f \in \mathcal{C}$, and an underlying distribution $\mathcal{P}$, the test/out-of-sample error of $h$ is defined by 
$$\begin{aligned}\mathcal{E}_{\text{out}}(h) = \underset{x \sim \mathcal{P}}{\mathrm{Pr}}[h(\mathrm{x}) \neq f(\mathrm{x})]\end{aligned}$$
- Note that the above equation is just the error rate between the hypothesis function $h$ and the true function $f$ and as a result, the test error of a hypothesis is not known because both the distribution $\mathcal{P}$ and the true function $f$ are unknown.
- This brings us to the next best thing we can measure, the In-sample/Empirical/Training Error.

---

More formally, in a regression setting where we Mean Squared Error, $$\begin{aligned}\mathcal{E}_{\text{out}}(h) = \mathbb{E}_{\mathrm{x}}\left[(h_{\mathcal{D}}(\mathrm{x}) - f(\mathrm{x}))^2 \right]
\end{aligned}$$

---

This is difficult and confusing to understand. To water down the formal definition, it is worth taking an example, in $\mathcal{E}_{\text{out}}(h)$ we are only talking about the **Expected Test Error** over the Test Set and nothing else. **Think of a test set with only one query point**, we call it $\mathrm{x}_{q}$, then the above equation is just $$\begin{aligned}\mathcal{E}_{\text{out}}(h) = \mathbb{E}_{\mathrm{x}_{q}}\left[(h_{\mathcal{D}}(\mathrm{x}_{q}) - f(\mathrm{x}_{q}))^2 \right]
\end{aligned}$$

over a single point over the distribution $\mathrm{x}_{q}$. That is if $\mathrm{x}_{q} = 3$ and $h_{\mathcal{D}}(\mathrm{x}_{q}) = 2$ and $f(\mathrm{x}_{q}) = 5$, then $(h_{\mathcal{D}}(\mathrm{x}_{q}) - f(\mathrm{x}_{q}))^2 = 9$ and it follows that $$\mathcal{E}_{\text{out}}(h) =  \mathbb{E}_{\mathrm{x}_{q}}\left[(h_{\mathcal{D}}(\mathrm{x}_{q}) - f(\mathrm{x}_{q}))^2 \right] = \mathbb{E}_{\mathrm{x}_{q}}[9] = \frac{9}{1} = 9$$

Note that I purposely denoted the denominator to be 1 because we have only 1 test point, if we were to have 2 test point, say $\mathrm{x} = [x_{p}, x_{q}] = [3, 6]$, then if $h_{\mathcal{D}}(x_{p}) = 4$ and $f(x_{p}) = 6$, then our $(h_{\mathcal{D}}(\mathrm{x}_{p}) - f(\mathrm{x}_{p}))^2 = 4$. 

Then our $$\mathcal{E}_{\text{out}}(h) =  \mathbb{E}_{\mathrm{x}}\left[(h_{\mathcal{D}}(\mathrm{x}) - f(\mathrm{x}))^2 \right] = \mathbb{E}_{\mathrm{x}_{q}}[[9, 4]] = \frac{1}{2} [9 + 4] = 6.5$$

Note how I secretly removed the subscript in $\mathrm{x}$, and how when there are two points, we are taking expectation over the 2 points. So if we have $m$ test points, then the expectation is taken over all the test points.

Till now, our hypothesis $h$ is fixed over a particular sample set $\mathcal{D}$. We will now move on to the next concept on **Expected Generalization Error** (adding a word Expected in front makes a lot of difference).

### **Expected Generalization Error/Test Error/Out-of-Sample Error:** $\mathbb{E}_{\mathcal{D}}[\mathcal{E}_{\text{out}}(h)]$

For the previous generalization error, we are only talking a fixed hypothesis generated by one particular $\mathcal{D}$. In order to remove this dependency, we can simply take the expectation of Generalization Error of $h$ over a particular $\mathcal{D}$ by simply taking the expectation over all such $\mathcal{D}_{i}$, $i = 1,2,3,...K$.

---

Then the **Expected Generalization Test Error** is independent of any particular realization of $\mathcal{D}$:

$$\begin{aligned}\mathbb{E}_{\mathcal{D}}[\mathcal{E}_{\text{out}}(h)] = \mathbb{E}_{\mathcal{D}}[\mathbb{E}_{\mathrm{x}}\left[(h_{\mathcal{D}}(\mathrm{x}) - f(\mathrm{x}))^2 \right]]
\end{aligned}$$

---

In the following example, we can calculate the Expected Generalization Error, where we are using the Error to be Mean Squared Error, so in essence, we are finding the expected MSE.

### **Empirical Error/Training Error/In-Sample Error:** $\mathcal{E}_{\text{in}}(h)$

- Given a hypothesis $h \in \mathcal{H}$, a true function $f \in \mathcal{C}$, and an underlying distribution $\mathcal{P}$, and a sample $\mathrm{X}$ drawn from $\mathcal{X}$ i.i.d with distribution $\mathcal{P}$, the test/out-of-sample error of $h$ is defined by 
$$\begin{aligned}\mathcal{E}_{\text{in}}(h) = \frac{1}{\mathrm{m}}\sum_{i=1}^{\mathrm{m}}\text{sign}[h(\mathrm{x}^{(i)}) \neq f(\mathrm{x}^{(i)})]\end{aligned}$$
- Here the sign function is mainly used for binary classification, where if $h$ and $f$ disagrees at any point $x^{(i)}$, then $\text{sign}[h(\mathrm{x}^{(i)}) \neq f(\mathrm{x}^{(i)})]$ evaluates to 1. We take the sum of all disagreements and divide by the total number of samples. In short, that is just the misclassification/error rate.
- The empirical error of $h \in \mathcal{H}$ is its average error over the sample $\mathcal{X}$, in contrast, the generalization error is its expected error based on the distribution $\mathcal{P}$.
- Take careful note here that $h(x^{(i)})$ is the prediction made by our hypothesis (model), we can conventionally call it $\hat{y}^{(i)}$ whereby our $f(x^{(i)})$ is our ground truth label $y^{(i)}$. I believe that this ground truth label is realized once we draw the sample from $\mathcal{X}$ even though we do not know what $f$ is.
- An additional note here, is that the summand of the in-sample error function is not fixated to the sign function. In fact, I believe you can define any loss function to calculate the "error". As an example, if we are dealing with regression, then we can modify the summand to our favourite Mean Squared Error.

$$\begin{aligned}\mathcal{E}_{\text{in}}(h) = \frac{1}{\mathrm{m}}\sum_{i=1}^{\mathrm{m}}[h(\mathrm{x}^{(i)}) - f(\mathrm{x}^{(i)})]^2\end{aligned}$$

---

### **Bias - Variance Decomposition**

- This is a decomposition of the **Expected** Generalization Error. Formal Proof please read Learning From Data.
- Unless otherwise stated, we consider only the univariate case where $\mathrm{x}$ is a single test point.

---

$$\begin{align*}
\mathbb{E}_{\mathcal{D}}[\mathcal{E}_{\text{out}}(h)] &= \mathbb{E}_{\mathcal{D}}[\mathbb{E}_{\mathrm{x}}\left[(h_{\mathcal{D}}(\mathrm{x}) - f(\mathrm{x}))^2 \right]] 
\\ &= \big(\;\mathbb{E}_{\mathcal{D}}[\;h_{\mathcal{D}}(x)\;] - f(x)\;\big)^2 + \mathbb{E}_{\mathcal{D}}\big[\;(\;h_{\mathcal{D}}(x) - \mathbb{E}_{\mathcal{D}}[\;h_{\mathcal{D}}(x)\;])^2\;\big] + \mathbb{E}\big[(y-f(x))^2\big]
\\ &= \big(\;\bar{h}(\mathrm{x}) - f(x)\;\big)^2 + \mathbb{E}_\mathcal{D}\big[\;(\;h_{\mathcal{D}}(x) - \bar{h}(\mathrm{x}) \;])^2\;\big]+ \mathbb{E}\big[(y-f(x))^2\big]
\end{align*} 
$$

---

Where $\big(\;\mathbb{E}_{\mathcal{D}}[\;h_{\mathcal{D}}(x)\;] - f(x)\;\big)^2 $ is the Bias, $\mathbb{E}_\mathcal{D}\big[\;(\;h_{\mathcal{D}}(x) - \mathbb{E}_{\mathcal{D}}[\;h_{\mathcal{D}}(x)\;])^2\;\big]$ is the Variance and $\mathbb{E}\big[(y-f(x))^2\big]$ is the irreducible error $\epsilon$.


### **Bias:** $\big(\;\mathbb{E}_\mathcal{D}[\;h_{\mathcal{D}}(x)\;] - f(x)\;\big)^2$ 

In other form, we can express Bias as 
$$\big(\;\mathbb{E}_{\mathcal{D}}[\;h_{\mathcal{D}}(x)\;] - f(x)\;\big)^2 = \big(\;\bar{h}(\mathrm{x}) - f(x)\;\big)^2$$

---

See simulation on Bias-Variance Tradeoff to understand.

--- 
If our test point is $x_{q} = 0.9$, then our bias is as such:

$$
\widehat{\text{bias}} \left(\hat{f}(0.90) \right)  = \frac{1}{n_{\texttt{sims}}}\sum_{i = 1}^{n_{\texttt{sims}}} \left(\hat{f}_k^{[i]}(0.90) \right) - f(0.90)
$$


### **Variance:** $\mathbb{E}_\mathcal{D}\big[\;(\;h_{\mathcal{D}}(x) - \mathbb{E}_{\mathcal{D}}[\;h_{\mathcal{D}}(x)\;])^2\;\big]$

---

This is more confusing, but we first express Variance as:

$$\mathbb{E}_\mathcal{D}\big[\;(\;h_{\mathcal{D}}(x) - \mathbb{E}_{\mathcal{D}}[\;h_{\mathcal{D}}(x)\;])^2\;\big] = \mathbb{E}_\mathcal{D}\big[\;(\;h_{\mathcal{D}}(x) - \bar{h}(\mathrm{x}) \;])^2\;\big]$$

---

If our test point is $x_{q} = 0.9$, then our variance is as such:

$$
\widehat{\text{var}} \left(\hat{f}(0.90) \right) = \frac{1}{n_{\texttt{sims}}}\sum_{i = 1}^{n_{\texttt{sims}}} \left(\hat{f}_k^{[i]}(0.90) - \frac{1}{n_{\texttt{sims}}}\sum_{i = 1}^{n_{\texttt{sims}}}\hat{f}_k^{[i]}(0.90) \right)^2 
$$


# Pseudo Code

## Cross-Validation

- Define $G$ as the set of combination of hyperparamters. Define number of splits to be $K$.
- For each set of hyperparameter $z \in Z$:
    - for fold $j$ in K:
        - Set $F_{\text{train}}=\bigcup\limits_{i\neq k}^{K} F_{i}$
        - Set $F_{\text{val}} = F_{j}$ as the validation set
        - Perform Standard Scaling on $F_{\text{train}}$ and find the mean and std
        - Perform VIF recursively on $F_{\text{train}}$ and find the selected features
        - Transform $F_{\text{val}}$ using the mean and std found using $F_{\text{train}}$
        - Transform $F_{\text{val}}$ to have only the selected features from $F_{\text{train}}$
        - Train and fit on $F_{\text{train}}$ 
    - Evaluate the fitted parameters on $F_{\text{val}}$ to obtain $\mathcal{M}$

# Logistic Regression

Given target variable $Y \in \{0, 1\}$ and predictors $X$, denote $\mathbb{P}(X) = P(Y = 1 | X)$ to estimate the probability of $Y$ is of positive (malignant) class. LR expresses $\mathbb{P}$ as a function the predictors $X$ as $\mathbb{P}(X) = \sigma(\hat{\mathrm{\beta}}^T X) = \frac{1}{1 + \exp(\hat{\mathrm{\beta}}^T X)}$ where $\hat{\beta}$ is the estimated coefficients of the model. One thing worth mentioning is the logistic function $\sigma(z) = \frac{1}{1 + \exp(-z)}$ outputs values from 0 to 1  which is actually the functional form of our hypothesis, and therefore makes up the \textbf{Hypothesis Space} $\mathcal{H}$. We then uses a learning algorithm $\mathcal{A}$, \textbf{Maximum Likelihood Estimation (MLE)}, to estimate the coefficients of our predictors; however, since there is no closed form solution to MLE, the learning algorithm will use optimization techniques like \textbf{Gradient Descent}\footnote{We can use Gradient Descent if we instead minimze the negative loglikehood function which is the same as maximizing MLE} to find $\hat{\beta}$.

# Readings and References

- [False-positive and false-negative cases of fine-needle aspiration cytology for palpable breast lesions](https://pubmed.ncbi.nlm.nih.gov/17986804/)
- [What is a Dendrogram?](https://www.displayr.com/what-is-dendrogram/)
- [Breast Biopsy - Mayo Clinic](https://www.mayoclinic.org/tests-procedures/breast-biopsy/about/pac-20384812)
- [Data Centric - Andrew Ng](https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/)
- [When is Multicollinearity not an issue - Paul Allison](https://statisticalhorizons.com/multicollinearity)
- [Intuitive Explanation of Multicollinearity in Linear Regression - Stackoverflow](https://stats.stackexchange.com/questions/1149/is-there-an-intuitive-explanation-why-multicollinearity-is-a-problem-in-linear-r)
- [Hypothesis Testing Across Models](http://rasbt.github.io/mlxtend/user_guide/evaluate/paired_ttest_kfold_cv/)
- [Hypothesis Test for Comparing ML Algorithms - Jason Brownlee](https://machinelearningmastery.com/hypothesis-test-for-comparing-machine-learning-algorithms/)
- [Regression Modelling Strategies - Professor Frank Harrell](https://www.springer.com/us/book/9783319194240)
- [Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules - Professor Frank Harrell](https://www.fharrell.com/post/class-damage/)
- [On a reliable cross validation split 1](https://www.fharrell.com/post/split-val/)
- [On a reliable cross validation split 2](https://stats.stackexchange.com/questions/545990/after-model-selection-can-i-perform-prediction-on-the-test-set-using-all-my-mod/546017#546017)
- [Estimate Generalization Error](https://ai.stackexchange.com/questions/50/how-can-the-generalization-error-be-estimated)
- [Using Boxplot to compare Model's Performance](https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/)
- [Common Pitfalls - Scikit-Learn](https://scikit-learn.org/stable/common_pitfalls.html)
- [Common pitfalls in the interpretation of coefficients of linear models - Scikit-Learn](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html)
- [Calibrated Classification - Jason Brownlee](https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/)
- [scikit learn calibration](https://scikit-learn.org/stable/modules/calibration.html)
- [Are you sure your models return probabilities?](https://towardsdatascience.com/calibrating-classifiers-559abc30711a)
- [cambridge's probability calibration](https://blog.cambridgespark.com/probability-calibration-c7252ac123f)
- [calibration in ML Terms](https://stats.stackexchange.com/questions/270508/meaning-of-model-calibration)
- [Brier Score and Model Calibration - Neptune AI](https://neptune.ai/blog/brier-score-and-model-calibration)
- [Google's take on calibrated models](https://www.unofficialgoogledatascience.com/2021/04/why-model-calibration-matters-and-how.html)
- [IMPORTANT: WHAT IS CALIBRATION](https://statisticaloddsandends.wordpress.com/2020/10/07/what-is-calibration/)
- [Hands on sklearn calibration](https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html)
- [Hands on sklearn calibration v2](https://changhsinlee.com/python-calibration-plot/)
- [Examples of scoring rules](https://stats.stackexchange.com/questions/339919/what-does-it-mean-that-auc-is-a-semi-proper-scoring-rule)
- [Logistic Regression is well calibrated](https://stats.stackexchange.com/questions/390487/why-is-logistic-regression-well-calibrated-and-how-to-ruin-its-calibration)