# Predictive Modeling

In predictive modeling, the goal is to come up with a machine learning model to predict response variable $Y$ given the independent variable $X$. Here, both $X$ and $Y$ can be multi-dimensional. Mathematically, the independent variable may be related to the response in the form of - 

$$ Y = f(X) + \epsilon $$

We may be interested in predicting $f$ for **Prediction**, **Inference**, or a combination of both. We can denote the estimate of the function $f$ as $\hat{f}$. Therefore, when we make a prediction of the response using the function, our prediction is denoted by - 

$$ \hat{Y} = \hat{f}(X) $$

* **Regression** - Problems where the response variable is quantitative
* **Classification** - Problems where the response variable is qualitative

### Error in the estimate

We can mathematically define and divide the error in the estimate as follows - 

$$ \begin{align*} E[Y-\hat{Y}]^2 &= E[f(X) + \epsilon - \hat{f}(X)]^2 \\  &= [f(X) - \hat{f}(X)]^2 + Var(\epsilon) \end{align*} $$

Here, the first term represents the *reducible* error and the second term represents the *irreducible error*.

### Parametric and Non-Parametric Models

To estimate $f$ we can choose a parametric or non-parametric approach.

1. Parametric Models - In this approach, we make an assumption about the functional form of $f$, therefore reducing the problem of estimating $f$ down to one of estimatic a set of parameters
    * The assumption of the functional form of $f$ introduces 'model bias'.
    * We only have a few parameters to finetune in this approach
2. Non-parametric Models - In this approach, we do not make an explicit assumption about the functional form of $f$. Instead, we try to estimate $f$ that can get as close to the data points as possible
    * Since the problem is not simply estimating a small number of parameters, a very large number of observations are required for an accurate estimation of f
    * We have more flexibility introducing more options for finetuning the model known as hyper-parameters

#### Accuracy vs. Interpretability

As the complexity of the model increases -
* The accuracy of the predictions may increase
* The interpretability of the model may decrease

### Bias Variance Trade-off

* **Variance** refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set.
    * As the complexity of the model increases, the variance of the model increases
    * Error due to model variance can be reduced by considering more observations
* **Bias** refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much
simpler mathematical model.
    * As the complexity of the model increases, the bias of the model decreases
    * Error due to model bias can be reduced by increasing the complexity of the model (Also increases variance)

When measuring the error on the test set, we can write the *expected test MSE* as follows -

$$
E_{\mathit{D}}[\left( y(x|\mathit{D}) - h(x) \right)^2] = [E_{\mathit{D}}[y(x|\mathit{D})] - h(x)]^2 + E_{\mathit{D}}[\left( y(x|\mathit{D}) - E_{\mathit{D}}[y(x|\mathit{D})] \right)^2] + Var(\epsilon)
$$

Here,
* $\mathit{D}$ represents the dataset under consideration
* $y(x|D)$ represents the estimated response of the model given the dataset the model is trained on
* $h(x)$ represents the true function's response (without the irreducible error) for the given x

<br>

* The first term corresponds to $(bias)^2$ when the expectation is evaluated by integrating over the whole dataset $\mathit{D}$. It simply measures how good the average model is.
* The second term corresponds to the model $variance$ when the expectation is evaluated by integrating over the whole dataset $\mathit{D}$. It simply measures how sensitive the model is to variations in the dataset.
* The third term is the irreducible error