### Variable Types

Quantitative. e.g., glucose prediction.

Qualitative. e.g., handwritten digits, denoted by $\mathcal{G} = \{0, 1, \dots, 9\}$.

When we're predicting a quantitative variable, it's a regression task, while predicting a qualitative variable is a classificaiton task. Some methods of prediction are good for one or the other task, while some can be good for both.

Some variable can be $\textit{ordered categorical}$, such as $\textit{small, medium}$ and $\textit{large}$.

Qualitative variables are typically represented numerically by codes. The most useful and commonly used coding is via $\textit{dummy variables}$. Here a $K$-level qualitative variable is represented by a vector of $K$ binary variables or bits, only one of which is "on" at a time. Although more compact coding schemes are possible.

### Terminology

$X$: input variable

$X_j$: individual component of vector $X$

$Y$: quantitative output

$G$: qualitative output

$x_i$: observed value of $X$, (where $x_i$ can be a scalar or a vector)

$\textbf{X}$: a matrix, e.g., a set of $N$ input $p$-vectors $x_i, i = 1, \dots, N$ would be represented by $N \times p$ matrix $\textbf{X}$

Vectors will not be bold, except when they have $N$ components. $p$-vector of inputs is denoted by $x_i$ for $i$th observarion, but for $N$-vector, $\textbf{x}_j$ denotes all the observations on variable $X_j$

\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9 \\
\end{bmatrix}

Here, $p$-vector of inputs is denoted by $x_1$ is $\{1, 5, 7\}$ and $N$-vector, $\textbf{x}_1$ is $\{1, 2, 3\}$

Since all vectors are assumed to be column vectors, the $i$th row of $\textbf{X}$ is $x^{T}_{i}$, the vector transpose of $x_i$

### Learning task

given the value of an input vector $X$, make a good prediction of the output $Y$, denoted by $\hat{Y}$ (pronounced as "y-hat").

If $Y$ takes values in $\mathbb{R}$ then so should $\hat{Y}$; likewise for categorical outputs, $\hat{G}$ should take values in the same set $\mathcal{G}$ associated with $G$.

For a two-class $G$, one approach is to denote the binary coded target as $Y$, and then treat it as a quantitative output.

The predictions $\hat{Y}$ will typically lie in $\left[0, 1\right]$, and we can assign to $\hat{G}$ the class label according to whether $\hat{y}$ > 0.5. 

### Least Squares and Nearest Neighbors

The linear model fit by Least Squares makes huge assumptions about structure and yields stable but possibly inaccurate predictions.

The method of $k$-nearest neighbors makes very mild structural assumptions: its predictions are often accurate but can be unstable.

### Linear Models and Least Squares

Given a vector of inputs $X^T$ = $(X_1, X_2, \dots , X_p)$, we predict the output $Y$ via the model:

\begin{equation}
\hat{Y} = \hat{\beta}_0 + \sum_{i=1}^{p}X_{j}\hat{\beta}_{j}
\end{equation}

The term $\hat{\beta}_0$ is the intercept or bias. $X^T$ denotes vector or matrix transpose ($X$ being a column vector).

Including a constant variable 1 in $X$ and $\hat{\beta}_0$ in the vector of coefficients $\hat{\beta}$, We can represent the model in vector form as an inner product:

\begin{equation}
\hat{Y} = X^T\hat{\beta}
\end{equation}

we pick the coefficients $\beta$ to minimize the residual sum of squares

\begin{equation}
RSS(\beta) = \sum_{i=1}^{N}(y_i - x^T_i\beta)^{2}
\end{equation}

In matrix notation

\begin{equation}
RSS(\beta) = (y - \textbf{X}\beta)^{T}(y - \textbf{X}\beta)
\end{equation}

Differentiating w.r.t. $\beta$ we get the normal equations

\begin{equation}
\textbf{X}^{T}(y - \textbf{X}\beta) = 0
\end{equation}

If $X^TX$ is nonsingular, solving this equation

\begin{equation}
\hat{\beta} = (\textbf{X}^{T}\textbf{X})^{-1}\textbf{X}^Ty
\end{equation}

and the fitted value at the $i$th input $x_i$ is $\hat{y}_i = x^T_i \hat{\beta}$. At an arbitrary input $x_0$ the prediction is $\hat{y}_0 = x^T_0 \hat{\beta}$

Here we are modeling a single output, so $\hat{Y}$ is a scalar; in general $\hat{Y}$ can be a $K$–vector, in which case $\beta$ would be a $p \times K$ matrix of coefficients.

In the ($p$ + 1)-dimensional input–output space, ($X$,  $\hat{Y}$) represents a hyperplane.

If the constant is included in $X$, then the hyperplane includes the origin and is a subspace; if not, it is an affine set cutting the $Y$-axis at the point (0, $\hat{\beta}_0$).

### Nearest-Neighbor Methods

$k$-nearest neighbor fit for $\hat{Y}$

\begin{equation}
\hat{Y}(x) = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i
\end{equation}

where $N_k(x)$ is the neighborhood of $x$ defined by the $k$ closest points $x_i$ in the training sample.

This closeness can be Euclidean distance.

So, we find the $k$ observations with $x_i$ closest to $x$ in input space, and average their responses.

Just like we had $p$ parameters in Linear Model, we have $k$ parameters in Nearest-Neighbour.

### Statistical Decision Theory

We seek a function $f(X)$ for predicting $Y$ given values of the input $X$. 

Let $X \in \mathbb{R}^p$ denote a real valued random input vector, and $Y ∈ \mathbb{R}$ a real valued random output variable, with joint distribution $Pr(X, Y )$.

This theory requires a loss function $L(Y, f(X))$ for penalizing errors in prediction, and by far the most common and convenient is squared error loss: $L(Y, f(X)) = (Y − f(X))^2$.

We denote the Expected (squared) prediction error given below,

\begin{align*}
    EPE(f) &= E(Y − f(X))^2 \\
           &= \int \left[y − f(x)\right]^2Pr(dx, dy) \\
           &= E_XE_{Y|X}(\left[Y − f(X)\right]^2|X), \text{conditioning w.r.t. X}
\end{align*}

minimize EPE pointwise
\begin{align*}
    f(x) = argmin_cE_{Y|X}(\left[Y − c\right]^2|X = x) 
\end{align*}

solving this
\begin{align*}
    f(x) = E(Y|X = x) 
\end{align*}

Thus the best prediction of $Y$ at any point $X = x$ is the conditional mean, when best is measured by average squared error.

What do we do when the output is a categorical variable $G$? 

We pick a different loss function.
\begin{align*}
    EPE &= E\left[L(G, \hat{G}(X))\right] \\
        &= E_X \sum_{k=1}^{K}L\left[Gk, \hat{G}(X)\right]Pr(\mathcal{G}_k|X), \text{conditioning w.r.t. X}
\end{align*}

minimize EPE pointwise
\begin{align*}
    \hat{G}(x) = argmin_{g \in \mathcal{G}}\sum_{k=1}^{K}L(\mathcal{G}_k, g)Pr(\mathcal{G}_k|X=x) 
\end{align*}
