# Instance-Based Models

### **Introduction**

**Recall:**

**Parametric methods:**
- Assuming a parameterized model for density function.
- A number of parameters are optimized by fitting the model to the data set

**Non-parametric methods:**
- No specific parametric model is assumed.  
- The form of density function is determined entirely by the data
- Training phase often just stores the data.

Both *supervised* and *unsupervised* learning methods can be categorized into parametric and non-parametric methods

**Instance-Based (Memory-Based) Learning**

These methods store the entire training dataset and make predictions by retrieving and analyzing similar instances from memory.

**Key Characteristics**
- **No Explicit Training:**  
    Unlike parametric models, no parameters are learned; the training phase simply stores data.

- **Lazy Learning:**  
    Postpones the building of the model.  
    Shorter time to train and a longer time to predict.  
    (Almost) All the work at the test time

- **Instance-Based Decision Making:**  
    Predictions are based on local similarity to stored training examples (e.g., nearest neighbors).

**Histogram Approximation**

Divide the data into $L$ bins and count the samples in each bin $b_l$:
$$p(b_l) \approx \frac{k_n(b_l)}{n}, \quad l = 1, ..., L$$
Where:
- $k_n(b_l)$ number of samples (out of $n$) in bin $b_l$

**Estimated Probability Density Function (PDF)**

For a fixed point $x$ within $b_l$:
$$\hat{p}(x) = \frac{p(b_l)}{h}$$
Where:
- $h$: bin width
- $|x - \bar{x}_{bl}| \le \frac{h}{2}$
- $\bar{x}_{bl}$: Mid-point of the bin $b_l$

<div style="text-align:center">
  <img src="images/histogram.png" alt="Histogram">
</div>

### **Density Function**

#### **Introduction**

**Given:**
- $D = \{x^{(i)}\}_{i=1}^n$: A set of samples drawn i.i.d. according to $p(x)$

Probability of Falling in a Region $R$:
$$P = \int_R p(x') dx'$$

<br>

**Binomial Distribution for Sample Counts**  
Probability of $k$ of the $n$ samples falling in $R$ follows a binomial distribution:
$$P_k = \binom{n}{k}P^k(1 - P)^{n-k}$$
Where:
- **Mean:** $E[k] = nP$
- **Variance:** $Var(k) = nP(1-P)$

<br>

**Key Insight:**  
For large $n$ the distribution of $k$ peaks sharply around the mean ($nP$), Thus
$$k \approx nP \implies \frac{k}{n} \approx P$$
- This makes $\frac{k}{n}$ a consistent estimator for $P$
- More accurate for large $n$

**Estimating $p(x)$ from $P$**

**Assumptions:**
- $p(x)$ is **continuous** (smooth)
- Region $R$ is small enough that $p(x)$ is approximately constant within it.

**Approximation**
$$P = \int_R p(x')dx' = p(x) V$$
Where:
$$V = Vol(R)$$

Result:
$$x \in R \implies p(x) = \frac{P}{V} \approx \frac{\frac{k}{n}}{V}$$

Let:
- $p_n(x)$: Estimate of $p(x)$ using $n$ samples
- $V_n(x)$: Volume of region around $x$
- $k_n$: Number of samples falling in the region

Then:
$$p_n(x) = \frac{k_n}{n V_n}$$

Conditions for converge of $p_n(x)$ to $p(x)$:
$$\lim_{n \rightarrow \infty} V_n = 0$$
$$\lim_{n \rightarrow \infty} k_n = \infty$$
$$\lim_{n \rightarrow \infty} \frac{k_n}{n} = 0$$

**Main Approaches of Satisfying Conditions**

- **K-Nearest Neighbor Density Estimator:**
    - Fix $K$ and determine the value of $V$ from the data
    - Volume grows until it contains $K$ neighbors of $x$

- **Kernel Density Estimator (Parzen Window):**
    - Fix V and determine K from the data
    - Number of points falling inside the volume can vary from point to point

#### **Parzen Window**

##### **Window Function Conditions**

Let $\varphi(x)$ be a **window function** (also called a **kernel** or **potential function**)  
It must satisfy following conditions:
- **Non-negativity:**
    $$\varphi(x) \ge 0$$
- **Normalization**
    $$\int \varphi(x) dx = 1$$

Following conditions ensure that $\varphi(x)$ behaves like a probability density function (PDF).

##### **Hypercube Window Function for Density Estimation**

**Definition**  
The hypercube window function is defined as:

$$
\varphi(u) = 
\begin{cases} 
1 & \text{if } \left( |u_1| \leq \frac{1}{2} \land \dots \land |u_d| \leq \frac{1}{2} \right) \\ 
0 & \text{otherwise}
\end{cases}
$$

- **Meaning**: Returns 1 when point $u$ is inside a unit hypercube centered at origin, else 0.

**Density Estimation Formula**  
The density estimate at point $x$:

$$
p_n(x) = \frac{k_n}{n V_n} = \frac{1}{n V_n} \sum_{i=1}^n \varphi \left( \frac{x - x^{(i)}}{h_n} \right)
$$

Where:
- $k_n = \sum_{i=1}^n \varphi \left( \frac{x - x^{(i)}}{h_n} \right)$ → sample count in hypercube
- $V_n = (h_n)^d$ → Volume of $d$-dimensional hypercube
- $h_n$: The edge length of the hypercube (bandwidth).
- $\frac{x - x^{(i)}}{h_n}$: Normalized distance between $x$ and $x^{(i)}$

**Key Result:**
- final density estimate formula does not depend on $k_n$ directly.

##### **Gaussian Kernel**

- Hypercube window function simply counts samples in a neighborhood of $x$
- Gaussian kernel weight samples based on their distance to $x$

**Formula:**  
The density estimate at point $x$:
$$\hat{p(x)} = \frac{1}{n} \sum_{i=1}^n N(x | x^{(i)}, \sigma^2)$$
$$ = \frac{1}{n} \sum_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x - x^{(i)})^2}{2 \sigma^2}}$$

**Key Parameters:**
- $\sigma$ (bandwidth): Controls the smoothing scale (analogous to $h_n$ hypercube methods)
- **Larger** $\sigma$ $\rightarrow$ Smoother estimates, Low resolution, **High bias**
- **Smaller** $\sigma$ $\rightarrow$ More resolution but potentially noisy estimates, **High variance**

<div style="text-align:center">
  <img src="images/gaussianDensityFunction.png" alt="Gaussian Window">
</div>

##### **Key Results**

**General Density Estimator Formula**
$$p_n(x) = \frac{k_n}{n V_n} = \frac{1}{n h_n^d} \sum_{i=1}^n \varphi \left( \frac{x - x^{(i)}}{h_n} \right)$$

Where:
- $k_n$: Number of samples within the window
- $V_n = h_n^d$: Volume of the $d$-dimensional window
- $\varphi$: Kernel function (e.g., hypercube, Gaussian)
- $h_n$: Bandwidth controlling window volume

**Bandwidth Trade-offs**   
For a fixed sample size $n$, bandwidth $h_n$:
- **Too large:**
    - Low resolution
    - **High bias**
    - **Low Variance**
- **Too small:**
    - High variability
    - **Low Bias**
    - **High Variance**

For a fixed $h$, the **variance decreases** as the number of samples $n$ tends to $\infty$

**Practical Considerations**
- **Finite samples:** must balance between $h_n$ and $n$
- **Bandwidth Selection:**
    - Techniques like cross-validation where the density estimation used for learning tasks such as classification
    - smaller $h_n$ improves accuracy only when $n$ is sufficiently large

**Issue: Curse of Dimensionality**

$n$ must **grow exponentially** with the dimensionality $d$  
This phenomenon fundamentally limits non-parametric methods in high-dimensional spaces.

For example in **hypercube density function**:
- **One-Dimensional:** $n$ points are required to densely fill an interval
- **$d$-dimensional:** $n^d$ points are required to fill the corresponding hypercube

<div style="text-align:center">
  <img src="images/curseOfDim.png" alt="Curse of Dim Example">
</div>

#### **K-Nearest Neighbors**

Cell volume is a function of the point location:  
To estimate $p(x)$, let the cell around $x$ grow until it captures $k_n$ samples called $k_n$ nearest neighbors of $x$

Two possibility:
- **High density near $x$:**  
    The cell volume $V_n(x)$ remains small, preserving fine details.

- **Low density near $x$:**  
    The cell expands until it captures k_n points, ensuring stable estimates.

**Formulation**

$$p_n(x) = \frac{k_n}{n V_n}$$
Where:
- $V_n(x)$ is the volume of the cell containing $k_n$ nearest neighbors of $x$

<br>

**Optimal $k_n$ Selection:**  
Scaling with sample size:
$$k_n = k_1 \sqrt{n}$$
*(Assume $k_1 = 1$ (baseline) for simplicity)*

<br>

**Volume Behavior:**
$$p_n(x) = \frac{k_n}{n V_n} \implies V_n \approx \frac{1}{p(x) \sqrt{n}}$$
Properties:
- Automatically smaller in dense regions (p(x) large)
- Larger in sparse regions (p(x) small)

**Effect of $K$**

**Practical tuning** $K$ via **cross-validation** ensures optimal performance.

- Small $K$:
    - High variance
    - Low bias
- Large $K$:
    - Low variance
    - High bias

<div style="text-align:center">
  <img src="images/KNNExample.png" alt="KNN Example">
</div>

#### **Summary**

- **Generality of Distribution**
    - Convergence with enough samples
    - Makes no assumptions about the underlying distributional form of the data.

- **Number of required samples**
    - To assure convergence number of samples must be very large
    - Grows exponentially with the dimensionality of the feature space

- Sensitive to Choice of **Window** or **Number of Nearest Neighbors**

- Need to **save all training samples**
    - Training phase simply requires storage of the training set
    - Computational cost of evaluating $p(x)$ grows linearly with the size of the dataset

### **Classification**

#### **Parzen Window**

**Generative Classification**

Recall Bayes' theorem:
$$p(y | x) = \frac{p(x|y)p(y)}{p(x)}$$

For each class $C_i$, we estimate:
- Density
$$
p(x | C_i) = \frac{k}{n_i V} = \frac{1}{n_i h^d} \sum_{x^{(i)} \in {D_i}} \varphi \left( \frac{x - x^{(i)}}{h} \right)
$$
- Class Prior
$$p(C_i) = \frac{n_i}{n}$$
Where
- $D_i$ Training samples in class $i$
- $n_i$: Number of training samples in class $C_i$
- $h$: bandwidth
- $\varphi$: Kernel window function

Decision Rule: Assign $x$ to the class with higher posterior probability:
$$
\text{label} = 
\begin{cases} 
C_1 & \text{if } {p(x|C_1)p(C_1)} \gt {p(x|C_2)p(C_2)} \\ 
C_2 & \text{otherwise}
\end{cases}
$$\

The comparison simplifies to:
$${p(x|C_1)p(C_1)} \gt {p(x|C_2)p(C_2)} \rightarrow \frac{p(x|C_1)}{p(x|C_2)} \gt \frac{p(C_2)}{p(C_1)}$$

Substitute $p(x | C_i)$ and $p(C_i)$:
$$\frac{\frac{1}{n_1 h^d} \sum_{x^{(i)} \in {D_1}} \varphi \left( \frac{x - x^{(i)}}{h} \right)}{\frac{1}{n_2 h^d} \sum_{x^{(i)} \in {D_2}} \varphi \left( \frac{x - x^{(i)}}{h} \right)} \gt \frac{n_2}{n_1}$$

After cancellation of common terms:
$$\sum_{x^{(i)} \in {D_1}} \varphi \left( \frac{x - x^{(i)}}{h} \right) \gt \sum_{x^{(i)} \in {D_2}} \varphi \left( \frac{x - x^{(i)}}{h} \right)$$

**Key Insight**
- The final decision rule reduces to comparing:
    - The number of $C_1$ neighbors vs $C_2$ neighbors within the kernel window
    - This shows how a **generative approach** leads to a **discriminative solution**
- For large $n$, it needs both high time and memory

#### **KNN**

##### **Generative Classification**

Recall Bayes' theorem:
$$p(y | x) = \frac{p(x|y)p(y)}{p(x)}$$

For each class $C_i$, we estimate:
- Density
$$p(x | C_i) = \frac{k}{n_i V}$$
- Class Prior
$$p(C_i) = \frac{n_i}{n}$$
Where
- $n_i$: Number of training samples in class $C_i$

Decision Rule: Assign $x$ to the class with higher posterior probability:
$$
\text{label} = 
\begin{cases} 
C_1 & \text{if } {p(x|C_1)p(C_1)} \gt {p(x|C_2)p(C_2)} \\ 
C_2 & \text{otherwise}
\end{cases}
$$\

The comparison simplifies to:
$$\frac{p(x|C_1)}{p(x|C_2)} \gt \frac{p(C_2)}{p(C_1)}$$

Substitute $p(x | C_i)$ and $p(C_i)$:
$$\frac{k n_2 v_2}{k n_1 v_1} \gt \frac{n_2}{n_1}$$

Result is similar to *Parzen Window*.

##### **Discriminative Classification**

**Given**
- Training dataset: $\{(x^{(1)}, y^{(1)}), \ldots, (x^{(n)}, y^{(n)})\}$
- Test sample: $x$

**Classification Steps**
- Find $k$ nearest training samples to $x$
- Among these $k$ neighbors, count how many belong to each class:  
    $$k_j = \text{Number of neighbors in class } C_j \quad (j = 1, \ldots, C)$$

- Assign $x$ to the class $C_{j^*}$; where:
    $$
    j^* = \operatorname*{argmax}_{j=1,\ldots,C} k_j
    $$

##### **Effect of $K$**

**Practical tuning** $K$ via **cross-validation** ensures optimal performance.

- Small $K$:
    - High variance
    - Low bias
- Large $K$:
    - Low variance
    - High bias

<div style="text-align:center">
  <img src="images/effectOfK.png" alt="Effect of k">
</div>

#### **Distance Measures**

**Euclidean Distance**
$$d(x, x') = \sqrt[2]{||x - x'||_{2}^{2}} = \sqrt[2]{(x_1 - x'_1)^2 + \ldots + (x_d - x'_d)^2}$$

**Distance Learning Methods**

- **Weighted Euclidean Distance**
    $$d(x, x') = \sqrt[2]{w_1(x_1 - x'_1)^2 + \ldots + w_d(x_d - x'_d)^2}$$

- **Mahalanobis Distance**
    $$d(x, x') = \sqrt[2]{(x_1 - x'_1)^T A (x_1 - x'_1)}$$
    Where:
    - The matrix $A$ acts similarly to the inverse covariance matrix
    - $A$: $d \times d$ matrix
    - If $A$ is diagonal (uncorrelated features), it reduces to weighted Euclidean distance
    - Off-diagonal elements capture feature correlations

**Minkowski Distance**
$$d(x, x') = \sqrt[p]{(\sum_{i=1}^n |x_i - x'_i|^p)}$$

- For $p \ge 1$
- Minkowski with $p = 2$ is the same as Euclidean distance
- Minkowski distance is the same as $L^p$ norm of $(x - x')$

**$L^p$ norm:**
$$||x||_p = \sqrt[p]{|x_1|^p + \ldots + |x_n|^p}$$
Some famous ones:
$$||x||_1 = \sum_{i=1}^n |x_i|$$
$$||x||_2 = \sqrt{x_1^2 + \ldots + x_n^2}$$
$$||x||_{\infty} = \max{\{|x_1|, |x_2|, \ldots, |x_n|\}}

**Cosine distance (angle)**
$$d(x, x') = 1 - \text{cosine similarity}(x, x')$$
Where:
$$\text{cosine similarity}(x, x') = \frac{x.x'}{||x||_2 ||x'||_2} = \frac{\sum_{i=1}^d x_i x'_i}{\sqrt{\sum_{i=1}^d x_i^2} \sqrt{\sum_{i=1}^d {x'_i}^2}}$$

<div style="text-align:center">
  <img src="images/cosineSimilarity.png" alt="Cosine Similarity">
</div>

#### **Weighted (Kernel) KNN**

Recall: **Weighted Euclidean distance** measure:
$$d(x, x') = \sqrt[2]{w_1(x_1 - x'_1)^2 + \ldots + w_d(x_d - x'_d)^2}$$

Weight **nearer** neighbors more **heavily**:
$$\hat{y} = f(x)= \argmax_{c = 1, \ldots, C} \sum_{j \in N_k(x)} w_j(x) I(c = y^{(j)})$$

Example of weighting function:
$$w_j(x) = \frac{1}{||x - x^{(j)}||^2}$$

**Stepard's Method**

In the weighted kNN, we can use all training examples instead of just $k$:
$$\hat{y} = f(x)= \argmax_{c = 1, \ldots, C} \sum_{j \in n} w_j(x) I(c = y^{(j)})$$

Weights can be found using a **kernel function**
$$w_j(x) = K(x, x^{(j)})$$
Example:
$$K(x, x^{(j)}) = e^{-\frac{d(x, x^{(j)})}{\sigma^2}}$$

<div style="text-align:center">
  <img src="images/weightingFuncs.png" alt="Weighing Functions">
</div>

### **Regression**

#### **KNN Regression**

Let
- $\{x'^{(1)}, \ldots, x'^{(k)}\}$: $k$ nearest neighbors of $x$
- $\{y'^{(1)}, \ldots, y'^{(k)}\}$ Corresponding labels.

Predict $\hat{y}$ as the average of the neighbors' labels:
$$\hat{y} = \frac{1}{k} \sum_{j = 1}^k y'_j$$

**Challenges of kNN Regression**


- **Discontinuities in Estimated Function**
    - **Problem:**  
    The predicted function $\hat{y}$ is piecewise constant, leading to abrupt jumps.

    - **Solution:**  
    **Weighted (Kernel) regression**: Assign weights to neighbors based on distance

- **Noise Sensitivity**
    - **1NN:**  
        - Overfit to noise
        - High variance
        - Adopting the label of the single nearest neighbor, even if it's an outlier.

    - **kNN (k $\gt$ 1):**
        - Smooths away noise, but there are other deficiencies.
        - Lower variance
        - Disadvantage: **Flats the end**


<div style="text-align:center">
  <img src="images/kNNRegression.png" alt="kNN Regression examples">
</div>

#### **Weighted kNN Regression**

**Recall:**  
Standard kNN regression estimate:
$$\hat{y} = \frac{1}{k} \sum_{j = 1}^k y_j$$
Where $\{y'^{(1)}, \ldots, y'^{(k)}\}$ are the labels if the $k$ nearest neighbors.

**Weighted kNN Regression**

Give higher weights to nearer neighbors:
$$\hat{y} = \frac{\sum_{j \in N_k(x)} w_j(x) y^{(j)}}{\sum_{j \in N_k(x)} w_j(x)} $$

In the weighted kNN regression, we can use all training examples instead of just $k$:
$$\hat{y} = \frac{\sum_{j \in n} w_j(x) y^{(j)}}{\sum_{j \in n} w_j(x)} $$

Weights can be found using a **kernel function**
$$w_j(x) = K(x, x^{(j)})$$

**Common Kernel Choices:**
- Gaussian Kernel:
$$K(x, x^{(j)}) = e^{-\frac{d(x, x^{(j)})}{\sigma^2}}$$
- Bandwidth $\sigma$ controls smoothness:
    - Small $\sigma$: More localized to fit (risk of overfitting)
    - Large $\sigma$: Smoother fit (risk of undefitting)
    
<div style="text-align:center">
  <img src="images/KernelKNN.png" alt="Gaussian Kernel">
</div>

**Disadvantages:**

- **Inability to Capture Simple Global Structures**  
    kNN fails to identify fundamental patterns in the data (e.g., linear, polynomial, or periodic relationships), even when they are obvious.

- **Unreliable Answers at edges**  
    Failure to extrapolate at edges

#### **Locally Weighted Regression**

For each test sample:
- Fits a **local parametric** linear model to nearby training points

**Advantages Over kNN Regression**
- Avoids piecewise-constant artifacts of kNN
- Converges to true function as $n \rightarrow \infty$ (under proper $\sigma$ scheduling)
- Superior edge estimation to kNN

**Locally Unweighted Linear Regression**

For a neighborhood $N_k(x)$ of $k$ nearest neighbors:
$$\hat{y} = f(x;w) = w_0 + w_1x_1 + \ldots + w_dx_d$$

Optimization:
$$J(w) = \sum_{i \in N_k(x)} (y^{(i)} - w^Tx^{(i)})^2$$

**Locally Weighted Linear Regression**

Introduces kernel weights $K(x,x^{(i)})$ to emphasize nearby points:
$$J(w) = \sum_{i \in N_k(x)} K(x, x^{(i)}) (y^{(i)} - w^Tx^{(i)})^2$$

**Global Kernel Extension**

Uses all $n$ training samples with distance-based weighting:
$$J(w) = \sum_{i = 1}^n K(x, x^{(i)}) (y^{(i)} - w^Tx^{(i)})^2$$