# Tutorial 2: Regression with kNN and Linear Regression
[![View notebooks on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/amonroym99/uva-applied-ml/blob/main/docs/notebooks/2_reg_knn_linreg.ipynb)

**Author:** Alejandro Monroy

In this notebook we will cover two of the most basic regression models: kNN and Linear Regression. Furthermore, we will see some metrics to evaluate regression models.

In [1]:
import numpy as np

## 1. Loading and preparing the data
We will use the `diabetes` dataset from Sklearn as we did in the previous tutorial. This time, we will set `scaled=True` to skip the normalization step:

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

diabetes = datasets.load_diabetes(as_frame=True, scaled=True)
diabetes.data

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


In [3]:
X = diabetes.data.values
y = diabetes.target.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 2. K-Nearest Neighbors
K-Nearest Neighbors (kNN) is a simple, yet powerful, algorithm used for both classification and regression tasks. In regression, kNN predicts the value of a target variable based on the average of the values of its k-nearest neighbors. The "neighbors" are determined by calculating the distance between data points, typically using Euclidean distance. KNN regression is non-parametric, meaning it makes no assumptions about the underlying data distribution, making it versatile for various types of data.

### 2.1 Implementation from scratch
An important step in the k-NN algorithm is computing distances between datapoints. We will use the euclidean distance. The Euclidean distance between two points $\mathbf{p} = (p_1, p_2, \ldots, p_n)$ and $\mathbf{q} = (q_1, q_2, \ldots, q_n)$ in Euclidean n-space is given by:


$$ d(\mathbf{p}, \mathbf{q}) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \cdots + (p_n - q_n)^2} $$

The following function implements the euclidean distance by computing the element-wise difference between the points and then computing the norm using `np.linalg.norm`:

In [4]:
def euclidean_distance(p, q):
    """
    Calculates the Euclidean distance between two points.

    Args:
        p (np.ndarray): First point.
        q (np.ndarray): Second point.

    Returns:
        float: Euclidean distance between the two points.
    """
    return np.linalg.norm(p - q)

# Sample usage
point1 = np.array([1, 2, 3])
point2 = np.array([4, 5, 6])
distance = euclidean_distance(point1, point2)
print("Euclidean distance:", distance)

Euclidean distance: 5.196152422706632


We can now implement the k-NN algorithm:

In [5]:
import numpy as np

def knn_regressor(X_train, y_train, X_test, k=5):
    """
    K-Nearest Neighbors regressor.

    Args:
        X_train (np.ndarray): Training data features.
        y_train (np.ndarray): Training data labels.
        X_test (np.ndarray): Data to predict.
        k (int): Number of neighbors to use for prediction. Default is 5.

    Returns:
        np.ndarray: Predicted target values.
    """
    # Calculate predictions for each test point
    predictions = []
    for x in X_test:
        # Compute distances from the test point to all training points
        distances = np.linalg.norm(X_train - x, axis=1)
        # Find the indices of the k nearest neighbors
        k_indices = np.argsort(distances)[:k]
        # Get the target values of the k nearest neighbors
        k_nearest_values = y_train[k_indices]
        # Compute the mean of the k nearest values
        prediction = np.mean(k_nearest_values)
        predictions.append(prediction)
    
    return np.array(predictions)

# Predict on the diabetes dataset
y_pred = knn_regressor(X_train, y_train, X_test, k=2)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])

First 5 predicted labels: [148.5 148.  183.  248.5 123.5]
First 5 true labels: [219.  70. 202. 230. 111.]


Recall that in the previous homework assignment we implemented the `DummyRegressor` mimicking Sklearn's implementation. Let's do the same here. In this case, fitting the model just means storing the training set in the regressor object, and predictions are made in a similar way as in our previous implementation, but accessing the training set that is stored in the class:

In [6]:
class KNeighborsRegressor:
    """
    K-Nearest Neighbors regressor.
    """

    def __init__(self, n_neighbors=5):
        """
        Initializes the KNeighborsRegressor with the specified number of neighbors.

        Args:
            n_neighbors (int): Number of neighbors to use for prediction. Default is 5.
        """
        self.n_neighbors = n_neighbors
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        """
        Fit the KNN regressor on the training data.

        Args:
            X (np.ndarray): Training data features.
            y (np.ndarray): Training data labels.
        """
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        """
        Predict the target for the given data.

        Args:
            X (np.ndarray): Data to predict.

        Returns:
            np.ndarray: Predicted target values.
        """
        predictions = []
        for x in X:
            distances = np.linalg.norm(self.X_train - x, axis=1)
            k_indices = np.argsort(distances)[:self.n_neighbors]
            k_nearest_values = self.y_train[k_indices]
            prediction = np.mean(k_nearest_values)
            predictions.append(prediction)
        return np.array(predictions)
    
# Sample usage
knn_regressor = KNeighborsRegressor(n_neighbors=2)
knn_regressor.fit(X_train, y_train)
y_pred = knn_regressor.predict(X_test)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])

First 5 predicted labels: [148.5 148.  183.  248.5 123.5]
First 5 true labels: [219.  70. 202. 230. 111.]


### 2.2. Importing the model from Sklearn
Now that we know how the model works, we can just import it from the `sklearn.neighbors` module so we don't have to implement it from scratch everytime we use it:

In [7]:
from sklearn.neighbors import KNeighborsRegressor

knn_regressor = KNeighborsRegressor(n_neighbors=2)
knn_regressor.fit(X_train, y_train)
y_pred = knn_regressor.predict(X_test)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])

First 5 predicted labels: [148.5 148.  183.  248.5 123.5]
First 5 true labels: [219.  70. 202. 230. 111.]


Our results and Sklearn"s coincide :) Hurray!

## 3. Evaluating regression models
Evaluating a ML model consists assessing how well the model's predictions match the actual values. The most common metrics for evaluating regression models are:

- **Mean Squared Error (MSE)**: Measures the average squared difference between the actual and predicted values.
  
  $$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2.$$

 - **Root Mean Squared Error (RMSE)**: The square root of the MSE, providing a measure in the same units as the target variable.

  $$RMSE = \sqrt{MSE}.$$

- **R-squared (**$R^2$**)**: Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.
  
  $$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2}{\sum_{i=1}^{n} (y_{i} - \bar{\mathbf{y}})^2},$$

where $n$ is the number of samples, $y_{i}$ is the true target value for the $i$-th sample, $\hat{y_{i}}$ is the predicted target value for the $i$-th sample, and $\bar{\mathbf{y}}$ is the mean of the true target values.

The implementation for these metrics is available in the `metrics` module of Sklearn:


In [8]:
from sklearn.metrics import mean_squared_error, r2_score

# Assuming y_true and y_pred are the true and predicted target values, respectively
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)


Mean Squared Error: 3537.612359550562
Root Mean Squared Error: 59.477830824186604
R-squared: 0.3322931226835779


## 4. Linear Regression

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting linear equation that describes how the dependent variable changes as the independent variables change. The equation of a multiple linear regression model is:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon $$

where:

- $y$ is the dependent variable or target,
- $x_1, x_2, \ldots, x_p$ are the independent variables,
- $\beta_0$ is the y-intercept,
- $\beta_1, \beta_2, \ldots, \beta_p$ are the coefficients,
- $\epsilon$ is the error term.

In the context of Machine Learning, we can use a training set to compute estimates of the parameters ($\hat{\beta_0}, \hat{\beta_1}, \hat{\beta_2}, \ldots, \hat{\beta_p}$), which can be used to make predictions from new test points. In this specific case, we could either compute them using the Ordinary Least Squares method or using a numerical optimization algorithm such as gradient descent. However, we will not implement this model form scratch, but we will directly use Sklearn's implementation:

In [9]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)

print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r_squared = r2_score(y_test, y_pred)
print(f"\nMean Squared Error (MSE): {mse:.3f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.3f}")
print(f"R-squared: {r_squared:.3f}")

First 5 predicted labels: [139.5475584  179.51720835 134.03875572 291.41702925 123.78965872]
First 5 true labels: [219.  70. 202. 230. 111.]

Mean Squared Error (MSE): 2900.194
Root Mean Squared Error (RMSE): 53.853
R-squared: 0.453


The MSE is lower than for kNN and the R-squared is higher, which indicate that this model performs better.

We can inspect the `lin_reg` object to find out what are the estimates for the parameters:

In [10]:
print("Estimated coefficients: ", lin_reg.coef_)
print("Estimated intercept: ", lin_reg.intercept_)

Estimated coefficients:  [  37.90402135 -241.96436231  542.42875852  347.70384391 -931.48884588
  518.06227698  163.41998299  275.31790158  736.1988589    48.67065743]
Estimated intercept:  151.34560453985995


Recall that the features are:

In [11]:
diabetes.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

Therefore, the formula for the linear regression model (i.e., the one that is applied when we call `lin_reg.predict`) is

$$
\begin{align*}
\hat{y} = & \ 151.35 + 37.90 \cdot \text{age} - 241.96 \cdot \text{sex} + 542.43 \cdot \text{bmi} \\
          & + 347.70 \cdot \text{bp} - 931.49 \cdot \text{s1} + 518.06 \cdot \text{s2} \\
          & + 163.42 \cdot \text{s3} + 275.32 \cdot \text{s4} + 736.20 \cdot \text{s5} \\
          & + 48.67 \cdot \text{s6}
\end{align*}
$$

🔎 **Observation:** When writing math formulas for ML/statistics, we usually use the hat (  $\hat{ }$  ) to denote esimates/predictions. For example $\hat{\beta_1}$ is the estimate of $\beta_1$ that we learn from data.