<a href="https://colab.research.google.com/github/fenix-hub/ml-foundations-cours-2026/blob/main/ML_Tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import time
print(time.ctime())

print("SHIFT + ENTER: New Cell")
print("CTRL + ENTER: Execute cell")

Tue Feb 10 08:15:08 2026
SHIFT + ENTER: New Cell
CTRL + ENTER: Execute cell


This is **bold**.
This is *italic*.
This is ~strikethrough~.
The output of the above commands is rendered on the right hand side of the Cell as shown here.

$\sqrt{3x-1}+(1+x)^2$

In [9]:
import pandas as pd
pd.read_csv("/content/sample_data/california_housing_test.csv")

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


Data is collected in a training set defined as:

$\mathcal{T} = \{(x_i, y_i)\}_{i=1}^n$

where n is the number of observed samples. Each pair represents one realization of the relationship between input and output.
Assumption: samples in T are indipendent and identically distributed.

# Input Variables

Inputs are represented as vectors:

$x = [x_1, x_2, \ldots, x_p] \in \mathbb{R}^p$

where each component xj correspondes to a **feature** describing one aspect of the observation. The dimension p is the number of features.  

If the input variable takes ordered, meaningful values, it is a **numerical** variable. It can be continuous or descrete and arithmetic operations and distances are well defined.

If the input variable takes a finite set of labels as values with no instrinsic ordering, it is a **categorical** variable. Labels are *identifiers* rather than quantities.

# Output Variables
The output variable y is usually a scalar.
Supervised learning problems are commonly categorised according to the nature of the output
variable.

If the output is numerical we deal with **regression** problems.  The goal of regression is to learn a function that maps inputs to real-valued outputs.

$y ∈ R$


If the output takes a finit set of classes we deal with **classification** problems.

$y ∈ {1, . . . ,K}$

The goal is to assign each input to one of the available classes.  
Boths *classification* and *regression* belong to the machine learning paradigm of **supervised learning**.

# Supervised Learning Workflow

A supervised learning workflow typically consists of:  
• Collecting labeled training data $(xi, yi)$.  
• Choosing a learning algorithm and model type.  
• Training the model $\hat{f}$ by adapting its parameters to minimize prediction error.  
• Using the trained model to make predictions on new inputs x⋆.  

## Prediction

At prediction time, the learning scenario is the following:

• a new input $x^∗$ is observed;  
• the corresponding true output $y^∗$ is unknown;  
• the model produces a prediction  

$ \hat{y}^∗ =  f(x^∗) $


## Generalization and Overfitting

To be useful, models must not just fit the training data but generalize to new, unseen data.  
Overfitting occurs when the model captures noise or peculiarities of the training set rather than the underlying pattern, resulting in poor performance on new inputs.  
To monitor and prevent overfitting, techniques include:  
• Hold-out validation sets.  
• Cross-validation.  
• Regularization methods.  
• Early stopping during training.  

Although regression and classification differ in the type of output and in the evaluation metrics used, many learning principles, such as generalisation, model complexity, and overfitting, apply to both settings.

# Matrix representation of the training data

For computational and conceptual clarity, the training data are often represented in matrix form. The inputs are arranged into a matrix  
$X ∈ R_n×p$  
where each row $(n)$ corresponds to one observation and each column $(p)$ corresponds to one feature.   
The outputs are collected into a vector  
$y ∈ R_n$  
This representation makes explicit the distinction between samples and features and underlies
many algorithmic operations, such as distance computation, feature scaling, and optimisation
procedures.

# A distance based method: k-Nearest Neighbours

One of the simples and intuitive methods in supervised learning, yet a common ground for more specific and powerful methods.  
The idea is simple:  
if two inputs $x$ ad $x*$ are **close** to each other in the input space, then their corresponding outputs $y$ and $y*$ should be similar.  

This assumption reflects a form of smoothness of the underlying input–output relationship.  
Rather than explicitly learning a parametric function, k-NN bases its predictions directly on the observed training data.
Given a new input $x^∗$, the idea is to:  
• search for training inputs xi that are close to $x^∗$;  
• combine their associated outputs yi to produce a prediction $\hat{y}^∗$.  

## Distance measure

The k-NN needs a notion for the concept of **closeness**. For simplicity we can use the *Euclidean* distance between *two input vectors $xi$ and $x*$*.  

$ ||x_{i} - x^*||_{2}   = \sqrt {\sum _{j=1}^{p}  \left( x_{ij}-x^*_{j}\right)^2 } $

## The set of k nearest neighbours

In order to compute the prediction $\hat{y}^*$, we don't evaluate the whole outputs vector $y$, but just a *set* of it. The index set - the set of the indices of the k nearest neighbours - is defined as:  
$ N^* = \{ i: x_{i} \text{ is among the k closest points of } x^* \} $  
where k is the nuber of neighbours.  
So we evaluate only the outputs in the set:
$ \{y_i : i ∈ N^* \} $

# Prediction Rules

Once we have defined the set of outputs to use in order to compute the prediction, how do we actually compute it?  

For **regression** problems, where the ouput consists of numerical values, whe use the avarage of the outputs belonging to the output set:  
$ \hat{y}^* = \frac{1}{k}\sum_{i\in\mathcal{N}^*} y_i $  
  

For **classification** problems, where the output consists of categorical values, we use the most frequent class among the neighbours:  
$ \hat{y}^* = \arg\max_{c\in\mathcal{C}} \sum_{i\in\mathcal{N}^*} \mathbf{1}\{y_i = c\} $  
where $C$ denotes the set of classes and $1\{.\}$ denotes the indicator function.  

$ 1_A(x)=\begin{cases}1&\text{if }x\in A\\ 0&\text{if }x\notin A\end{cases} $

# The k-NN algorithm

Given a training set $\mathcal{T} = \{(x_i, y_i)\}_{i=1}^n$, a positive integer $k$, and a test input $x^∗$:  
1. Compute the euclidean distances of $x_i$ for all $i = 1 .. n $ (whole input space);   
2. Identify the index set $N^∗$ of the $k$ closest training points;  
3. Predict $\hat{y}^*$ based on the type of task (classification or regression).  

## The value of $k$
But *how* do we chose the value of *k*?  
$k$ is an **hyperparameter** since it has a direct impact on the behavior of the **predictor**.  From a statistical perspective, the behavior caused by the value of $k$ can be interpreted in terms of *bias-variance trade-off*:  
- small value of $k$ => low bias, high variance (flexible model, strict to training data, abrupt changes based on inputs)  
- high value of $k$ => high bias, low variance (smoother predictions since using many samples)  

In practice, the value of k is typically selected using cross-validation on the
training set.

## Feature scaling
Since k-NN is highly impacted by the *closeness* of the inputs, the scale of such input features has a strong influence on the predictions. In this case, if we use Euclidean distance without preprocessing, the feature with the largest numerical scale will dominate.  
This is why for distance-based methods, whe need the pre-apply *feature scaling algorithms*.  

The most common strategies are **standardization (z-score)** and **normalization (to [0,1])**. The former is preferred for continuous features.  

*Standardization*: $\mathbf{x}_{ij}^{\mathrm{std}} = \frac{x_{ij}-\mu_j}{\sigma_j}$ where $\mu_j$ and $\sigma_j$ denote the mean and standard deviation of feature $j$ computed on the training set.  
*Normalization*: $ x_{ij}^{\mathrm{norm}}=\frac{x_{ij}-\min(x_j)}{\max(x_j)-\min(x_j)} $

# Decision boundaries - $Classification$

In classification problems the predictor defines a mapping function, which maps to each input in the input space a *class label*.

$ \hat{f} : R^p → C $

This mapping implicitly partitions the input space into *decision regions*  
$ R_c = \{ x : \hat{f}(x) = c \} $ for each class $ c \in C $.  
The *decision regions* are separated by *decision boundaries*.

For $ k = 1 $ (formerly voronoi cells) each training sample defines its own region of influence, and each point is assigned to the label of the nearest neighbour.  
As $k$ increases decision boundaries become smoother and less sensitive to individual training points. This is the geometric interpretation of the bias-variance trade-off.  

An important observation is that the k-NN method produces predictions that are *piecewise constant* functions of the input.

$ |x| = \begin{cases} -x, & x < 0 \\ x, & x \ge 0 \end{cases} $  

The input space is divided into $M$ disjoint regions $\{ R_m \}_{m=1}^M $. In each region, for whatever $ x \in R_m $ the output $\hat{y}$ of the predictor is constant.  

$ \hat{y}=c_m \text{  for all } x \in R_m $ where $c_m$ depends on the outputs of the $k$ nearest neighbours associated to the region $ R_m $.

For $k-NN$ algorithms these regions are defined implicitely based on the distance metrics and the position of training data in space. Other algorithms (like *decision trees*) use explicit rules in order to define the decision regions.


# Key Note
k-nearest neighbours define piecewise-constant predictors through implicit, distancebased regions;


In [11]:
import pandas as pd
df = pd.read_csv("/content/sample_data/california_housing_test.csv")
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


**Research question**: can we *predict* the $\text{median_house_value}$ based on the features that we have in our dataset?

A) We assume that the samples in our dataset $T$ are indipendent and identically distributed.  
B) This is a *regression problem* since we deal with numerical output variables.