## Feature selection

## Exam question 23-24 (1st session)

Let us consider a regression task with $n=10$ inputs and one target.

The regression dataset is in the variable 
<span style="font-family:Courier; "> Q3_D </span> 
of the <span style="font-family:Courier; "> fsel1.pkl </span> file.

Note that the 11th column contains the target.


Consider a **wrapper backward selection** strategy where the learner is a locally linear regression algorithm which returns the prediction of a linear model fitted to the K nearest neighbours (K=10 and Euclidean distance), and the assessment is based on leave-one-out.

Return the index of the five most relevant features according to such feature selection strategy:

Use the instructions
```python
import pickle
with open("fsel1.pkl", 'rb') as f:
    data = pickle.load(f)
Q3_D=data["Q3_D"]
```
to load the <span style="font-family:Courier; "> Q3_D </span> variable in Python.

In [4]:
import pickle
with open("fsel1.pkl", 'rb') as f:
    data = pickle.load(f)
Q3_D=data["Q3_D"]


In [8]:

import numpy as np
from numpy.linalg import solve




def lsq(X, Y, q):
    n = X.shape[1]
    N = X.shape[0]
    XX = np.column_stack((np.ones(N), X))
    beta = solve(XX.T @ XX, XX.T @ Y)
    yhat=np.concatenate((np.ones(1), q.flatten())) @ beta
    return float(yhat[0])



def LL(X, Y, q, k):
    N = X.shape[0]
    # Euclidean metric
    d = np.sqrt(np.sum((X - np.ones((N, 1)) @ q.reshape(1, -1))**2, axis=1)) 
    index = np.argsort(d)[:k]
    LLhat=lsq(X[index], Y[index], q)
    return LLhat


XY = Q3_D 
nn = XY.shape[1]
X = XY[:, :nn-1]
Y = XY[:, nn-1]

n = X.shape[1]
N = X.shape[0]
nfeat = 10
K = 10

fsub = list(range(n))

for ss in range(nfeat-1):
    Eloo = np.full(n, np.inf)
    for j in fsub:

        Eloo[j] = 0

        for i in range(N):
            Xi = np.delete(X, i, axis=0)
            Yi = np.delete(Y, i)        
            remaining_features = [f for f in fsub if f != j]
            Xi_subset = Xi[:, remaining_features]
            X_i_subset = X[i:i+1, remaining_features]
            Yhati = LL(Xi_subset, Yi.reshape(-1,1), X_i_subset, K)
            Eloo[j] += (Y[i] - Yhati)**2

        

        Eloo[j] = Eloo[j]/N

    
    min_index = np.argmin(Eloo)
    fsub.remove(min_index)
    print([f+1 for f in fsub])


[1, 2, 3, 4, 5, 6, 7, 9, 10]
[1, 3, 4, 5, 6, 7, 9, 10]
[1, 3, 4, 5, 6, 7, 9]
[1, 3, 4, 5, 7, 9]
[1, 3, 5, 7, 9]
[3, 5, 7, 9]
[5, 7, 9]
[5, 9]
[9]


## Exam question 22-23 (1st session)

Let us consider a regression task with $n=10$ inputs and one target whose dataset is in the variable <span style="font-family:Courier; "> D </span> 
of the  <span style="font-family:Courier; "> fsel2.pkl </span> file.


Note that the 11th column contains the target.

Consider a wrapper forward selection strategy where the learner is a 5NN (KNN with $K=5$ and Euclidean distance), and the assessment is based on leave-one-out.

Return the index of the five most relevant features according to this feature selection strategy

Use the instructions
```python
import pickle
with open("fsel2.pkl", 'rb') as f:
    data = pickle.load(f)
D=data["D"]
```
to load the <span style="font-family:Courier; "> D </span> variable in Python.


In [None]:
import pickle
with open("fsel2.pkl", 'rb') as f:
    data = pickle.load(f)
D=data["D"]

In [14]:
import numpy as np
import math # Import math for inf

# Define the KNN function
def KNN(X, Y, q, k):
    # X: training features (numpy array)
    # Y: training labels (numpy array)
    # q: query point (numpy array, expected to be 2D, e.g., 1xM)
    # k: number of neighbors
    N = X.shape[0]

    ones_matrix = np.ones((N, 1))
    # Ensure q is treated as a row vector for matrix multiplication
    q_matrix = q.reshape(1, -1) if q.ndim == 1 else q
    d = np.sqrt(((X - (ones_matrix @ q_matrix))**2).sum(axis=1))

    index = np.argsort(d)[:k]
    return Y[index].mean()

n = 10
X = D[:, 0:n]
Y = D[:, n]

K = 5
N = len(Y)
n = X.shape[1]

fsub = []
nfeat = 5
for ss in range(nfeat):

   # Initialize Eloo (Leave-One-Out error) with infinity for each feature
    # Eloo is a numpy array of size n (number of columns in X)
    Eloo = np.full(n, np.inf)

    for j in range(n):
        if j in fsub:
            continue # Skip this feature if it's already selected

         Eloo[j] = 0

        for i in range(N):

            # Create the training data (Xi) and labels (Yi) by removing the i-th row/element
            # np.delete removes the specified index along the specified axis
            Xi = np.delete(X, i, axis=0)
            Yi = np.delete(Y, i)

            # Select the columns for the current subset of features (already selected fsub + candidate j)
            # fsub contains 0-based indices, j is the current 0-based index
            cols_subset = fsub + [j]

            # Training data subset for KNN: select rows from Xi and the columns in cols_subset
            Xi_subset = Xi[:, cols_subset]

            # Test point for KNN: select the i-th row from the original X and the columns in cols_subset
            # X[i, cols_subset] results in a 1D numpy array
            # .reshape(1, -1) reshapes it into a 1xM matrix as expected by the KNN function's 'q' parameter
            qi_subset = X[i, cols_subset].reshape(1, -1)

            # Perform KNN prediction for the left-out data point
            Yhati = KNN(Xi_subset, Yi, qi_subset, K)

           # Accumulate the squared error for the current candidate feature j
            # Y[i] is the actual label for the i-th data point (0-based index)
            Eloo[j] = Eloo[j] + (Y[i] - Yhati)**2

        # Calculate the mean squared error for the current candidate feature j
        Eloo[j] = Eloo[j] / N

    # Find the index (0-based) of the feature with the minimum mean squared error in Eloo
    best_j = np.argmin(Eloo)
    # Append the index of the best feature found in this iteration to the list of selected features
    fsub.append(best_j)

# Print the list of selected features (converting 0-based indices back to 1-based for output) and the value of K
print(f"bestfs={[x + 1 for x in fsub]} K={K}")



bestfs=[2, 3, 1, 6, 8] K=5
