**Group: Triangle of Sadness**

# HomeWork08
<div class="alert alert-block alert-warning" style="margin-top: 20px">

**Exercise 1: Brute force forward feature selection**

_Forward Selection_ is an iterative process to identify the best subset of features for a predictive model.
At each step, one feature is added to the selected set based on its performance improvement.
<br><br>

**Objective:** In this case, we want to find the order of importance of features reducing MAE in a Random Forest Regressor.
<br><br>

**Task**: You will have to investigate one feature at a time, adding it to the bucket of good features $-$ Proceed as follows:
<br><br>
    
1 $-$ **Initial Preparations:**

- Determine the total number of features, `n_features`, from the dataset (`X`).

- Create two lists:
    - `idxs_selected_features`: to store indices of selected features, sorted by importance.
    - `idxs_remaining_features`: to store indices of features not yet selected.<br><br>

2 $-$ **Outer Loop (iterate over all features):**

&ensp;&ensp;&ensp;&ensp; Run a loop for a maximum of `n_features` iterations. Each iteration will select the "best" feature and add it to the selected "bucket".<br>
&ensp;&ensp;&ensp;&ensp; $\rightarrow$ This will be naturally sorted by importance.<br>

- **Inner Loop (find the best feature to add to the bucket at this stage):**

    - For each feature in `idxs_remaining_features`, temporarily add it to `idxs_selected_features` to test its performance.
    - Extract the data corresponding to the current feature set (`idxs_features_to_test`) from `X` $\rightarrow$ `X_iter`.
    - Train and Evaluate:
        - Create a new `RandomForestRegressor` instance (_recreate it inside the loop for a clean start each time_).
        - Train the model using only the current subset of features (`X_iter`).
        - Predict the outputs and calculate the MAE to evaluate the model's performance with the current feature set.
        - Compare Performance: Check if the feature you added improves the score (lower MAE). Keep track by updating a `best_feature` and `best_score` object.
    
- **Update Feature Sets:**
    - After the inner loop, add the `best_feature` to `idxs_selected_features` and remove it from `idxs_remaining_features`.

- **Store Results**

- **Repeat Until Completion**

**3 $-$ Report the feature indexes sorted by importance**

- - -

**Hints:**
- No need to split train/test, just work with the whole `X`
- Use whatever hyperparameter you want for RF (_even the default_)
- The bucket of selected features shall grow at every iteration of the outer loop
- At the end of the outer loop, the bucket shall contain all features, in order of importance    
<br>

**[Bonus] Exercise 2**
    
**Objective:** Compare your results against the feature importance from inside RF**

**Task:** This is actually very simple. Retrain on the exact same RF hyperparamters. Then, exctract the feature importances as:    
    
```python
feature_importances = regr.feature_importances_
```

You will just need to sort them. Did you obtain similar sorting?

In [2]:
import tarfile
import pandas as pd
import numpy as np
from prettytable import PrettyTable
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor


with tarfile.open("L08_data_inclass.tar.gz", 'r:gz') as tar:
    filename = tar.extractfile("StudentsPerformance.csv")
    if filename:
        df = pd.read_csv(filename)

print('\nOriginal Data:')
display(df.head(5))

# Encode
df_enc = pd.get_dummies(df.drop(columns=['math score']), drop_first=True)

# Convert from boolean to int
columns_bool = df_enc.select_dtypes(include=['bool']).columns
df_enc[columns_bool] = df_enc[columns_bool].astype(int)

print('\nOne Hot Encoded data:')
display(df_enc.head(5))

X = df_enc.values
y = df['math score'].values

classes = np.unique(y)
print('There are %s classes' % len(classes))

table = PrettyTable()
table.title = str('Data shape')
table.field_names = ['X', 'y']
table.add_row([np.shape(X), np.shape(y)])
print(table)

n_features=X.shape[1]
print(f"Number of features: {n_features}")

idxs_selected_features=[] #to store indices of selected features, sorted by importance.
idxs_remaining_features=list(range(n_features)) #to store indices of features not yet selected

#Storing
maes=[]
features=[]

#looping
for i in range(n_features):
    
    best_score=float('inf')
    best_feature=None
    
    for j in idxs_remaining_features:
        
        idxs_features_to_test=idxs_selected_features+[j]
        X_iter=X[:,idxs_features_to_test]
        
        model=RandomForestRegressor(random_state=42)
        model.fit(X_iter,y)
        y_pred=model.predict(X_iter)
        
        mae=mean_absolute_error(y,y_pred)
        if mae<best_score:
            best_score=mae
            best_feature=j
            
    idxs_selected_features.append(best_feature)
    idxs_remaining_features.remove(best_feature)

    maes.append(best_score)
    features.append(best_feature)
    maes.sort()
    features.reverse()
    
print(f"\nBest MAEs in order of importance:\n{maes}\n")
print(f'Best Features idxs in order of importance:\n{features}')


Original Data:


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



One Hot Encoded data:


Unnamed: 0,reading score,writing score,gender_male,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,parental level of education_some high school,lunch_standard,test preparation course_none
0,72,74,0,1,0,0,0,1,0,0,0,0,1,1
1,90,88,0,0,1,0,0,0,0,0,1,0,1,0
2,95,93,0,1,0,0,0,0,0,1,0,0,1,1
3,57,44,1,0,0,0,0,0,0,0,0,0,0,1
4,78,75,1,0,1,0,0,0,0,0,1,0,1,1


There are 81 classes
+----------------------+
|      Data shape      |
+------------+---------+
|     X      |    y    |
+------------+---------+
| (1000, 14) | (1000,) |
+------------+---------+
Number of features: 14

Best MAEs in order of importance:
[1.8152307857142858, 1.8227706904761904, 1.8246475, 1.8304388095238096, 1.8406091071428572, 1.8523603055555555, 1.8788861587301589, 1.9242325396825395, 2.003529388888889, 2.135162353174603, 2.346227594516595, 2.6809958532488745, 4.420272748859801, 6.766487490802181]

Best Features idxs in order of importance:
[4, 3, 7, 5, 6, 13, 1, 0, 2, 12, 10, 8, 9, 11]
