### Boston Housing Dataset

The Boston Housing dataset is a classic dataset that contains information about housing prices in various areas of Boston, along with related socio-economic and geographic factors. It has 14 columns (or features)

1. CRIM (Per Capita Crime Rate):
   - This column represents the crime rate per capita by town. Higher values mean more crime in the area.

2. ZN (Proportion of Residential Land Zoned for Large Lots):
   - This gives the proportion of land in the town that is zoned for residential housing where each house has a large lot size (more than 25,000 square feet). Higher values indicate more land is set aside for large homes.

3. INDUS (Proportion of Non-Retail Business Acres):
   - This represents the proportion of land in the town that is used for businesses that are not related to retail (such as factories). Higher values suggest more industrial areas.

4. CHAS (Charles River Dummy Variable):
   - This is a binary variable that tells whether the property is near the Charles River.
     - 1 = Yes (property is near the river)
     - 0 = No (property is not near the river)

5. NOX (Nitric Oxide Concentration):
   - This represents the air pollution levels, specifically the concentration of nitric oxides (NOx). Higher values indicate worse air quality.

6. RM (Average Number of Rooms per Dwelling):
   - This shows the average number of rooms in houses in the area. Higher values suggest larger homes.

7. AGE (Proportion of Owner-Occupied Units Built Before 1940):
   - This gives the percentage of houses built before 1940 that are still occupied by the owners. Higher values indicate older homes in the area.

8. DIS (Weighted Distances to Employment Centers):
   - This represents the weighted distance to five Boston employment centers. Higher values suggest the homes are further away from the main business areas.

9. RAD (Index of Accessibility to Highways):
   - This is an index that shows how easily accessible major highways are from the area. Higher values suggest better highway access.

10. TAX (Property Tax Rate):
    - This represents the full-value property tax rate per $10,000 of the value of the property. Higher values mean higher property taxes.

11. PTRATIO (Pupil-Teacher Ratio by Town):
    - This is the ratio of students to teachers in local schools. A higher value suggests larger class sizes, and potentially lower individual attention for students.

12. Black (Proportion of African American Population):
    - It represents a transformation of the proportion of African American residents in the town:
      - B = 1000 × (1 - proportion of African American residents)²
    - Higher values suggest a lower proportion of African Americans.

13. LSTAT (Percentage of Lower Status Population):
    - This represents the percentage of the population that is considered lower socioeconomic status. Higher values indicates a poorer area.

14. MEDV (Median Value of Owner-Occupied Homes):
    - This is the target variable. 
    - It represents the median value of homes in $1,000s. For example, if `MEDV = 30`, that means the median house price is $30,000.

In [13]:
import numpy as np
import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

In [14]:
boston_house_price = pd.read_csv("./csv-files/boston-house-price.csv", header=0)
print(boston_house_price.head())

   Unnamed: 0     crim    zn  indus  chas    nox     rm   age     dis  rad  \
0           1  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1   
1           2  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2   
2           3  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2   
3           4  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3   
4           5  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3   

   tax  ptratio   black  lstat  medv  
0  296     15.3  396.90   4.98  24.0  
1  242     17.8  396.90   9.14  21.6  
2  242     17.8  392.83   4.03  34.7  
3  222     18.7  394.63   2.94  33.4  
4  222     18.7  396.90   5.33  36.2  


In [15]:
# We can see that `Unnamed: 0` is a sequential column and doesn't add any value to feature
# We can drop this column
boston_house_price.drop(boston_house_price.columns[0], axis=1, inplace=True)

features = boston_house_price.iloc[:, :13]
target = boston_house_price.iloc[:, -1]

- SequentialFeatureSelector is a feature selection algorithm implemented in the mlxtend (Machine Learning Extensions) library.
- It's used for automatically selecting a subset of the most relevant features from a larger set of features in a dataset.
- This can be particularly useful in machine learning tasks where you want to reduce the dimensionality of your data or identify the most important predictors for your model.
- **Purpose:**
  - Reduce overfitting by removing irrelevant features.
  - Improve model performance by focusing on the most important features.
  - Reduce computational complexity and training time.
  - Enhance model interpretability by identifying key predictors.
- **How it works:**
  - It uses a greedy approach to iteratively select (or remove) features.
  - Two main variants:
    - Forward Selection: Starts with no features and adds them one by one
    - Backward Elimination: Starts with all features and removes them one by one
- **Key Parameters:**
  - k_features: The desired number of features to select
  - forward: Boolean to choose between forward selection (True) or backward elimination (False)
  - floating: Allows features to be re-added or re-removed if it improves the score
  - scoring: The metric used to evaluate feature subsets (e.g., accuracy, r2_score)
  - cv: Number of folds for cross-validation

In [16]:
SFS = SequentialFeatureSelector(
    LinearRegression(), k_features=4, forward=False, floating=False, scoring="r2", cv=5
)



SFS.fit(features, target)



#### How it works in detail

- The algorithm first trains the model with all features (let’s say f0, f1, f2, ..., f9).
- Let's say we started with an initial R^2 score of 0.80 (with all features).

```
Performance after removing `f0`:  0.79
Performance after removing `f1`:  0.78
Performance after removing `f2`:  0.81  (improved!)
Performance after removing `f3`:  0.76
Performance after removing `f4`:  0.79
Performance after removing `f5`:  0.77
Performance after removing `f6`:  0.78
Performance after removing `f7`:  0.79
Performance after removing `f8`:  0.77
Performance after removing `f9`:  0.75
```
- If removing a feature causes the performance to stay the same or improve, it’s a sign that the feature isn’t contributing much to the model's predictive power.
- In this example, removing f2 actually improves performance (R^2 score goes from 0.80 to 0.81), so the algorithm will remove f2 in this step.


In [17]:
SFS_results = pd.DataFrame(SFS.subsets_).transpose()
SFS_results

Unnamed: 0,feature_idx,cv_scores,avg_score,feature_names
13,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)","[0.6391999371396757, 0.7138669803833209, 0.587...",0.353276,"(crim, zn, indus, chas, nox, rm, age, dis, rad..."
12,"(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)","[0.46450600715655155, 0.6152206369052179, 0.43...",0.493532,"(crim, zn, indus, chas, nox, age, dis, rad, ta..."
11,"(0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)","[0.4691034708133358, 0.6668237424199122, 0.437...",0.505275,"(crim, zn, chas, nox, age, dis, rad, tax, ptra..."
10,"(0, 1, 4, 6, 7, 8, 9, 10, 11, 12)","[0.4625143560945083, 0.6708087243341195, 0.501...",0.50465,"(crim, zn, nox, age, dis, rad, tax, ptratio, b..."
9,"(0, 1, 4, 7, 8, 9, 10, 11, 12)","[0.43957006983905156, 0.6694217709715935, 0.50...",0.502931,"(crim, zn, nox, dis, rad, tax, ptratio, black,..."
8,"(1, 4, 7, 8, 9, 10, 11, 12)","[0.4369431763159647, 0.6697584476500276, 0.497...",0.491715,"(zn, nox, dis, rad, tax, ptratio, black, lstat)"
7,"(1, 4, 7, 8, 10, 11, 12)","[0.39643171047698034, 0.6508563360831714, 0.44...",0.477113,"(zn, nox, dis, rad, ptratio, black, lstat)"
6,"(4, 7, 8, 10, 11, 12)","[0.41195496456786707, 0.6349946822306836, 0.43...",0.468274,"(nox, dis, rad, ptratio, black, lstat)"
5,"(4, 7, 8, 10, 12)","[0.4125635915078121, 0.6195305016217024, 0.426...",0.430104,"(nox, dis, rad, ptratio, lstat)"
4,"(4, 7, 10, 12)","[0.4023044242500534, 0.5900031154648442, 0.407...",0.447941,"(nox, dis, ptratio, lstat)"


1. Since this is backward selection (forward=False), it starts with all features.
2. In each iteration, it considers dropping one feature at a time.
3. For each potential feature subset (after considering dropping a feature):
    1. The data is split into 5 folds (because cv=5).
    2. 4 folds are used for training
    3. 1 fold is used for validation
    4. A score (R^2 in this case) is calculated on the validation fold
4. After the 5-fold process, we have 5 scores (one from each validation fold).
5. These 5 scores are averaged to get the final score for this feature subset.
6. This process(Step 3, 4, 5) is repeated for each potential feature to drop.
7. The feature whose removal leads to the best average score is actually dropped.
8. This continues until we are left with 4 features (because k_features=4).