# SPRINT 3: Feature Selection

## Step 1: Load Data and Prepare for Selection

**Objective:** In this notebook, we will use various techniques to select the most important features from our original processed dataset (`heart_disease_processed.csv`). Unlike PCA, which transforms features, feature selection *chooses* a subset of the original features. This can improve model performance by eliminating irrelevant data and results in a more interpretable model.

**Method:** We will start with a powerful technique called **Recursive Feature Elimination (RFE)**. RFE works by recursively removing the least important features and building a model on the remaining features until the desired number of features is reached.

In [1]:
# --- 1. Import Libraries ---
import pandas as pd
import numpy as np
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# --- 2. Load the Processed Data ---
# We use the data that was encoded and scaled, but NOT transformed by PCA.
PROCESSED_DATA_PATH = '../data/heart_disease_processed.csv'
df = pd.read_csv(PROCESSED_DATA_PATH)

# --- 3. Separate Features (X) and Target (y) ---
X = df.drop('target', axis=1)
y = df['target']

# --- 4. Verify the separation ---
print("--- Shape of our data for Feature Selection ---")
print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)

--- Shape of our data for Feature Selection ---
Features (X) shape: (303, 18)
Target (y) shape: (303,)


## Step 2: Applying Recursive Feature Elimination (RFE)

**Objective:** We will now apply the RFE algorithm to our dataset. To do this, we need to make two key decisions:
1.  **Estimator:** RFE needs a model to help it judge the importance of features. We will use `LogisticRegression` as our estimator because it is a fast and efficient classification model.
2.  **Number of Features:** We need to decide how many of the best features we want to select. To make a fair comparison with our PCA results, let's aim to select the top **12 features**.

In [2]:
# --- 1. Instantiate the base model (estimator) ---
# We'll use Logistic Regression to evaluate features. max_iter is set to avoid convergence warnings.
estimator = LogisticRegression(max_iter=1000, random_state=42)

# --- 2. Instantiate and apply RFE ---
# We want to select the top 12 features.
rfe = RFE(estimator=estimator, n_features_to_select=12)
rfe.fit(X, y)

# --- 3. Get the list of selected features ---
# The 'support_' attribute of the fitted rfe object is a boolean mask.
selected_features_mask = rfe.support_
selected_features = X.columns[selected_features_mask]

# --- 4. Print the results ---
print(f"Number of selected features: {len(selected_features)}")
print("\nThe selected features are:")
print(list(selected_features))

Number of selected features: 12

The selected features are:
['trestbps', 'thalach', 'oldpeak', 'ca', 'sex_1.0', 'cp_2.0', 'cp_4.0', 'fbs_1.0', 'restecg_2.0', 'exang_1.0', 'slope_2.0', 'thal_7.0']


## Step 3: Creating and Saving the Final DataFrame

**Objective:** Now that RFE has identified the most impactful features, our final step is to create a new DataFrame. This DataFrame will contain only these 12 selected features plus our target variable. This clean, focused, and highly interpretable dataset is now ready for the modeling phase.

In [3]:
# --- 1. Create a new DataFrame with only the selected features ---
X_rfe = X[selected_features]

# --- 2. Concatenate the selected features with the target variable ---
df_rfe_final = pd.concat([X_rfe, y], axis=1)

# --- 3. Display the final DataFrame ---
print(f"Shape of the final RFE DataFrame: {df_rfe_final.shape}")
print("\n--- First 5 rows of the final DataFrame ---")
print(df_rfe_final.head())

# --- 4. Save the final DataFrame to a new CSV file ---
OUTPUT_PATH_RFE = '../data/heart_disease_rfe_features.csv'
df_rfe_final.to_csv(OUTPUT_PATH_RFE, index=False)

print(f"\nFeature-selected data successfully saved to: {OUTPUT_PATH_RFE}")
print("\n--- Mission Accomplished for Notebook 03! ---")

Shape of the final RFE DataFrame: (303, 13)

--- First 5 rows of the final DataFrame ---
   trestbps   thalach   oldpeak   ca  sex_1.0  cp_2.0  cp_4.0  fbs_1.0  \
0  0.757525  0.017197  1.087338  0.0     True   False   False     True   
1  1.611220 -1.821905  0.397182  3.0     True   False    True    False   
2 -0.665300 -0.902354  1.346147  2.0     True   False    True    False   
3 -0.096170  1.637359  2.122573  0.0     True   False   False    False   
4 -0.096170  0.980537  0.310912  0.0    False    True   False    False   

   restecg_2.0  exang_1.0  slope_2.0  thal_7.0  target  
0         True      False      False     False       0  
1         True       True       True     False       1  
2         True       True       True      True       1  
3        False      False      False     False       0  
4         True      False      False     False       0  

Feature-selected data successfully saved to: ../data/heart_disease_rfe_features.csv

--- Mission Accomplished for Notebook 