# 👩‍💻 Feature Selection and Extraction for Housing Price Prediction
## 📋 Overview
In this lab, you'll tackle a real-world machine learning challenge: reducing the number of features in the Ames Housing Dataset while maintaining or improving predictive performance. You'll implement feature selection using Recursive Feature Elimination (RFE) and dimensionality reduction using Principal Component Analysis (PCA), then compare their effectiveness for a linear regression model. These techniques are essential for any data scientist working with high-dimensional datasets as they help improve model efficiency, reduce overfitting, and increase interpretability.
## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- Apply Recursive Feature Elimination to identify the most important features in a dataset
- Implement Principal Component Analysis for dimensionality reduction
- Compare and evaluate the performance of models using different feature selection techniques
- Make informed decisions about feature selection trade-offs in real-world scenarios

## 🚀 Starting Point
Access the starter code provided below. You'll need a Python environment with the following libraries:

- pandas
- numpy
- scikit-learn
- matplotlib (optional, for visualization)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# Load the Ames Housing dataset
ames = fetch_openml(name="house_prices", as_frame=True)
data = ames.data
data['SalePrice'] = ames.target

## Task 1: Explore the Dataset
**Context:** Before applying any feature selection technique, it's important to understand the dataset you're working with. Real estate analysts often start by exploring housing data to get a sense of the available features and their distributions.

**Steps:**

1. Examine the first few rows of the dataset using the `head()` method
2. Get statistical summaries of numerical features using `describe()`
3. Check for missing values in the dataset using methods like `isna().sum()`
4. Handle missing values in numerical features using appropriate strategies

In [None]:
# Explore the dataset
# YOUR CODE HERE

# Handle missing values
# YOUR CODE HERE

# Define features and target
# YOUR CODE HERE

**💡 Tip:** For simplicity in this lab, consider using only numerical features and handling missing values with median imputation.

**⚙️ Test Your Work:**

- Print the shape of your processed dataset
- Verify that there are no missing values in the data you'll use for modeling
- Expected output: A confirmation of the dataset dimensions and features you'll be working with

## Task 2: Apply Recursive Feature Elimination (RFE)
**Context:** Real estate companies often want to know which housing attributes are most predictive of sales price. RFE helps identify the most important features while eliminating redundant or less important ones.

**Steps:**

1. Create a Linear Regression model to use as the base estimator
2. Initialize the RFE with the model, specifying to select the top 5 features
3. Fit RFE to your data
4. Extract and display the selected features

In [None]:
# Apply Recursive Feature Elimination
# YOUR CODE HERE

# Extract selected features
# YOUR CODE HERE

**💡 Tip:** The `RFE` class has a `support_` attribute that shows which features were selected. You can use this with your original feature names to identify the selected features.

**⚙️ Test Your Work:**

- Print the names of the selected features
- Expected output: A list of the 5 most important features for predicting house prices

## Task 3: Implement Principal Component Analysis (PCA)
**Context:** In many real-world datasets including real estate data, features may be correlated. PCA transforms the original features into uncorrelated principal components that capture the maximum variance in the data.

**Steps:**

1. Standardize the features using `StandardScaler`
2. Initialize PCA with 2 components to start
3. Fit and transform the data using PCA
4. Examine the explained variance ratio to understand how much information is retained

In [None]:
# Standardize the features
# YOUR CODE HERE

# Apply PCA
# YOUR CODE HERE

# Examine explained variance
# YOUR CODE HERE

**💡 Tip:** Always standardize your data before applying PCA since it is sensitive to the scale of the features.
    
**⚙️ Test Your Work:**

- Print the explained variance ratio of the principal components
- Expected output: The percentage of variance explained by each principal component

## Task 4: Evaluate Model Performance
**Context:** Data scientists must compare different approaches to determine which yields the best model. Here, you'll evaluate whether feature selection with RFE or dimensionality reduction with PCA results in better predictive performance.

**Steps:**

1. Split the RFE-selected data and the PCA-transformed data into training and testing sets
2. Train a Linear Regression model on each training set
3. Make predictions on the test sets
4. Calculate and compare performance metrics (R-squared and MSE) for both approaches

In [None]:
# Evaluate RFE model
# YOUR CODE HERE

# Evaluate PCA model
# YOUR CODE HERE

# Compare performance
# YOUR CODE HERE

**💡 Tip:** Use the same random state when splitting data to ensure a fair comparison between models.
    
**⚙️ Test Your Work:**

- Print the R-squared and MSE values for both models
- Expected output: Performance metrics showing how well each model predicts house prices

## Task 5: Analyze and Document Findings
**Context:** In a real-world scenario, you would need to communicate your findings to stakeholders. This involves analyzing the trade-offs between different approaches and making recommendations.

**Steps:**

1. Compare the performance of the RFE and PCA approaches
2. Discuss the interpretability advantage of RFE (knowing specific important features) versus the potential information preservation of PCA
3. Document which features RFE selected and why they might be important for house price prediction

In [None]:
# Document your findings
# YOUR CODE HERE

**💡 Tip:** Consider both quantitative metrics and qualitative aspects like interpretability in your analysis.
    
**⚙️ Test Your Work:**

- Write a concise summary of your findings
- Expected output: A clear analysis comparing the two approaches with specific metrics and insights

## ✅ Success Checklist
- Successfully loaded and preprocessed the Ames Housing dataset
- Applied RFE to identify the 5 most important features
- Implemented PCA for dimensionality reduction
- Trained and evaluated linear regression models using both approaches
- Compared performance metrics between RFE and PCA approaches
- Documented insights about feature importance and selection trade-offs
- Code runs without errors

## 🔍 Common Issues & Solutions
**Problem:** RFE takes a long time to run. 

**Solution:** Start with a smaller subset of features or use `RFECV` with cross-validation to find the optimal number of features more efficiently.

**Problem:** Poor model performance even after feature selection. 

**Solution:** Consider trying different base estimators for RFE or exploring other preprocessing techniques for the dataset.

**Problem:** PCA components are difficult to interpret. 

**Solution:** This is a natural trade-off with PCA. If interpretability is critical, feature selection methods like RFE might be more appropriate than PCA.

## 🔑 Key Points
- Feature selection techniques like RFE help identify the most predictive features, improving model interpretability.
- PCA reduces dimensionality while preserving variance but sacrifices the direct interpretability of features.
- The choice between feature selection and dimensionality reduction depends on your specific goals and requirements.
- Always evaluate and compare model performance to make data-driven decisions about feature engineering.

## 💻 Exemplar Solution

<details>

<summary><strong>Click HERE to see an exemplar solution</strong></summary>    
    
```python
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


# Load the Ames Housing dataset
ames = fetch_openml(name="house_prices", as_frame=True)
data = ames.data
data['SalePrice'] = ames.target
print(data.head())
print(data.describe())


# Basic data preprocessing
# Handle missing values
data = data.select_dtypes(include=[np.number])  # Select only numerical features for simplicity
data = data.fillna(data.median())


# Define features and target
X = data.drop('SalePrice', axis=1)
y = data['SalePrice'].astype(float)


# Apply Recursive Feature Elimination (RFE)
model = LinearRegression()
rfe = RFE(model, n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)


selected_features = X.columns[rfe.support_]
print("Selected Features:", selected_features)


# Explore PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)


print("Explained Variance Ratio by PCA:", pca.explained_variance_ratio_)


# Evaluate Model Performance with RFE features
X_train, X_test, y_train, y_test = train_test_split(X_rfe, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
y_pred_rfe = model.predict(X_test)


print("RFE Model - R-squared:", r2_score(y_test, y_pred_rfe))
print("RFE Model - MSE:", mean_squared_error(y_test, y_pred_rfe))


# Evaluate Model Performance with PCA-transformed data
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.2, random_state=42)
model.fit(X_train_pca, y_train_pca)
y_pred_pca = model.predict(X_test_pca)


print("PCA Model - R-squared:", r2_score(y_test_pca, y_pred_pca))
print("PCA Model - MSE:", mean_squared_error(y_test_pca, y_pred_pca))


# Analyze and Document Findings
print("\nAnalysis of Results:")
print(f"The top 5 features selected by RFE are: {', '.join(selected_features)}")
print(f"RFE model performance: R² = {r2_score(y_test, y_pred_rfe):.4f}, MSE = {mean_squared_error(y_test, y_pred_rfe):.2f}")
print(f"PCA model performance: R² = {r2_score(y_test_pca, y_pred_pca):.4f}, MSE = {mean_squared_error(y_test_pca, y_pred_pca):.2f}")

```    