# 📌 Feature Selection in Machine Learning  

Feature selection is the process of selecting the most **relevant features (variables)** from your dataset that contribute the most to predicting the target variable.  

This step improves:  
- ✅ Model accuracy  
- ✅ Training speed  
- ✅ Model interpretability  
- ✅ Reduces overfitting  

---

## 🔎 Types of Feature Selection Methods  

### 1️⃣ Filter Methods  
These rely on statistical measures to evaluate the relationship between independent variables and the target.  

- **Correlation Coefficient** (for numerical variables)  
- **Chi-Square Test** (for categorical variables)  
- **ANOVA Test**  

👉 These are fast but don’t consider model performance.  

---

### 2️⃣ Wrapper Methods  
These use a predictive model to evaluate the combination of features.  

- **Recursive Feature Elimination (RFE)**: Iteratively removes least important features.  
- More accurate but computationally expensive.  

---

### 3️⃣ Embedded Methods  
Feature selection occurs during model training.  

- **Lasso (L1 Regularization)** shrinks less important features’ coefficients to zero.  
- **Tree-based models (RandomForest, XGBoost)** provide built-in feature importance.  

---

### 4️⃣ Dimensionality Reduction  
Instead of selecting features, this **creates new features** by combining existing ones.  

- **Principal Component Analysis (PCA)**  
- **t-SNE, UMAP** (for visualization and high-dimensional data)  

---

## 🧑‍💻 Hands-on Code Examples  

### Import Libraries & Dataset
```python
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor


## Load Dataset

In [None]:
# Load sample dataset
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### 1. Filter Method – SelectKBest

In [None]:
# Using ANOVA F-test (f_regression)
selector = SelectKBest(score_func=f_regression, k=5)
X_new = selector.fit_transform(X, y)

selected_features = X.columns[selector.get_support()]
print("Selected Features (Filter Method):", selected_features)


### 2. Wrapper Method – RFE

In [None]:
model = LinearRegression()
rfe = RFE(model, n_features_to_select=5)
fit = rfe.fit(X, y)

selected_features_rfe = X.columns[fit.support_]
print("Selected Features (Wrapper Method - RFE):", selected_features_rfe)


### 3. Embedded Method – Lasso Regularization

In [None]:
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

selected_features_lasso = X.columns[lasso.coef_ != 0]
print("Selected Features (Embedded Method - Lasso):", selected_features_lasso)


### 4. Embedded Method – RandomForest Feature Importance

In [None]:
rf = RandomForestRegressor()
rf.fit(X, y)

importances = pd.Series(rf.feature_importances_, index=X.columns)
print(importances.sort_values(ascending=False).head(5))


### 5. Dimensionality Reduction – PCA

In [None]:
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X)

print("Explained Variance Ratio (Top 5 PCA components):", pca.explained_variance_ratio_)


##  Key Takeaways

Feature selection reduces noise & redundancy in data.

Use Filter methods for quick pre-selection.

Apply Wrapper & Embedded methods for model-based refinement.

Consider PCA/Dimensionality Reduction when dealing with very high-dimensional data.