# Feature Selection Techniques
This means methods that help us choose the most important features (columns) from a dataset and remove the unnecessary or less useful ones.

🔹 Forward Selection (Forward Elimination)

Forward selection is a feature selection technique in which we start with no features in the model and then add features one by one. At each step, the feature that gives the most significant improvement in the model’s performance is included. The process stops when adding new features does not improve the model.

key points:---

*   Start with no features.
*   Add one feature at a time → the one that improves model performance the most.
*   Keep adding until no improvement.

Example:
Dataset = [Age, Salary, Experience, Education]

1.   Start with nothing.
2.   Add "Experience" → best result
3.   Add "Age" → model improves.
4.   Add "Education" → no improvement → stop.

     👉 Final model = [Experience, Age]


🔹 Backward Selection (Backward Elimination)

Backward elimination is a feature selection technique in which we start with all the features in the model and then remove the least significant feature (based on p-value or contribution). This process continues until only the most important features remain in the model.

key points:--


*   Start with all features.
*   Remove the least important one (statistically insignificant or weak).
*  Keep removing until only important ones remain.

Example:
Dataset = [Age, Salary, Experience, Education]



1.   Start with all.
2.   "Education" is least useful → remove.
3.   "Salary" not significant → remove.

      👉 Final model = [Age, Experience]

simple summary
▶Forward = Start empty → keep adding features.
◀Backward = Start full → keep removing features.



for exmple :---

















Create a simple dataset

In [None]:
import pandas as pd
data = {
    "Age":     [25, 30, 35, 40, 45],
    "Experience": [1, 3, 5, 7, 9],
    "Education":  [10, 12, 14, 16, 18],
    "Salary":  [20, 40, 60, 80, 100]   # Target variable
}
df = pd.DataFrame(data)

🔹 Backward Elimination (using statsmodels)

In [None]:
import statsmodels.api as sm

X = df[["Age", "Experience", "Education"]]  # Features
y = df["Salary"]                           # Target

# Add constant for intercept
X = sm.add_constant(X)

# Fit model
model = sm.OLS(y, X).fit()
print(model.summary())


🔹 Forward Selection (using sklearn)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = df.drop("Salary", axis=1)
y = df["Salary"]

selected_features = []
remaining_features = list(X.columns)
best_score = -1

while remaining_features:
    scores = []
    for feature in remaining_features:
        model = LinearRegression()
        temp_features = selected_features + [feature]
        model.fit(X[temp_features], y)
        score = r2_score(y, model.predict(X[temp_features]))
        scores.append((score, feature))

    scores.sort(reverse=True)
    if scores[0][0] > best_score:
        best_score, best_feature = scores[0]
        selected_features.append(best_feature)
        remaining_features.remove(best_feature)
    else:
        break

print("Selected Features (Forward Selection):", selected_features)


output is "Experience"

1.   Forward Selection starts with nothing.
2.   It checks: “Which single feature predicts Salary best?”
3.   In your data, Experience alone explains Salary almost perfectly.
4.   When it tries to add Age or Education, the model does not get much better.
5.   So it stops and keeps only Experience.

👉 Meaning: In your dataset, Experience is the strongest feature, so Forward Selection chooses only that.







“Now I applied feature selection techniques on my original dataset stored in Google Drive.”

In [None]:
import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
url = "https://docs.google.com/spreadsheets/d/1Mdlhhgd7ViNVAsktM133-jBuOOCLgvxg/export?format=xlsx"
dataset = pd.read_excel(url)
print(dataset.head(5))

In [None]:
import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# ✅ Load dataset
url = "https://docs.google.com/spreadsheets/d/1Mdlhhgd7ViNVAsktM133-jBuOOCLgvxg/export?format=xlsx"
dataset = pd.read_excel(url)

X = dataset.iloc[:, :-1]
y = dataset["Outcome"]

# ✅ Logistic Regression Model
lr = LogisticRegression(max_iter=1000)

# ✅ Sequential Feature Selector
fs = SequentialFeatureSelector(
    lr,
    k_features=5,
    forward=True,
    scoring='accuracy',
    cv=5
)

Once the features are selected, we can create a pie chart to see which features have been chosen and what their ratio looks like.

In [None]:
# ✅ Fit the model
fs = fs.fit(X, y)

# ✅ Selected & Not Selected Features
selected = [X.columns[i] for i in fs.k_feature_idx_]
not_selected = [col for col in X.columns if col not in selected]

print("Selected Features:", selected)

# ✅ Pie Chart
labels = selected + not_selected
sizes = [1 if col in selected else 0 for col in labels]

plt.figure(figsize=(6,6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title("Feature Selection Ratio")
plt.show()


♐This will display a pie chart with:

Selected Features ('Glucose', 'Blood_preser', 'Skin thikness', 'BMI', 'Age')

Not Selected Features (along with their names)

Your pie chart is showing that 100% features are selected and 0% are not selected.
This means the feature selection method (Forward Selection) decided to keep all the features from your dataset because they were all useful for predicting the output.