<a href="https://colab.research.google.com/github/amzad-786githumb/AI_and_ML_by-Microsoft/blob/main/14_Implementing_forward_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2>Tasks:</h2>


*   **Implement forward selection:** Apply forward selection to add the most significant features to an ML model, enhancing its predictive power.
*   **Evaluate model performance:** Use the R-squared metric to assess the impact of each feature on the model's performance during the selection process.
*   **Build an efficient model:** Identify and include only the most relevant features, improving model efficiency and interpretability.


<h3>Step 1: Import the required libraries</h3>

In [2]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

<h3>Step 2: Load and prepare the data</h3>

In [3]:
# Sample dataset: Study hours, previous exam scores, and pass/fail labels
data = {
    'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'PrevExamScore': [30, 40, 45, 50, 60, 65, 70, 75, 80, 85],
    'Pass': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # 0 = Fail, 1 = Pass
}

df = pd.DataFrame(data)

# Features and target variable
X = df[['StudyHours', 'PrevExamScore']]
y = df['Pass']

<h3>3. Implementing forward selection<h3>







**Step-by-step process:**

1. Start with an empty model (no features).

2. For each feature, train a model, and evaluate its performance using R-squared.

3. Add the feature that improves the R-squared value the most.

4. Repeat the process, adding features one by one until no further improvement is made.

In [5]:
def forward_selection(X, y):
    remaining_features = set(X.columns)
    selected_features = []
    current_score = 0.0
    best_score = 0.0

    while remaining_features:
        scores_with_candidates = []

        # Loop through remaining features
        for feature in remaining_features:
            features_to_test = selected_features + [feature]
            X_train, X_test, y_train, y_test = train_test_split(X[features_to_test], y, test_size=0.2, random_state=42)

            # Train the model
            model = LinearRegression()
            model.fit(X_train, y_train)

            # Make predictions and calculate R-squared
            y_pred = model.predict(X_test)
            score = r2_score(y_test, y_pred)

            # Record the score with the current feature
            scores_with_candidates.append((score, feature))

        # Sort candidates by score (highest score first)
        scores_with_candidates.sort(reverse=True)
        best_score, best_feature = scores_with_candidates[0]

        # If adding the feature improves the score, add it to the model
        if current_score < best_score:
            remaining_features.remove(best_feature)
            selected_features.append(best_feature)
            current_score = best_score
        else:
            break

    return selected_features

# Run forward selection
best_features = forward_selection(X, y)
print("Selected features using Forward Selection:", best_features)

Selected features using Forward Selection: ['PrevExamScore']


<h3>4. Explaining the process</h3>

**How forward selection works**

1. **Initialize with no features:** The process begins with an empty set of features.

2. **Evaluate each feature:** At each iteration, the model is trained with one additional feature at a time, and the R-squared value is calculated.

3. **Add the best feature:** The feature that provides the highest improvement in R-squared is added to the model.

4. **Repeat:** The process repeats by adding the next best feature until no further improvement can be made.

**Metrics used**

**R-squared (coefficient of determination):** This metric shows how much of the variance in the target variable is explained by the features. The higher the R-squared, the better the model explains the data.

<h3>Step 6: Analyze the results</h3>

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X[best_features], y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
final_r2_score = r2_score(y_test, y_pred)

print(f'Final R-squared score with selected features: {final_r2_score}')

Final R-squared score with selected features: 1.0


In this case, PrevExamScore was the feature that provided the most improvement in the model's performance. The forward selection algorithm determined that adding StudyHours did not significantly improve the model’s R-squared, so only PrevExamScore was selected.