<a href="https://colab.research.google.com/github/amzad-786githumb/AI_and_ML_by-Microsoft/blob/main/13_Implementing_backward_elimination.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2>Tasks:</h2>


*   **Implement backward elimination:** identify and remove the least significant features from a dataset.
*   **Apply statistical modeling:** fit a linear regression model using the statsmodels library and interpret the p-values to determine feature significance.
*   **Refine and simplify models:** analyze the impact of removing irrelevant features on model performance and interpret the results to improve model efficiency.


<h3>Step 1: Import the required libraries</h3>

In [7]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

<h3>Step 2: Load and prepare the data</h3>

In [8]:
# Sample dataset
data = {
    'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'PrevExamScore': [30, 40, 45, 50, 60, 65, 70, 75, 80, 85],
    'Pass': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # 0 = Fail, 1 = Pass
}

df = pd.DataFrame(data)

# Features and target variable
X = df[['StudyHours', 'PrevExamScore']]
y = df['Pass']

<h3>Step 3: Add a constant to the model</h3>

In [9]:
#add a constant to a variable(for intercept)

X = sm.add_constant(X)

<h3>Step 4: Fit the initial model<h3>

In [11]:
#fit the model using the Ordinany Least Square(OLS) regression

model = sm.OLS(y, X).fit()

#Display the ssummary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Pass   R-squared:                       0.758
Model:                            OLS   Adj. R-squared:                  0.688
Method:                 Least Squares   F-statistic:                     10.94
Date:                Tue, 07 Oct 2025   Prob (F-statistic):            0.00701
Time:                        07:35:26   Log-Likelihood:               -0.17258
No. Observations:                  10   AIC:                             6.345
Df Residuals:                       7   BIC:                             7.253
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -0.3333      1.464     -0.228

<h3>Step 5: Implement backward elimination</h3>

**Step-by-step process:**

1. Fit the model with all features.

2. Identify the feature with the highest p-value.

3. Remove the feature with the highest p-value.

4. Refit the model and repeat until all remaining features are statistically significant.

In [13]:
# Define a significance level
significance_level = 0.05

# Perform backward elimination
while True:
    # Fit the model
    model = sm.OLS(y, X).fit()
    # Get the highest p-value in the model
    max_p_value = model.pvalues.max()

    # Check if the highest p-value is greater than the significance level
    if max_p_value > significance_level:
        # Identify the feature with the highest p-value
        feature_to_remove = model.pvalues.idxmax()
        print(f"Removing feature: {feature_to_remove} with p-value: {max_p_value}")

        # Drop the feature
        X = X.drop(columns=[feature_to_remove])
    else:
        break

#Features to remove
print("Features to remove:")
print(X.columns[1:])
# Display the final model summary
print(model.summary())

Features to remove:
Index([], dtype='object')
                                 OLS Regression Results                                
Dep. Variable:                   Pass   R-squared (uncentered):                   0.831
Model:                            OLS   Adj. R-squared (uncentered):              0.812
Method:                 Least Squares   F-statistic:                              44.31
Date:                Tue, 07 Oct 2025   Prob (F-statistic):                    9.31e-05
Time:                        07:39:13   Log-Likelihood:                         -1.8294
No. Observations:                  10   AIC:                                      5.659
Df Residuals:                       9   BIC:                                      5.961
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.

<h3>Step 6: Analyze the results</h3>

After completing the backward elimination process, we examined the final model summary. The remaining features in the model should all have p-values below 0.05, indicating that they are statistically significant predictors of the target variable.

In this case, PrevExamScore was the remaining feature after backward elimination, as it had a statistically significant p-value (less than 0.05). The feature StudyHours was removed because its p-value exceeded the significance threshold.