<a href="https://colab.research.google.com/github/amzad-786githumb/AI_and_ML_by-Microsoft/blob/main/16_Implementing_feature_selection_techniques_on_a_given_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2>Tasks:</h2>


*   **Implement cross-validation:** Apply cross-validation techniques to evaluate the robustness of supervised learning models.
*   **Use key evaluation metrics:** Calculate and interpret metrics such as accuracy, precision, recall, F1-score, and R-squared for model assessment.
*   **Improve model reliability:** Ensure that model performance is generalizable by using cross-validation combined with multiple evaluation metrics.


<h3>1. Setting up your environment</h3>

In [2]:
pip install pandas numpy statsmodels scikit-learn



<h3>2: Importing the required libraries</h3>

In [3]:
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

<h3>3: Load and prepare the data</h3>

In [8]:
# Sample dataset: Study hours, previous exam scores, and pass/fail labels
data = {
    'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'PrevExamScore': [30, 40, 45, 50, 60, 65, 70, 75, 80, 85],
    'Pass': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # 0 = Fail, 1 = Pass
}

df = pd.DataFrame(data)

# Features and target variable
X = df[['StudyHours', 'PrevExamScore']]  # Features
y = df['Pass']  # Target variable

<h3>4. Implementing backward elimination</h3>

Backward elimination starts with all features and removes those that are not statistically significant based on their p-values.

**Steps for backward elimination**

1.  Add a constant (intercept) to the feature set.

2.  Fit the model and check p-values.

3.  Remove the feature with the highest p-value greater than 0.05.

4.  Repeat until all remaining features have p-values below 0.05.

In [9]:
#add a constant to a model
X = sm.add_constant(X)

#Fit the model using the OLS
model = sm.OLS(y, X).fit()

#Model summary
print(model.summary())

#remove features with highest p value greater than 0.5
if model.pvalues['StudyHours'] > 0.5:
  X = X.drop(columns='StudyHours', axis=1)
  model = sm.OLS(y, X).fit()

#final model after backward elimination
print("****************************************************************")
print("Model summary after backward elimination")
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Pass   R-squared:                       0.758
Model:                            OLS   Adj. R-squared:                  0.688
Method:                 Least Squares   F-statistic:                     10.94
Date:                Thu, 09 Oct 2025   Prob (F-statistic):            0.00701
Time:                        04:10:27   Log-Likelihood:               -0.17258
No. Observations:                  10   AIC:                             6.345
Df Residuals:                       7   BIC:                             7.253
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -0.3333      1.464     -0.228

<h3>5. Implementing forward selection</h3>

Forward selection adds features one at a time based on their contribution to the model’s performance.

**Steps for forward selection**

1.  Start with an empty model.

2.  Add one feature at a time that improves the model’s performance.

3.  Stop when adding features no longer improves the model.

In [10]:
def forward_selection(X, y):
    remaining_features = set(X.columns)
    selected_features = []
    current_score = 0.0

    while remaining_features:
        scores_with_candidates = []

        for feature in remaining_features:
            features_to_test = selected_features + [feature]
            X_train, X_test, y_train, y_test = train_test_split(X[features_to_test], y, test_size=0.2, random_state=42)

            # Train the model
            model = LinearRegression()
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            score = r2_score(y_test, y_pred)

            scores_with_candidates.append((score, feature))

        # Select the feature with the highest score
        scores_with_candidates.sort(reverse=True)
        best_score, best_feature = scores_with_candidates[0]

        if current_score < best_score:
            remaining_features.remove(best_feature)
            selected_features.append(best_feature)
            current_score = best_score
        else:
            break

    return selected_features

best_features = forward_selection(X, y)
print(f"Selected features using Forward Selection: {best_features}")

Selected features using Forward Selection: ['PrevExamScore']


<h3>6. Implementing LASSO</h3>







LASSO is a regularization technique that automatically shrinks the coefficients of less important features to zero, effectively performing feature selection.

**Steps for LASSO**

1.  Initialize the LASSO model with a regularization parameter.

2.  Fit the model on the training data.

3.  Analyze which features have nonzero coefficients.

In [11]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the LASSO model with alpha (regularization parameter)
lasso_model = Lasso(alpha=0.1)

# Train the LASSO model
lasso_model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = lasso_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f'R-squared score: {r2}')

# Display the coefficients of the features
print(f'LASSO Coefficients: {lasso_model.coef_}')

R-squared score: 0.9997884297520662
LASSO Coefficients: [0.         0.02463636]


<h3>6. Analyzing the results</h3>

In [None]:
# Display the coefficients of the features
print(f'LASSO Coefficients: {lasso_model.coef_}')

LASSO Coefficients: [0.         0.02463636]




*   The coefficient for StudyHours is 0, meaning it was removed from the model.
*   The coefficient for PrevExamScore is nonzero, meaning it was retained in the model.


