<a href="https://colab.research.google.com/github/AbdhMohammady/DataScience/blob/main/forward_selection_with_significance_level.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**How to impliment forward selection**

In this notebook, I want to implement forward selection
For this, I use the dataset available in sklearn.datasets library, this dataset is the California housing dataset. In the following, I will use some other dataset samples to test this code.

The flowchart related to this code is available on [my google drive](https://drive.google.com/file/d/1tXxCP1qk4bMWCPJKon-XW03WeBAqMX_0/view?usp=share_link)

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm  # used to fit models
from sklearn.datasets import fetch_california_housing  #loads the california housing dataset

In [None]:
#The California housing dataset
california_housing = fetch_california_housing(as_frame=True)
# Avilable describtion  => california_housing.DESCR

In [None]:
california_housing.frame.head()
#According to the description on the https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html,
# the target variable is 'MedHouseVal' column , to access data we need call 'california_housing.data.head()' and for target column
# call 'california_housing.target.head()'

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [None]:
# Define forward feature selection using p-value
# this method returns collection of features with best p-values
def forward_selection(data, target, significance_level=0.05):
   # the Set of column names, that is original feature in our dataset
    features = set(data.columns)
    #features we select as best features
    selected_features = set()
    # this is a helper dictiunary to store feature names and p-values
    selected_dict = dict()
    
    while (len(features)>0):
        #Removes processed features from main feature set
        features = features - selected_features
        # p-values that is processed in each fitting loop and store p-values of processed features 
        p_values = dict()

        for feature in features:
           # prepare features to fit
            condidated = list(selected_features)
            # add new feature to fit
            condidated.append(feature)
            # fit model using selected features and new one in each loop
            model = sm.OLS(target, sm.add_constant(data[condidated])).fit()
            # stores new feature and it's p-value
            p_values[feature]= model.pvalues[feature]

        # gets minimum p-value in processed features
        min_p_value = p_values[min (p_values, key=p_values.get)]
        min_feature_name = min(p_values , key=p_values.get)
       
       # terminate loop if condition is true
        if(min_p_value >= significance_level): break
        
        # if min_p_value is smaller than significance_level we select the feature
        selected_features.add(min_feature_name)
        selected_dict[min_feature_name] = min_p_value

    #returns best features processed by p-value
    #this dictiunary contains feater names and p-values
    return selected_dict


In [None]:
#execut the code using california housing dataset
features = forward_selection(california_housing.data,california_housing.target,0.05)


In [None]:
features

{'MedInc': 0.0,
 'HouseAge': 0.0,
 'Latitude': 2.120189499314217e-76,
 'Longitude': 0.0,
 'AveBedrms': 1.881858803058202e-55,
 'AveRooms': 6.884823168164486e-72,
 'AveOccup': 3.985591607998472e-15}

# **Testing the code using 50_startups.csv**

In [None]:
stu = pd.read_csv("/content/drive/MyDrive/DATA/50_Startups.csv")

stu.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [None]:
data = stu.iloc[:,:-2]
target = stu.iloc[:,4:5]


In [None]:
features = forward_selection(data,target,0.5)


In [None]:
features

{'R&D Spend': 3.50032224369015e-32, 'Marketing Spend': 0.06003039719113171}

# **Testing the code using Alcohol.csv**

In [None]:
alcohol = pd.read_csv("/content/drive/MyDrive/DATA/Alcohol.csv")

data = alcohol.iloc[:,:-1]
target = alcohol.iloc[:,13:14]
features = forward_selection(data,target,0.05)


In [None]:
features

{'Flavanoids': 2.7366522617003495e-50,
 'Proline': 6.52009150108951e-11,
 'Color_Intensity': 1.1655178780367606e-16,
 'Ash_Alcanity': 5.721854844894178e-06,
 'OD280': 5.390825157579003e-06,
 'Alcohol': 0.0028329142579628683,
 'Total_Phenols': 0.026647245418236754,
 'Hue': 0.04213945137268148,
 'Ash': 0.04773083859669257}