## Feature Selection

Feature selection is primarily focused on removing non-informative or redundant predictors from the model. When you are done creating hundreds of thousands of features, it’s time to select a few of them. Well, we should never create hundreds of thousands of useless features. Having too many features pose a problem well known as the curse of dimensionality. If you have a lot of features, you must also have a lot of training samples to capture all the features.

### Step for forwarding Selection
- Start with the empty feature set
- Try the remaining feature
- Estimate classification/regression error for adding each feature in the model
- Select a feature that gives maximum improvement
- Stop when there is no significant improvement

### Step for Backward Selection

- Start with a complete feature set
- Try the remaining feature
- Estimate classification/regression error for adding each feature in the model
- Drop feature that gives less improvement
- Stop when there is no significant improvement

### Methods/Technique

#### Univariate

- Pearson Correlation
- F-score
- Chi-square
- Signal to noise ratio

#### Multivariate

- Compute 'w' on all features
- Remove feature with smallest 'w'
- Recompute 'w' on reduced data

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("salary.csv")
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
6758,47,Self-emp-not-inc,121124,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,>50K
2448,29,Local-gov,251854,HS-grad,9,Never-married,Protective-serv,Not-in-family,Black,Female,0,0,40,United-States,<=50K
29546,25,Private,231638,Some-college,10,Never-married,Tech-support,Unmarried,White,Female,0,0,24,United-States,<=50K
7746,54,?,196975,HS-grad,9,Divorced,?,Other-relative,White,Male,0,0,45,United-States,<=50K
20219,24,Private,35603,Some-college,10,Divorced,Other-service,Not-in-family,Black,Male,0,0,40,United-States,<=50K
14522,39,Private,32650,Assoc-voc,11,Never-married,Farming-fishing,Not-in-family,White,Male,0,0,60,United-States,<=50K
26670,34,Private,174789,Bachelors,13,Never-married,Other-service,Not-in-family,White,Male,0,2001,40,United-States,<=50K
26772,28,Private,412149,10th,6,Never-married,Farming-fishing,Other-relative,White,Male,0,0,35,Mexico,<=50K
6772,28,Private,192591,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,45,United-States,<=50K
4561,32,Local-gov,49325,7th-8th,4,Married-civ-spouse,Other-service,Husband,White,Male,0,0,40,United-States,<=50K


In [3]:
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

In [4]:
data.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
salary            0
dtype: int64

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### Data Preprocessing

In [6]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

In [7]:
data.nunique()

age                  73
workclass             9
fnlwgt            21648
education            16
education-num        16
marital-status        7
occupation           15
relationship          6
race                  5
sex                   2
capital-gain        119
capital-loss         92
hours-per-week       94
native-country       42
salary                2
dtype: int64

In [8]:
categorial_data = []
numerical_data = []
for col in data.columns:
    if data[col].dtype == "O":
        categorial_data.append(col)
    else:
        numerical_data.append(col)

In [9]:
le = LabelEncoder()

In [10]:
for category in categorial_data:
    data[category] = le.fit_transform(data[category])

In [11]:
data.sample(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
6908,35,4,108293,15,10,4,1,3,4,0,0,2205,40,39,0
24657,24,4,283092,1,7,4,3,1,2,1,0,0,35,23,0
4777,22,4,287988,15,10,4,12,1,4,1,0,0,20,39,0


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             32561 non-null  int64
 1   workclass       32561 non-null  int32
 2   fnlwgt          32561 non-null  int64
 3   education       32561 non-null  int32
 4   education-num   32561 non-null  int64
 5   marital-status  32561 non-null  int32
 6   occupation      32561 non-null  int32
 7   relationship    32561 non-null  int32
 8   race            32561 non-null  int32
 9   sex             32561 non-null  int32
 10  capital-gain    32561 non-null  int64
 11  capital-loss    32561 non-null  int64
 12  hours-per-week  32561 non-null  int64
 13  native-country  32561 non-null  int32
 14  salary          32561 non-null  int32
dtypes: int32(9), int64(6)
memory usage: 2.6 MB


In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X = data.drop('salary',axis=1)
y = data.salary

In [15]:
minmax = MinMaxScaler()
X = minmax.fit_transform(X)

## Feature Selection Methods

In [16]:
def model_train(X,y):
    from sklearn.linear_model import LogisticRegression
    x_train,x_test,y_train,y_test = train_test_split(X,y,train_size=0.78,random_state=42)
    model = LogisticRegression()
    model = model.fit(X,y)
    return model,x_test,y_test

In [17]:
def scoring(y_actual,y_pred):
    from sklearn.metrics import accuracy_score,roc_auc_score,precision_score
    print("Accuracy:",accuracy_score(y_pred=y_pred,y_true=y_actual))
    print("Precision:",precision_score(y_pred=y_pred,y_true=y_actual))
    print("ROC AUC Score:",roc_auc_score(y_true=y_actual,y_score=y_pred))

In [18]:
from sklearn.feature_selection import chi2,mutual_info_classif,f_classif,VarianceThreshold

In [19]:
from sklearn.feature_selection import SelectKBest,SelectPercentile

### 1. Variance Threshold

In [20]:
varThresh = VarianceThreshold()
tranform_data = varThresh.fit_transform(X)

In [21]:
tranform_data.shape

(32561, 14)

In [22]:
model,x_test,y_test = model_train(tranform_data,y)

In [23]:
predict = model.predict(x_test)

In [24]:
scoring(y_test,predict)

Accuracy: 0.8234226689000559
Precision: 0.7129455909943715
ROC AUC Score: 0.6929595815364497


### 2. Chi-square

In [25]:
selectK = SelectKBest(chi2,k=5) 
#k is hyperparameter should be equal more than 0 and less than n_features in the dataset

In [26]:
#in our case we have 14 features

In [27]:
selectK.fit(X,y)
x_trans = selectK.transform(X)

In [28]:
model2,x_test,y_test = model_train(x_trans,y)

In [29]:
predict2 = model2.predict(x_test)

In [30]:
scoring(y_test,predict2)

Accuracy: 0.8055555555555556
Precision: 0.7595541401273885
ROC AUC Score: 0.624877523449632


### 3. F score for classification

In [31]:
f_score_method = SelectKBest(f_classif,k=6) 

In [32]:
f_score_method.fit(X,y)
x_f_score = f_score_method.transform(X)

In [33]:
model3,x_test,y_test = model_train(x_f_score,y)

In [34]:
predict3 = model3.predict(x_test)

In [35]:
scoring(y_test,predict3)

Accuracy: 0.8218872138470128
Precision: 0.7152575315840622
ROC AUC Score: 0.6871725344833388


### 4. Mutual Information for classification

In [36]:
mutu_method = SelectKBest(mutual_info_classif,k=8) 

In [37]:
mutu_method.fit(X,y)
x_mut = mutu_method.transform(X)

In [38]:
model4,x_test,y_test = model_train(x_mut,y)

In [39]:
predict4 = model4.predict(x_test)

In [40]:
scoring(y_test,predict4)

Accuracy: 0.818676716917923
Precision: 0.7034883720930233
ROC AUC Score: 0.6830701109139948
