Feature Selection

https://chat.openai.com/share/e9914f54-8741-479c-89af-bbedec638295

There are various techniques for feature selection.We will discuss each and everyone

### Feature Selection - Dropping Constant features

In this step we will be removing the features which have constant features which are actually not important for solving the particular problem.

### Handling in small data

In [1]:
# import pandas to create dataframe

import pandas as pd

data = pd.DataFrame({'A':[1,2,4,1,2,4],'B':[4,5,6,7,8,9],'C':[0,0,0,0,0,0],'D':[1,1,1,1,1,1]})

In [2]:
data.head()

Unnamed: 0,A,B,C,D
0,1,4,0,1
1,2,5,0,1
2,4,6,0,1
3,1,7,0,1
4,2,8,0,1


See the column C and D are the constant values are same.It means both the feature has 0 varience.

In [3]:
# To calculate the varience 

import numpy as np

# Create an example dataset with two features
feature_C = np.array([0,0,0,0,0,0])
feature_D = np.array([1,1,1,1,1,1])

# Calculate the variance for each feature
variance1 = np.var(feature_C)
variance2 = np.var(feature_D)

print(f'Variance of Feature 1: {variance1}')
print(f'Variance of Feature 2: {variance2}')


Variance of Feature 1: 0.0
Variance of Feature 2: 0.0


### Why remove constant feature

Removing constant features is a common practice in data preprocessing and feature selection for several important reasons:

#### No Discriminatory Power: 

   Constant features, by definition, have the same value for all observations in your dataset. They do not provide any discriminatory information that can help your machine learning model distinguish between different classes or make meaningful predictions. Essentially, they contain no useful information for the model.

#### Redundancy: 

   Constant features are redundant because they don't add any new information. Keeping them in the dataset would not improve the model's performance, and they might even introduce noise or cause issues in some algorithms.

#### Dimensionality Reduction: 

   Removing constant features reduces the dimensionality of your dataset. Fewer features can lead to faster model training times, lower memory usage, and simpler models that are less prone to overfitting.

#### Improved Interpretability: 

   With fewer features, it's easier to interpret the model's results and understand which features are most important for making predictions. Constant features can clutter your dataset and make it harder to extract meaningful insights.

#### Avoiding Collinearity: 

   Constant features can lead to issues with collinearity (multicollinearity) in some machine learning algorithms. Collinearity occurs when two or more features are highly correlated, which can result in unstable model coefficients and difficulty in interpreting the importance of individual features.

#### Reducing Noise: 

   Constant features contribute to the noise in your dataset because they don't carry any information. Removing them can help reduce noise and improve the model's ability to learn meaningful patterns.

Overall, removing constant features is a preprocessing step that can help you create a cleaner, more efficient, and more informative dataset for training your machine learning models. It's particularly important when working with large datasets or when you're building models that are sensitive to feature redundancy or irrelevant information.

### NOTE 
We can directly drop the feature. Because it is only applicable for small dataset. But large dataset we have one class in sklearn library called Varience Threshold.Using this library to check the varience respect to the threshold.We can remove the feature easily

### Variance Threshold
In scikit-learn, you can use the VarianceThreshold class from the sklearn.feature_selection module to apply a variance threshold to your dataset and remove low-variance or constant features.

In [4]:
# to perform

from sklearn.feature_selection import VarianceThreshold

var_thres = VarianceThreshold(threshold=0.3)  # 30 % 
var_thres.fit(data)

In [5]:
var_thres.get_support()  # this basically indicates the each and every values for removing

array([ True,  True, False, False])

array([ True,  True, False, False])

It indicates,

true - A (No need to remove) <br>
true - B (No need to remove)<br>
false - C (Need to remove)<br>
false - D (Need to remove)


   - True indicates that the corresponding feature meets the variance threshold and should not be removed.
   - False indicates that the corresponding feature does not meet the variance threshold and should be removed.

So, based on the array [True, True, False, False]:

   - Features 'A' and 'B' have True, meaning they meet the variance threshold and should not be removed.
   - Features 'C' and 'D' have False, indicating they do not meet the variance threshold and should be removed.

In [6]:
# To get constant columns (It means the removed column)

constant_colunms = [column for column in data.columns 
                   if column not in data.columns[var_thres.get_support()]]



In [7]:
constant_colunms ## this is removed 

['C', 'D']

In [8]:
len(constant_colunms)

2

In [9]:
# To get the feature using for loop

for features in constant_colunms:
    print(features)

C
D


In [10]:
# To remove the column

data.drop(constant_colunms,axis = 1)

Unnamed: 0,A,B
0,1,4
1,2,5
2,4,6
3,1,7
4,2,8
5,4,9


### Handling in Big dataset

In the dataset there are many rows around 76020 iam selected 10000 rows for example and practising

In [11]:
df = pd.read_csv('D:\\Feature Selection\\Datasets\\Santander Dataset\\train.csv',nrows = 10000)

In [12]:
df.shape  # 371 columns 

(10000, 371)

In [13]:
df.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In small dataset there is no need to follow the splits(Independent and dependent) and train_test splits.

##### Remember:

Always practices to follow the splits(Independent and dependent) and train_test splits then applying the variance threshold.
It always to be efficient for big level datasets.It is given by Sklearn documentation.

In [14]:
# Independent and dependent feature

X = df.drop(['TARGET'],axis=1)
y = df['TARGET']

In [15]:
X.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016


In [16]:
pd.DataFrame(y.head())

Unnamed: 0,TARGET
0,0
1,0
2,0
3,0
4,0


In [17]:
# Performing train_test split

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 43)

In [18]:
X_train.shape

(7000, 370)

In [19]:
X_test.shape

(3000, 370)

### Lets apply Variance Threshold

In [20]:
from sklearn.feature_selection import VarianceThreshold

var_thres = VarianceThreshold(threshold = 0)
var_thres.fit(X_train)

In [21]:
# finding non constant features
var_thres.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False,  True,  True,  True,  True,  True, False,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True, False, False,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False,  True,  True,  True,
        True,  True, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [22]:
sum(var_thres.get_support()) # Total non constant column

272

#### It means the out of 371 columns 272 columns are non constant which means important columns.

constant columns = 370 - 272 = 98 (duplicate or removed columns)

In [23]:
# To get constant columns (It means the removed column)

constant_colunms = [column for column in X_train.columns 
                   if column not in X_train.columns[var_thres.get_support()]]



In [24]:
constant_colunms ## this is removed column with zero varience(Threshold)

['ind_var2_0',
 'ind_var2',
 'ind_var6',
 'ind_var13_medio_0',
 'ind_var13_medio',
 'ind_var18_0',
 'ind_var18',
 'ind_var27_0',
 'ind_var28_0',
 'ind_var28',
 'ind_var27',
 'ind_var29',
 'ind_var34_0',
 'ind_var34',
 'ind_var41',
 'ind_var46_0',
 'ind_var46',
 'num_var6',
 'num_var13_medio_0',
 'num_var13_medio',
 'num_var18_0',
 'num_var18',
 'num_var27_0',
 'num_var28_0',
 'num_var28',
 'num_var27',
 'num_var29',
 'num_var34_0',
 'num_var34',
 'num_var41',
 'num_var46_0',
 'num_var46',
 'saldo_var6',
 'saldo_var13_medio',
 'saldo_var18',
 'saldo_var28',
 'saldo_var27',
 'saldo_var29',
 'saldo_var34',
 'saldo_var41',
 'saldo_var46',
 'delta_imp_amort_var18_1y3',
 'delta_imp_amort_var34_1y3',
 'delta_imp_reemb_var17_1y3',
 'delta_imp_reemb_var33_1y3',
 'delta_imp_trasp_var17_out_1y3',
 'delta_imp_trasp_var33_in_1y3',
 'delta_imp_trasp_var33_out_1y3',
 'delta_num_reemb_var17_1y3',
 'delta_num_reemb_var33_1y3',
 'delta_num_trasp_var17_out_1y3',
 'delta_num_trasp_var33_in_1y3',
 'delta_n

In [25]:
len(constant_colunms) ## this is removed 

98

In [26]:
X_train.drop(constant_colunms,axis=1)

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
2724,5427,2,23,3.0,0.00,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,207370.380000
5057,10100,2,64,0.0,54.33,77.46,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,36737.850000
6027,12112,2,23,0.0,0.00,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,46381.860000
6729,13511,2,25,0.0,0.00,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,198533.370000
9132,18381,2,37,0.0,0.00,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,105892.620000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8499,17120,2,39,30.0,0.00,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,65245.980000
2064,4120,2,26,0.0,0.00,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70188.900000
7985,16098,2,23,0.0,0.00,0.00,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,87213.870000
2303,4628,2,60,0.0,2110.05,2743.26,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,175329.960000


### Threshold Values

Threshold value is selected based on the dataset and performing cross validation.

- Threshold 0 means the features with zero varience are selected (0%)
- Threshold 0.3 means the features with 0.30 and below varience are selected (30%)
- Threshold 0.5 means the features with 0.50 and below varience are selected (50%)
- Threshold 0.7 means the features with 0.70 and below varience are selected (70%)
- Threshold 1.0 means the features with 1.00 and below varience are selected (100%)

These selection are also known as Quasi in Variance Threshold