# Feature Selection

### Feature Selection
<img src="../imgs/03.3.02_1.png" align="right"  width=600><br>

<p><font size="3" face="Arial" font-size="large"><ul type="square">
    
<li><a href="#1"> Purpose </a></li>

<li><a href="#g1"> Types of feature filtering</a>
<ul>
<li><a href="#g01">Linear Correlation</a></li>
<li><a href="#g02">Phik Correlation</a></li>
<li><a href="#g1">Feature importance</a></li>
<li><a href="#g2">Permutation importance</a></li>
<li><a href="#g3">SHAP values</a></li>
<li><a href="#g4">CatBoost feature selection </a></li>    
<li><a href="#g4">Boruta</a></li>
<li><a href="#g4">BoostARoota</a></li>
    
</ul>
</li>
    
<li><a href="#check1">Review features</a>
<li><a href="#6">Conclusion</a>

</li></ul></font></p>

### Purpose

<div class="alert alert-info">

As we can see in the lessons about **Feature engineering** and automatic feature generation, we can quickly generate a large number of features, especially in automatic mode. Why then throw out some of them? <br> 
There are several main reasons:
* The first one, which lies on the surface, is that if there are a lot of features, then the data may no longer fit in memory; the training time of the model may significantly increase, especially if we want to test several different algorithms or an ensemble. Especially when platforms are limited to the duration of a single session (12 hours in Kaggle) and memory consumption limits.
* But the main reason is that as the number of features increases, the accuracy of model prediction often decreases. Especially if there are a lot of junk features in the data (almost not correlating with the target). Some algorithms, with a strong increase in the number of features, generally cease to work adequately. In other words, what we get as a result is **overfitting**
* Even if the accuracy does not decrease, there is a risk that your model relies on noisy features, which will reduce the stability of the forecast for a private sample.

## 1. Feature Selection Methods
There are 3 main Feature Selection methods:
* Filter methods
* Wrapper methods
* Embedded methods

<div class="alert alert-info">    
    
## **1.1 - Filter methods**

These methods are based on statistical methods and, as a rule, consider each feature independently. They allow you to evaluate and rank features according to their significance, which is taken as the degree of correlation of this feature with the target variable. The main advantage of this group of methods is the low cost of calculations, which linearly depends on the total number of features.  They are significantly faster than both wrapper and embedded methods. In addition, they work well even when the number of features exceeds the number of examples in the training sample (which is not always possible for methods of other categories). <br>
The main disadvantage of these methods is that they consider each feature in isolation from the others, so they are not as accurate, but they can quickly sort out the features. However, progress does not stand still and filter methods are emerging that try to solve this problem in different ways - based on mutual information of features, or taking into account the redundancy of features 
(the mRmR method is minimal redundancy with maximum relevance). <br>
Some of these methods are implemented in the `sklearn.feature_selection` section of the scikit-learn library, for example, `SelectKBest`, `chi2`, etc. Similarly, **feature importance** in gradient boosting libraries can be said to be based on this technique, but to the same extent they can be attributed to embedded methods is a debatable issue.

<div class="alert alert-info">    

## **1.2 - Embedded methods**

Where are they embedded? Right into the learning process of the model. These models make it possible not to separate the processes of training and feature selection, i.e., during model training, we filter out and get a model at the output that knows which features to pay more attention to and which are garbage. These methods require less calculations than wrapper methods, but more than filtering methods. As you probably already guessed, the main methods from this group are our "old friends" - regularization (for example, LASSO and Ridge regressions), DecisionTree and randomForest algorithms, there are also regularizations in boosters and neural networks.<br>
Regularization also has disadvantages: at least once you will have to train the model on all the signs and look at the coefficients, which is not always convenient and feasible. Similarly, a model trained on all signs will work slower with inference. But in general, this method is better able to capture the interdependencies of variables than filtering methods.

<div class="alert alert-info">    

## **1.3 - Wrapper methods**
What do they wrap? And they wrap the training of the model in sequential removal (backward feature selection) or addition (forward fs) of features. Backward feature selection is better at tracking the relationships between features, but it is much more computationally expensive.<br>
The main disadvantage of all wrapper methods is the long calculation time. In addition, in the case of a large number of features and a small size of the training dataset, these methods are in danger of overfitting.<br>
Some examples of such methods: RFE (recursive feature elimination) from **scikit-learn** package, **Boruta** of **BorutaPy** (for **RandomForest** algorithm), etc.<br>

<div class="alert alert-info">    

Sometimes it is difficult to determine unequivocally which group a particular method belongs to, or whether it is a hybrid that combines several methods. For example, when using the **CatBoost** gradient boosting library, you can touch all 3 types of methods or a combination of them: if you just take feature importance with default parameters and disabled regularization, we get the filter method; if we add parameters responsible for regularization to the model, we get a kind of hybrid - like the filter method, but at the same time with regularization (embedded method); if we use the `feature_selection()` function built into CatBoost, the wrapper method with backward fs will work.