# Feature selection 
is the process of selecting a subset of relevant features
(variables, predictors) for use in machine learning model building

WHY SHOULD WE SELECT FEATURES?
- Simple models are easier to interpret
- Shorter training times
- Enhanced generalisation by reducing overfitting
- Easier to implement by software developers
- Reduced risk of data errors during model use
- Variable redundancy
- Bad learning behaviour in high dimensional spaces

A feature selection algorithm can be seen as the combination of a search
technique for proposing new feature subsets, along with an evaluation measure
which scores the different feature subsets.
- Computationally expensive
- Different feature subsets render optimal performance for different machine learning algorithms

FEATURE SELECTION: METHODS:
- Filter Methods
- Wrapper Methods
- Embedded Methods

**********************
FILTER METHODS
- Rely ONLY on the characteristics of the data(variable) (feature characteristics)
- Do not use machine learning algorithms, Select variables independently of the machine learning algorithm
- Model agnostic(adv): Feature selected by these procedure can be use in any ML algorithm
- Tend to be less computationally expensive (Fast computation)
- Usually give lower prediction performance than a wrapper methods
- Are very well suited for a quick screen and removal of irrelevant features
- By evaluating the correlation between each feature and the target attribute, these methods use a statistical measure to assign a value to each feature. Features are then sorted by score, which is helpful for preserving or eliminating specific features
- Common methods
  - Pearson correlation coefficient
  - Chi-square coefficient
  - Mutual information
- Limitations : • The filter method tends to select redundant variables as the relationship between features is not considered.
- disadv: tend to ignore the effect of the selected feature subset on the performance of the algorithm and evaluate variables individualy
- Filter: Select features By looking to their:  Variance, Correlation, Univariate selection

- A typical Univariate Filter algorithm consists of 2 steps:
   - • Rank features according to a certain criteria(RANKING CRITERIA), Each feature is ranked independently (Univariate) of the feature space
   - • Select the highest ranking features
   - disadv:  May select redundant variables because they do not consider the relationships between features (independently)
- RANKING CRITERIA (ranking features): Feature scores on various statistical tests:
  - Chi-square | Fisher Score
  - Univariate parametric tests (ANOVA)
  - Mutual information
  - Variance: Constant features, Quasi-constant features
  
  
- MULTIVARIATE Filer selection Method:
  - Handle redundant feature, Consider features in relation to other features of the dataset 
  - Scanning for Duplicated features or Correlated features
  - adv: Simple yet powerful methods to quickly remove irrelevant and redundant features
  - Filter methods are First step in feature selection procedures

***********************

************************
WRAPPER METHODS
- Use predictive machine learning models to score the feature subset
- Train a new model on each feature subset and select the subset with the highest performing algorithm
- Tend to be very computationally expensive, because building several ML models at each round of feature selection
- Usually provide the best performing feature subset for a given machine learning algorithm
- They may not produce the best feature combination for a different machine learning model that was not used to select the features
- (Same models to select features and to build it with these selected features in actual work)
- Wrapper methods consider feature selection as a search issue for which different combinations are evaluated and compared. 
- A predictive model is used to evaluate a combination of features and assign a score based on model accuracy.
- Common methods
   - Recursive feature elimination (RFE)
- Limitations : - • Wrapper methods train a new model for each subset, resulting in a huge number of computations.
                - • A feature set with the best performance is usually provided for a specific type of model.
- Wrapper: Forward selection, Backward selection, Exhaustive search
- adv: Detect interactions between variables,, Find the optimal feature subset for the desired classifier
- Steps and procedure:
    - Search for a subset of features
    - Build a machine learning model on the selected feature subset
    - Evaluate model performance
    - Repeat for a different subset of features
- BUT, How to search for the subset of features?
- BUT, How to stop the search? 

Search ,, 3 mechanisms to search for the feature subsets:
- Forward feature selection
   - start with having No features and Adds 1 feature at a time until a predefined criteria is met
- Backward feature elimination
   - start with all the features and remove the least significant feature at each iteration until a criteria is met
- Exhaustive feature search
   - Searches across all possible feature combinations and create a model for each of these possible combinations and select the best
   
- These SEARCH algorithms are: Greedy algorithms because they evaluate all possible combinations, Aim to find the best possible combinations, Computationally expensive, Often impracticable (Exhaustive search)
- STOPPING CRITERIA: (somewhat arbitrary, To be defined by user)
  - • Performance increase
  - • Performance decrease
  - • Predefined number of features is reached

- SUMMARY .....
-  Better predictive accuracy than filter methods
-  Best performing feature subset for the predefined classifier
-  Computationally expensive
-  Stopping criteria is relatively arbitrary
***************************

************************                
EMBEDDED METHODS
- Perform feature selection as part of the model construction process (embedded in the algorithm)
- Consider the interaction between features and models (adv of wrapper models)
- They are less computationally expensive than wrapper methods, because they fit the machine learning model only once (single ML model)
- The most common type of embedded feature selection method is the regularization method
- Regularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive algorithm that bias the model toward lower complexity and reduce the number of features
- Common methods
   - • Lasso regression
   - • Ridge regression
   - Embedded: LASSO, Tree importance(random forest), Regression coefficients of linear models
- Embedded methods have the advs of both filter and wrabber
- Include the interaction of the feature with the classifier or the regressor LIKE Wrabber methods
- They are less computationally intensive LIKE Filter methods
-  Faster than wrapper methods
-  More accurate than filter methods
-  Detect interactions between variables
-  Find the feature subset for the algorithm being trained
- Embedded methods are part of training the ML algorithm, PROCEDURE:
   - • Train a machine learning algorithm using all the features
   - • Derive the feature importance according to the algorithm used
   - • Remove non-important features following some criteria
- SUMMARY ...
- • Better predictive accuracy than filter methods 
- • Faster than wrapper methods and not computationally expensive
- • Render generally good feature subsets for the used algorithm
- • Constrained to the limitations of the algorithm
*************************

# Open-source for Feature Selection / engineering

Scikit-learn - MLXTEND - Feature-engine

- fit() >>> finds important features
- transform() >>> transforms data, Removes unwanted features

![image.png](attachment:image.png)

In [None]:
# train pipeline
price_pipe.fit(X_train, y_train)
# transform data
price_pipe.transform(X_train)
price_pipe.transform(X_test)