# **Feature selection**
Features, attributes, columns, properties are all synonims for us.

The significance of attributes for the purposes of data mining can vary highly:
* **Irrelevant alteration**: they can alter the results of some mining algorithms, in particular if there's no sufficient control of overfitting;
* **Redundancies**: some attributes can be strongly related to other useful attributes;
> * **Alteration**: some mining algorithms (i.e. Naive Bayes) are strongly influenced by strong correlation (since they're based on probabilities).
* **Confunding**: some attributes can be misleading like having **hidden effect** on the outcome variable;
> **Mixed effect**: i.e. one attribute could be strongly related to the class in 65% of the cases and random in the other cases.

**Why feature selection?** Sometimes less is better. It may:
* Enables the machine learning algorithm to train faster;
* Reduces the complexity of a model, making it easier to interpret;
* Improves the accuracy of a model;
* Reduces overfitting.

Note: it may be the case that a specific selection action obtain only one of the
above effects.

### Supervised or not?
**Unsupervised**: lots of methods available (i.e. clustering, feature trasformation such as PCA);
**Supervised**: consider the relationship between each attribute and the class (i.e. filter methods, scheme-dependent and scheme-independent selection, wrapper methods, embedded methods such as Lasso and Ridge regression).

## Filter methods
The assesments is based on general characteristic of data.
It selects the subset of attributes indipendentely from the mining model that will be used.

* **Pearson's correlation**: quantifies the amount of linea dependence between two variables;
* **LDA - Linear discriminant analysis**: find a linear combination of features that characterizes or separates two or more classes;
* **ANOVA - Analysis of variance**: similar to LDA but uses independet categorical features and a dependent continuous feature; 
* **Chi-Square**: statistical test applied to the groups of categorical feature to evaluate the likelihood of correlation or association using frequency distribution.

![](https://i.ibb.co/gr4F0t1/zin.jpg)

## Wrapper methods
Try to use a subeset of features and train a model using them, adding or removing features from the subset: basically a greedy search problem (high computational cost).


### RFE - Recursive feature elimination
It's a feature ranking with recursive feature elimination.
Uses an external estimator to assign weights to features.
Considers smaller and smaller sets of features

The estimator is trained on the initial set and the importance of each feature is obtained: the lest important ones are pruned.

Repeat and stops when the desired number of features is reached.

### Difference between filter and wrapper methods

Filter | Wrapper
--- | ---
Measure the relevance of features by their correlation with the dependent variable | Measure the usefullness of features actually training a model (stronger)
Faster (no training) | Slower and expensive
Statistical methods for evaluation | Cross-validation for evaluation
Might fail (suboptimal) | Always find best subset
Less prone to overfit | More prone to overfit

## Dimensionality reduction
Instead of considering which subset of attributes is to be ignored it is possible to map the dataset into a new space with fewer attributes (i.e. PCA).

### PCA - Principal component analysis
![](https://i.ibb.co/5RxHrqJ/photo-2021-01-03-18-33-47.jpg)

The covariance matrix is positive semidefinite, eigenvalues are positive and sorted in decreasing order, while eigenvectors are sorted according to the eigenvalue order.

### MDS - Multi-dimensional scaling
It's a presentation technique (even just to have a visual representation of the data).

Starting from the distances among the elements of the dataset, fits the projection of the element into a $m$-dimensional space in such way that the distances among the elements are preserved.

Exists for metric and non-metric spaces as well.

## Univariate feature selection
Select the best set of features based on univariate statistical test:
* Consider the original set of features and the target;
* For each feature, return a score and a p-value;
* Then:
> * **Select k-best**: remove all but the $k$ highest scoring features;
  * **Select percentile**: removes all but a user-specified highest scoring percentage of features.

**Scoring function**: used by the feature selector to evaluate how much a feature is useful to predict the target (i.e. **Mutual information**, generalization of information gain).