# PreProcessing: Feature Selection
Feature Selection is an important step in data pre-processing. It consists in selecting the best subset of input variable as the most pertinent. Discarding irrelevant data is essential before applying Machine Learning algorithm in order to:
* *Reduce Overfitting*: less opportunity to make decisions based on noise;
* *Improve Accuracy*: less misleading data means modelling accuracy improves. Predictions can be greatly distorted by redundant attributes. 
* *Reduce Training Time*: With less data the algorithms will train faster;


### Configuration Notebook
Before running the notebook, check these parameter value in order to configure dataset and method option. In particular, we have:
- **DATA_NORMALIZED**: If *True* data will be normalized between (0,1) before feature selection;
- **DATA_STANDARDIZED**: If *True* data will be standarized before feature selection;
- **RESULTS_NORMALIZED**: If *True* results of feature selection will be normalized between (0,1);
- **TARGET_VARIABLE**: It's the target variable used for regression analysis.

In [15]:
DATA_NORMALIZED = False
DATA_STANDARDIZED = True
RESULTS_NORMALIZED = True
TARGET_VARIABLE = 'pm25_int'

### Import Libraries

In [2]:
import scipy.stats as stats
import geopandas as gpd
import warnings
from fs import methods as m
import pandas as pd
warnings.filterwarnings("ignore")

### Data import
Data are imported from .gpkg file, splitted in **train_set** and **test_set** which are used by supervised algorithm. Otherwise with unsupervised algorithm the entire dataset will be processed.

In [3]:
#read gpkg file
data = gpd.read_file('grids/test_grid.gpkg')
#read variables which are not null
labels = m.check_NotNull(pd.DataFrame(data))


### Definition of dataframe used for feature selection


In [None]:
X = pd.DataFrame(data, columns=labels)
Y = X[[TARGET_VARIABLE]]
Y = Y.values.ravel()
X.pop(TARGET_VARIABLE)
X.pop('geometry')
#coordinates definition used for mgwr
coords = list(zip(data['lat_cen'], data['lng_cen']))

if(DATA_NORMALIZED):
    X_notNorm= X
    X = m.NormalizeData2D(X)

if(DATA_STANDARDIZED):
    X_notStand = X
    X = X.apply(stats.zscore)

Due to the fact there isn’t a best feature selection technique, many different methods are performed. The aim of this part is to discover by experimentation which one/ones work better for this specific problem.
In this study, I choose supervised methods, which are classified into 3 groups, based on their  different approach.


## Filter Methods
Filter-based feature selection methods adopt statistical measures to evaluate the correlation/dependence between input variables.
These select features from the without machine learning algorithm. In terms of computation, they are very fast and are very suitable in order to remove duplicated, correlated, redundant variables. On the contrary,  these methods do not remove multicollinearity.


### Pearson correlation index
It's a measure of linear correlation which stands for the ratio between the covariance of two variables and the product of their standard deviations. The algorithm evaluate the index for each feature variable.

In [None]:
tuple = m.pearson(X, Y, RESULTS_NORMALIZED)
#results saved in an external .csv
tuple.to_csv(r'results/pearson.csv', index = False)


### Spearmanr correlation coefficient
The Spearman correlation coefficient is a measure of the monotonicity of the relationship between two datasets.
The algorithm evaluate the index for each feature variable.



In [None]:
tuple =m.spearmanr(X, Y, RESULTS_NORMALIZED)
#results saved in an external .csv
tuple.to_csv(r'results/spearmanr.csv', index = False)


### Kendall tau
As the Spearmanr correlation coefficient, is based on the ranks of data.
In most of the situations, the interpretations of Kendall’s tau and Spearman’s rank correlation coefficient are very similar and thus invariably lead to the same inferences.
The algorithm evaluate the index for each feature variable.


In [None]:
tuple = m.kendall(X, Y, RESULTS_NORMALIZED)
#results saved in an external .csv
tuple.to_csv(r'results/kendall.csv', index = False)


### Chi-square test not allowed with no categorical input
A chi-square test is used to test events independence. In feature selection instead, we aim to select the features which are more highly dependent. Even if is more suitable for categorical input, computation is performed by casting features values as *Int values*.

In [None]:
#tuple = m.chi2_test(X, Y)
#score_frame.append(tuple)

### F-Test
It's a statistical test used to compare model from X and Y and check if the difference is significant between them through regression. The algorithm evaluate the Fisher score for each feature variable.

In [None]:
tuple = m.f_test(X, Y, RESULTS_NORMALIZED)
#results saved in an external .csv
tuple.to_csv(r'results/fisher.csv', index = False)


### Dispersion Ratio (Unsupervised)
It stands for the ratio between arithmetic mean the geometric mean. This is very usefull to check dispersion on data. Higher dispersion implies a higher value of this coefficent, thus a more relevant variable.

In [None]:

#tuple = m.compute_dispersion_ratio(preprocessing.normalize(X))
#score_frame.append(tuple)

### Variance Threshold (Unsupervised)
It's an approach which aims to remove all features which variance doesn’t meet some threshold. Usually it removes all zero-variance features, so variables taht contains useless information.


In [None]:
tuple = m.variance_threshold(X_notStand, RESULTS_NORMALIZED)
#results saved in an external .csv
tuple.to_csv(r'results/variance_threshold.csv', index = False)


## Wrapper Methods (and Embedded)
Wrapper methods, as the name suggests, wrap a machine learning model, with different subsets of input features: In this way the subsets are  evaluated following  the best model performance.
Embedded methods instead are characterised by the benefits of both the wrapper and filter methods, by including interactions of features but also having a reasonable computational cost.


### Exhaustive feature selection for regression analysis
This algorithm follow the exhaustive feature selection approach with brute-force evaluation of feature subsets; the best subset is selected by optimizing a specified metric given an arbitrary regressor or classifier. In this case a transformer is used to perform the Sequential Feature Selection.
The final outputs are:
* Accuracy for the subset choosen;
* Indices of the choosen features;
* Corresponding names of the features choosen;



In [None]:
#m.exhaustive_feature_selection(X, Y)

### Random Forest importance
It uses a forest of trees to evaluate the importance of each features as outputs.

In [None]:
tuple = m.RF_importance(X, Y,RESULTS_NORMALIZED)
#results saved in an external .csv
tuple.to_csv(r'results/RF_importance.csv', index = False)


### Recursive Feature Selection
The goal of RFE is to select features by recursively considering smaller and smaller sets of features.
The final output for each variable will be:
* *Label*;
* A *boolean expressing* whetever is selected or not;
* *Ranking value* based on its score;

In [None]:
tuple = m.recursive_feature_selection(X, Y.astype(int), 5)
#results saved in an external .csv
tuple.to_csv(r'results/rfe.csv', index = False)

## Multiscale Geographically Weighted Regression
Due to the fact that this study is related to geographic and spatial data, each pieces of data is very sensitive to the geographic distance between them. So the use of mgwr methods could be innovative, since multivariate models are increasingly encountered in geographical research to estimate spatially varying relationships between a targets and its predictive variables.




In [None]:
#m.mgwr(data, list(X.columns), coords, Y)
