# PreProcessing: Feature Selection
Feature Selection is an important step in data pre-processing. It consists in selecting the best subset of input variable as the most pertinent. Discarding irrelevant data is essential before applying Machine Learning algorithm in order to:
* *Reduce Overfitting*: less opportunity to make decisions based on noise;
* *Improve Accuracy*: less misleading data means modelling accuracy improves. Predictions can be greatly distorted by redundant attributes. 
* *Reduce Training Time*: With less data the algorithms will train faster;


### Configuration Notebook
Before running the notebook, check these parameter value in order to configure dataset and method option. In particular, we have:
- **DATA_NORMALIZED**: If *True* data will be normalized between (0,1) before feature selection;
- **DATA_STANDARDIZED**: If *True* data will be standarized before feature selection;
- **RESULTS_NORMALIZED**: If *True* results of feature selection will be normalized between (0,1);
- **TARGET_VARIABLE**: It's the target variable used for regression analysis.
- **NUMBER_OF_FEATURE_RFE**: It's number of features of the best subset obtained by *recursive feature selection method*;
- **DATASET_PATH**: path of the .gpkg file imported.

In [17]:
DATA_NORMALIZED = False
DATA_STANDARDIZED = True
RESULTS_NORMALIZED = True
DATASET_PATH = 'grids/grid_cams_42_43_2020.gpkg'
TARGET_VARIABLE = 'pm25_cams'
NUMBER_OF_FEATURE_RFE = 5

### Import Libraries

In [18]:
import plotly.graph_objects as go
import scipy.stats as stats
import geopandas as gpd
import warnings
from fs import methods as m
import pandas as pd
import plotly.express as px
warnings.filterwarnings("ignore")

### Data import
Data are imported from .gpkg file, splitted in **train_set** and **test_set** which are used by supervised algorithm. Otherwise with unsupervised algorithm the entire dataset will be processed.

In [19]:
#read gpkg file
data = gpd.read_file(DATASET_PATH)
#read variables which are not null
labels = m.noSensor_features(data.columns.tolist())
score_results = pd.DataFrame()

### Definition of dataframe used for feature selection


In [20]:
X = m.clean_dataset_nosensor(data, labels)
Y = X[[TARGET_VARIABLE]]
Y = Y.values.ravel()
X.pop(TARGET_VARIABLE)
X.pop('dsf4')
X.pop('dsf5')
X.pop('dsf11')
X.pop('dsf12')
X.pop('dsf13')
X.pop('dsf14')
score_results['Features'] = X.columns.tolist()

#coordinates definition used for mgwr
coords = list(zip(X['lat_cen'], X['lng_cen']))
X_notStand = X

if(DATA_NORMALIZED):
    X_notNorm= X
    X = m.NormalizeData2D(X)

if(DATA_STANDARDIZED):
    X = X.apply(stats.zscore)

Due to the fact there isn’t a best feature selection technique, many different methods are performed. The aim of this part is to discover by experimentation which one/ones work better for this specific problem.
In this study, I choose supervised methods, which are classified into 3 groups, based on their  different approach.


## Filter Methods
Filter-based feature selection methods adopt statistical measures to evaluate the correlation/dependence between input variables.
These select features from the without machine learning algorithm. In terms of computation, they are very fast and are very suitable in order to remove duplicated, correlated, redundant variables. On the contrary,  these methods do not remove multicollinearity.


### Pearson correlation index
It's a measure of linear correlation which stands for the ratio between the covariance of two variables and the product of their standard deviations. The algorithm evaluate the index for each feature variable.

In [21]:
score_results['pearson'] = m.pearson(X, Y, RESULTS_NORMALIZED)


### Spearmanr correlation coefficient
The Spearman correlation coefficient is a measure of the monotonicity of the relationship between two datasets.
The algorithm evaluate the index for each feature variable.



In [22]:
score_results['spearmanr'] =m.spearmanr(X, Y, RESULTS_NORMALIZED)


### Kendall tau
As the Spearmanr correlation coefficient, is based on the ranks of data.
In most of the situations, the interpretations of Kendall’s tau and Spearman’s rank correlation coefficient are very similar and thus invariably lead to the same inferences.
The algorithm evaluate the index for each feature variable.


In [23]:
score_results['kendall'] = m.kendall(X, Y, RESULTS_NORMALIZED)


### Chi-square test not allowed with no categorical input
A chi-square test is used to test events independence. In feature selection instead, we aim to select the features which are more highly dependent. Even if is more suitable for categorical input, computation is performed by casting features values as *Int values*.

In [24]:
#tuple = m.chi2_test(X, Y)
#score_frame.append(tuple)

### F-Test
It's a statistical test used to compare model from X and Y and check if the difference is significant between them through regression. The algorithm evaluate the Fisher score for each feature variable.

In [25]:
score_results['fisher'] = m.f_test(X, Y, RESULTS_NORMALIZED)


### Dispersion Ratio (Unsupervised)
It stands for the ratio between arithmetic mean the geometric mean. This is very usefull to check dispersion on data. Higher dispersion implies a higher value of this coefficent, thus a more relevant variable.

In [26]:

#results['dispersion_ratio'] = m.compute_dispersion_ratio(preprocessing.normalize(X))

### Variance Threshold 
It's an approach which aims to remove all features which variance doesn’t meet some threshold. Usually it removes all zero-variance features, so variables taht contains useless information.


In [27]:
score_results['variance_threshold'] = m.variance_threshold(X_notStand)

## Wrapper Methods (and Embedded)
Wrapper methods, as the name suggests, wrap a machine learning model, with different subsets of input features: In this way the subsets are  evaluated following  the best model performance.
Embedded methods instead are characterised by the benefits of both the wrapper and filter methods, by including interactions of features but also having a reasonable computational cost.


### Exhaustive feature selection for regression analysis
This algorithm follow the exhaustive feature selection approach with brute-force evaluation of feature subsets; the best subset is selected by optimizing a specified metric given an arbitrary regressor or classifier. In this case a transformer is used to perform the Sequential Feature Selection.
The final outputs are:
* Accuracy for the subset choosen;
* Indices of the choosen features;
* Corresponding names of the features choosen;



In [28]:
#score_results = m.exhaustive_feature_selection(X, Y)

### Random Forest importance
It uses a forest of trees to evaluate the importance of each features as outputs.

In [29]:
score_results['RF_importance'] = m.RF_importance(X, Y,RESULTS_NORMALIZED)


### Recursive Feature Selection
The goal of RFE is to select features by recursively considering smaller and smaller sets of features.
The final output for each variable will be:
* *Label*;
* A *boolean expressing* whetever is selected or not;
* *Ranking value* based on its score;

In [30]:
m.recursive_feature_selection(X, Y.astype(int), NUMBER_OF_FEATURE_RFE)

Unnamed: 0,Variables,isSelected,Ranking
0,left,False,85
1,bottom,False,84
2,right,False,83
3,top,False,82
4,lng_cen,False,81
...,...,...,...
84,prec_int,False,50
85,rad_glob_int,False,61
86,temp_int,False,66
87,wind_dir_int,False,69


## Multiscale Geographically Weighted Regression
Due to the fact that this study is related to geographic and spatial data, each pieces of data is very sensitive to the geographic distance between them. So the use of mgwr methods could be innovative, since multivariate models are increasingly encountered in geographical research to estimate spatially varying relationships between a targets and its predictive variables.




In [31]:
#m.mgwr(data, list(X.columns), coords, Y)


## Results
Results before are collected in a Pandas Dataframe called **score_results**. In each column are shown the score results for each method; instead, each row collect the results for each feature variable.
* **Features**: Set of feature names (Strings);
* **pearson**:  Pearson correlation index scores;
* **spearmanr**: Spearmanr correlation coefficient scores;
* **kendall**: Kendall tau scores; 
* **fisher**: F-Test scores;
* **variance_threshold**: Variance Threshold  scores;
* **RF_importance**: Random Forest importance scores;

* **avg_scores**: scores obtained from the mean of the previous scores for each variables; this values are displayed through a bar plot.

In [32]:
mean_results =  score_results.mean(axis=1)
score_results['avg_scores'] = mean_results

for column in score_results:
    column.round(decimals = 3)
#TODO insert result of rfe
fig = px.bar(score_results, x='Features', y='avg_scores', title='Score average')
fig.show()


fig = go.Figure(data=[go.Table(
    header=dict(values=list(score_results.columns),
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[score_results.Features, score_results.pearson, score_results.spearmanr,
                       score_results.kendall, score_results.fisher, score_results.variance_threshold, score_results.RF_importance, score_results.avg_scores],
               fill_color='lavender',
               align='left'))
])
fig.update_layout(
    autosize=False,
    width=1000,
    height=800,)
fig.show()

AttributeError: 'str' object has no attribute 'round'