> #### Feature Selection


In a high dimensional dataset, there remain some entirely irrelevant, insignificant and unimportant features. It has been seen that the contribution of these types of features is often less towards predictive modeling as compared to the critical features. They may have zero contribution as well. These features cause a number of problems which in turn prevents the process of efficient predictive modeling

- Unnecessary resource allocation for these features.
- These features act as a noise for which the machine learning model can perform terribly poorly.
- The machine model takes more time to get trained.

The most economical solution is **Feature Selection**. Feature Selection is the process of selecting out the most significant features from a given dataset. In many of the cases, Feature Selection can enhance the performance of a machine learning model as well. 

"The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data."

the importance of feature selection: 
 - It enables the machine learning algorithm to train faster.
 - It reduces the complexity of a model and makes it easier to interpret.
 - It improves the accuracy of a model if the right subset is chosen.
 - It reduces Overfitting.




>  **1. difference between dimensionality reduction and feature selection. **
  a dimensionality reduction method does so by creating new combinations of attributes (sometimes known as feature transformation), whereas feature selection methods include and exclude attributes present in the data without changing them.
     - dimensionality reduction methods are Principal Component Analysis, Singular Value Decomposition, Linear Discriminant Analysis, etc.

> **2. different types of general feature selection methods** - Filter methods, Wrapper methods, and Embedded methods.
 - 1) fitler methods:
  Filter methods are generally used as a data preprocessing step. The selection of features is independent of any machine learning algorithm. Features give rank on the basis of statistical scores which tend to determine the features' correlation with the outcome variable.
  Statistical tests can be used to select those features that have the strongest relationships with the output variable.


     - Some examples of some filter methods include the Chi-squared test, information gain, and correlation    coefficient scores.
     
  - 2) wrapper method: a wrapper method needs one machine learning algorithm and uses its performance as evaluation criteria. Some typical examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc. 
     - Recursive Feature elimination: Recursive feature elimination performs a greedy search to find the best performing feature subset.
   
   
   - 3) Embedded methods: are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration. Regularization methods are the most commonly used embedded methods which penalize a feature given a coefficient threshold. Examples of regularization algorithms are the LASSO, Elastic Net, Ridge Regression, etc.
  


%%html
### fitler methods
<img src = 'filter.png', width = 300, height = 400>

https://www.datacamp.com/community/tutorials/feature-selection-python

> ** Difference between filter and wrapper methods **

 - Filter methods do not incorporate a machine learning model in order to determine if a feature is good or bad whereas wrapper methods use a machine learning model and train it the feature to decide if it is essential or not.
 - Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally costly, and in the case of massive datasets, wrapper methods are not the most effective feature selection method to consider.
 - Filter methods may fail to find the best subset of features in situations when there is not enough data to model the statistical correlation of the features, but wrapper methods can always provide the best subset of features because of their exhaustive nature.
 - Using features from wrapper methods in your final machine learning model can lead to overfitting as wrapper methods already train machine learning models with the features and it affects the true power of learning. But the features from filter methods will not lead to overfitting in most of the cases

example 1: use random forest and sklearn.feature_selection.SelectFromModel to do feature selection

https://hub.packtpub.com/4-ways-implement-feature-selection-python-machine-learning/

In [None]:

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import f_regression, mutual_info_regression


# Import the necessary modules
from time import time
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet, Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split


start = time()


# feature selection using lassoCV
clf = LassoCV(cv=5)
sfm = SelectFromModel(clf)

# using anova for feature selection
anova_filter = SelectKBest(f_regression)
mutual = SelectKBest(mutual_info_regression)

# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
         ('filter', sfm),
#          ('ridge', Ridge())
        ('rf', RandomForestRegressor())
        ]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# # Specify the hyperparameter space
# parameters = {
#              'ridge__alpha':[1e-2, 1, 5, 10, 20, 30, 40]
# }

# Specify the hyperparameter space
parameters = {
#                 'filter__k': [10, 20 ,30],
              'rf__n_estimators':[200, 500, 1000]
#               ,'rf__max_features':['auto','sqrt']
             }

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_final, y_refund, test_size = 0.4, random_state = 42)

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, param_grid= parameters, cv = 5)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
y_pred = gm_cv.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned Alpha: {}".format(gm_cv.best_params_))
print("Tuned R squared: {}".format(r2))
print("Tuned MSE: {}".format(mse))

print("this takes %.2f seconds" %(time()-start))

In [None]:
regressor = gm_cv.best_estimator_.named_steps['rf']
selector = gm_cv.best_estimator_.named_steps['filter']
print ('after selection, num of feature %d' %selector.get_support().sum())
print ('original feature: %d' %X_final.shape[1])

In [None]:
# to get what features are selected or removed
pd.DataFrame(sorted(zip(selector.get_support(), X_train.columns)
                          , key=lambda x: x[0], reverse = True))


In [None]:
# function to plot feature_importance.
def plot_feature_importance(feature_importances,feature_names, title):
    ftr_imp_df = pd.DataFrame(sorted(zip(feature_names,feature_importances)
                          , key=lambda x: x[1], reverse = False)
                   )
    y_pos = np.arange(ftr_imp_df.shape[0])

    plt.barh(y_pos, ftr_imp_df[1], align='center', alpha=0.4)
    plt.yticks(y_pos, ftr_imp_df[0])
    plt.xlabel('Feature Importance')
    plt.title(title)
    plt.grid()
    plt.show()

    
plt.subplots(figsize=(15,10))    
rf = gm_cv.best_estimator_.named_steps['rf']
selector = gm_cv.best_estimator_.named_steps['filter']
feature_importances = rf.feature_importances_
feature_names = X_train.columns[selector.get_support() == True]

plot_feature_importance(feature_importances,feature_names, title = 'random forest ')


In [None]:
# let's checked these selected features' pvalue with response

feature_names = X_train.columns[selector.get_support() == True]

f_test, p_val = f_regression(X_final[feature_names], y_refund)
print ('p-value after Anova with response')
df = pd.DataFrame(sorted(zip(p_val,feature_names)
                          , key=lambda x: x[0], reverse = False))
df['significant'] = np.where(df[0] <0.05, 'sig', 'not sig') 
df