# Random forest selection -- Boston

1. Main idea
Random forest has the advantages of high accuracy, good robustness and easy to use, making it one of the most popular machine learning algorithms at present. Random forest provides two methods of feature selection: mean decrease impurity and mean decrease accuracy.

2. Mean decrease in impurity

Principle introduction
- The random forest is composed of multiple CART decision trees, in which each node is a condition about a certain feature, in order to divide the dataset into two parts according to different response variables.
- CART uses impurity to determine nodes (optimal conditions). Gini impurity is usually used for classification problems, and variance or least square fitting is usually used for regression problems.
- When training the decision tree, you can calculate how much each feature reduces the purity of the tree. For a decision tree forest, it is possible to calculate how much impurity each feature reduces on average, and use the average reduction of impurity as the criterion for feature selection.
- The ranking result of random forest based on impurity is very bright, and the score of the features after the features with the highest score drops sharply.

In [2]:
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
import numpy as np

#Load boston housing dataset as an example
boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
rf = RandomForestRegressor()
# 训练随机森林模型，并通过feature_importances_属性获取每个特征的重要性分数。rf = RandomForestRegressor()
rf.fit(X, Y)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names),
             reverse=True))


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Features sorted by their score:
[(0.432, 'RM'), (0.3777, 'LSTAT'), (0.0599, 'DIS'), (0.0345, 'CRIM'), (0.0273, 'NOX'), (0.0162, 'PTRATIO'), (0.0153, 'TAX'), (0.0127, 'AGE'), (0.0114, 'B'), (0.0071, 'INDUS'), (0.0038, 'RAD'), (0.0011, 'ZN'), (0.0009, 'CHAS')]


2. Mean decrease in accuracy

Principle introduction
- Feature selection is carried out by directly measuring the effect of each feature on model accuracy.
- The main idea is to disrupt the order of the eigenvalues of each feature and measure the effect of the order change on the accuracy of the model.
- For unimportant variables, shuffling order does not have much effect on the accuracy of the model.
- For important variables, disordering reduces the accuracy of the model

Random forest is a very popular feature selection method and it is easy to use. But it has two main problems:
- Important features may score low (correlation feature questions)
- This method is more favorable for features with more categories of feature variables (bias problem)

Finally, feature screening is to understand the data or better train the model, we should choose suitable methods according to our own goals. Feature filtering for better/easier training of models should be avoided if computational resources are sufficient, as feature filtering can easily lose useful information. In order to reduce the influence of invalid features and avoid over-fitting, random forest and XGBoost integration models can be selected to avoid over-fitting features.