L1-regularized logistic regression 筛选特征的原理是，经过L1正则化，次要的特征的weights 会被压缩得接近0或恰等于0。

除此之外，随机森里是另一个常用的feature selection方法。

回顾下，随机森林集成了一堆单独的决策树，在node上找最佳分裂点时，每棵树考虑一个随机的特征子集。

决策树算法的本质是，只有重要特征（及其分裂值）被用于构造tree node。考虑整个森林， 越是重要的特征，越频繁被作为tree node的特征。

换句话说，基于它们在所有树中作为nodes的频率，我们可以对特征重要程度排级，然后选出最靠前最重要的那些。

### 训练好的RandomForestClassifier 有一个属性 .feature\_importances_

In [1]:
import csv
import numpy as np

In [3]:
def read_ad_click_data(n, offset=0):
    X_dict, y = [],[]
    with open('train.csv','r') as csvfile:
        reader = csv.DictReader(csvfile)
        for i in range(offset):
            next(reader)
        i = 0
        for row in reader:
            i += 1
            y.append(int(row['click']))
            del row['click'],row['id'],row['device_id'],row['device_ip'],row['hour']
            X_dict.append(row)
            if i >=n:
                break
    return X_dict, y

In [4]:
n = 10000
X_dict_train, y_train = read_ad_click_data(n)

In [5]:
from sklearn.feature_extraction import DictVectorizer
dict_one_hot_encoder = DictVectorizer(sparse=False)
X_train = dict_one_hot_encoder.fit_transform(X_dict_train)

X_dict_test, y_test = read_ad_click_data(n,n)
X_test = dict_one_hot_encoder.transform(X_dict_test)

X_train_10k = X_train
y_train_10k = np.array(y_train)

In [7]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100, criterion='gini',
                                       min_samples_split=30, n_jobs=-1)
random_forest.fit(X_train_10k, y_train_10k)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=30, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [8]:
random_forest.feature_importances_

array([  1.16494576e-03,   0.00000000e+00,   3.16679117e-04, ...,
         8.79571036e-06,   5.37803466e-06,   3.01467503e-04])

最底下10个weights及对应的features

In [9]:
print(np.sort(random_forest.feature_importances_)[:10])
print(np.argsort(random_forest.feature_importances_)[:10])

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[1063  148 2467 2468 1031  161 1014 1001 2514 2515]


最前面10个：

In [10]:
print(np.sort(random_forest.feature_importances_)[-10:])
print(np.argsort(random_forest.feature_importances_)[-10:])

[ 0.00861919  0.00891563  0.00896739  0.00903449  0.00908059  0.00914879
  0.00932988  0.00955284  0.01217865  0.01382207]
[1923 1540 1503  549  395  275 1085 2307  393  554]


此处，最重要的特征是554

In [11]:
print(dict_one_hot_encoder.feature_names_[554])

C21=33


更多的，前500个重要特征

In [13]:
top500_feature = np.argsort(random_forest.feature_importances_)[-500:]
X_train_10k_selected = X_train_10k[:, top500_feature]
print(X_train_10k_selected.shape)

(10000, 500)


In [14]:
top500_feature

array([2269, 1947, 1249,  931, 1401, 1387, 1683, 1569, 1768,  180, 1624,
       1201, 2164, 2239, 1782,  103, 2391,  632, 2051, 1701, 2100, 2486,
        312, 2128, 2174, 2453,  760,    7, 2737, 2399,  418, 2755, 1795,
       1771, 2473, 1167,  984, 1459, 1869,  341, 1975, 2300, 2733, 2551,
        442,  709, 2485, 1383,  545, 2043,  792, 1928,   47, 1328,  828,
       2236, 1636, 1916,  256, 1319,  258, 1224, 2314, 2210,  499,  485,
       1969, 2725, 2249, 2682, 2251, 2812, 1298, 1095,  423, 1234,  557,
       2209, 2221, 1110, 2336, 1788,   10, 1852, 2598,   79, 2584, 2030,
        472, 1308, 2310,  482, 1776,  765, 1100, 1873,  400,  327, 2579,
       2323, 2750,  814, 2296, 2042,  767, 1086, 1629,  319, 2041, 1172,
        927, 2350,  577,  780, 2105, 1362,  383, 2200, 2422,  283, 1685,
       1835, 1347, 2574, 1117, 2368, 1332,  926, 2640,  304,  417,  369,
       2338, 1214,  170, 1710,  644, 1101,  255, 1403,  904, 1217, 2372,
       1666, 2312, 1437, 1846, 1106,  974, 1186, 18