# 自动化特征工程

搬运参考：https://www.kaggle.com/liananapalkova/automated-feature-engineering-for-titanic-dataset

### 1.介绍
如果您曾经为您的ML项目手动创建过数百个特性（我相信您做到了），那么您将乐于了解名为“featuretools”的Python包如何帮助完成这项任务。好消息是这个软件包很容易使用。它的目标是自动化特征工程。当然，人类的专业知识是无法替代的，但是“featuretools”可以自动化大量的日常工作。出于探索目的，这里使用fetch_covtype数据集。

本笔记本的主要内容包括：

首先，使用自动特征工程（“featuretools”包），从54个特征总数增加到N个。

其次，应用特征约简和选择方法，从N个特征中选择X个最相关的特征。

In [1]:
import sys
print(sys.version)  # 版本信息

3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]


In [5]:
pip install featuretools

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simpleNote: you may need to restart the kernel to use updated packages.
Collecting featuretools
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8f/32/b5d02df152aff86f720524540ae516a8e15d7a8c53bd4ee06e2b1ed0c263/featuretools-0.26.2-py3-none-any.whl (327 kB)
Installing collected packages: featuretools
Successfully installed featuretools-0.26.2



In [2]:
import numpy as np
import time

import featuretools as ft
from featuretools.primitives import *
from featuretools.variable_types import Numeric
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
# 导入相关模型，没有的pip install xxx 即可

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb 
import lightgbm as lgb 

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import log_loss

In [3]:
from sklearn.datasets import fetch_covtype
data = fetch_covtype()

In [4]:
# 预处理
X, y = data['data'], data['target']
# 由于模型标签需要从0开始，所以数字需要全部减1
print('七分类任务，处理前：',np.unique(y))
print(y)
ord = OrdinalEncoder()
y = ord.fit_transform(y.reshape(-1, 1))
y = y.reshape(-1, )
print('七分类任务，处理后：',np.unique(y))
print(y)

七分类任务，处理前： [1 2 3 4 5 6 7]
[5 5 2 ... 3 3 3]
七分类任务，处理后： [0. 1. 2. 3. 4. 5. 6.]
[4. 4. 1. ... 2. 2. 2.]


In [8]:
X = pd.DataFrame(X,columns=data.feature_names)
X = X.reset_index()
X.head(2)

Unnamed: 0,index,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type_30,Soil_Type_31,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39
0,0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2.执行自动化特征工程
需要先确认是否有NaN值

In [47]:
es.entity_from_dataframe?

In [9]:
es = ft.EntitySet(id = 'fetch_covtype_data')
es = es.entity_from_dataframe(entity_id = 'X', dataframe = X, 
                              variable_types = 
                              {
                                  'Aspect': ft.variable_types.Categorical,
                                  'Slope': ft.variable_types.Categorical,
                                  'Hillshade_9am': ft.variable_types.Categorical,
                                  'Hillshade_Noon': ft.variable_types.Categorical,
                                  'Hillshade_3pm': ft.variable_types.Categorical,
                                  'Wilderness_Area_0': ft.variable_types.Boolean,
                                  'Wilderness_Area_1': ft.variable_types.Boolean,
                                  'Wilderness_Area_2': ft.variable_types.Boolean,
                                  'Wilderness_Area_3': ft.variable_types.Boolean,
                                  'Soil_Type_0': ft.variable_types.Boolean,
                                  'Soil_Type_1': ft.variable_types.Boolean,
                                  'Soil_Type_2': ft.variable_types.Boolean,
                                  'Soil_Type_3': ft.variable_types.Boolean,
                                  'Soil_Type_4': ft.variable_types.Boolean,
                                  'Soil_Type_5': ft.variable_types.Boolean,
                                  'Soil_Type_6': ft.variable_types.Boolean,
                                  'Soil_Type_7': ft.variable_types.Boolean,
                                  'Soil_Type_8': ft.variable_types.Boolean,
                                  'Soil_Type_9': ft.variable_types.Boolean,
                                  'Soil_Type_10': ft.variable_types.Boolean,
                                  'Soil_Type_11': ft.variable_types.Boolean,
                                  'Soil_Type_12': ft.variable_types.Boolean,
                                  'Soil_Type_13': ft.variable_types.Boolean,
                                  'Soil_Type_14': ft.variable_types.Boolean,
                                  'Soil_Type_15': ft.variable_types.Boolean,
                                  'Soil_Type_16': ft.variable_types.Boolean,
                                  'Soil_Type_17': ft.variable_types.Boolean,
                                  'Soil_Type_18': ft.variable_types.Boolean,
                                  'Soil_Type_19': ft.variable_types.Boolean,
                                  'Soil_Type_20': ft.variable_types.Boolean,
                                  'Soil_Type_21': ft.variable_types.Boolean,
                                  'Soil_Type_22': ft.variable_types.Boolean,
                                  'Soil_Type_23': ft.variable_types.Boolean,
                                  'Soil_Type_24': ft.variable_types.Boolean,
                                  'Soil_Type_25': ft.variable_types.Boolean,
                                  'Soil_Type_26': ft.variable_types.Boolean,
                                  'Soil_Type_27': ft.variable_types.Boolean,
                                  'Soil_Type_28': ft.variable_types.Boolean,
                                  'Soil_Type_29': ft.variable_types.Boolean,
                                  'Soil_Type_30': ft.variable_types.Boolean,
                                  'Soil_Type_31': ft.variable_types.Boolean,
                                  'Soil_Type_32': ft.variable_types.Boolean,
                                  'Soil_Type_33': ft.variable_types.Boolean,
                                  'Soil_Type_34': ft.variable_types.Boolean,
                                  'Soil_Type_35': ft.variable_types.Boolean,
                                  'Soil_Type_36': ft.variable_types.Boolean,
                                  'Soil_Type_37': ft.variable_types.Boolean,
                                  'Soil_Type_38': ft.variable_types.Boolean,
                                  'Soil_Type_39': ft.variable_types.Boolean
                              },
                              index = 'index')

es

Entityset: fetch_covtype_data
  Entities:
    X [Rows: 581012, Columns: 55]
  Relationships:
    No relationships

In [10]:
es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_0', index='Wilderness_Area_0')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_1', index='Wilderness_Area_1')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_2', index='Wilderness_Area_2')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_3', index='Wilderness_Area_3')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_0', index='Soil_Type_0')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_1', index='Soil_Type_1')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_2', index='Soil_Type_2')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_3', index='Soil_Type_3')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_4', index='Soil_Type_4')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_5', index='Soil_Type_5')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_6', index='Soil_Type_6')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_7', index='Soil_Type_7')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_8', index='Soil_Type_8')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_9', index='Soil_Type_9')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_10', index='Soil_Type_10')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_11', index='Soil_Type_11')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_12', index='Soil_Type_12')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_13', index='Soil_Type_13')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_14', index='Soil_Type_14')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_15', index='Soil_Type_15')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_16', index='Soil_Type_16')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_17', index='Soil_Type_17')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_18', index='Soil_Type_18')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_19', index='Soil_Type_19')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_20', index='Soil_Type_20')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_21', index='Soil_Type_21')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_22', index='Soil_Type_22')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_23', index='Soil_Type_23')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_24', index='Soil_Type_24')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_25', index='Soil_Type_25')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_26', index='Soil_Type_26')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_27', index='Soil_Type_27')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_28', index='Soil_Type_28')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_29', index='Soil_Type_29')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_30', index='Soil_Type_30')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_31', index='Soil_Type_31')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_32', index='Soil_Type_32')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_33', index='Soil_Type_33')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_34', index='Soil_Type_34')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_35', index='Soil_Type_35')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_36', index='Soil_Type_36')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_37', index='Soil_Type_37')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_38', index='Soil_Type_38')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_39', index='Soil_Type_39')
es

Entityset: fetch_covtype_data
  Entities:
    X [Rows: 581012, Columns: 55]
    Wilderness_Area_0 [Rows: 2, Columns: 1]
    Wilderness_Area_1 [Rows: 2, Columns: 1]
    Wilderness_Area_2 [Rows: 2, Columns: 1]
    Wilderness_Area_3 [Rows: 2, Columns: 1]
    Soil_Type_0 [Rows: 2, Columns: 1]
    Soil_Type_1 [Rows: 2, Columns: 1]
    Soil_Type_2 [Rows: 2, Columns: 1]
    Soil_Type_3 [Rows: 2, Columns: 1]
    Soil_Type_4 [Rows: 2, Columns: 1]
    Soil_Type_5 [Rows: 2, Columns: 1]
    Soil_Type_6 [Rows: 2, Columns: 1]
    Soil_Type_7 [Rows: 2, Columns: 1]
    Soil_Type_8 [Rows: 2, Columns: 1]
    Soil_Type_9 [Rows: 2, Columns: 1]
    Soil_Type_10 [Rows: 2, Columns: 1]
    Soil_Type_11 [Rows: 2, Columns: 1]
    Soil_Type_12 [Rows: 2, Columns: 1]
    Soil_Type_13 [Rows: 2, Columns: 1]
    Soil_Type_14 [Rows: 2, Columns: 1]
    Soil_Type_15 [Rows: 2, Columns: 1]
    Soil_Type_16 [Rows: 2, Columns: 1]
    Soil_Type_17 [Rows: 2, Columns: 1]
    Soil_Type_18 [Rows: 2, Columns: 1]
    Soil_Type_19 

In [11]:
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100
primitives[primitives['type'] == 'aggregation'].head(primitives[primitives['type'] == 'aggregation'].shape[0])

Unnamed: 0,name,type,dask_compatible,koalas_compatible,description,valid_inputs,return_type
0,all,aggregation,True,False,Calculates if all values are 'True' in a list.,Boolean,Boolean
1,skew,aggregation,False,False,Computes the extent to which a distribution differs from a normal distribution.,Numeric,Numeric
2,percent_true,aggregation,True,False,Determines the percent of `True` values.,Boolean,Numeric
3,count,aggregation,True,True,"Determines the total number of values, excluding `NaN`.",Index,Numeric
4,num_unique,aggregation,True,True,"Determines the number of distinct values, ignoring `NaN` values.",Discrete,Numeric
5,first,aggregation,False,False,Determines the first value in a list.,Variable,
6,mode,aggregation,False,False,Determines the most commonly repeated value.,Discrete,
7,entropy,aggregation,False,False,Calculates the entropy for a categorical variable,Categorical,Numeric
8,time_since_last,aggregation,False,False,Calculates the time elapsed since the last datetime (default in seconds).,DatetimeTimeIndex,Numeric
9,any,aggregation,True,False,Determines if any value is 'True' in a list.,Boolean,Boolean


In [12]:
primitives[primitives['type'] == 'transform'].head(primitives[primitives['type'] == 'transform'].shape[0])

Unnamed: 0,name,type,dask_compatible,koalas_compatible,description,valid_inputs,return_type
22,url_to_domain,transform,False,False,Determines the domain of a url.,URL,Categorical
23,cum_mean,transform,False,False,Calculates the cumulative mean.,Numeric,Numeric
24,minute,transform,True,True,Determines the minutes value of a datetime.,Datetime,Numeric
25,cum_max,transform,False,False,Calculates the cumulative maximum.,Numeric,Numeric
26,age,transform,True,False,Calculates the age in years as a floating point number given a,DateOfBirth,Numeric
...,...,...,...,...,...,...,...
79,greater_than_scalar,transform,True,True,Determines if values are greater than a given scalar.,"Numeric, Datetime, Ordinal",Boolean
80,url_to_protocol,transform,False,False,Determines the protocol (http or https) of a url.,URL,Categorical
81,month,transform,True,True,Determines the month value of a datetime.,Datetime,Ordinal
82,divide_numeric_scalar,transform,True,True,Divide each element in the list by a scalar.,Numeric,Numeric
