# Task3 特征工程

此部分为零基础入门数据挖掘-心跳信号分类预测的 Task3 特征工程部分，带你来了解时间序列特征工程以及分析方法，欢迎大家后续多多交流。

赛题：零基础入门数据挖掘-心跳信号分类预测

项目地址：
比赛地址：

## 3.1 学习目标

* 学习时间序列数据的特征预处理方法
* 学习时间序列特征处理工具 Tsfresh（TimeSeries Fresh）的使用

## 3.2 内容介绍
* 数据预处理
	* 时间序列数据格式处理
	* 加入时间步特征time
* 特征工程
	* 时间序列特征构造
	* 特征筛选
	* 使用 tsfresh 进行时间序列特征处理

## 3.3 代码示例

### 3.3.1 导入包并读取数据

In [14]:
# 包导入
import pandas as pd
import numpy as np
# import tsfresh as tsf
# from tsfresh import extract_features, select_features
# from tsfresh.utilities.dataframe_functions import impute
# # from tsfresh.examples.robot_execution_failures import download_robot_execution_failures,load_robot_execution_failures
# from tsfresh import extract_features, extract_relevant_features, select_features
# from tsfresh.utilities.dataframe_functions import impute
# from tsfresh.feature_extraction import ComprehensiveFCParameters

In [15]:
# 数据读取
data_train = pd.read_csv('./data/train.csv')
data_test_A = pd.read_csv('./data/testA.csv')

print(data_train.shape)
print(data_test_A.shape)

(100000, 3)
(20000, 2)


In [16]:
data_train.head()
data_train.describe()

Unnamed: 0,id,label
count,100000.0,100000.0
mean,49999.5,0.85696
std,28867.657797,1.217084
min,0.0,0.0
25%,24999.75,0.0
50%,49999.5,0.0
75%,74999.25,2.0
max,99999.0,3.0


In [17]:
# data_train['label']

In [5]:
# data_train = data_train.drop(['label'], axis=1)
# data_train

In [18]:
data_test_A.head()

Unnamed: 0,id,heartbeat_signals
0,100000,"0.9915713654170097,1.0,0.6318163407681274,0.13..."
1,100001,"0.6075533139615096,0.5417083883163654,0.340694..."
2,100002,"0.9752726292239277,0.6710965234906665,0.686758..."
3,100003,"0.9956348033996116,0.9170249621481004,0.521096..."
4,100004,"1.0,0.8879490481178918,0.745564725322326,0.531..."


In [19]:
### 3.3.2 数据预处理
# 对心电特征进行行转列处理，同时为每个心电信号加入时间步特征time
train_heartbeat_df = data_train['heartbeat_signals'].str.split(",", expand=True).stack()
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df.index.name = None
train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
train_heartbeat_df['heartbeat_signals'] = train_heartbeat_df['heartbeat_signals'].astype(float)
train_heartbeat_df.info()
train_heartbeat_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20500000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column             Dtype  
---  ------             -----  
 0   time               int64  
 1   heartbeat_signals  float64
dtypes: float64(1), int64(1)
memory usage: 469.2 MB


Unnamed: 0,time,heartbeat_signals
0,0,0.99123
0,1,0.943533
0,2,0.764677
0,3,0.618571
0,4,0.379632


In [20]:
# 将处理后的心电特征加入到训练数据中，同时将训练数据label列单独存储
# data_train_label = data_train['label']
# data_train = data_train.drop(['label'], axis=1)
# # data_train = data_train.drop('heartbeat_signals', axis=1)
# data_train = data_train.join(train_heartbeat_df)
# data_train
data_train_label = data_train['label']
data_train = data_train.drop(['label'], axis=1)
data_train = data_train.drop(['heartbeat_signals'], axis=1)
data_train = data_train.join(train_heartbeat_df)
data_train.info()
data_train.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20500000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column             Dtype  
---  ------             -----  
 0   id                 int64  
 1   time               int64  
 2   heartbeat_signals  float64
dtypes: float64(1), int64(2)
memory usage: 625.6 MB


Unnamed: 0,id,time,heartbeat_signals
0,0,0,0.99123
0,0,1,0.943533
0,0,2,0.764677
0,0,3,0.618571
0,0,4,0.379632


In [21]:
data_train[data_train["id"]==1]
data_train.describe()

Unnamed: 0,id,time,heartbeat_signals
count,20500000.0,20500000.0,20500000.0
mean,49999.5,102.0,0.2253953
std,28867.51,59.1777,0.2601665
min,0.0,0.0,0.0
25%,24999.75,51.0,0.0
50%,49999.5,102.0,0.1343601
75%,74999.25,153.0,0.3943378
max,99999.0,204.0,1.0


In [22]:
from tsfresh.feature_extraction import extract_features, MinimalFCParameters
train_features = extract_features(data_train, column_id='id', column_sort='time', default_fc_parameters=MinimalFCParameters())
train_features.head()

Feature Extraction: 100%|██████████████████████| 10/10 [01:41<00:00, 10.15s/it]


Unnamed: 0,heartbeat_signals__sum_values,heartbeat_signals__median,heartbeat_signals__mean,heartbeat_signals__length,heartbeat_signals__standard_deviation,heartbeat_signals__variance,heartbeat_signals__root_mean_square,heartbeat_signals__maximum,heartbeat_signals__minimum
0,38.927945,0.125531,0.189892,205.0,0.229783,0.0528,0.298093,1.0,0.0
1,19.445634,0.030481,0.094857,205.0,0.16908,0.028588,0.193871,1.0,0.0
2,21.192974,0.0,0.10338,205.0,0.184119,0.0339,0.211157,1.0,0.0
3,42.113066,0.241397,0.20543,205.0,0.186186,0.034665,0.277248,1.0,0.0
4,69.756786,0.0,0.340277,205.0,0.366213,0.134112,0.499901,0.999908,0.0


In [23]:
train_features.info()
train_features.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   heartbeat_signals__sum_values          100000 non-null  float64
 1   heartbeat_signals__median              100000 non-null  float64
 2   heartbeat_signals__mean                100000 non-null  float64
 3   heartbeat_signals__length              100000 non-null  float64
 4   heartbeat_signals__standard_deviation  100000 non-null  float64
 5   heartbeat_signals__variance            100000 non-null  float64
 6   heartbeat_signals__root_mean_square    100000 non-null  float64
 7   heartbeat_signals__maximum             100000 non-null  float64
 8   heartbeat_signals__minimum             100000 non-null  float64
dtypes: float64(9)
memory usage: 7.6 MB


Unnamed: 0,heartbeat_signals__sum_values,heartbeat_signals__median,heartbeat_signals__mean,heartbeat_signals__length,heartbeat_signals__standard_deviation,heartbeat_signals__variance,heartbeat_signals__root_mean_square,heartbeat_signals__maximum,heartbeat_signals__minimum
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,46.20604,0.21235,0.225395,205.0,0.222657,0.053632,0.322009,0.997855,0.000113
std,24.303115,0.206664,0.118552,0.0,0.063688,0.031757,0.121655,0.011635,0.003512
min,2.788435,0.0,0.013602,205.0,0.05974,0.003569,0.080331,0.690931,0.0
25%,27.167043,0.018551,0.132522,205.0,0.175587,0.030831,0.226336,1.0,0.0
50%,40.333951,0.14505,0.196751,205.0,0.208876,0.043629,0.297231,1.0,0.0
75%,61.713348,0.371969,0.301041,205.0,0.25913,0.067148,0.391462,1.0,0.0
max,169.251087,0.931782,0.825615,205.0,0.483643,0.23391,0.84673,1.0,0.363879


In [24]:
# 2. 特征选择 
# train_features中包含了heartbeat_signals的779种常见的时间序列特征（所有这些特征的解释可以去看官方文档），
# 这其中有的特征可能为NaN值（产生原因为当前数据不支持此类特征的计算），使用以下方式去除NaN值：
from tsfresh.utilities.dataframe_functions import impute
impute(train_features)

Unnamed: 0,heartbeat_signals__sum_values,heartbeat_signals__median,heartbeat_signals__mean,heartbeat_signals__length,heartbeat_signals__standard_deviation,heartbeat_signals__variance,heartbeat_signals__root_mean_square,heartbeat_signals__maximum,heartbeat_signals__minimum
0,38.927945,0.125531,0.189892,205.0,0.229783,0.052800,0.298093,1.000000,0.0
1,19.445634,0.030481,0.094857,205.0,0.169080,0.028588,0.193871,1.000000,0.0
2,21.192974,0.000000,0.103380,205.0,0.184119,0.033900,0.211157,1.000000,0.0
3,42.113066,0.241397,0.205430,205.0,0.186186,0.034665,0.277248,1.000000,0.0
4,69.756786,0.000000,0.340277,205.0,0.366213,0.134112,0.499901,0.999908,0.0
...,...,...,...,...,...,...,...,...,...
99995,63.323449,0.388402,0.308895,205.0,0.211636,0.044790,0.374441,1.000000,0.0
99996,69.657534,0.421138,0.339793,205.0,0.199966,0.039986,0.394266,1.000000,0.0
99997,40.897057,0.213306,0.199498,205.0,0.200657,0.040263,0.282954,1.000000,0.0
99998,42.333303,0.264974,0.206504,205.0,0.164380,0.027021,0.263941,1.000000,0.0


In [25]:
# 接下来，按照特征和响应变量之间的相关性进行特征选择，这一过程包含两步：首先单独计算每个特征和响应变量之间的相关性，
# 然后利用Benjamini-Yekutieli procedure [1] 进行特征选择，决定哪些特征可以被保留。
from tsfresh import select_features

# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features, data_train_label)

train_features_filtered

Unnamed: 0,heartbeat_signals__sum_values,heartbeat_signals__median,heartbeat_signals__mean,heartbeat_signals__standard_deviation,heartbeat_signals__variance,heartbeat_signals__root_mean_square,heartbeat_signals__maximum,heartbeat_signals__minimum
0,38.927945,0.125531,0.189892,0.229783,0.052800,0.298093,1.000000,0.0
1,19.445634,0.030481,0.094857,0.169080,0.028588,0.193871,1.000000,0.0
2,21.192974,0.000000,0.103380,0.184119,0.033900,0.211157,1.000000,0.0
3,42.113066,0.241397,0.205430,0.186186,0.034665,0.277248,1.000000,0.0
4,69.756786,0.000000,0.340277,0.366213,0.134112,0.499901,0.999908,0.0
...,...,...,...,...,...,...,...,...
99995,63.323449,0.388402,0.308895,0.211636,0.044790,0.374441,1.000000,0.0
99996,69.657534,0.421138,0.339793,0.199966,0.039986,0.394266,1.000000,0.0
99997,40.897057,0.213306,0.199498,0.200657,0.040263,0.282954,1.000000,0.0
99998,42.333303,0.264974,0.206504,0.164380,0.027021,0.263941,1.000000,0.0


In [26]:
train_features_filtered.info()
train_features_filtered.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   heartbeat_signals__sum_values          100000 non-null  float64
 1   heartbeat_signals__median              100000 non-null  float64
 2   heartbeat_signals__mean                100000 non-null  float64
 3   heartbeat_signals__standard_deviation  100000 non-null  float64
 4   heartbeat_signals__variance            100000 non-null  float64
 5   heartbeat_signals__root_mean_square    100000 non-null  float64
 6   heartbeat_signals__maximum             100000 non-null  float64
 7   heartbeat_signals__minimum             100000 non-null  float64
dtypes: float64(8)
memory usage: 6.9 MB


Unnamed: 0,heartbeat_signals__sum_values,heartbeat_signals__median,heartbeat_signals__mean,heartbeat_signals__standard_deviation,heartbeat_signals__variance,heartbeat_signals__root_mean_square,heartbeat_signals__maximum,heartbeat_signals__minimum
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,46.20604,0.21235,0.225395,0.222657,0.053632,0.322009,0.997855,0.000113
std,24.303115,0.206664,0.118552,0.063688,0.031757,0.121655,0.011635,0.003512
min,2.788435,0.0,0.013602,0.05974,0.003569,0.080331,0.690931,0.0
25%,27.167043,0.018551,0.132522,0.175587,0.030831,0.226336,1.0,0.0
50%,40.333951,0.14505,0.196751,0.208876,0.043629,0.297231,1.0,0.0
75%,61.713348,0.371969,0.301041,0.25913,0.067148,0.391462,1.0,0.0
max,169.251087,0.931782,0.825615,0.483643,0.23391,0.84673,1.0,0.363879
