# Task3 特征工程

此部分为零基础入门数据挖掘-心跳信号分类预测的 Task3 特征工程部分，带你来了解时间序列特征工程以及分析方法，欢迎大家后续多多交流。

赛题：零基础入门数据挖掘-心跳信号分类预测

项目地址：
比赛地址：

## 3.1 学习目标

* 学习时间序列数据的特征预处理方法
* 学习时间序列特征处理工具 Tsfresh（TimeSeries Fresh）的使用

## 3.2 内容介绍
* 数据预处理
	* 时间序列数据格式处理
	* 加入时间步特征time
* 特征工程
	* 时间序列特征构造
	* 特征筛选
	* 使用 tsfresh 进行时间序列特征处理

## 3.3 代码示例

### 3.3.1 导入包并读取数据

In [19]:
# 包导入
import pandas as pd
import numpy as np
# import tsfresh as tsf
# from tsfresh import extract_features, select_features
# from tsfresh.utilities.dataframe_functions import impute
# # from tsfresh.examples.robot_execution_failures import download_robot_execution_failures,load_robot_execution_failures
# from tsfresh import extract_features, extract_relevant_features, select_features
# from tsfresh.utilities.dataframe_functions import impute
# from tsfresh.feature_extraction import ComprehensiveFCParameters

In [37]:
# 数据读取
data_train = pd.read_csv('./data/train.csv')
data_test_A = pd.read_csv('./data/testA.csv')

print(data_train.shape)
print(data_test_A.shape)

(100000, 3)
(20000, 2)


In [38]:
data_train.head(10)

Unnamed: 0,id,heartbeat_signals,label
0,0,"0.9912297987616655,0.9435330436439665,0.764677...",0.0
1,1,"0.9714822034884503,0.9289687459588268,0.572932...",0.0
2,2,"1.0,0.9591487564065292,0.7013782792997189,0.23...",2.0
3,3,"0.9757952826275774,0.9340884687738161,0.659636...",0.0
4,4,"0.0,0.055816398940721094,0.26129357194994196,0...",2.0
5,5,"1.0,0.8675497147252661,0.5128848334259041,0.36...",0.0
6,6,"0.9505940730409403,0.9166910623948625,0.848396...",2.0
7,7,"0.8681707532149519,0.8318642354644805,0.531120...",2.0
8,8,"0.9792414410794537,0.6155508397973931,0.632268...",3.0
9,9,"0.9917559671326545,1.0,0.9740518898684873,0.93...",2.0


In [30]:
# data_train['label']

0        0.0
1        0.0
2        2.0
3        0.0
4        2.0
        ... 
99995    0.0
99996    2.0
99997    3.0
99998    2.0
99999    0.0
Name: label, Length: 100000, dtype: float64

In [31]:
# data_train = data_train.drop(['label'], axis=1)
# data_train

Unnamed: 0,id,heartbeat_signals
0,0,"0.9912297987616655,0.9435330436439665,0.764677..."
1,1,"0.9714822034884503,0.9289687459588268,0.572932..."
2,2,"1.0,0.9591487564065292,0.7013782792997189,0.23..."
3,3,"0.9757952826275774,0.9340884687738161,0.659636..."
4,4,"0.0,0.055816398940721094,0.26129357194994196,0..."
...,...,...
99995,99995,"1.0,0.677705342021188,0.22239242747868546,0.25..."
99996,99996,"0.9268571578157265,0.9063471198026871,0.636993..."
99997,99997,"0.9258351628306013,0.5873839035878395,0.633226..."
99998,99998,"1.0,0.9947621698382489,0.8297017704865509,0.45..."


In [39]:
data_test_A.head()

Unnamed: 0,id,heartbeat_signals
0,100000,"0.9915713654170097,1.0,0.6318163407681274,0.13..."
1,100001,"0.6075533139615096,0.5417083883163654,0.340694..."
2,100002,"0.9752726292239277,0.6710965234906665,0.686758..."
3,100003,"0.9956348033996116,0.9170249621481004,0.521096..."
4,100004,"1.0,0.8879490481178918,0.745564725322326,0.531..."


In [40]:
### 3.3.2 数据预处理
# 对心电特征进行行转列处理，同时为每个心电信号加入时间步特征time
train_heartbeat_df = data_train['heartbeat_signals'].str.split(",", expand=True).stack()
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df.index.name = None
train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
train_heartbeat_df['heartbeat_signals'] = train_heartbeat_df['heartbeat_signals'].astype(float)

train_heartbeat_df

Unnamed: 0,time,heartbeat_signals
0,0,0.991230
0,1,0.943533
0,2,0.764677
0,3,0.618571
0,4,0.379632
...,...,...
99999,200,0.000000
99999,201,0.000000
99999,202,0.000000
99999,203,0.000000


In [41]:
# 将处理后的心电特征加入到训练数据中，同时将训练数据label列单独存储
# data_train_label = data_train['label']
# data_train = data_train.drop(['label'], axis=1)
# # data_train = data_train.drop('heartbeat_signals', axis=1)
# data_train = data_train.join(train_heartbeat_df)
# data_train
data_train_label = data_train['label']
data_train = data_train.drop(['label'], axis=1)
data_train = data_train.drop(['heartbeat_signals'], axis=1)
data_train = data_train.join(train_heartbeat_df)

data_train

Unnamed: 0,id,time,heartbeat_signals
0,0,0,0.991230
0,0,1,0.943533
0,0,2,0.764677
0,0,3,0.618571
0,0,4,0.379632
...,...,...,...
99999,99999,200,0.000000
99999,99999,201,0.000000
99999,99999,202,0.000000
99999,99999,203,0.000000


In [42]:
data_train[data_train["id"]==1]

Unnamed: 0,id,time,heartbeat_signals
1,1,0,0.971482
1,1,1,0.928969
1,1,2,0.572933
1,1,3,0.178457
1,1,4,0.122962
...,...,...,...
1,1,200,0.000000
1,1,201,0.000000
1,1,202,0.000000
1,1,203,0.000000


In [43]:
from tsfresh.feature_extraction import extract_features, MinimalFCParameters
train_features = extract_features(data_train, column_id='id', column_sort='time', default_fc_parameters=MinimalFCParameters())
train_features.head()

NvvmError: Failed to compile

<unnamed> (114, 19): parse expected comma after load's type
NVVM_ERROR_COMPILATION

In [44]:
from tsfresh import extract_features

NvvmError: Failed to compile

<unnamed> (114, 19): parse expected comma after load's type
NVVM_ERROR_COMPILATION