# 概要

一、赛题背景

预测性维护是工业互联网应用“皇冠上的明珠”，实现预测性维护的关键是对设备系统或核心部件的寿命进行有效预测。对工程机械设备的核心耗损性部件的剩余寿命进行预测，可以据此对于相关部件的进行提前维护或者更换，从而减少整个设备非计划停机时间，避免因计划外停机而带来的经济损失，比如导致整个生产现场其他配套设备等待故障设备部件的修复。

二、赛事任务

本赛题由中科云谷科技有限公司提供某类工程机械设备的核心耗损性部件的工作数据，包括部件工作时长、转速、温度、电压、电流等多类工况数据。希望参赛者利用大数据分析、机器学习、深度学习等方法，提取合适的特征、建立合适的寿命预测模型，预测核心耗损性部件的剩余寿命。

三、开放数据

针对某类工程机械设备的核心耗损性部件，数据集包含训练集和测试集两个部分。

训练集中，每个文件对应一个该类部件的全寿命物联网采样数据，即从安装后一直到更换之间的对应数据，形式为多维时间序列。字段“部件工作时长”的最大值（通常为最后一行记录）即为该部件实例的实际寿命。（参见样例数据）

测试集中，每个文件对应一个该类部件一段时间内的物联网采样数据，需要基于该段数据，预测该部件此后的剩余寿命。

特征数据字段包括：部件工作时长, 累积量参数1，累积量参数2，转速信号1, 转速信号2, 压力信号1, 压力信号2, 温度信号, 流量信号, 电流信号, 开关1信号, 开关2信号, 告警信号1, 设备类型。其中：

数值型字段包括：部件工作时长, 累积量参数1，累积量参数2，转速信号1, 转速信号2, 压力信号1, 压力信号2, 温度信号, 流量信号, 电流信号。

开关量字段（0或1）：开关1信号, 开关2信号, 告警信号1

字符串型字段：设备类型。

除了开关量以外，上述设备类型、工况数据的具体值都经过了一定的脱敏处理，但已考虑尽量不影响数据蕴含的关系。

赛题的算法预测精度的衡量标准公式如下：

![工程机械寿命预测公式](工程机械寿命预测公式.png)

其中，ri表示第i个样本的真实剩余寿命，r ̂_i表示第i个样本的预测剩余寿命。

# 数据探索

In [1]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
sns.set_style({'font.sans-serif':['Microsoft YaHei','Arial']})

## 数据初探

In [2]:
first_train_research = pd.read_csv('./train/00fb58ecd675062e4423.csv')

In [3]:
first_train_research.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55859 entries, 0 to 55858
Data columns (total 14 columns):
部件工作时长    55859 non-null float64
累积量参数1    55859 non-null float64
累积量参数2    55859 non-null float64
转速信号1     55859 non-null float64
转速信号2     55859 non-null float64
压力信号1     55859 non-null float64
压力信号2     55859 non-null float64
温度信号      55859 non-null float64
流量信号      55859 non-null float64
电流信号      55859 non-null float64
开关1信号     55859 non-null float64
开关2信号     55859 non-null float64
告警信号1     55859 non-null float64
设备类型      55859 non-null object
dtypes: float64(13), object(1)
memory usage: 6.0+ MB


In [4]:
pd.isnull(first_train_research).sum()

部件工作时长    0
累积量参数1    0
累积量参数2    0
转速信号1     0
转速信号2     0
压力信号1     0
压力信号2     0
温度信号      0
流量信号      0
电流信号      0
开关1信号     0
开关2信号     0
告警信号1     0
设备类型      0
dtype: int64

In [5]:
first_train_research.head()

Unnamed: 0,部件工作时长,累积量参数1,累积量参数2,转速信号1,转速信号2,压力信号1,压力信号2,温度信号,流量信号,电流信号,开关1信号,开关2信号,告警信号1,设备类型
0,0.0,0.0,0.0,10801.19,24614.69,67.86,372.86,42.2,132.66,1627.52,0.0,0.0,0.0,S26a
1,0.0,0.0,0.0,7666.51,17452.22,76.95,374.27,42.2,135.06,1627.52,0.0,0.0,0.0,S26a
2,0.0,0.0,0.0,7661.61,17451.8,85.17,373.47,42.2,134.68,1627.65,0.0,0.0,0.0,S26a
3,0.0,0.0,0.0,7656.61,17452.76,86.35,373.86,42.2,134.69,1627.69,0.0,0.0,0.0,S26a
4,0.0,0.0,0.0,7657.57,17448.74,86.39,374.54,42.2,132.8,1627.63,0.0,0.0,0.0,S26a


In [6]:
first_train_research.describe()

Unnamed: 0,部件工作时长,累积量参数1,累积量参数2,转速信号1,转速信号2,压力信号1,压力信号2,温度信号,流量信号,电流信号,开关1信号,开关2信号,告警信号1
count,55859.0,55859.0,55859.0,55859.0,55859.0,55859.0,55859.0,55859.0,55859.0,55859.0,55859.0,55859.0,55859.0
mean,4066.770592,76622.149242,81165.021975,7828.778586,17146.625721,118.529115,341.903339,60.995861,83.571954,829.33917,0.315527,0.0,0.029843
std,2227.611652,43287.889501,45491.092616,2924.623679,7760.556261,103.325901,61.408848,11.628993,43.019423,411.934032,0.464729,0.0,0.170156
min,0.0,0.0,0.0,0.0,0.01,0.01,0.0,5.6,0.0,0.0,0.0,0.0,0.0
25%,2181.25,39674.25,42402.75,5202.4,11845.52,67.41,334.92,55.6,49.03,620.1,0.0,0.0,0.0
50%,4185.5,77357.5,82797.0,5322.17,12005.35,68.97,354.34,62.1,73.55,620.21,0.0,0.0,0.0
75%,5880.25,112950.75,119518.75,10728.185,24432.87,173.175,371.64,67.9,132.35,1178.27,1.0,0.0,0.0
max,7942.5,152438.0,159607.5,12985.95,28973.53,950.44,472.03,115.6,138.5,1627.8,1.0,0.0,1.0


## 数据合并

In [7]:
df_train_label = pd.read_csv('df_train_label.csv')

parts_df = pd.DataFrame()
for i in range(df_train_label.shape[0]):
    parts_df = parts_df.append(pd.read_csv('./train/'+df_train_label['train_file_name'][i]))