# Task05. 自定义时序数据集的预处理与插补

在本节中，我们将以**合成的 eICU 数据集**为例，演示如何将自定义的医疗时间序列数据预处理为 [PyPOTS](https://github.com/WenjieDu/PyPOTS) 框架所需的输入格式，并使用 PyPOTS 进行插补。

## 关于 eICU 数据集

> The eICU Collaborative Research Database is a freely available multi-center database for critical care research.  
> **Reference**:  
> Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, and Badawi O. (2018). *The eICU Collaborative Research Database: A multi-center critical care database for research*. Scientific Data. DOI: [10.1038/sdata.2018.178](http://dx.doi.org/10.1038/sdata.2018.178)  
> Available at: [https://www.nature.com/articles/sdata2018178](https://www.nature.com/articles/sdata2018178)

eICU 数据库包含来自多家医院的 ICU 病患监护记录，是医疗时间序列研究的重要开源资源。在本示例中，我们使用经过脱敏和合成的 eICU 数据集，以避免隐私风险，同时保证数据结构与真实医疗数据一致。

## 任务目标

- 预处理表格格式的医疗时序数据为 PyPOTS 可用格式。
- 使用 PyPOTS 进行插补并还原数据。
- 生成可供后续分析或模型训练的数据集。

## 主要步骤

1. **数据加载**  
   加载原始时序数据，包括特征、标签和样本标识。

2. **构建三维张量**  
   - 将不同样本的特征对齐到统一的时间步长度。
   - 构造三维张量 `(n_samples, n_steps, n_features)`。

3. **数据插补**  
   使用 PyPOTS 提供的插补算法对张量中的缺失值进行填充。

4. **还原 DataFrame 结构**  
   将插补后的张量转换回 DataFrame 形式，保留样本 ID、时间步、特征和标签。

5. **结果保存**  
   将插补结果保存为 `.csv` 或 `.npy` 以供后续分析或建模使用。

## 结果说明

执行完以上步骤后，你将得到三个预处理完成的数据集：
- `df_train_imputed`：训练集插补结果
- `df_val_imputed`：验证集插补结果
- `df_test_imputed`：测试集插补结果

## 示例输出检查

通过 `.shape` 查看数据集维度，确认处理无误：


### 1. 自定义时序数据集的预处理与插补

In [1]:
import pypots
import numpy as np
import pandas as pd
import tsdb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from benchpots.utils.logging import logger, print_final_dataset_info
from benchpots.utils.missingness import create_missingness # 生成人工缺失值

  from .autonotebook import tqdm as notebook_tqdm


[34m
████████╗██╗███╗   ███╗███████╗    ███████╗███████╗██████╗ ██╗███████╗███████╗    █████╗ ██╗
╚══██╔══╝██║████╗ ████║██╔════╝    ██╔════╝██╔════╝██╔══██╗██║██╔════╝██╔════╝   ██╔══██╗██║
   ██║   ██║██╔████╔██║█████╗█████╗███████╗█████╗  ██████╔╝██║█████╗  ███████╗   ███████║██║
   ██║   ██║██║╚██╔╝██║██╔══╝╚════╝╚════██║██╔══╝  ██╔══██╗██║██╔══╝  ╚════██║   ██╔══██║██║
   ██║   ██║██║ ╚═╝ ██║███████╗    ███████║███████╗██║  ██║██║███████╗███████║██╗██║  ██║██║
   ╚═╝   ╚═╝╚═╝     ╚═╝╚══════╝    ╚══════╝╚══════╝╚═╝  ╚═╝╚═╝╚══════╝╚══════╝╚═╝╚═╝  ╚═╝╚═╝
ai4ts v0.0.2 - building AI for unified time-series analysis, https://time-series.ai [0m



### 1.1 数据加载

In [2]:
df = pd.read_csv('attachments/synthetic_eicu.csv')
df.head()

Unnamed: 0,sample_id,timestamp,apacheadmissiondx,ethnicity,gender,GCS Total,Eyes,Motor,Verbal,admissionheight,...,MAP (mmHg),Invasive BP Diastolic,Invasive BP Systolic,O2 Saturation,Respiratory Rate,Temperature (C),glucose,FiO2,pH,label
0,0,0,17.0,394.0,398.0,,,,,182.9,...,80.0,56.0,119.0,99.0,,,,,,0
1,0,1,17.0,394.0,398.0,,,,,182.9,...,79.0,56.0,112.0,98.0,,,,,,0
2,0,2,17.0,394.0,398.0,413.0,,,,182.9,...,75.0,56.0,112.0,98.0,20.0,35.3,,,,0
3,0,3,17.0,394.0,398.0,,,,,182.9,...,79.0,58.0,108.0,97.0,,,,,,0
4,0,4,17.0,394.0,398.0,,,,,182.9,...,76.0,55.0,111.0,91.0,,,,,,0


In [3]:
'''
确保时间步长的一致性：
如果自定义数据的时间序列长度不一，则需要通过用缺失值 (NaN) 填充较短的序列或截断较长的序列来对其进行标准化。
我们来设置一个最大长度，例如，我们有 48 个时间步长，表示每个患者 48 小时的记录（可以根据数据进行调整）。
'''

max_length = 48

def pad_truncate(df):
    if len(df) > max_length:
        # 如果 DataFrame 超过最大长度，则截断
        # 这里我们选择保留前 max_length 行
        # 你也可以选择其他策略，比如保留最后 max_length 行
        return df.iloc[:max_length]
    else:
        # 如果 DataFrame 少于最大长度，则填充
        # 这里我们用 NaN 填充
        # 你也可以选择其他填充值，比如 0 或者均值等
        padding = pd.DataFrame(
            index=range(max_length - len(df)),
            columns=df.columns
        )
        if not padding.empty:
            return pd.concat([df, padding])
        else:
            return df

# 对每个患者的时间序列进行填充或截断
# 这里假设 'sample_id' 是患者的唯一标识符
# 你需要根据你的数据集中的实际列名进行调整
new_df = df.groupby('sample_id').apply(pad_truncate).reset_index(drop=True)

  new_df = df.groupby('sample_id').apply(pad_truncate).reset_index(drop=True)


### 1.2 数据拆分

In [4]:
unique_sample_ids = new_df['sample_id'].unique()

train_ids, temp_ids = train_test_split(unique_sample_ids, test_size=0.2, random_state=42)
val_ids, test_ids = train_test_split(temp_ids, test_size=0.5, random_state=42)

train_df = new_df[new_df['sample_id'].isin(train_ids)]
val_df = new_df[new_df['sample_id'].isin(val_ids)]
test_df = new_df[new_df['sample_id'].isin(test_ids)]

print(f"Train DataFrame shape: {train_df.shape}")
print(f"Validation DataFrame shape: {val_df.shape}")
print(f"Test DataFrame shape: {test_df.shape}")

# 拆分特征和标签
def separate_features_labels(df, feature_cols, label_col='label'):
    X = df[feature_cols].values.reshape(-1, 48, len(feature_cols))
    # 获取唯一的样本 ID
    unique_ids = df['sample_id'].unique()
    # 获取每个样本 ID 的第一个标签
    y = df.groupby('sample_id')[label_col].first().loc[unique_ids].values
    return X, y

# 选择特征列
feature_columns = [col for col in df.columns if col not in ['sample_id', 'label', 'timestamp']]

train_X, train_y = separate_features_labels(train_df.copy(), feature_columns)
val_X, val_y = separate_features_labels(val_df.copy(), feature_columns)
test_X, test_y = separate_features_labels(test_df.copy(), feature_columns)

print(f"Train features shape: {train_X.shape}, Train labels shape: {train_y.shape}")
print(f"Validation features shape: {val_X.shape}, Validation labels shape: {val_y.shape}")
print(f"Test features shape: {test_X.shape}, Test labels shape: {test_y.shape}")

Train DataFrame shape: (235584, 23)
Validation DataFrame shape: (29472, 23)
Test DataFrame shape: (29472, 23)
Train features shape: (4908, 48, 20), Train labels shape: (4908,)
Validation features shape: (614, 48, 20), Validation labels shape: (614,)
Test features shape: (614, 48, 20), Test labels shape: (614,)


### 1.3 数据标准化

In [5]:
scaler = StandardScaler()
# Flatten the data before scaling and then reshape it into time series samples
train_X = scaler.fit_transform(train_X.reshape(-1, train_X.shape[-1])).reshape(train_X.shape)
val_X = scaler.transform(val_X.reshape(-1, val_X.shape[-1])).reshape(val_X.shape)
test_X = scaler.transform(test_X.reshape(-1, test_X.shape[-1])).reshape(test_X.shape)

In [6]:
processed_dataset = {
        # general info
        "n_classes": len(np.unique(train_y)),
        "n_steps": train_X.shape[-2],
        "n_features": train_X.shape[-1],
        "scaler": scaler,
        # train set
        "train_X": train_X,
        "train_y": train_y.flatten(),
        # val set
        "val_X": val_X,
        "val_y": val_y.flatten(),
        # test set
        "test_X": test_X,
        "test_y": test_y.flatten(),
    }

### 1.4 创建人工缺失值

In [7]:
# 保留原始数据中的ground truth以用于评估
train_X_ori = train_X
val_X_ori = val_X
test_X_ori = test_X

rate = 0.3 # 30%缺失率

# 在训练集上创建缺失值作为ground truth
train_X = create_missingness(train_X, rate, 'point')

# 在验证集上创建缺失值作为ground truth
val_X = create_missingness(val_X, rate, 'point' )

# 在测试集上创建缺失值作为ground truth
test_X = create_missingness(test_X, rate, 'point' )


processed_dataset["train_X"] = train_X
processed_dataset["val_X"] = val_X
processed_dataset["test_X"] = test_X

processed_dataset['train_X_ori'] = train_X_ori
processed_dataset['val_X_ori'] = val_X_ori
processed_dataset['test_X_ori'] = test_X_ori

In [8]:
from pypots.data.saving import pickle_dump

pickle_dump(processed_dataset, "result_saving/processed_synthetic_eicu.pkl")

2025-05-10 23:34:06 [INFO]: Successfully saved to result_saving/processed_synthetic_eicu.pkl


### 1.5 准备用于插补的数据

In [9]:
# 计算掩码来指示X_ori数据中的真实位置，将被用来评估模型性能

train_X_indicating_mask = np.isnan(train_X_ori) ^ np.isnan(train_X)
val_X_indicating_mask = np.isnan(val_X_ori) ^ np.isnan(val_X)
test_X_indicating_mask = np.isnan(test_X_ori) ^ np.isnan(test_X)

# 组装训练集
dataset_for_training = {
    "X": processed_dataset['train_X'],
    'X_ori': processed_dataset['train_X_ori'],
}

# 组装验证集
dataset_for_validating = {
    "X": processed_dataset['val_X'],
    "X_ori": processed_dataset['val_X_ori'],
}

# 组装测试集
dataset_for_testing = {
    "X": processed_dataset['test_X'],
    "X_ori": processed_dataset['test_X_ori'],
  }

test_X_indicating_mask = np.isnan(processed_dataset['test_X_ori']) ^ np.isnan(processed_dataset['test_X'])

# 度量函数不接受 NaN 输入，因此用 0 填充 NaN
test_X_ori = np.nan_to_num(processed_dataset['test_X_ori'])

# 2. 使用SAITS对自定义数据集中的缺失值进行插补

### 2.1 插补数据

In [10]:
from pypots.nn.functional import calc_mae
from pypots.optim import Adam
from pypots.imputation import SAITS

# 设置模型的运行设备为cpu, 如果你有gpu设备可以设置为cuda
DEVICE='cpu'

# 创建 SAITS 模型
# SAITS 模型的参数可以根据需要进行调整
saits = SAITS(
    n_steps=processed_dataset['n_steps'],
    n_features=processed_dataset['n_features'],
    n_layers=1,
    d_model=256,
    d_ffn=128,
    n_heads=4,
    d_k=64,
    d_v=64,
    dropout=0.1,
    # 你可以调整参数ORT_weight和MIT_weight的权重值，以使SAITS模型更多地关注于一个任务。通常你可以让它们保持默认值，比如1
    ORT_weight=1,
    MIT_weight=1,
    batch_size=32,
    # 这里为了快速演示我们将epochs设置为10，你可以将其设置为100或更多以获得更好的结果
    epochs=10,
    # 这里我们设置patience=3，如果连续3个epoch的评估loss没有减少，则提前停止训练。你可以不设置它,则默认为None,禁用早停机制
    patience=3,
    # 设置优化器。不同于torch.optim。在初始化pypots.optimizer时，你不必指定模型的参数。您也可以不设置它, 它将默认初始化一个lr=0.001的Adam优化器。
    optimizer=Adam(lr=1e-3),
    # 这个num_workers参数用于torch.utils.data.Dataloader。它是用于数据加载的子进程的数量。让它默认为0意味着数据加载将在主进程中，即不会有子进程。如果你认为数据加载是模型训练速度的瓶颈，则可以将其增加
    num_workers=0,
    # 如果不设置device, PyPOTS将自动为你分配最佳设备。这里我们将其设置为“cpu”。你也可以设置为'cuda', ‘cuda:0’或‘cuda:1’，如果你有多个cuda设备，甚至并行['cuda:0', 'cuda:1']
    device=DEVICE,
    # 设置保存tensorboard和训练模型文件的路径
    saving_path="result_saving/imputation/saits",
    # 训练完成后只保存最好的模型。你还可以将其设置为“better”，以保存在训练期间每一次在val set上表现得比之前更好的模型
    model_saving_strategy="best",
)

# 训练阶段，使用训练集和验证集
saits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

# 测试阶段，插补缺失值
test_set_imputation = saits.impute(dataset_for_testing)

# 根据真实值（人为缺失的值）计算平均绝对误差
testing_mae = calc_mae(
    test_set_imputation,
    test_X_ori,
    test_X_indicating_mask,
)
print(f"Testing mean absolute error: {testing_mae:.4f}")


2025-05-10 23:34:06 [INFO]: Using the given device: cpu
2025-05-10 23:34:06 [INFO]: Model files will be saved to result_saving/imputation/saits/20250510_T233406
2025-05-10 23:34:06 [INFO]: Tensorboard file will be saved to result_saving/imputation/saits/20250510_T233406/tensorboard
2025-05-10 23:34:06 [INFO]: Using customized MAE as the training loss function.
2025-05-10 23:34:06 [INFO]: Using customized MSE as the validation metric function.
2025-05-10 23:34:06 [INFO]: SAITS initialized with the given hyperparameters, the number of trainable parameters: 691,248
2025-05-10 23:34:18 [INFO]: Epoch 001 - training loss (MAE): 0.7821, validation MSE: 0.2482
2025-05-10 23:34:28 [INFO]: Epoch 002 - training loss (MAE): 0.4996, validation MSE: 0.2092
2025-05-10 23:34:40 [INFO]: Epoch 003 - training loss (MAE): 0.4473, validation MSE: 0.1983
2025-05-10 23:34:52 [INFO]: Epoch 004 - training loss (MAE): 0.4167, validation MSE: 0.1893
2025-05-10 23:35:04 [INFO]: Epoch 005 - training loss (MAE): 0.

Testing mean absolute error: 0.2214


In [11]:
# 插补训练集和验证集
train_set_imputation = saits.impute(dataset_for_training)
val_set_imputation = saits.impute(dataset_for_validating)
test_set_imputation = saits.impute(dataset_for_testing)

In [12]:
from pypots.data.saving import pickle_dump

processed_dataset['train_X'] = train_set_imputation
processed_dataset['val_X'] = val_set_imputation
processed_dataset['test_X'] = test_set_imputation
pickle_dump(processed_dataset, "result_saving/imputed_synthetic_eicu.pkl")

2025-05-10 23:36:06 [INFO]: Successfully saved to result_saving/imputed_synthetic_eicu.pkl


### 2.2 如果需要的话可以将3D NumPy数组还原回原始的DataFrame

In [13]:
def convert_to_dataframe(X, labels, sample_ids, scaler, invers_norm = False, n_steps=48):
    """
    Convert 3D NumPy array to a DataFrame with sample_id, timestamp, and original scale features.

    Parameters:
    - X: 3D NumPy array of shape (n_samples, n_steps, n_features)
    - labels: 1D NumPy array of shape (n_samples,) -> labels for each sample
    - sample_ids: 1D NumPy array with sample IDs corresponding to each sample
    - scaler: Scaler used for normalization (MinMaxScaler/StandardScaler)
    - n_steps: Number of time steps (default: 48)

    Returns:
    - DataFrame with sample_id, timestamp, features, and labels
    """
    n_samples, _, n_features = X.shape

    assert len(feature_columns) == n_features, "Number of features in X does not match feature_columns"
    assert len(labels) == n_samples, "Number of labels does not match number of samples"
    assert len(sample_ids) == n_samples, "Number of sample IDs does not match number of samples"

    # extract the last timestep record for each sample_id  to get one row per sample,
    # using the final timestep’s data (e.g., the last hour if n_steps=48 represents hourly data)

    X_last = X[:, -1, :]  # Shape: (n_samples, n_features)

    # Inverse normalization
    if invers_norm:
      X_original = scaler.inverse_transform(X_last)
    else:
      X_original = X_last


    # Create DataFrame
    df = pd.DataFrame(X_original, columns=feature_columns)
    df['sample_id'] = sample_ids
    df['timestamp'] = n_steps - 1  # Last timestep (e.g., 47 if 0-indexed)
    df['label'] = labels

    # Reorder columns: sample_id, timestamp, features, label
    df = df[['sample_id', 'timestamp'] + feature_columns + ['label']]

    return df

In [14]:
df_train_imputed = convert_to_dataframe(train_set_imputation, train_y, train_ids, scaler)
df_val_imputed = convert_to_dataframe(val_set_imputation, val_y, val_ids, scaler)
df_test_imputed = convert_to_dataframe(test_set_imputation, test_y, test_ids, scaler)

# 检查数据集的形状
print(df_train_imputed.shape, df_val_imputed.shape, df_test_imputed.shape)

(4908, 23) (614, 23) (614, 23)


In [15]:
df_train_imputed.head()

Unnamed: 0,sample_id,timestamp,apacheadmissiondx,ethnicity,gender,GCS Total,Eyes,Motor,Verbal,admissionheight,...,MAP (mmHg),Invasive BP Diastolic,Invasive BP Systolic,O2 Saturation,Respiratory Rate,Temperature (C),glucose,FiO2,pH,label
0,3098,47,-0.676732,0.3022,0.918308,0.514112,0.54517,0.353181,0.649388,1.156228,...,0.318641,-0.023323,0.19702,-0.023408,1.652006,-0.371437,-0.154864,-0.288266,0.489719,0
1,4221,47,-0.516926,0.419554,-1.088959,0.365205,0.246216,0.165949,0.347976,-1.677569,...,-0.578688,-0.616842,-0.466705,-0.496805,-0.905164,-0.806545,-0.528615,-0.356959,0.123013,0
2,3154,47,-0.490291,0.3022,-1.088959,0.489465,0.447765,0.338536,0.656788,-0.16621,...,-0.800381,-1.012521,-0.289712,0.09749,0.85289,0.080816,-0.154911,-0.352732,0.292381,0
3,4041,47,-0.730001,0.3022,-1.088959,0.350713,0.534339,0.34231,0.490669,-1.434142,...,1.144788,1.330135,1.602894,0.09749,3.569884,-0.524262,3.389644,-0.296362,0.271751,1
4,2664,47,-0.78327,0.3022,-1.088959,0.188676,0.107571,0.02747,0.398906,-1.248206,...,-0.70226,-0.682788,-0.484909,-0.793952,1.172537,0.406375,-0.360595,-0.290909,0.234953,0


In [16]:
df_train_imputed.to_csv('result_saving/train_imputed.csv', index=False)
df_val_imputed.to_csv('result_saving/val_imputed.csv', index=False)
df_test_imputed.to_csv('result_saving/test_imputed.csv', index=False)

# 3. 阅读材料

### Wang, J., Du, W., Yang, Y., Qian, L., Cao, W., Zhang, K., Wang, W., Liang, Y. & Wen, Q. (2025) [Deep Learning for Multivariate Time Series Imputation: A Survey](https://arxiv.org/abs/2402.04059). IJCAI 2025.
#### 推荐原因: 该文回顾并总结了深度学习在时序插补领域的发展, 文章被人工智能顶级会议IJCAI 2025收录, 五位审稿人均给出正面评价. 截止2025年5月Google Scholar上引用50+.