In [1]:
print('hello')

hello


这是一个非常棒的问题。你提供的第二张截图（Raw Data）正是我们量化分析的**源头活水**。

对于 Tardis.dev 这种**Tick 级别（逐笔）的原始爆仓数据，直接“求和”是最浪费的处理方式。你需要从强度、广度、分布形态、相对位置**这四个维度进行清洗和特征构造。

针对 **H1（小时级）** 的机器学习模型（XGBoost/gplearn），以下是最佳的清洗与特征工程方案：

---

### 第一步：基础聚合 (Base Aggregation) —— 将 Tick 转为 Bar

首先，必须理清方向。

* **数据中的 `side='buy'**`  系统买入平仓  **空头爆仓 (Short Liq)**。
* **数据中的 `side='sell'**`  系统卖出平仓  **多头爆仓 (Long Liq)**。

在 `1H` 重采样（Resample）时，不要只算 `Sum`，要算一组统计量：

| 聚合方式 | 变量名示例 | 业务含义 |
| --- | --- | --- |
| **Sum (总和)** | `vol_sum` | **总燃料**。这小时一共爆了多少钱？衡量绝对强度。 |
| **Count (笔数)** | `cnt` | **恐慌广度**。如果是散户踩踏，Amount 可能不大，但 Count 巨大。 |
| **Max (最大值)** | `amt_max` | **鲸鱼死亡**。这小时里最大的一笔爆仓是多少？单笔 500 万和 500 笔 1 万意义完全不同。 |
| **Mean (均值)** | `amt_mean` | **平均受害者画像**。`Sum / Count`。均值大说明死的是大户，均值小说明死的是散户。 |

---

### 第二步：分布形态特征 (Distribution Shape) —— 偏度与峰度

你提到的偏度和峰度非常有用，因为爆仓往往不是均匀分布的。

1. **偏度 (Skewness):**
* **计算对象：** 这 1 小时内所有爆仓单金额的分布。
* **逻辑：**
* **正偏 (Skew > 0):** 大部分是小额爆仓，但有几笔极端的巨额爆仓（长尾在右）。 **定点爆破大户**。
* **低偏 (Skew  0):** 爆仓金额很均匀。 **系统性崩盘/普跌**。




2. **峰度 (Kurtosis):**
* **逻辑：** 衡量爆仓是“集中爆发”还是“持续流血”。
* **高峰度：** 爆仓非常集中在某一个瞬间（Flash Crash）。
* **低峰度：** 整个一小时都在均匀地爆仓（阴跌/钝刀子割肉）。



---

### 第三步：相对强弱特征 (Relative & Ratio) —— 归一化

GP 和 XGBoost 对绝对数值不敏感（或者容易过拟合），它们更喜欢比例。

1. **多空失衡比 (Imbalance Ratio):**


* 范围 [-1, 1]。这是判断**“单边屠杀”**最直接的因子。


2. **含血率 (Blood Content):**


* 今天的成交量里，有多少是被强平的？这个比例越高，**反转概率越大**。


3. **鲸鱼指数 (Whale Index):**


* 如果数值接近 1，说明这波爆仓纯粹是因为死了一个超级大户带崩的，这种通常是**假跌（Noise）**，会迅速反抽。



---

### 第四步：时序异常特征 (Rolling & Z-Score) —— 历史比较

这就是你提到的**“滚动窗口”**。我们需要判断当前的爆仓是否“异常”。

1. **Z-Score (标准化异常值):**


* **必做因子。** 告诉模型：现在的爆仓量是过去 24 小时平均水平的 5 倍！这是大事件。


2. **分位数位置 (Percentile Rank):**
* 当前的爆仓量在过去 7 天（168小时）里排第几名？
* 如果 `Rank > 0.99`，说明是极值事件，往往对应底部。



---

### 第五步：微观价格交互 (Price Interaction) —— 进阶方法

除了你列出的，还有一个**非常高阶**的处理方法：**计算爆仓均价 (Liq-VWAP) 与市场均价的偏离**。

1. **爆仓 VWAP (Liq_VWAP):**


* 计算这一小时内，所有强平单的加权平均成交价。


2. **痛感深度 (Pain Depth):**


* **逻辑：** 如果收盘价远低于多头爆仓均价，说明爆仓后价格继续深跌（甚至穿仓了），市场极其疲软。如果收盘价快速收回到了爆仓均价之上，说明有资金在接这些血筹。



---

### 总结建议

1. **一定要用 Z-Score：** 2021 年的 1000 万爆仓和 2025 年的 1000 万爆仓意义完全不同。只有标准化的数据才能喂给模型。
2. **关注 Count (笔数)：** 如果你要抓**散户恐慌底**，`count` 比 `sum` 更准。
3. **关注 Max (最大值)：** 如果你要抓**插针反转**，`max` 比 `sum` 更准。

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
from scipy.optimize import minimize
import time
import talib as ta
from enum import Enum
import re

import pandas as pd
import numpy as np
from pathlib import Path
from scipy.optimize import minimize
import time
import talib as ta
from enum import Enum
import re
import os
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Tuple, Any
from datetime import datetime, timedelta

import sys
import matplotlib.pyplot as plt
from scipy.stats import zscore, kurtosis, skew, yeojohnson, boxcox
from scipy.stats import tukeylambda, mstats
from sklearn.preprocessing import RobustScaler
import zipfile
from io import BytesIO

In [3]:
class DataFrequency(Enum):
    """数据频率枚举"""
    MONTHLY = 'monthly'  # 月度数据
    DAILY = 'daily'      # 日度数据


def _generate_date_range(start_date: str, end_date: str, read_frequency: DataFrequency = DataFrequency.MONTHLY) -> List[str]:
    """
    生成日期范围列表
    
    参数:
    start_date: 起始日期
        - 月度格式: 'YYYY-MM' (如 '2020-01') 或 'YYYY-MM-DD' (自动转换为 'YYYY-MM')
        - 日度格式: 'YYYY-MM-DD' (如 '2020-01-01')
    end_date: 结束日期，格式同上
    frequency: 数据频率（月度或日度）
    
    返回:
    日期字符串列表
    """
    if read_frequency == DataFrequency.MONTHLY:
        # 兼容 'YYYY-MM' 和 'YYYY-MM-DD' 两种格式
        # 如果是 'YYYY-MM-DD' 格式，自动截取为 'YYYY-MM'
        new_start_date = start_date
        new_end_date = end_date
        if len(start_date) == 10:  # 'YYYY-MM-DD' 格式
            new_start_date = start_date[:7]
        if len(end_date) == 10:
            new_end_date = end_date[:7]
            
        start_dt = datetime.strptime(new_start_date, '%Y-%m')
        end_dt = datetime.strptime(new_end_date, '%Y-%m')
        
        date_list = []
        current_dt = start_dt
        while current_dt <= end_dt:
            date_list.append(current_dt.strftime('%Y-%m'))
            # 移动到下一个月
            if current_dt.month == 12:
                current_dt = current_dt.replace(year=current_dt.year + 1, month=1)
            else:
                current_dt = current_dt.replace(month=current_dt.month + 1)
        
        return date_list
    
    elif read_frequency == DataFrequency.DAILY:
        start_dt = datetime.strptime(start_date, '%Y-%m-%d')
        end_dt = datetime.strptime(end_date, '%Y-%m-%d')
        
        date_list = []
        current_dt = start_dt
        while current_dt <= end_dt:
            date_list.append(current_dt.strftime('%Y-%m-%d'))
            current_dt += timedelta(days=1)
        
        return date_list
    
    else:
        raise ValueError(f"不支持的数据频率: {frequency}")

In [None]:
# start_date = '2025-01-01'
# end_date = '2025-11-01'
# read_frequency = DataFrequency.MONTHLY
# date_range_list = _generate_date_range(start_date=start_date, end_date=end_date, read_frequency=read_frequency)
# date_range_list

['2025-01',
 '2025-02',
 '2025-03',
 '2025-04',
 '2025-05',
 '2025-06',
 '2025-07',
 '2025-08',
 '2025-09',
 '2025-10',
 '2025-11']

处理多空数据
/Users/aming/data/ETHUSDT

takerlongshortRatio
topLongShortPositionRatio
topLongShortAccountRatio

大户的多头和空头总持仓量占比，大户指保证金余额排名前20%的用户。 
多仓持仓量比例 = 大户多仓持仓量 / 大户总持仓量 
空仓持仓量比例 = 大户空仓持仓量 / 大户总持仓量 
多空持仓量比值 = 多仓持仓量比例 / 空仓持仓量比例

topLongShortPositionRatio

https://developers.binance.com/docs/zh-CN/derivatives/usds-margined-futures/market-data/rest-api/Top-Trader-Long-Short-Ratio

{ 
         "symbol":"BTCUSDT",
	      "longShortRatio":"1.4342",// 大户多空持仓量比值
	      "longAccount": "0.5344", // 大户多仓持仓量比例
	      "shortAccount":"0.4238", // 大户空仓持仓量比例
	      "timestamp":"1583139600000"
}

名称	类型	是否必需	描述
symbol	STRING	YES	
period	ENUM	YES	"5m","15m","30m","1h","2h","4h","6h","12h","1d"
limit	LONG	NO	default 30, max 500
startTime	LONG	NO	
endTime	LONG	NO	


In [45]:
start_date = '2025-10-01'
end_date = '2025-11-01'
date_range_list = _generate_date_range(start_date=start_date, end_date=end_date, read_frequency=DataFrequency.MONTHLY)
dir = '/Users/aming/data/ETHUSDT'
path = 'topLongShortPositionRatio'
df_list = []

for date_str in date_range_list:
    df = pd.read_csv(f'{dir}/{path}/{path}_{date_str}.csv')
    df_list.append(df)

df = pd.concat(df_list)
df['open_time'] = pd.to_datetime(df['open_time'], unit='ns')
# df.sort_values(by='open_time', ascending=True, inplace=True)
df.set_index('open_time', inplace=True)
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df
# df['buySellRatio'].plot()


Unnamed: 0_level_0,symbol,longAccount,longShortRatio,shortAccount
open_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2025-10-01 00:00:00,ETHUSDT,0.7353,2.7775,0.2647
2025-10-01 00:00:00,ETHUSDT,0.7353,2.7775,0.2647
2025-10-01 00:00:00,ETHUSDT,0.7353,2.7775,0.2647
2025-10-01 00:00:00,ETHUSDT,0.7353,2.7775,0.2647
2025-10-01 00:00:00,ETHUSDT,0.7353,2.7775,0.2647
...,...,...,...,...
2025-11-30 23:50:00,ETHUSDT,0.7451,2.9237,0.2549
2025-11-30 23:50:00,ETHUSDT,0.7451,2.9237,0.2549
2025-11-30 23:50:00,ETHUSDT,0.7451,2.9237,0.2549
2025-11-30 23:55:00,ETHUSDT,0.7450,2.9213,0.2550


topLongShortAccountRatio

持仓大户的净持仓多头和空头账户数占比，大户指保证金余额排名前20%的用户。一个账户记一次。 多仓账户数比例 = 持多仓大户数 / 总持仓大户数 空仓账户数比例 = 持空仓大户数 / 总持仓大户数 多空账户数比值 = 多仓账户数比例 / 空仓账户数比例


https://developers.binance.com/docs/zh-CN/derivatives/usds-margined-futures/market-data/rest-api/Top-Long-Short-Account-Ratio

名称	类型	是否必需	描述
symbol	STRING	YES	
period	ENUM	YES	"5m","15m","30m","1h","2h","4h","6h","12h","1d"
limit	LONG	NO	default 30, max 500
startTime	LONG	NO	
endTime	LONG	NO	

{ 
         "symbol":"BTCUSDT",
	      "longShortRatio":"1.8105",// 大户多空账户数比值
	      "longAccount": "0.6442", // 大户多仓账户数比例
	      "shortAccount":"0.3558", // 大户空仓账户数比例
	      "timestamp":"1583139600000"
    }
    


In [46]:
start_date = '2025-10-01'
end_date = '2025-11-01'
date_range_list = _generate_date_range(start_date=start_date, end_date=end_date, read_frequency=DataFrequency.MONTHLY)
dir = '/Users/aming/data/ETHUSDT'
path = 'topLongShortAccountRatio'
df_list = []

for date_str in date_range_list:
    df = pd.read_csv(f'{dir}/{path}/{path}_{date_str}.csv')
    df_list.append(df)

df = pd.concat(df_list)
df['open_time'] = pd.to_datetime(df['open_time'], unit='ns')
# df.sort_values(by='open_time', ascending=True, inplace=True)
df.set_index('open_time', inplace=True)
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df
# df['buySellRatio'].plot()


Unnamed: 0_level_0,symbol,longAccount,longShortRatio,shortAccount
open_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2025-10-01 00:00:00,ETHUSDT,0.7126,2.4795,0.2874
2025-10-01 00:00:00,ETHUSDT,0.7126,2.4795,0.2874
2025-10-01 00:00:00,ETHUSDT,0.7126,2.4795,0.2874
2025-10-01 00:05:00,ETHUSDT,0.7122,2.4746,0.2878
2025-10-01 00:05:00,ETHUSDT,0.7122,2.4746,0.2878
...,...,...,...,...
2025-11-30 23:50:00,ETHUSDT,0.7028,2.3647,0.2972
2025-11-30 23:50:00,ETHUSDT,0.7028,2.3647,0.2972
2025-11-30 23:55:00,ETHUSDT,0.7037,2.3750,0.2963
2025-11-30 23:55:00,ETHUSDT,0.7037,2.3750,0.2963


{
    buySellRatio: "1.5586",
    buyVol: "387.3300", // 主动买入量
    sellVol: "248.5030", // 主动卖出量
    timestamp: "1585614900000",
  }

https://developers.binance.com/docs/zh-CN/derivatives/usds-margined-futures/market-data/rest-api/Taker-BuySell-Volume

takerlongshortRatio

In [15]:
start_date = '2025-10-01'
end_date = '2025-11-01'
date_range_list = _generate_date_range(start_date=start_date, end_date=end_date, read_frequency=DataFrequency.MONTHLY)
dir = '/Users/aming/data/ETHUSDT'
takerlongshortRatioPath = 'takerlongshortRatio'
df_list = []

for date_str in date_range_list:
    df = pd.read_csv(f'{dir}/{takerlongshortRatioPath}/{takerlongshortRatioPath}_{date_str}.csv')
    df_list.append(df)

df = pd.concat(df_list)
# df.sort_values(by='open_time', ascending=True, inplace=True)
df.set_index('open_time', inplace=True)
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df.head()
# df['buySellRatio'].plot()


Unnamed: 0_level_0,buySellRatio,sellVol,buyVol
open_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-10-01 00:00:00,1.5031,4230.999,6359.645
2025-10-01 00:00:00,1.5031,4230.999,6359.645
2025-10-01 00:00:00,1.5031,4230.999,6359.645
2025-10-01 00:05:00,0.8976,3388.384,3041.311
2025-10-01 00:05:00,0.8976,3388.384,3041.311


globalLongShortAccountRatio

https://developers.binance.com/docs/zh-CN/derivatives/usds-margined-futures/market-data/rest-api/Long-Short-Ratio

{ 
         "symbol":"BTCUSDT",
	      "longShortRatio":"0.1960", // 多空人数比值
	      "longAccount": "0.6622", // 多仓人数比例
	      "shortAccount":"0.3378", // 空仓人数比例
	      "timestamp":"1583139600000"
    
}


In [49]:
start_date = '2025-10-01'
end_date = '2025-11-01'
date_range_list = _generate_date_range(start_date=start_date, end_date=end_date, read_frequency=DataFrequency.MONTHLY)
dir = '/Users/aming/data/ETHUSDT'
path = 'globalLongShortAccountRatio'
df_list = []

for date_str in date_range_list:
    df = pd.read_csv(f'{dir}/{path}/{path}_{date_str}.csv')
    df_list.append(df)

df = pd.concat(df_list)
# df.sort_values(by='open_time', ascending=True, inplace=True)
df.set_index('open_time', inplace=True)
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df
# df.head()
# df['buySellRatio'].plot()


Unnamed: 0_level_0,symbol,longAccount,longShortRatio,shortAccount
open_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2025-10-01 00:00:00,ETHUSDT,0.6786,2.1114,0.3214
2025-10-01 00:00:00,ETHUSDT,0.6786,2.1114,0.3214
2025-10-01 00:00:00,ETHUSDT,0.6786,2.1114,0.3214
2025-10-01 00:00:00,ETHUSDT,0.6786,2.1114,0.3214
2025-10-01 00:00:00,ETHUSDT,0.6786,2.1114,0.3214
...,...,...,...,...
2025-11-30 23:50:00,ETHUSDT,0.6650,1.9851,0.3350
2025-11-30 23:55:00,ETHUSDT,0.6655,1.9895,0.3345
2025-11-30 23:55:00,ETHUSDT,0.6655,1.9895,0.3345
2025-11-30 23:55:00,ETHUSDT,0.6655,1.9895,0.3345


杠杆数据

liquidations

side: 强平方向。

sell: 代表多头被爆仓（系统卖出平仓）。

buy: 代表空头被爆仓（系统买入平仓）。

price: 强平发生的成交价格。

amount: 强平的数量（币数或张数）。


In [4]:
start_date = '2025-10-01'
end_date = '2025-11-01'
read_frequency = DataFrequency.DAILY
date_range_list = _generate_date_range(start_date=start_date, end_date=end_date, read_frequency=read_frequency)
dir = '/Users/aming/data/ETHUSDT'
channel_path = 'liquidations'
symbol = 'ETHUSDT'
liq_list = []

for date_str in date_range_list:
    df = pd.read_csv(f'{dir}/{channel_path}/binance-futures_{channel_path}_{date_str}_{symbol}.csv.gz')
    liq_list.append(df)

liq_df = pd.concat(liq_list)
# df.sort_values(by='open_time', ascending=True, inplace=True)
liq_df.rename(columns={'timestamp': 'open_time'}, inplace=True)
liq_df['open_time'] = pd.to_datetime(liq_df['open_time'], unit='us')
# df['funding_timestamp'] = pd.to_datetime(df['funding_timestamp'], unit='us')
liq_df.set_index('open_time', inplace=True)
# df.index = pd.to_datetime(df.index)
liq_df.sort_index(inplace=True)
liq_df.drop(columns=['id', 'exchange', 'local_timestamp', 'symbol'], inplace=True)
liq_df.head()

Unnamed: 0_level_0,side,price,amount
open_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-10-01 00:00:17.767,buy,4161.21,3.244
2025-10-01 00:01:12.560,buy,4162.28,0.738
2025-10-01 00:01:39.582,buy,4164.18,0.06
2025-10-01 00:03:18.173,buy,4165.67,1.722
2025-10-01 00:05:02.358,buy,4166.02,3.837


In [5]:
from tools import LiquidationFactorEngine as liq
liq_factor_engine = liq.LiquidationFactorEngine(resample_freq = '15m')
bucket_quantiles = [0.90]
bucket_window_hours=[24]
mining_windows=[24]
mining_quantiles=[0.90]

liq_factor_df = liq_factor_engine.process(liq_df, bucket_quantiles=bucket_quantiles, bucket_window_hours=bucket_window_hours, mining_windows=mining_windows, mining_quantiles=mining_quantiles)

liq_factor_df

[*] 启动引擎 | 频率: 15m | 动态分桶回看: [24]H


  hourly_quantiles = df['value'].resample('1H').quantile(quantiles).unstack()


ValueError: window must be an integer 0 or greater

derivative_ticker

In [51]:
start_date = '2025-10-01'
end_date = '2025-11-01'
read_frequency = DataFrequency.DAILY
date_range_list = _generate_date_range(start_date=start_date, end_date=end_date, read_frequency=read_frequency)
dir = '/Users/aming/data/ETHUSDT'
channel_path = 'derivative_ticker'
symbol = 'ETHUSDT'
df_list = []

for date_str in date_range_list:
    df = pd.read_csv(f'{dir}/{channel_path}/binance-futures_{channel_path}_{date_str}_{symbol}.csv.gz')
    df_list.append(df)

df = pd.concat(df_list)
# df.sort_values(by='open_time', ascending=True, inplace=True)
df.rename(columns={'timestamp': 'open_time'}, inplace=True)
df['open_time'] = pd.to_datetime(df['open_time'], unit='us')
df['funding_timestamp'] = pd.to_datetime(df['funding_timestamp'], unit='us')
df.set_index('open_time', inplace=True)
# df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df.drop(columns=['exchange', 'local_timestamp', 'predicted_funding_rate', 'open_interest', 'symbol'], inplace=True)
df.head()
# df['buySellRatio'].plot()


Unnamed: 0_level_0,funding_timestamp,funding_rate,last_price,index_price,mark_price
open_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-10-01 00:00:00.000,2025-10-01 00:00:00,-1.6e-05,,4145.183256,4142.98
2025-10-01 00:00:01.001,2025-10-01 08:00:00,-1.6e-05,,4145.183256,4142.74
2025-10-01 00:00:01.632,2025-10-01 08:00:00,-1.6e-05,4142.99,4145.183256,4142.74
2025-10-01 00:00:02.001,2025-10-01 08:00:00,-1.6e-05,4142.99,4145.226512,4142.96
2025-10-01 00:00:03.000,2025-10-01 08:00:00,-1.6e-05,4142.99,4145.49,4143.066357


处理openInterest
{
	"openInterest": "10659.509", // 未平仓合约数量
	"symbol": "BTCUSDT",	// 交易对
	"time": 1589437530011   // 撮合引擎时间
}
https://developers.binance.com/docs/zh-CN/derivatives/usds-margined-futures/market-data/rest-api/Open-Interest

In [34]:
start_date = '2025-10-01'
end_date = '2025-11-01'
read_frequency = DataFrequency.MONTHLY
date_range_list = _generate_date_range(start_date=start_date, end_date=end_date, read_frequency=read_frequency)
dir = '/Users/aming/data/ETHUSDT'
channel_path = 'openInterest'
symbol = 'ETHUSDT'
df_list = []

for date_str in date_range_list:
    df = pd.read_csv(f'{dir}/{channel_path}/{channel_path}_{date_str}.csv')
    df_list.append(df)

df = pd.concat(df_list)
df.set_index('open_time', inplace=True)
# df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
# df.drop(columns=['exchange', 'local_timestamp', 'symbol'], inplace=True)
df.head()
# df['buySellRatio'].plot()


Unnamed: 0_level_0,symbol,openInterest
open_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2025-10-01 00:00:00.114,ETHUSDT,1827785.572
2025-10-01 00:00:05.408,ETHUSDT,1827799.596
2025-10-01 00:00:12.566,ETHUSDT,1827989.568
2025-10-01 00:00:20.574,ETHUSDT,1827917.962
2025-10-01 00:00:26.319,ETHUSDT,1827886.995
