# 获取npy数据

## 概述 Overview

把去掉Morgan指纹后的特征和数据整理成一个.npy或者.pkl文件，
然后写一个jupyter notebook展示如何读入这些预处理好的数据，方便进行特征分析和模型训练。
然后把.npy和jupyter notebook汇总。

**主要步骤：**
1. **数据清理** (Optional): 清理原始CSV数据，转换星号标记
2. **添加分子特征**: 检测单体/二聚体、环化、二硫键，转换标签为分钟
3. **提取RDKit特征**: 提取QED、物理化学描述符、Morgan/Avalon指纹

**输入**: `data/raw/*.csv` - 原始数据文件  
**输出**: 
- `data/processed/*.csv` - 添加分子特征后的CSV  
- `outputs/npy_datas/{dataset_name}_processed/*.npy` - RDKit特征矩阵（npy格式，包括特征名，ids，X，sgf的y值，sig的y值 ）

**关于.npy文件的备注**：
- 输出的X中，去掉了所有的非monomer，以及sif_minutes和sgf_minutes均为-1的数据
- X.shape[0] = y_sif.shape = y_sgf.shape
---

## 1. 环境检查与导入 Environment Setup

In [11]:
# 环境检查
import sys
from pathlib import Path

# 添加项目根目录到路径
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / "src"))

# 核心库导入
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

# 项目模块导入
from feature_extraction import PeptideFeaturizer
from feature_extraction.utils import (
    get_csv_files, load_csv_safely, extract_molecular_features,
    convert_label_to_minutes, save_features_to_npz
)

# 设置显示选项
pd.set_option('display.max_columns', None)
plt.rcParams['figure.figsize'] = (12, 6)
sns.set_style('whitegrid')

print("✓ 所有库已成功导入")
print(f"✓ 项目根目录: {project_root}")

✓ 所有库已成功导入
✓ 项目根目录: d:\RA\feature_extraction


## 2. 参数配置区 Configuration

**⚙️ 根据您的需求修改以下参数**

In [12]:
# ============== 参数配置区 ==============
# 用户可根据需要修改以下参数

CONFIG = {
    # 输入输出路径
    'raw_dir': project_root / 'data' / 'raw',
    'processed_dir': project_root / 'data' / 'processed',
    'features_dir': project_root / 'outputs' / 'features',
    'figures_dir': project_root / 'outputs' / 'figures' / 'phase1',
    'npy_datas_dir': project_root / 'outputs' / 'npy_datas',
    'pkl_datas_dir': project_root / 'outputs' / 'pkl_datas',
    'contain_type':"NoMorganAndAvalon",
    
    # 特征提取参数
    'morgan_bits': 0, #1024,     # Morgan指纹位数（这里临时修改为0，关闭特征提取）
    'avalon_bits': 0,      # Avalon指纹位数
    'use_avalon': False,      # 是否使用Avalon指纹（需RDKit支持）
    
    # 可视化参数
    'dpi': 300,              # 图像分辨率
    'format': 'png',         # 图像格式 (png/pdf/svg)
    'display_plots': True,   # 是否在notebook中显示关键图表
    'max_display_plots': 3,  # 最多显示几个图表
}

# 创建输出目录
CONFIG['processed_dir'].mkdir(parents=True, exist_ok=True)
CONFIG['features_dir'].mkdir(parents=True, exist_ok=True)
CONFIG['figures_dir'].mkdir(parents=True, exist_ok=True)

print("配置参数:")
for key, value in CONFIG.items():
    if isinstance(value, Path):
        print(f"  {key}: {value.relative_to(project_root) if value.is_relative_to(project_root) else value}")
    else:
        print(f"  {key}: {value}")

配置参数:
  raw_dir: data\raw
  processed_dir: data\processed
  features_dir: outputs\features
  figures_dir: outputs\figures\phase1
  npy_datas_dir: outputs\npy_datas
  pkl_datas_dir: outputs\pkl_datas
  contain_type: NoMorganAndAvalon
  morgan_bits: 0
  avalon_bits: 0
  use_avalon: False
  dpi: 300
  format: png
  display_plots: True
  max_display_plots: 3


## 3. 步骤 1: 添加分子特征 Add Molecular Features

为每个SMILES分子添加结构特征并转换标签格式。

**新增列**:
- `is_dimer`: 是否为二聚体 (bool)
- `is_cyclic`: 是否含环状结构 (bool)
- `has_disulfide_bond`: 是否含二硫键 (bool)
- `SIF_minutes`: SIF半衰期（分钟）
- `SGF_minutes`: SGF半衰期（分钟）

In [13]:
def add_molecular_features_to_csv(csv_path: Path, output_dir: Path):
    """
    为单个CSV文件添加分子特征
    
    Args:
        csv_path: 输入CSV文件路径
        output_dir: 输出目录
    
    Returns:
        dict: 统计信息
    """
    # 加载CSV
    df, status = load_csv_safely(csv_path, required_columns=["id", "SMILES"])
    if df is None:
        return {"error": status}
    
    original_count = len(df)
    
    # 提取分子特征
    feature_records = []
    for _, row in tqdm(df.iterrows(), total=len(df), desc=f"处理 {csv_path.name}", leave=False):
        smiles = row["SMILES"]
        features = extract_molecular_features(smiles)
        feature_records.append(features)
    
    # 添加特征列
    feature_df = pd.DataFrame(feature_records)
    df = pd.concat([df, feature_df], axis=1)
    
    # 转换标签到分钟
    sif_col = "SIF_class" if "SIF_class" in df.columns else None
    sgf_col = "SGF_class" if "SGF_class" in df.columns else None
    
    if sif_col:
        df["SIF_minutes"] = df[sif_col].apply(convert_label_to_minutes)
    else:
        df["SIF_minutes"] = -1
    
    if sgf_col:
        df["SGF_minutes"] = df[sgf_col].apply(convert_label_to_minutes)
    else:
        df["SGF_minutes"] = -1
    
    # 1. 过滤掉 SIF 和 SGF 都缺失的样本
    mask_both_missing = (df["SIF_minutes"] == -1) & (df["SGF_minutes"] == -1)

    # 2. 筛选单体分子
    mask_is_monomer = df["is_monomer"] == True  # 或 df["is_monomer"]

    # 3. 合并两个筛选条件
    # 先过滤掉双缺失，再保留 is_monomer=True 的行
    df_filtered = df[~mask_both_missing & mask_is_monomer].copy()
    
    # 保存处理后的CSV
    output_path = output_dir / csv_path.name.replace('.csv', '_processed.csv')
    df_filtered.to_csv(output_path, index=False)
    
    # 统计信息
    stats = {
        "file": csv_path.name,
        "original_count": original_count,
        "filtered_count": len(df_filtered),
        "dimer_count": df_filtered["is_dimer"].sum(),
        "cyclic_count": df_filtered["is_cyclic"].sum(),
        "disulfide_count": df_filtered["has_disulfide_bond"].sum(),
        "sif_valid_count": (df_filtered["SIF_minutes"] != -1).sum(),
        "sgf_valid_count": (df_filtered["SGF_minutes"] != -1).sum(),
        "output_path": output_path,
    }
    
    return stats

# 执行：批量处理所有CSV文件
csv_files = list(CONFIG['raw_dir'].glob('*.csv'))
print(f"找到 {len(csv_files)} 个CSV文件\n")

all_stats = []
for csv_file in csv_files:
    stats = add_molecular_features_to_csv(csv_file, CONFIG['processed_dir'])
    if "error" not in stats:
        all_stats.append(stats)
        print(f"✓ {stats['file']}: {stats['original_count']} → {stats['filtered_count']} samples")

# 汇总统计
summary_df = pd.DataFrame(all_stats)
print(f"\n{'='*60}")
print("总体统计:")
print(f"  总样本数: {summary_df['original_count'].sum()}")
print(f"  保留样本数: {summary_df['filtered_count'].sum()} ({summary_df['filtered_count'].sum() / summary_df['original_count'].sum() * 100:.1f}%)")
print(f"  二聚体样本: {summary_df['dimer_count'].sum()} ({summary_df['dimer_count'].sum() / summary_df['filtered_count'].sum() * 100:.1f}%)")
print(f"  环化样本: {summary_df['cyclic_count'].sum()} ({summary_df['cyclic_count'].sum() / summary_df['filtered_count'].sum() * 100:.1f}%)")
print(f"  含二硫键样本: {summary_df['disulfide_count'].sum()} ({summary_df['disulfide_count'].sum() / summary_df['filtered_count'].sum() * 100:.1f}%)")
print(f"{'='*60}\n")

# 显示详细表格
display(summary_df[['file', 'original_count', 'filtered_count', 'dimer_count', 'cyclic_count', 'disulfide_count']])

找到 5 个CSV文件



处理 sif_sgf_second.csv:   0%|          | 0/897 [00:00<?, ?it/s]

✓ sif_sgf_second.csv: 897 → 202 samples


处理 US20140294902A1.csv:   0%|          | 0/5 [00:00<?, ?it/s]

✓ US20140294902A1.csv: 5 → 5 samples


处理 US9624268.csv:   0%|          | 0/775 [00:00<?, ?it/s]

✓ US9624268.csv: 775 → 130 samples


处理 US9809623B2.csv:   0%|          | 0/80 [00:00<?, ?it/s]

✓ US9809623B2.csv: 80 → 32 samples


处理 WO2017011820A2.csv:   0%|          | 0/174 [00:00<?, ?it/s]

✓ WO2017011820A2.csv: 174 → 151 samples

总体统计:
  总样本数: 1931
  保留样本数: 520 (26.9%)
  二聚体样本: 0 (0.0%)
  环化样本: 515 (99.0%)
  含二硫键样本: 180 (34.6%)



Unnamed: 0,file,original_count,filtered_count,dimer_count,cyclic_count,disulfide_count
0,sif_sgf_second.csv,897,202,0,202,27
1,US20140294902A1.csv,5,5,0,5,5
2,US9624268.csv,775,130,0,130,54
3,US9809623B2.csv,80,32,0,28,28
4,WO2017011820A2.csv,174,151,0,150,66


## 4. 步骤 2: 提取RDKit特征 Extract RDKit Features

从处理后的CSV中提取分子特征向量，保存为NPZ格式。

**特征类型**:
- QED属性 (8维)
- 物理化学描述符 (11维)
- Gasteiger电荷统计 (5维)
- Morgan指纹 (1024维)
- Avalon指纹 (512维, 可选)

In [14]:
from pathlib import Path
import os
def extract_rdkit_features(csv_path: Path, output_dir: Path, featurizer):
    """
    从CSV提取RDKit特征并保存为NPZ
    
    Args:
        csv_path: 输入CSV文件路径
        output_dir: 输出目录
        featurizer: PeptideFeaturizer实例
    
    Returns: dict: 统计信息
        {
            "file": "US9809623B2_processed.csv",
            "total_samples": 80,
            "valid_samples": 76,  # 剔除了那 4 个报错的 nan 后的数量
            "feature_dim": 1560,
            "output_path": "features/US9809623B2_processed.npz"
        }
    """
    # 加载CSV
    df, _ = load_csv_safely(csv_path, required_columns=["id", "SMILES", "SIF_minutes", "SGF_minutes"])
    if df is None:
        return {"error": "Failed to load CSV"}
    
    X = []
    y_sif = []
    y_sgf = []
    ids = []
    valid_count = 0
    
    # 提取特征
    for _, row in tqdm(df.iterrows(), total=len(df), desc=f"提取特征 {csv_path.name}", leave=False):
        smiles = str(row["SMILES"])
        features, success = featurizer.featurize(smiles)
        
        if success and features is not None:
            X.append(features)
            y_sif.append(int(row["SIF_minutes"]) if not pd.isna(row["SIF_minutes"]) else -1)
            y_sgf.append(int(row["SGF_minutes"]) if not pd.isna(row["SGF_minutes"]) else -1)
            ids.append(str(row["id"]))
            valid_count += 1
    
    # 转换为NumPy数组
    X = np.array(X, dtype=np.float32)
    y_sif = np.array(y_sif, dtype=np.int32)
    y_sgf = np.array(y_sgf, dtype=np.int32)
    ids = np.array(ids, dtype=object)
    feature_names = featurizer.get_feature_names()

    print("提取后数据形状:")
    print(len(feature_names), X.shape, y_sif.shape, y_sgf.shape, ids.shape)

    print("保存特征到NPY文件...")
    save_dir = output_dir / csv_path.stem
    save_dir.mkdir(parents=True, exist_ok=True)
    print(save_dir)
    exit("检查输出")
    
    np.save(os.path.join(save_dir, "X.npy"), X)
    np.save(os.path.join(save_dir, "y_sif.npy"), y_sif)
    np.save(os.path.join(save_dir, "y_sgf.npy"), y_sgf)
    np.save(os.path.join(save_dir, "ids.npy"), ids)
    np.save(os.path.join(save_dir, "feature_names.npy"), feature_names)

    
    return {
        "file": csv_path.name,
        "total_samples": len(df),
        "valid_samples": valid_count,
        "feature_dim": X.shape[1],
        "output_path": save_dir,
    }

# 初始化特征提取器
featurizer = PeptideFeaturizer(
    morgan_bits=CONFIG['morgan_bits'],
    avalon_bits=CONFIG['avalon_bits'],
    use_avalon=CONFIG['use_avalon']
)


print(f"特征提取器配置:")
print(f"  Morgan指纹: {CONFIG['morgan_bits']} bits")
print(f"  Avalon指纹: {CONFIG['avalon_bits']} bits (启用: {CONFIG['use_avalon']})")
print(f"  预计总特征维度: {featurizer.n_features}\n")

# 执行：批量提取特征，这里只是数据集列出一个路径而已
processed_csvs = list(CONFIG['processed_dir'].glob('*_processed.csv'))
print(f"找到 {len(processed_csvs)} 个处理后的CSV文件\n")

# --- IGNORE ---
feature_stats = []
for csv_file in processed_csvs:
    stats = extract_rdkit_features(csv_file, CONFIG['npy_datas_dir']/CONFIG["contain_type"], featurizer)
    if "error" not in stats:
        feature_stats.append(stats)
        print(f"✓ {stats['file']}: {stats['valid_samples']} samples, {stats['feature_dim']} features")


特征提取器配置:
  Morgan指纹: 0 bits
  Avalon指纹: 0 bits (启用: False)
  预计总特征维度: 24

找到 5 个处理后的CSV文件



提取特征 sif_sgf_second_processed.csv:   0%|          | 0/202 [00:00<?, ?it/s]



提取后数据形状:
24 (202, 24) (202,) (202,) (202,)
保存特征到NPY文件...
d:\RA\feature_extraction\outputs\npy_datas\NoMorganAndAvalon\sif_sgf_second_processed
✓ sif_sgf_second_processed.csv: 202 samples, 24 features




提取特征 US20140294902A1_processed.csv:   0%|          | 0/5 [00:00<?, ?it/s]

提取后数据形状:
24 (5, 24) (5,) (5,) (5,)
保存特征到NPY文件...
d:\RA\feature_extraction\outputs\npy_datas\NoMorganAndAvalon\US20140294902A1_processed
✓ US20140294902A1_processed.csv: 5 samples, 24 features




提取特征 US9624268_processed.csv:   0%|          | 0/130 [00:00<?, ?it/s]



提取后数据形状:
24 (130, 24) (130,) (130,) (130,)
保存特征到NPY文件...
d:\RA\feature_extraction\outputs\npy_datas\NoMorganAndAvalon\US9624268_processed
✓ US9624268_processed.csv: 130 samples, 24 features


提取特征 US9809623B2_processed.csv:   0%|          | 0/32 [00:00<?, ?it/s]

[20:41:19] SMILES Parse Error: syntax error while parsing: nan
[20:41:19] SMILES Parse Error: check for mistakes around position 2:
[20:41:19] nan
[20:41:19] ~^
[20:41:19] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[20:41:19] SMILES Parse Error: syntax error while parsing: nan
[20:41:19] SMILES Parse Error: check for mistakes around position 2:
[20:41:19] nan
[20:41:19] ~^
[20:41:19] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[20:41:19] SMILES Parse Error: syntax error while parsing: nan
[20:41:19] SMILES Parse Error: check for mistakes around position 2:
[20:41:19] nan
[20:41:19] ~^
[20:41:19] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[20:41:19] SMILES Parse Error: syntax error while parsing: nan
[20:41:19] SMILES Parse Error: check for mistakes around position 2:
[20:41:19] nan
[20:41:19] ~^
[20:41:19] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


提取后数据形状:
24 (28, 24) (28,) (28,) (28,)
保存特征到NPY文件...
d:\RA\feature_extraction\outputs\npy_datas\NoMorganAndAvalon\US9809623B2_processed
✓ US9809623B2_processed.csv: 28 samples, 24 features


提取特征 WO2017011820A2_processed.csv:   0%|          | 0/151 [00:00<?, ?it/s]

[20:41:19] SMILES Parse Error: syntax error while parsing: nan
[20:41:19] SMILES Parse Error: check for mistakes around position 2:
[20:41:19] nan
[20:41:19] ~^
[20:41:19] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


提取后数据形状:
24 (150, 24) (150,) (150,) (150,)
保存特征到NPY文件...
d:\RA\feature_extraction\outputs\npy_datas\NoMorganAndAvalon\WO2017011820A2_processed
✓ WO2017011820A2_processed.csv: 150 samples, 24 features




In [15]:
print("结果完全执行成果，检查后续内容", flush=True)


结果完全执行成果，检查后续内容
