# 数据清理 (Data Cleaning)

本笔记本涵盖了 `processed_data.csv` 数据的清理过程，旨在为分类任务做准备。
主要步骤包括：
1. 加载数据
2. 文本预处理（去除停用词、标点符号、特殊字符）
3. **去除与标签过于相似的词汇**（防止模型直接通过标签词进行预测，即“数据泄露”）
4. 保存清理后的数据

In [12]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# 下载必要的 NLTK 数据包 (如果尚未下载)
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [13]:
# 1. 加载数据
df = pd.read_csv('processed_data.csv')

# 查看数据前几行
print(f"数据总行数: {len(df)}")
df.head()


数据总行数: 1641


Unnamed: 0,Content,Paper Name,Label
0,Insulin resistance is a condition characterize...,Brain insulin resistance mediated cognitive im...,Alzheimer's Disease
1,substrate 1 (IRS1)/PI3K/AKT and IGF-1 receptor...,Brain insulin resistance mediated cognitive im...,Alzheimer's Disease
2,Prolactin is a pituitary anterior lobe hormone...,Hyperprolactinemia and Brain Health: Exploring...,Alzheimer's Disease
3,Lecanemab is an amyloid-targeted antibody indi...,Severe Persistent Urinary Retention Following ...,Alzheimer's Disease
4,Glycoprotein 88 (GP88) is a secreted biomarker...,An Impedimetric Immunosensor for Progranulin D...,Alzheimer's Disease


In [14]:
# 2. 定义需要去除的词汇和掩码策略

# 获取 NLTK 的英文停用词
stop_words = set(stopwords.words('english'))

# 定义与标签（疾病名称）高度相关的词汇，用于掩码处理
# 标签包括: Alzheimer's disease, Frontotemporal dementia, Lewy body dementia, Mild cognitive impairment, Parkinson's disease
label_related_words = {
    'alzheimer', 'alzheimers', 'ad', # Alzheimer's disease
    'frontotemporal', 'ftd', # Frontotemporal dementia
    'lewy', 'lbd', 'dlb', # Lewy body dementia
    'parkinson', 'parkinsons', 'pd', # Parkinson's disease
    'dementia', 'disease', 'syndrome', 'disorder', # 通用医学后缀
    'vascular',  # Vascular dementia
}

# 注意：我们不再将这些词加入停用词表进行删除，而是在后续步骤中替换为掩码
print(f"基础停用词总数: {len(stop_words)}")
print(f"需要掩码处理的词汇数: {len(label_related_words)}")

基础停用词总数: 198
需要掩码处理的词汇数: 16


In [15]:
# 3. 定义数据清理函数

lemmatizer = WordNetLemmatizer()

def clean_text(text):
    if not isinstance(text, str):
        return ""
    
    # 1. 转为小写
    text = text.lower()
    
    # 2. 去除 URL, 邮箱, 若有HTML标签也可以去除
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\S+@\S+', '', text)
    
    # 3. 去除特殊字符和数字 (保留字母和空格)
    # 这里的正则 [^a-zA-Z\s] 表示除了字母和空白字符外的所有字符都会被替换为空格
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    
    # 4. 分词并去除停用词、垃圾字符、词形还原
    words = text.split()
    cleaned_words = []
    
    for word in words:
        # 去除过短的单词 (比如长度为1的单字母，除了'a', 'i'等常用词外通常意义不大，这里一并过滤)
        if len(word) < 2:
            continue
            
        lemma_word = lemmatizer.lemmatize(word)
        
        # 优先检查是否需要掩码 (Before stopword check to ensure we mask even if it somehow was a stopword, though unlikely)
        if word in label_related_words or lemma_word in label_related_words:
            cleaned_words.append('[DISEASE]')
            continue

        # 检查是否在停用词表中 (使用基础 stop_words)
        if word not in stop_words and lemma_word not in stop_words:
            cleaned_words.append(lemma_word)
            
    return " ".join(cleaned_words)

# 测试清理函数
sample_text = "Background: Patients with Alzheimer's disease (AD) often show cognitive impairment. 123 http://test.com"
print("原始文本:", sample_text)
print("清理后文本:", clean_text(sample_text))

原始文本: Background: Patients with Alzheimer's disease (AD) often show cognitive impairment. 123 http://test.com
清理后文本: background patient [DISEASE] [DISEASE] [DISEASE] often show cognitive impairment


In [16]:
# 4. 应用清理函数到 Content 列

# 我们可以选择清理 'Content' 列，也可以顺便清理 'Paper Name'
# 这里主要关注 'Content'，并将其保存到新的一列 'Cleaned_Content'
df['Cleaned_Content'] = df['Content'].apply(clean_text)

# 检查清理后的空值 (如果有些行只包含停用词，清理后可能为空)
print("清理后为空的行数:", (df['Cleaned_Content'] == "").sum())

# 删除清理后内容为空的行
df = df[df['Cleaned_Content'] != ""]

# 重置索引
df.reset_index(drop=True, inplace=True)

# 查看清理前后的对比
df[['Content', 'Cleaned_Content', 'Label']].head()


清理后为空的行数: 0


Unnamed: 0,Content,Cleaned_Content,Label
0,Insulin resistance is a condition characterize...,insulin resistance condition characterized att...,Alzheimer's Disease
1,substrate 1 (IRS1)/PI3K/AKT and IGF-1 receptor...,substrate irs pi akt igf receptor igf irs pi p...,Alzheimer's Disease
2,Prolactin is a pituitary anterior lobe hormone...,prolactin pituitary anterior lobe hormone play...,Alzheimer's Disease
3,Lecanemab is an amyloid-targeted antibody indi...,lecanemab amyloid targeted antibody indicated ...,Alzheimer's Disease
4,Glycoprotein 88 (GP88) is a secreted biomarker...,glycoprotein gp secreted biomarker overexpress...,Alzheimer's Disease


In [17]:
# 5. 保存清理后的数据
output_file = 'cleaned_data.csv'
df.to_csv(output_file, index=False)
print(f"清理后的数据已保存至: {output_file}")


清理后的数据已保存至: cleaned_data.csv
