# Data Cleaning

This notebook covers the cleaning process for `processed_data.csv`, preparing it for classification tasks.
Key steps include:
1. Loading data
2. Text preprocessing (removing stopwords, punctuation, special characters)
3. **Removing words highly similar to labels** (preventing data leakage where models predict directly using label words)
4. Saving cleaned data

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK packages (if not already downloaded)
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
# 1. Load Data
df = pd.read_csv('processed_data.csv')

# View first few rows
print(f"Total rows: {len(df)}")
df.head()

数据总行数: 1000


Unnamed: 0,Content,Paper Name,Label,Document Type,Affiliations
0,Insulin resistance is a condition characterize...,Brain insulin resistance mediated cognitive im...,Alzheimer's Disease,Review,"Department of Pharmaceutical Sciences, Maharsh..."
1,Prolactin is a pituitary anterior lobe hormone...,Hyperprolactinemia and Brain Health: Exploring...,Alzheimer's Disease,Review,"School of Pharmacy, Hubei University of Chines..."
2,Lecanemab is an amyloid-targeted antibody indi...,Severe Persistent Urinary Retention Following ...,Alzheimer's Disease,Article,"Department of Psychiatry, Duke University Scho..."
3,Glycoprotein 88 (GP88) is a secreted biomarker...,An Impedimetric Immunosensor for Progranulin D...,Alzheimer's Disease,Article,"University of New Brunswick, Fredericton, NB, ..."
4,Disruption of the blood–brain barrier (BBB) ac...,Regulation of Blood–Brain Barrier Permeability...,Alzheimer's Disease,Review,"Department of Pharmacology, Research Institute..."


In [None]:
# 2. Define words to remove and masking strategy

# Get NLTK English stopwords
stop_words = set(stopwords.words('english'))

# Define words highly related to labels (disease names) for masking
# Labels include: Alzheimer's disease, Frontotemporal dementia, Lewy body dementia, Mild cognitive impairment, Parkinson's disease
label_related_words = {
    'alzheimer', 'alzheimers', 'ad', # Alzheimer's disease
    'frontotemporal', 'ftd', # Frontotemporal dementia
    'lewy', 'lbd', 'dlb', # Lewy body dementia
    'parkinson', 'parkinsons', 'pd', # Parkinson's disease
    'dementia', 'disease', 'syndrome', 'disorder', # Common medical suffixes
    'vascular',  # Vascular dementia
}

# Note: We are no longer adding these words to stopwords for deletion, but replacing them with masks in subsequent steps
print(f"Total base stopwords: {len(stop_words)}")
print(f"Number of words to mask: {len(label_related_words)}")

基础停用词总数: 198
需要掩码处理的词汇数: 16


In [None]:
# 3. Define data cleaning function

lemmatizer = WordNetLemmatizer()

def clean_text(text):
    if not isinstance(text, str):
        return ""
    
    # 1. Convert to lowercase
    text = text.lower()
    
    # 2. Remove URLs, emails, and HTML tags if any
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\S+@\S+', '', text)
    
    # 3. Remove special characters and numbers (keep letters and spaces)
    # The regex [^a-zA-Z\s] means replace all characters except letters and whitespace with a space
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    
    # 4. Tokenize and remove stopwords, junk characters, lemmatize
    words = text.split()
    cleaned_words = []
    
    for word in words:
        # Remove words that are too short (e.g. single letters, except 'a', 'i', etc., usually meaningless, filtered here)
        if len(word) < 2:
            continue
            
        lemma_word = lemmatizer.lemmatize(word)
        
        # Prioritize checking if masking is needed (Before stopword check to ensure we mask even if it somehow was a stopword, though unlikely)
        if word in label_related_words or lemma_word in label_related_words:
            cleaned_words.append('[DISEASE]')
            continue

        # Check if in stop words list (using base stop_words)
        if word not in stop_words and lemma_word not in stop_words:
            cleaned_words.append(lemma_word)
            
    return " ".join(cleaned_words)

# Test cleaning function
sample_text = "Background: Patients with Alzheimer's disease (AD) often show cognitive impairment. 123 http://test.com"
print("Original text:", sample_text)
print("Cleaned text:", clean_text(sample_text))

原始文本: Background: Patients with Alzheimer's disease (AD) often show cognitive impairment. 123 http://test.com
清理后文本: background patient [DISEASE] [DISEASE] [DISEASE] often show cognitive impairment


In [None]:
# 4. Apply cleaning function to Content column

# We can choose to clean the 'Content' column, or 'Paper Name' as well
# Here we focus on 'Content' and save it to a new column 'Cleaned_Content'
df['Cleaned_Content'] = df['Content'].apply(clean_text)

# Check for empty values after cleaning (rows might be empty if they contained only stopwords)
print("Rows empty after cleaning:", (df['Cleaned_Content'] == "").sum())

# Remove rows that are empty after cleaning
df = df[df['Cleaned_Content'] != ""]

# Reset index
df.reset_index(drop=True, inplace=True)

# Compare before and after cleaning
df[['Content', 'Cleaned_Content', 'Label']].head()

清理后为空的行数: 0


Unnamed: 0,Content,Cleaned_Content,Label
0,Insulin resistance is a condition characterize...,insulin resistance condition characterized att...,Alzheimer's Disease
1,Prolactin is a pituitary anterior lobe hormone...,prolactin pituitary anterior lobe hormone play...,Alzheimer's Disease
2,Lecanemab is an amyloid-targeted antibody indi...,lecanemab amyloid targeted antibody indicated ...,Alzheimer's Disease
3,Glycoprotein 88 (GP88) is a secreted biomarker...,glycoprotein gp secreted biomarker overexpress...,Alzheimer's Disease
4,Disruption of the blood–brain barrier (BBB) ac...,disruption blood brain barrier bbb accompanies...,Alzheimer's Disease


In [None]:
# 5. Save cleaned data
output_file = 'cleaned_data.csv'
df.to_csv(output_file, index=False)
print(f"Cleaned data saved to: {output_file}")

清理后的数据已保存至: cleaned_data.csv
