# Data Preprocessing Pipeline

This notebook facilitates the processing of medical datasets (CSV files). 
The workflow includes:
1. **Loading** data from the `datasets` folders.
2. **Sampling** 200 documents from each class.
3. **Preprocessing** abstracts into 100-word chunks.
4. **Labeling** and consolidating the data into a single CSV file.

In [16]:
import os
import pandas as pd
import random

# Configuration / Settings
datasets_folder = 'datasets'
output_file = 'processed_data.csv'
chunk_size = 100
sample_size = 200

print(f"Datasets folder: {os.path.abspath(datasets_folder)}")
print(f"Output file: {output_file}")

Datasets folder: c:\Users\Administrator\Desktop\Classification\datasets
Output file: processed_data.csv


## 1. Process Data Files

We iterate through each `.csv` file in the target folder. 
For each file:
- Check for required columns (`Title`, `Abstract`).
- Randomly sample documents if the count exceeds the limit.
- Split the `Abstract` text into fixed-size word chunks.
- Discard key chunks that are too short (less than 100 words).

In [17]:
# Initialize list to store results
results = []

# Check if folder exists
if not os.path.exists(datasets_folder):
    print(f"Error: Folder '{datasets_folder}' not found.")
else:
    # Get list of CSV files
    files = [f for f in os.listdir(datasets_folder) if f.endswith('.csv')]
    print(f"Found {len(files)} CSV files in '{datasets_folder}':\n")

    for filename in files:
        file_path = os.path.join(datasets_folder, filename)
        label = os.path.splitext(filename)[0]
        
        print(f"--- Processing: {filename} ---")
        
        try:
            # Read CSV
            df = pd.read_csv(file_path)
            
            # Check for required columns
            if 'Abstract' not in df.columns or 'Title' not in df.columns:
                print(f"  [Skipped] Required columns 'Abstract' or 'Title' missing.")
                continue
            
            total_docs = len(df)
            
            # Randomly sample 200 documents
            if total_docs > sample_size:
                sampled_df = df.sample(n=sample_size, random_state=42)
                print(f"  Sampling: Selected {sample_size} out of {total_docs} documents.")
            else:
                sampled_df = df
                print(f"  Taking all: Used all {total_docs} documents.")
            
            # Process Abstract content
            file_record_count = 0
            for _, row in sampled_df.iterrows():
                abstract = row['Abstract']
                paper_name = row['Title']
                
                # Fetch additional fields
                # Using .get() to avoid errors if columns are missing in some files
                doc_type = row.get('Document Type', '')
                affiliations = row.get('Affiliations', '')

                # Handle missing or non-string abstracts
                if pd.isna(abstract) or not isinstance(abstract, str):
                    continue
                
                # Split into words
                words = abstract.split()
                
                # User Requirement: 
                # 1. Take first 100 words.
                # 2. Truncate if longer.
                # 3. Pad if shorter.
                
                if len(words) >= chunk_size:
                    chunk = words[:chunk_size]
                else:
                    # Pad with a placeholder token
                    chunk = words + ['[PAD]'] * (chunk_size - len(words))
                
                chunk_text = ' '.join(chunk)
                
                results.append({
                    'Content': chunk_text,
                    'Paper Name': paper_name,
                    'Label': label,
                    'Document Type': doc_type,
                    'Affiliations': affiliations
                })
                file_record_count += 1
            
            print(f"  > Generated {file_record_count} valid records from this file.\n")

        except Exception as e:
            print(f"  [Error] Failed to read {filename}: {e}\n")

print(f"Processing complete. Total records collected: {len(results)}")

Found 5 CSV files in 'datasets':

--- Processing: Alzheimer's Disease.csv ---
  Sampling: Selected 200 out of 20000 documents.
  > Generated 200 valid records from this file.

--- Processing: Frontotemporal Dementia.csv ---
  Sampling: Selected 200 out of 20000 documents.
  > Generated 200 valid records from this file.

--- Processing: Lewy Body Dementia.csv ---
  Sampling: Selected 200 out of 14268 documents.
  > Generated 200 valid records from this file.

--- Processing: Parkinson's Disease.csv ---
  Sampling: Selected 200 out of 20000 documents.
  > Generated 200 valid records from this file.

--- Processing: Vascular Dementia.csv ---
  Sampling: Selected 200 out of 20000 documents.
  > Generated 200 valid records from this file.

Processing complete. Total records collected: 1000


## 2. Save Results

We inspect the resulting dataset (View Shape, Class Distribution, and Head) and save it to a CSV file.

In [18]:
if results:
    result_df = pd.DataFrame(results)
    
    # --- Display Info ---
    print(f"Total Rows: {result_df.shape[0]}")
    print(f"Total Columns: {result_df.shape[1]}")
    
    print("\n--- Class Distribution (Records per Label) ---")
    print(result_df['Label'].value_counts())
    
    print("\n--- First 5 Records ---")
    
    try:
        display(result_df.head())
    except NameError:
        print(result_df.head())

    # Save to CSV
    result_df.to_csv(output_file, index=False, encoding='utf-8-sig')
    print(f"\n[Success] Data saved to: {output_file}")

    

else:
    print("[Warning] No data generated. Review the source files and logic.")

Total Rows: 1000
Total Columns: 5

--- Class Distribution (Records per Label) ---
Label
Alzheimer's Disease        200
Frontotemporal Dementia    200
Lewy Body Dementia         200
Parkinson's Disease        200
Vascular Dementia          200
Name: count, dtype: int64

--- First 5 Records ---


Unnamed: 0,Content,Paper Name,Label,Document Type,Affiliations
0,Insulin resistance is a condition characterize...,Brain insulin resistance mediated cognitive im...,Alzheimer's Disease,Review,"Department of Pharmaceutical Sciences, Maharsh..."
1,Prolactin is a pituitary anterior lobe hormone...,Hyperprolactinemia and Brain Health: Exploring...,Alzheimer's Disease,Review,"School of Pharmacy, Hubei University of Chines..."
2,Lecanemab is an amyloid-targeted antibody indi...,Severe Persistent Urinary Retention Following ...,Alzheimer's Disease,Article,"Department of Psychiatry, Duke University Scho..."
3,Glycoprotein 88 (GP88) is a secreted biomarker...,An Impedimetric Immunosensor for Progranulin D...,Alzheimer's Disease,Article,"University of New Brunswick, Fredericton, NB, ..."
4,Disruption of the blood–brain barrier (BBB) ac...,Regulation of Blood–Brain Barrier Permeability...,Alzheimer's Disease,Review,"Department of Pharmacology, Research Institute..."



[Success] Data saved to: processed_data.csv
