In [None]:
# === 1. Core Data Handling ===
import pandas as pd
import numpy as np
import os
import re
import string
import json

# === 2. Text Preprocessing & NLP ===
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import WordNetLemmatizer
from collections import Counter
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer

# Vectorization & Embeddings
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from gensim.models import Word2Vec
import torch
from transformers import BertTokenizer, BertModel

# === 3. Machine Learning Tools ===
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer, MaxAbsScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier

# === 4. Deep Learning (TensorFlow/Keras) ===
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import (
    Input, Dense, Dropout, BatchNormalization, 
    Flatten, Concatenate, GlobalAveragePooling1D, SimpleRNN)
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

# === 5. Multi-Label Evaluation Metrics ===
from sklearn.metrics import (
    f1_score, 
    precision_score, 
    recall_score, 
    hamming_loss, 
    classification_report,
    multilabel_confusion_matrix
)
# === 6. Visualization ===
import matplotlib.pyplot as plt
import seaborn as sns

# Download required NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

In [None]:
# --- 1. Load Data ---
loader = DataLoader()
data = loader.load_data()

# --- 2. Initial EDA (Dirty Data) ---
data['word_count'] = data['text'].apply(lambda x: len(str(x).split()))
MedicalVisualizer.plot_class_distribution(data)
MedicalVisualizer.plot_length_distribution(data, col='word_count')

# --- 3. Cleaning & Preprocessing ---
print("\n--- Starting Deep Cleaning Pipeline ---")
preprocessor = TextPreprocessor()

# Apply cleaning
data['clean_text'] = data['text'].apply(preprocessor.clean_text)

# Check results
print(f"Sample Clean Text: {data['clean_text'].iloc[0][:100]}...")

## Data Loading and Initial Inspection:
* We start by initializing the environment with the necessary NLP and machine learning libraries, including NLTK, Scikit-Learn, and TensorFlow.
* We then load the dataset, notably utilizing a fallback encoding strategy (latin-1) because the standard UTF-8 encoding failed. This immediate failure to decode suggests the presence of special characters or older file formats in the source documents. 
* The initial inspection reveals a dataset of 7,570 long-form research papers classified into three categories:
    1. Thyroid,
    2. Colon, and
    3. Lung cancer.
* Crucially, the class distribution is reasonably balanced (ranging from roughly 2,200 to 2,800 samples per class), indicating that we do not need to apply aggressive resampling techniques like SMOTE, and a standard stratified split will be sufficient for evaluation.

## Data Quality Audit and Dataset Reduction:
* Upon conducting a data quality audit, we discovered that approximately **87%** of the original dataset consisted of exact duplicate entries, likely resulting from artificial data augmentation or scraping errors in the source. 
* We immediately removed these 6,574 duplicates, reducing the total dataset from **7,570** to **996** unique documents. This step was essential to eliminate data leakage and prepare the dataset for later processing. 

## Document Length Analysis:
* This section analyzes the physical structure of the text data by calculating word counts for every document. The statistics reveal that we are dealing with full-length research papers as well rather than only short abstracts, with a mean length of approximately 3,000 words and a maximum exceeding 5,000 words. 
* This is a critical insight for model selection, as standard transformer models like BERT have a hard input limit of 512 tokens. Since the text far exceeds this limit, we know we must either use truncation strategies or rely on frequency-based methods (like TF-IDF) that can handle the full document context. 
* The visualization confirms that document length is consistent across all three cancer types, meaning length itself is not a predictive feature.

In [None]:
# --- 4. Post-Cleaning Analysis ---
MedicalVisualizer.plot_wordclouds(data, 'clean_text', 'Class Labels')

# --- 5. Feature Engineering (Vectorization) ---
print("\n--- Preparing Features ---")
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data['Class Labels'])

# Stratified Split
X_train, X_test, y_train, y_test = train_test_split(
    data['clean_text'], y, test_size=0.3, random_state=42, stratify=y
)

## Artifact and Encoding Scan:
* Here we perform a deep scan for non-standard characters to assess the "cleaneness" of the raw text. The analysis uncovers a widespread encoding issue, with nearly 37% of the documents containing non-ASCII artifacts like ï, ¬, and “. By counting these occurrences, we identified that these are broken ligatures (e.g., the letters "fi" becoming "ï¬") likely caused by PDF-to-text extraction errors. 
* This step is vital because if left unaddressed, these artifacts would corrupt the tokenization process, causing the model to treat words like "identified" and "identifi¬ed" as completely different terms. This analysis proves that standard cleaning would be insufficient and necessitates a custom character-replacement function.

## N-Gram and Boilerplate Detection:
* We generate bigrams and frequency lists to understand the semantic landscape of the raw text before modeling. This step exposes two critical problems that would have otherwise ruined the model's validity:
    1. generic noise and
    2. data leakage. 
* First, we find that words like "cancer," "cell," and "tumor" appear in the top 10 for all classes, providing no discriminatory power. 
* Second, and more dangerously, we detect high-frequency legal and publishing terms like "creative commons," "plos one," and "biochemical society" appearing unevenly across classes. This reveals that specific cancer types in this dataset were sourced from specific journals, meaning a model could potentially "cheat" by learning the publisher's footer format rather than the medical content.

## Domain-Specific Data Cleaning and Verification:
* This block implements the cleaning pipeline derived from our EDA findings to surgically repair the text. We define a custom function that first fixes the broken ligatures (restoring words like "significant"), then strips away the specific legal boilerplate and publisher artifacts identified in the N-gram analysis, and finally removes generic academic stopwords. We also perform a "final scrub" to remove tokenizer glitches and chemical instrument names that were acting as noise. 
* The post-cleaning verification confirms that the top features for all classes are now exclusively biological terms—such as "EGFR mutation" for lung cancer and "signaling pathway" for thyroid cancer—ensuring that our subsequent machine learning models will learn from actual pathology rather than formatting errors.

In [None]:
# Vectorization (TF-IDF Bigrams)
# Note: We use 7000 as derived from your statistical analysis (80% coverage)
tfidf = TfidfVectorizer(
    ngram_range=(1, 2), 
    max_features=7000, 
    stop_words=list(MedicalConfig.get_stop_words())
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"TF-IDF Matrix Shape: {X_train_tfidf.shape}")
print("Ready for Model Training (Batch 2)...")

## Label Encoding and Stratified Splitting:
 We prepare the data for machine learning by converting the categorical string labels into a numerical format using LabelEncoder. This transforms classes like "Thyroid_Cancer" into integers (e.g., 0, 1, 2), which is a requirement for most algorithms. 
* We then split the dataset into training (**70%**) and testing (**30%**) sets.
* Crucially, we utilize the stratify parameter.
    * Given the slight imbalance in class distribution (with Thyroid Cancer having ~**600** more samples than Lung Cancer), stratification ensures that the proportion of each cancer type in the test set exactly matches the original dataset, preventing evaluation bias.

## Statistical Analysis and Architecture Decisions:
* We conducted a rigorous statistical analysis of the training corpus to determine the optimal hyperparameters for our feature extraction and modeling pipeline, ensuring our decisions are data-driven rather than arbitrary.
    * By calculating the cumulative frequency distribution of the vocabulary, we identified that while the total unique vocabulary exceeds **120,000** words, the top **80%** of linguistic coverage is achieved with approximately **7,000** terms. 
* However, to maintain computational efficiency and prevent the "curse of dimensionality" given our training size of 697 samples, we made the decision to cap our vectorization at 7,000 features; this threshold retains the vast majority of the biological signal while discarding rare noise. 

## Feature Extraction Strategy:
* For classical machine learning, we employ two distinct vectorization techniques to represent the text data. Based on our statistical analysis showing that roughly 7,500 terms cover 80% of the corpus vocabulary, we set max_features=7000.
* This constraint is critical: with only ~700 training samples, using the full vocabulary (120k+ words) would lead to massive overfitting (the "curse of dimensionality").
    * **Bag-of-Words (BoW)**: Creates a count matrix. While simple, it biases towards longer documents.
    * **TF-IDF (Term Frequency-Inverse Document Frequency)**: Our primary method. It captures both unigrams and bigrams (e.g., "lung cancer") and penalizes common words, making it ideal for high-dimensional medical text. 

In [None]:
# MODEL BENCHMARKING
from medical_utils import MLTrainer, DLPreprocessor, DeepLearningFactory

# --- 1. Classical Machine Learning ---
print("--- 1. Running Classical ML Benchmark ---")

# Define features to test
feature_sets = {
    "TF-IDF": (X_train_tfidf, X_test_tfidf)
    # You can add "BoW": (X_train_bow, X_test_bow) here if you created it
}
ml_trainer = MLTrainer()
ml_results = ml_trainer.run_benchmark(
    feature_sets, 
    y_train, 
    y_test, 
    label_encoder.classes_
)
print("\nTop ML Models:")
print(ml_results.head())

## Machine Learning Model Benchmarking:
* We train and evaluate four distinct classifiers: **Naive Bayes** (baseline), **Logistic Regression**, **Random Forest**, and **Linear SVM**. We apply class_weight='balanced' to all applicable models to counteract the slight class imbalance between Thyroid and Lung cancer samples.
* We test each model against both **Bag-of-Words** and **TF-IDF** feature sets. 
* The results are ranked by **F1-Score (Weighted)**, which is the most critical metric in medical classification as it balances **Precision** (avoiding false positives) and **Recall** (avoiding missed diagnoses).
* This step determines the optimal combination of vectorization and algorithm to serve as the project's performance benchmark.

In [None]:
# --- 2. Prepare Data for Deep Learning ---
print("\n--- 2. Preparing Deep Learning Sequences ---")

dl_prep = DLPreprocessor(max_features=7000, sequence_length=500)

# A. Convert Text to Integer Sequences
X_train_seq, X_test_seq = dl_prep.prepare_sequences(X_train, X_test)

# B. One-Hot Encode Labels
y_train_hot, y_test_hot = dl_prep.fit_transform_labels(y_train, y_test)

# C. Create Validation Split (15% of Train)
from sklearn.model_selection import train_test_split
X_train_dl, X_val_dl, y_train_dl, y_val_dl = train_test_split(
    X_train_seq, y_train_hot, 
    test_size=0.15, 
    stratify=np.argmax(y_train_hot, axis=1),
    random_state=42
)

print(f"Vocab Size: {dl_prep.vocab_size}")
print(f"Train Shape: {X_train_dl.shape}")

## Data Preparation for Deep Learning:
* Unlike classical Machine Learning models (which accept sparse TF-IDF matrices), Deep Learning models require dense, sequential integer inputs to process the context of words over time. We implemented the following pipeline:
    * **Label Encoding**:
        * Converted class labels (Thyroid, Colon, Lung) into One-Hot Encoded vectors (e.g., (0, 1, 0)). This is required for the categorical_crossentropy loss function used in multi-class neural networks.
    * **Text Vectorization**:
        * We utilized the Keras TextVectorization layer to map string data to integer sequences. Set to 7,000 words. This captures the top ~80% of the medical vocabulary while ignoring rare noise/typos that cause overfitting. Fixed at 500 tokens. Medical papers are long (~1,800 words), but we truncated to 500 to maintain memory efficiency and focus on the introduction/abstract sections where the disease is usually defined. Disabled (standardize=None) to preserve our custom surgical cleaning (e.g., keeping chemical names and units like +/- or u which standard cleaners might remove).
    * **Validation Split**:
        * Created a strict 15% Validation Set stratified by class. We verified that the training set (~600 samples) and validation set (~100 samples) had identical class distributions to prevent evaluation bias.

In [None]:
# --- 3. Train LSTM Model ---
print("\n--- 3. Training LSTM ---")

lstm_model = DeepLearningFactory.build_lstm(
    vocab_size=dl_prep.vocab_size,
    num_classes=len(label_encoder.classes_)
)
# Callbacks
stopper = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

history_lstm = lstm_model.fit(
    X_train_dl, y_train_dl,
    validation_data=(X_val_dl, y_val_dl),
    epochs=15, 
    batch_size=16, 
    callbacks=[stopper],
    verbose=1
)
DeepLearningFactory.plot_history(history_lstm, "LSTM Model")

# --- 4. Train CNN Model ---
print("\n--- 4. Training CNN ---")

cnn_model = DeepLearningFactory.build_cnn(
    vocab_size=dl_prep.vocab_size,
    num_classes=len(label_encoder.classes_)
)
history_cnn = cnn_model.fit(
    X_train_dl, y_train_dl,
    validation_data=(X_val_dl, y_val_dl),
    epochs=15, 
    batch_size=16, 
    callbacks=[stopper],
    verbose=1
)
DeepLearningFactory.plot_history(history_cnn, "CNN Model")

## Deep Learning Model Architectures:
* We explored three Deep learning models to determine if capturing "sequential context" (word order) yields better performance than the "bag-of-words" approach.
    * Model A: Standard LSTM (Long Short-Term Memory),
    * Model B: Bidirectional LSTM
    * Model C: CNN
* We also added Early Stopping to monitor val_loss with a patience of 8 epochs. This prevented the model from training too long and memorizing noise, automatically restoring the best weights from the peak performance epoch.

## Deep Learning Evaluation Metrics:
* To evaluate performance on the imbalanced test set, we employed the following metrics:
    * Categorical Crossentropy Loss: Used to monitor convergence during training.
    * Weighted F1-Score (Primary Metric): Why Weighted? The dataset has a slight imbalance (Lung Cancer > Thyroid Cancer). The weighted score ensures the model is penalized if it ignores the minority classes.
    * Precision and Recall. 

In [None]:
# --- 5. Final Evaluation & Comparison ---
print("\n--- 5. Final Leaderboard ---")

# Helper to evaluate Keras models
def eval_keras(model, name, X, y_true):
    preds = np.argmax(model.predict(X, verbose=0), axis=1)
    acc = np.mean(preds == y_true)
    return {"Model": name, "Accuracy": acc, "Features": "Embeddings", "F1-Score": "N/A (See Report)"}

# Evaluate DL models on Test Set (Original y_test integers)
lstm_res = eval_keras(lstm_model, "LSTM", X_test_seq, y_test)
cnn_res = eval_keras(cnn_model, "CNN", X_test_seq, y_test)

# Combine with ML results
final_df = pd.concat([ml_results, pd.DataFrame([lstm_res, cnn_res])], ignore_index=True)
print(final_df.sort_values(by="Accuracy", ascending=False))

## Results:
* The comparative analysis reveals a stark contrast between classical machine learning and deep learning approaches, definitively favoring the former for this specific dataset. 
* The classical models, particularly Naive Bayes and Random Forest utilizing TF-IDF vectors, achieved robust performance with weighted F1-scores ranging from 75% to 76%, demonstrating their ability to effectively leverage distinct medical keywords for classification. 
* In contrast, the deep learning models (Standard LSTM, Bidirectional LSTM, and Tuned CNN) experienced a catastrophic failure known as mode collapse, where all three architectures converged to the exact same predictive pattern. With an identical accuracy of 45.4% and a low F1-score of 28%, these models failed to distinguish between classes and likely defaulted to predicting only the majority class (Lung Cancer) for every sample. 
* This outcome confirms that the dataset, comprising only roughly 600 training samples after duplicate removal, is insufficient to train complex neural networks from scratch. The neural networks could not escape local minima to learn semantic nuances, whereas the classical models successfully utilized the high-dimensional keyword features provided by TF-IDF to create a reliable diagnostic tool.

# Project Summary
* This project aimed to develop an automated classification system for biomedical research papers, categorizing them into Thyroid, Colon, and Lung cancer domains. 
* The study began with a rigorous data cleaning phase that identified and removed over 6,500 duplicate entries, reducing the dataset by 87% and exposing a "small data" constraint that fundamentally shaped the modeling strategy. 
* We further refined the text by repairing PDF encoding artifacts and normalizing specific medical nomenclature to ensure high-quality input features. The modeling phase benchmarked classical machine learning algorithms against deep learning architectures to determine the optimal approach for medical text classification under data constraints. 
* The final evaluation conclusively proved that simpler, feature-engineered models like Random Forest are superior in this low-resource environment, offering high accuracy and interpretability without the computational cost or data requirements of deep neural networks.
* The project successfully established a production-ready baseline model achieving approximately 77% accuracy, validating that rigorous data hygiene and appropriate model selection are more critical than architectural complexity in medical NLP tasks.