# Experimenting with XGBoost Classifier and Feature Extraction Techniques for Primary Progressive Aphasia (PPA) Detection
## Introduction

## Introduction

In this notebook, we explore the application of machine learning models for detecting primary progressive aphasia (PPA) using a XGBoost classifier. Because syntactic structure and linguistic irregularities are core indicators of aphasia, we test a variety of feature extraction techniques that range from basic lexical statistics to advanced semantic embeddings. The goal is to determine how effectively these features capture linguistic signals associated with the condition.

## Objectives

This notebook aims to:

1. Evaluate the performance of different feature extraction methods when paired with an XGBoost classifier.
2. Identify which features best represent the syntactic and structural anomalies typical of aphasic speech.
3. Establish a baseline for XGBoost to be compared with other classifiers in future experiments.

## Feature Extraction Methods

We consider a diverse set of feature extraction techniques:

- **TF-IDF**: Captures word importance relative to document frequency.
- **Bag of Words (BoW)**: Counts raw word occurrence.
- **Word Embeddings**: Including Word2Vec, GloVe, and FastText for capturing semantic relationships.
- **N-grams**: Specifically 2-grams and 4-grams to model local context and syntactic cues.
- **LSA and LDA**: Topic-based models that identify latent semantic structures.
- **Transformers (BERT, RoBERTa, ClinicalBERT, MentalBERT)**: Contextualized embeddings offering deep semantic understanding.
- **Dependency Parsing**: Features derived from syntactic dependency relations.

## Cross-Validation Strategy

To prevent data leakage, we apply **GroupKFold** instead of standard k-fold or stratified k-fold. This ensures that samples from the same participant (identified by a Subject ID) never appear in both training and test sets. This is critical in clinical NLP tasks, where models might otherwise learn speaker-specific artifacts instead of generalizable linguistic patterns relevant to PPA subtypes.

## Outline

1. **Data Preprocessing**: Tokenization, cleaning, and lemmatization of the dataset.
2. **Feature Extraction**: Generate feature vectors for each method listed above.
3. **Model Training and Evaluation**: Train the XGBoost classifier and compute metrics such as F1-score, balanced accuracy, AUC, precision, and recall.

In [None]:
import warnings
warnings.filterwarnings("ignore")
print(os.getcwd())
import nltk
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import make_scorer, balanced_accuracy_score, precision_score, recall_score, f1_score, hamming_loss, roc_auc_score
from sklearn.preprocessing import LabelEncoder, label_binarize
import gensim.downloader as api
from sklearn.svm import SVC
from transformers import BertModel, BertTokenizer, RobertaModel, RobertaTokenizer
import numpy as np
import torch
from collections import Counter
import spacy
import pandas as pd
import gensim
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import make_scorer, balanced_accuracy_score, precision_score, recall_score, f1_score, hamming_loss, roc_auc_score
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
import spacy
from sklearn.base import BaseEstimator, TransformerMixin
from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin
from transformers import (
    BertTokenizer, BertModel,
    RobertaTokenizer, RobertaModel,
    AutoTokenizer, AutoModel
)
import numpy as np
import torch
from sklearn.base import BaseEstimator, TransformerMixin
import gensim
import gensim.downloader as api
from sklearn.neural_network import MLPClassifier

In [None]:
# import data here

In [None]:
df.shape

In [None]:
# drop rows where NaN
df = df.dropna(subset=['Text'])
df.shape

In [None]:
nltk.download('punkt')

In [None]:

nltk.download('punkt_tab')

def preprocess_text(text):
    text = text.lower()

    # remove special characters but keep punctuation
    text = re.sub(r'[^a-zA-Z0-9\s.,!?;:\'\"-]', '', text)
    tokens = word_tokenize(text)

    return tokens

df['processed_text'] = df['Text'].apply(preprocess_text)


In [None]:
# encode labels
label_encoder = LabelEncoder()
df['encoded_label'] = label_encoder.fit_transform(df['Subtype'])
text_data = df['processed_text'].apply(lambda x: ' '.join(x))
labels = df['encoded_label']

In [None]:

# Clean TF-IDF featurizer
class TfidfFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.vectorizer = TfidfVectorizer()

    def fit(self, X, y=None):
        self.vectorizer.fit(X)
        return self

    def transform(self, X):
        return self.vectorizer.transform(X).astype(np.float32)

class BowFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.vectorizer = CountVectorizer()

    def fit(self, X, y=None):
        self.vectorizer.fit(X)
        return self

    def transform(self, X):
        return self.vectorizer.transform(X).astype(np.float32)

class NgramFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, ngram_range=(2, 3)):
        self.ngram_range = ngram_range
        self.vectorizer = CountVectorizer(ngram_range=self.ngram_range)

    def fit(self, X, y=None):
        self.vectorizer.fit(X)
        return self

    def transform(self, X):
        return self.vectorizer.transform(X).astype(np.float32)


class LsaFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, n_components=100):
        self.n_components = n_components
        self.vectorizer = TfidfVectorizer()
        self.svd = TruncatedSVD(n_components=self.n_components)

    def fit(self, X, y=None):
        X_tfidf = self.vectorizer.fit_transform(X)
        self.svd.fit(X_tfidf)
        return self

    def transform(self, X):
        X_tfidf = self.vectorizer.transform(X)
        return self.svd.transform(X_tfidf).astype(np.float32)


class LdaFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, n_components=10):
        self.n_components = n_components
        self.vectorizer = CountVectorizer()
        self.lda = LatentDirichletAllocation(n_components=self.n_components, random_state=42)

    def fit(self, X, y=None):
        X_counts = self.vectorizer.fit_transform(X)
        self.lda.fit(X_counts)
        return self

    def transform(self, X):
        X_counts = self.vectorizer.transform(X)
        return self.lda.transform(X_counts).astype(np.float32)



class Word2VecFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="word2vec-google-news-300"):
        self.model_name = model_name

    def fit(self, X, y=None):
        self.model = api.load(self.model_name)  # Will download and cache the model
        return self

    def transform(self, X):
        def avg_vector(text):
            tokens = text.split()
            vectors = [self.model[word] for word in tokens if word in self.model]
            return np.mean(vectors, axis=0) if vectors else np.zeros(self.model.vector_size)
        return np.vstack([avg_vector(doc) for doc in X]).astype(np.float32)


class GloVeFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="glove-wiki-gigaword-100"):
        self.model_name = model_name

    def fit(self, X, y=None):
        self.model = api.load(self.model_name)
        return self

    def transform(self, X):
        def avg_vector(text):
            tokens = text.split()
            vectors = [self.model[word] for word in tokens if word in self.model]
            return np.mean(vectors, axis=0) if vectors else np.zeros(self.model.vector_size)
        return np.vstack([avg_vector(doc) for doc in X]).astype(np.float32)



class FastTextFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="fasttext-wiki-news-subwords-300"):
        self.model_name = model_name

    def fit(self, X, y=None):
        self.model = api.load(self.model_name)
        return self

    def transform(self, X):
        def avg_vector(text):
            tokens = text.split()
            vectors = [self.model[word] for word in tokens if word in self.model]
            return np.mean(vectors, axis=0) if vectors else np.zeros(self.model.vector_size)
        return np.vstack([avg_vector(doc) for doc in X]).astype(np.float32)

# dense converter
class ToDenseTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if hasattr(X, "toarray"):
            return X.toarray()
        return X

In [None]:
@torch.no_grad()
def get_embedding(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

class BERTFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="bert-base-uncased"):
        self.model_name = model_name

    def fit(self, X, y=None):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModel.from_pretrained(self.model_name)
        self.model.eval()
        return self

    def transform(self, X):
        return np.vstack([get_embedding(self.model, self.tokenizer, text) for text in X]).astype(np.float32)

class RoBERTaFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="roberta-base"):
        self.model_name = model_name

    def fit(self, X, y=None):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModel.from_pretrained(self.model_name)
        self.model.eval()
        return self

    def transform(self, X):
        return np.vstack([get_embedding(self.model, self.tokenizer, text) for text in X]).astype(np.float32)

class ClinicalBERTFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="emilyalsentzer/Bio_ClinicalBERT"):
        self.model_name = model_name

    def fit(self, X, y=None):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModel.from_pretrained(self.model_name)
        self.model.eval()
        return self

    def transform(self, X):
        return np.vstack([get_embedding(self.model, self.tokenizer, text) for text in X]).astype(np.float32)

class MentalBERTFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="mental/mental-bert-base-uncased"):
        self.model_name = model_name

    def fit(self, X, y=None):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModel.from_pretrained(self.model_name)
        self.model.eval()
        return self

    def transform(self, X):
        return np.vstack([get_embedding(self.model, self.tokenizer, text) for text in X])

In [None]:
import os
from getpass import getpass

os.environ["HF_TOKEN"] = getpass("Enter your HuggingFace token: ")


In [None]:


# load spaCy model once
nlp = spacy.load("en_core_web_sm")

# get all unique dependency labels from your full corpus
def extract_dependency_tags(texts):
    deps = set()
    for doc in nlp.pipe(texts, batch_size=32):
        deps.update([token.dep_ for token in doc])
    return sorted(deps)

# precompute full dependency vocabulary
text_data = df['processed_text'].apply(lambda x: ' '.join(x))
all_deps = extract_dependency_tags(text_data)

# featurizer class
class DependencyFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, dependencies=all_deps):
        self.dependencies = dependencies

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        dep_vectors = []
        for doc in nlp.pipe(X, batch_size=32):
            dep_counts = Counter([token.dep_ for token in doc])
            vector = [dep_counts.get(dep, 0) for dep in self.dependencies]
            dep_vectors.append(vector)
        return np.array(dep_vectors)

In [None]:
groups = df['SubjectID']

# define GroupKFold
cv = GroupKFold(n_splits=5)

featurizers = {
    'TFIDF': TfidfFeaturizer(),
    'BoW': BowFeaturizer(),
    'ngrams': NgramFeaturizer(),
    'LSA': LsaFeaturizer(),
    'LDA': LdaFeaturizer(),
    'word2vec': Word2VecFeaturizer(),
    'GloVe': GloVeFeaturizer(model_name="glove-wiki-gigaword-100"),
    'FastText': FastTextFeaturizer(),
    'dependency Parsing': DependencyFeaturizer(),
    'BERT': BERTFeaturizer(),
    'RoBERTa': RoBERTaFeaturizer(),
    'ClinicalBERT': ClinicalBERTFeaturizer(),
    'MentalBERT': MentalBERTFeaturizer()
}


In [None]:
scoring = {
    'accuracy': 'balanced_accuracy',
    'f1': 'f1_macro',
    'precision': 'precision_macro',
    'recall': 'recall_macro',
    'roc_auc_ovr': 'roc_auc_ovr_weighted'
}

results = {}

for name, featurizer in featurizers.items():
    print(f"\n Testing featurizer: {name}")
    try:
        pipeline = Pipeline([
            ('features', featurizer),
            ('to_dense', ToDenseTransformer()),
            ('clf', XGBClassifier(eval_metric="mlogloss", use_label_encoder=False, random_state=42))
        ])
        scores = cross_validate(pipeline, text_data, labels, cv=cv, scoring=scoring, groups=groups)
        for metric, values in scores.items():
            if metric.startswith('test_'):
                print(f"{name} | {metric}: {np.mean(values):.4f} ± {np.std(values):.4f}")
        results[name] = scores
    except Exception as e:
        print(f" {name} crashed: {e}")


In [None]:
# flatten and organize the results
records = []

for model_name, metrics in results.items():
    for fold_idx in range(len(next(iter(metrics.values())))):
        record = {'Model': model_name, 'Fold': fold_idx + 1}
        for metric_name, values in metrics.items():
            record[metric_name] = values[fold_idx]
        records.append(record)

# convert to DataFrame
df = pd.DataFrame(records)

# save to CSV
df.to_csv("pipeline_cv_xgboost_group.csv", index=False)


