# Experimenting with SVM Classifier and Feature Extraction Techniques for Primary Progressive Aphasia (PPA) Detection
## Introduction

In this notebook, we explore the application of machine learning models for detecting primary progressive aphasia (PPA) using a SVM classifier. Because syntactic structure and linguistic irregularities are core indicators of aphasia, we test a variety of feature extraction techniques that range from basic lexical statistics to advanced semantic embeddings. The goal is to determine how effectively these features capture linguistic signals associated with the condition.

## Objectives

This notebook aims to:

1. Evaluate the performance of different feature extraction methods when paired with an SVM classifier.
2. Identify which features best represent the syntactic and structural anomalies typical of aphasic speech.
3. Establish a baseline for SVM to be compared with other classifiers in future experiments.

## Feature Extraction Methods

We consider a diverse set of feature extraction techniques:

- **TF-IDF**: Captures word importance relative to document frequency.
- **Bag of Words (BoW)**: Counts raw word occurrence.
- **Word Embeddings**: Including Word2Vec, GloVe, and FastText for capturing semantic relationships.
- **N-grams**: Specifically 2-grams and 4-grams to model local context and syntactic cues.
- **LSA and LDA**: Topic-based models that identify latent semantic structures.
- **Transformers (BERT, RoBERTa, ClinicalBERT, MentalBERT)**: Contextualized embeddings offering deep semantic understanding.
- **Dependency Parsing**: Features derived from syntactic dependency relations.

## Cross-Validation Strategy

To prevent data leakage, we apply **GroupKFold** instead of standard k-fold or stratified k-fold. This ensures that samples from the same participant (identified by a Subject ID) never appear in both training and test sets. This is critical in clinical NLP tasks, where models might otherwise learn speaker-specific artifacts instead of generalizable linguistic patterns relevant to PPA subtypes.

## Outline

1. **Data Preprocessing**: Tokenization, cleaning, and lemmatization of the dataset.
2. **Feature Extraction**: Generate feature vectors for each method listed above.
3. **Model Training and Evaluation**: Train the SVM classifier and compute metrics such as F1-score, balanced accuracy, AUC, precision, and recall.



In [14]:
import pandas as pd
import io
import os
import os
print(os.getcwd())
import nltk
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import make_scorer, balanced_accuracy_score, precision_score, recall_score, f1_score, hamming_loss, roc_auc_score
from sklearn.preprocessing import LabelEncoder, label_binarize
import gensim.downloader as api
from sklearn.svm import SVC
from transformers import BertModel, BertTokenizer, RobertaModel, RobertaTokenizer
import numpy as np
import torch
from collections import Counter
import spacy
import pandas as pd
import gensim
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import make_scorer, balanced_accuracy_score, precision_score, recall_score, f1_score, hamming_loss, roc_auc_score
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
import spacy
from sklearn.base import BaseEstimator, TransformerMixin
from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin
from transformers import (
    BertTokenizer, BertModel,
    RobertaTokenizer, RobertaModel,
    AutoTokenizer, AutoModel
)
import numpy as np
import torch
from sklearn.base import BaseEstimator, TransformerMixin
import gensim
import gensim.downloader as api
from sklearn.neural_network import MLPClassifier


In [1]:
# import data here

In [17]:
df.shape

(2262, 4)

In [18]:
# drop rows where NaN
df = df.dropna(subset=['Text'])
df.shape

(2130, 4)

In [19]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ghofranemerhbene/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [20]:
def preprocess_text(text):
    text = text.lower()
    # remove special characters but keep punctuation
    text = re.sub(r'[^a-zA-Z0-9\s.,!?;:\'\"-]', '', text)
    tokens = word_tokenize(text)

    return tokens
df['processed_text'] = df['Text'].apply(preprocess_text)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/ghofranemerhbene/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [23]:
groups = df['SubjectID']

# define GroupKFold
cv = GroupKFold(n_splits=5)
# encode labels
label_encoder = LabelEncoder()
df['encoded_label'] = label_encoder.fit_transform(df['Subtype'])

# define scoring
scoring = {
    'balanced_accuracy': make_scorer(balanced_accuracy_score),
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted'),
    'hamming_loss': make_scorer(hamming_loss),
    'auc': 'roc_auc_ovr_weighted'  
}
# create generic featurizer classes
class TfidfFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.vectorizer = TfidfVectorizer()

    def fit(self, X, y=None):
        return self.vectorizer.fit(X)

    def transform(self, X):
        return self.vectorizer.transform(X)

class BowFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.vectorizer = CountVectorizer()

    def fit(self, X, y=None):
        return self.vectorizer.fit(X)

    def transform(self, X):
        return self.vectorizer.transform(X)

class NgramFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.vectorizer = CountVectorizer(ngram_range=(2, 4))

    def fit(self, X, y=None):
        return self.vectorizer.fit(X)

    def transform(self, X):
        return self.vectorizer.transform(X)

class LsaFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, n_components=100):
        self.n_components = n_components
        self.vectorizer = TfidfVectorizer()
        self.svd = TruncatedSVD(n_components=self.n_components)


    def fit(self, X, y=None):
        X_tfidf = self.vectorizer.fit_transform(X)
        self.svd.fit(X_tfidf)
        return self

    def transform(self, X):
        X_tfidf = self.vectorizer.transform(X)
        return self.svd.transform(X_tfidf)

class LdaFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, n_components=10):
        self.n_components = n_components
        self.vectorizer = CountVectorizer()
        self.lda = LDA(n_components=self.n_components, random_state=42)


    def fit(self, X, y=None):
        X_bow = self.vectorizer.fit_transform(X)
        self.lda.fit(X_bow)
        return self

    def transform(self, X):
        X_bow = self.vectorizer.transform(X)
        return self.lda.transform(X_bow)


In [24]:
# load spaCy model once
nlp = spacy.load("en_core_web_sm")

# get all unique dependency labels from your full corpus
def extract_dependency_tags(texts):
    deps = set()
    for doc in nlp.pipe(texts, batch_size=32):
        deps.update([token.dep_ for token in doc])
    return sorted(deps)

# precompute full dependency vocabulary
text_data = df['processed_text'].apply(lambda x: ' '.join(x))
all_deps = extract_dependency_tags(text_data)

# featurizer class
class DependencyFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, dependencies=all_deps):
        self.dependencies = dependencies

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        dep_vectors = []
        for doc in nlp.pipe(X, batch_size=32):
            dep_counts = Counter([token.dep_ for token in doc])
            vector = [dep_counts.get(dep, 0) for dep in self.dependencies]
            dep_vectors.append(vector)
        return np.array(dep_vectors)


In [25]:
#  Word2Vec Featurizer (using GoogleNews vectors)
class Word2VecFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_path='GoogleNews-vectors-negative300.bin.gz'):
        self.model_path = model_path
        self.model = None

    def fit(self, X, y=None):
        self.model = gensim.models.KeyedVectors.load_word2vec_format(
            self.model_path, binary=True
        )
        return self

    def transform(self, X):
        embeddings = []
        for sentence in X:
            vectors = [self.model[word] for word in sentence.split() if word in self.model]
            embeddings.append(np.mean(vectors, axis=0) if vectors else np.zeros(self.model.vector_size))
        return np.vstack(embeddings)


#  GloVe Featurizer (100d from gensim API)
class GloVeFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.model = None

    def fit(self, X, y=None):
        self.model = api.load("glove-wiki-gigaword-100")
        return self

    def transform(self, X):
        embeddings = []
        for sentence in X:
            vectors = [self.model[word] for word in sentence.split() if word in self.model]
            embeddings.append(np.mean(vectors, axis=0) if vectors else np.zeros(self.model.vector_size))
        return np.vstack(embeddings)


#  FastText Featurizer (300d from gensim API)
class FastTextFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.model = None

    def fit(self, X, y=None):
        self.model = api.load("fasttext-wiki-news-subwords-300")
        return self

    def transform(self, X):
        embeddings = []
        for sentence in X:
            vectors = [self.model[word] for word in sentence.split() if word in self.model]
            embeddings.append(np.mean(vectors, axis=0) if vectors else np.zeros(self.model.vector_size))
        return np.vstack(embeddings)


In [28]:
from huggingface_hub import login
login(token="add you token")

In [29]:
# ensure no gradient tracking for inference
@torch.no_grad()
def get_embedding(model, tokenizer, text):
    inputs = tokenizer(
        text, return_tensors="pt", truncation=True, padding=True, max_length=512
    )
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# BERT Featurizer
class BERTFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertModel.from_pretrained("bert-base-uncased")

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        embeddings = [get_embedding(self.model, self.tokenizer, text) for text in X]
        return np.vstack(embeddings)

# RoBERTa Featurizer
class RoBERTaFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
        self.model = RobertaModel.from_pretrained("roberta-base")

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        embeddings = [get_embedding(self.model, self.tokenizer, text) for text in X]
        return np.vstack(embeddings)

class ClinicalBERTFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="medicalai/ClinicalBERT"):
        self.model_name = model_name

    def fit(self, X, y=None):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModel.from_pretrained(self.model_name)
        self.model.eval()
        return self

    def transform(self, X):
        return np.vstack([get_embedding(self.model, self.tokenizer, text) for text in X])


class MentalBERTFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="mental/mental-bert-base-uncased"):
        self.model_name = model_name

    def fit(self, X, y=None):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModel.from_pretrained(self.model_name)
        self.model.eval()
        return self

    def transform(self, X):
        return np.vstack([get_embedding(self.model, self.tokenizer, text) for text in X])

In [31]:
# prepare featurizers
featurizers = {
    'TFIDF': TfidfFeaturizer(),
    'BoW': BowFeaturizer(),
    'ngrams': NgramFeaturizer(),
    'LSA': LsaFeaturizer(),
    'LDA': LdaFeaturizer(),
    'dependency Parsing': DependencyFeaturizer(),
    'word2vec':  Word2VecFeaturizer(model_path='GoogleNews-vectors-negative300.bin.gz'),
    'GloVe': GloVeFeaturizer(),
    'FastText': FastTextFeaturizer(),
    'BERT': BERTFeaturizer(),
    'RoBERTa': RoBERTaFeaturizer(),
    'ClinicalBERT': ClinicalBERTFeaturizer(),
    'MentalBERT': MentalBERTFeaturizer(),
}

# run cross-validated pipelines
text_data = df['processed_text'].apply(lambda x: ' '.join(x))
labels = df['encoded_label']
results = {}

for name, featurizer in featurizers.items():
    pipeline = Pipeline([
        ('features', featurizer),
        ('clf', SVC(probability=True, kernel='linear', random_state=42))
    ])
    scores = cross_validate(pipeline, text_data, labels, cv=cv, scoring=scoring, groups=groups)
    results[name] = scores

results_summary = {name: {metric: (np.mean(values), np.std(values)) for metric, values in res.items() if 'test_' in metric} for name, res in results.items()}

df_results = pd.DataFrame(results_summary).T
print("Pipeline CV Results:")
print(df_results.head()) 

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at robert

Pipeline CV Results:
                             test_balanced_accuracy  \
TFIDF     (0.4643369414507915, 0.03241735691899943)   
BoW        (0.490445534349549, 0.02101417229468248)   
ngrams   (0.4291673187100683, 0.028455369064697637)   
LSA     (0.43735813410927105, 0.025681005255329722)   
LDA      (0.26859784364379125, 0.01234835536711196)   

                                   test_precision  \
TFIDF   (0.5607766776360436, 0.05959423717297235)   
BoW      (0.556356369951225, 0.05420364256775695)   
ngrams  (0.5300409230853326, 0.07686743202540522)   
LSA     (0.5533237830898401, 0.04598493984944958)   
LDA     (0.2881345989113727, 0.09020280364108672)   

                                       test_recall  \
TFIDF    (0.5601274014260105, 0.06426257195628156)   
BoW     (0.5525900649255489, 0.042083731817883135)   
ngrams   (0.4544524552727277, 0.04351874882498624)   
LSA      (0.5480316017927395, 0.06942814806057615)   
LDA      (0.4574498756159632, 0.08561428633545444)   

    

In [32]:
df_results.to_csv("pipeline_cv_results_svm_group.csv", index=True)

-------------
----------
