This notebook creates a baseline model for predicting dementia from linguistic data.

Data source: Ram Balasubramanium at Zelar Health.  
? originally from https://dementia.talkbank.org/access/English/Pitt.html

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import pylangacq
from copy import deepcopy
import re

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

### Set Up Data
Extract data from chat transcript files into pandas DataFrame

In [2]:
files_control = 'PittData/PittTranscripts/Control/cookie/'
files_dementia = 'PittData/PittTranscripts/Dementia/cookie/'

In [3]:
c = pylangacq.Reader.from_dir(files_control)
d = pylangacq.Reader.from_dir(files_dementia)

In [4]:
# Filenames become DataFrame index
idx_c = [i['Media'].split(',')[0] for i in c.headers()]
idx_d = [i['Media'].split(',')[0] for i in d.headers()]

In [5]:
# Check index structure
print('Index length - control:', len(idx_c))
print('Index length - dementia:', len(idx_d))
print('Are all index values in control group unique?', len(idx_c) == len(set(idx_c)))
print('Are all index values in dementia group unique?', len(idx_d) == len(set(idx_d)))

Index length - control: 243
Index length - dementia: 309
Are all index values in control group unique? True
Are all index values in dementia group unique? True


In [6]:
cols = ['Group', 'MMSE', 'INV_Interjections', 'Repeats']

In [7]:
# Set binary variable for control group vs. dementia group
grp_c = [0] * len(idx_c)
grp_d = [1] * len(idx_d)

In [8]:
# MMSE was coded by transcribers in the 'education' key in each transcript's header
MMSE_c = [int(i['Participants']['PAR']['education']) if i['Participants']['PAR']['education'] != ''
                else 'NaN'
                for i in c.headers()]
MMSE_d = [int(i['Participants']['PAR']['education']) if i['Participants']['PAR']['education'] != ''
                else 'NaN'
                for i in d.headers()]

In [9]:
# Extract number of interjections by investigator per interview
INV_c = [len(i) for i in c.utterances(participants="INV", by_files=True)]
INV_d = [len(i) for i in d.utterances(participants="INV", by_files=True)]

In [10]:
# Extract number of word or phrase repetitions
repeat_word = '\[/\]'
repeat_phrase = '\[//\]'
def get_repeats(chat_files):
    """Input: List of files from chat transcripts.
       Ouput: List of the sum of repeated words and phrases in each file."""
    reps = []
    for file in chat_files:
        # Collect relevant parts of each utterance into a single string before searching
        utts_list = []
        for utterance in file:
            utts_list.append(utterance.tiers.get('PAR', ''))
            utts = "".join(utts_list)
            reps_file = len(re.findall(repeat_word, utts)) + \
                      len(re.findall(repeat_phrase, utts))
        reps.append(reps_file)
    return reps

reps_c = get_repeats(c.utterances(by_files=True))
reps_d = get_repeats(d.utterances(by_files=True))

In [11]:
data = pd.DataFrame(list(zip(grp_c + grp_d, 
                             MMSE_c + MMSE_d,
                             INV_c + INV_d,
                             reps_c + reps_d
                             )),
                    columns=cols, 
                    index=idx_c + idx_d)

In [12]:
# Check combined index structure
print('Index length:', data.index.size)
print('Are all index values unique?', len(data.index) == len(set(data.index)))

Index length: 552
Are all index values unique? True


In [13]:
# convert MMSE to numeric from default conversion to object
data['MMSE'] = pd.to_numeric(data['MMSE'], errors='coerce')

In [14]:
data.dtypes

Group                  int64
MMSE                 float64
INV_Interjections      int64
Repeats                int64
dtype: object

In [15]:
# Control group Summary statistics
data[data['Group']==0].describe()

Unnamed: 0,Group,MMSE,INV_Interjections,Repeats
count,243.0,181.0,243.0,243.0
mean,0.0,29.127072,3.193416,1.934156
std,0.0,1.110749,1.865198,2.222115
min,0.0,24.0,0.0,0.0
25%,0.0,29.0,2.0,0.0
50%,0.0,29.0,3.0,1.0
75%,0.0,30.0,4.0,3.0
max,0.0,30.0,9.0,17.0


In [16]:
# Dementia group Summary statistics
data[data['Group']==1].describe()

Unnamed: 0,Group,MMSE,INV_Interjections,Repeats
count,309.0,277.0,309.0,309.0
mean,1.0,19.779783,5.7411,3.631068
std,0.0,5.668872,4.435427,4.197998
min,1.0,1.0,0.0,0.0
25%,1.0,16.0,3.0,1.0
50%,1.0,20.0,5.0,2.0
75%,1.0,24.0,7.0,5.0
max,1.0,30.0,49.0,29.0


TO DO: Look for missing data

Prepare data for creating model

In [17]:
# Shuffle data
data = data.sample(frac=1)

First model created with only one input feature: the number of interjections by the Investigator.

In [18]:
X_train, X_test, y_train, y_test = train_test_split(data["INV_Interjections"],
                                                    data['Group'], 
                                                    test_size = 0.3,
                                                    stratify = data['Group'], 
                                                    random_state = 8)

In [19]:
# confirm equal control vs. dementia split in train vs. test sets
print('Full group % dementia:', round(data['Group'].mean(), 4))
print('Training set % dementia:', round(y_train.mean(), 4))
print('Test set % dementia:', round(y_test.mean(), 4))

Full group % dementia: 0.5598
Training set % dementia: 0.5596
Test set % dementia: 0.5602


TO DO:
Set up for cross-validation within training data.  Also set random seed for the shuffle in addition to the random state already in the train/test split.

### Baseline models

For comparison, baseline models created from the dataset used for ADReSS challenge for the 2020 Interspeech Conference reached 75% accuracy and f1 scores using a different subset of linguistic data from the same raw dataset.\*

The dataset used for the ADreSS challenge was an age and gender-matched subset of the full Pitt dataset, and included spontaenous speech. The model used 34 linguistic features extracted from the raw dataset, including duration, total utterances, MLU (mean length of utterance), type-token ratio, open-closed class word ratio, and percentages of 9 parts of speech. 

The baseline model created in this notebook uses the portion of the Pitt dataset in which participants are asked to describe the cookie theft picture commonly used in aphasia testing and uses only one feature, the number of interjections by the interviewer.

It's also worth noting that it's unknown if the interviewer had knowledge of any diagnoses of the participants, and this knowledge could influence the number of their injections (i.e., whether they perceived a participant would need assistance given their diagnosis).  For this reason it would be useful to try at least one other potentially less biased feature for creating a baseline model.  


\* Luz S, Haider F, de la Fuente S, Fromm D, MacWhinney B. August 2020. *Alzheimer’s Dementia Recognition through Spontaneous Speech: The ADReSS Challenge.* https://arxiv.org/abs/2004.06833v3  


#### Logistic Regression prediction of control vs. dementia

In [22]:
logm= LogisticRegression()
log_baseline = logm.fit(pd.DataFrame(X_train), y_train)

In [23]:
# Assess fit of model
print('Accuracy of baseline model is:')
print(round(log_baseline.score(pd.DataFrame(X_test), y_test), 2))
print('Area under the ROC curve is:')
print(round(roc_auc_score(y_test, log_baseline.predict_proba(pd.DataFrame(X_test))[:, 1]), 2))
print('F1 score is:')
print(round(f1_score(y_test, log_baseline.predict(pd.DataFrame(X_test))), 2))

Accuracy of baseline model is:
0.66
Area under the ROC curve is:
0.73
F1 score is:
0.7


#### Prediction of MMSE

TO COME