This notebook creates a baseline model for predicting dementia from linguistic data.

Data source: Ram Balasubramanium at Zelar Health.  
Confirm: data originally from https://dementia.talkbank.org/access/English/Pitt.html

Python library for parsing chat files:
https://pylangacq.org/

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from copy import deepcopy
import re
import pylangacq

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

## Load Data
Extract data from chat transcript files into pandas DataFrame

In [None]:
files_control = 'PittData/PittTranscripts/Control/cookie/'
files_dementia = 'PittData/PittTranscripts/Dementia/cookie/'

In [None]:
# Count files in directories
file_count_control = !ls $files_control | wc -l
file_count_dementia = !ls $files_dementia| wc -l
all_files_count = int(file_count_control[0]) + int(file_count_dementia[0])
print('Control files:', file_count_control[0])
print('Dementia files:', file_count_dementia[0])
print('All files:', all_files_count)

In [None]:
# Load data into chat reader
raw_data = pylangacq.Reader.from_dir(files_control)
raw_data.append(pylangacq.Reader.from_dir(files_dementia))

In [None]:
# Create index from filenames and check index structure
idx = [i['Media'].split(',')[0] for i in raw_data.headers()]
print('Index length:', len(idx))
print('Are all index values unique?', len(idx) == len(set(idx)))

In [None]:
# Set column names for features and output variables
cols = ['Group', 'MMSE', 'INV_Interjections', 'Repeats']

#### Extract output variables
* _Control/Dementia_: Set binary variable for control to 0 and any dementia diagnosis to 1
* _MMSE_: Not all files have a recorded MMSE value

In [None]:
# Examine group values
values = [i['Participants']['PAR']['group'] for i in raw_data.headers()]
print('Unique group values:\n', set(values))

In [None]:
print("Number of occurences of blank group value:", values.count(''))

Solution: If blank group value is assigned to Control, binary variable assignment matches original file designation.

Optional further exploration: track down the file with the blank control value to confirm it was corrected assigned to the control group OR delete it from the raw dataset before extracting other values

In [None]:
# Set binary variable for Control=0 and Any Dementia Diagnosis =1
group = [0 if i['Participants']['PAR']['group'] == 'Control' or i['Participants']['PAR']['group'] == ''
         else 1
         for i in raw_data.headers()]
print('Dementia datapoints:', np.array(group).sum())

In [None]:
# MMSE was coded by transcribers in the 'education' key in each transcript's header
MMSE = [int(i['Participants']['PAR']['education']) if i['Participants']['PAR']['education'] != ''
        else None
        for i in raw_data.headers()]

#### Extract feature variables

In [None]:
INV = [len(i) for i in raw_data.utterances(participants="INV", by_files=True)]

In [None]:
# Extract number of word or phrase repetitions
repeat_word = '\[/\]'
repeat_phrase = '\[//\]'

reps = []
for file in raw_data.utterances(by_files=True):
    # Collect each file's utterances into a single string to search using regular expressions
    utts_list = []
    for utterance in file:
        utts_list.append(utterance.tiers.get('PAR', ''))
    utts = "".join(utts_list)
    # add each file's number of repeated words and phrases to the variable
    reps.append(len(re.findall(repeat_word, utts)) + \
                len(re.findall(repeat_phrase, utts)))    

In [None]:
data = pd.DataFrame(list(zip(grp_c + grp_d, 
                             MMSE_c + MMSE_d,
                             INV_c + INV_d,
                             reps_c + reps_d
                             )),
                    columns=cols, 
                    index=idx_c + idx_d)

In [None]:
data = pd.DataFrame(list(zip(group, 
                             MMSE,
                             INV,
                             reps
                             )),
                    columns=cols, 
                    index=idx)

In [None]:
# Check combined index structure
print('Index length:', data.index.size)
print('Are all index values unique?', len(data.index) == len(set(data.index)))

In [None]:
# Name index and dataframe
data.index.name = 'Filename'
data.name = 'Dementia Study - Cookie Theft Data'

## Explore Data

In [None]:
print("Rows:", data.shape[0], "Columns:", data.shape[1])
data.head(5)

In [None]:
print('Index data type:', data.index.dtype)
data.dtypes
# Note: MMSE is interpreted by Python as float bcause of missing MMSE data

In [None]:
data.info()

In [None]:
# Control group Summary statistics
data[data.Group==0].describe()

In [None]:
# Dementia group Summary statistics
data[data.Group==1].describe()

In [None]:
# Review missing values
print('{0:<20} {1:^30}'.format('Samples Total', data.shape[0]))
print()
print('{0:<20} {1:^30}'.format('Variable', 'Samples with Missing Data'))
print()
# Data Index
print('{0:<20} {1:^30}'.format(data.index.name, pd.isna(data.index).sum()))
# Each column
for i in range(len(cols)):
    print('{0:<20} {1:^30}'.format(cols[i], pd.isna(data[cols[i]]).sum()))

Are variables normally distributed?  **No**

In [None]:
fig = plt.figure(figsize = (16,4)) 
for i in range(len(cols)):
    ax = fig.add_subplot(1, 4, i+1)
    ax.hist(data[cols[i]].dropna(), color='mediumvioletred') 
    ax.spines["top"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.spines["left"].set_visible(False)
    ax.yaxis.set_ticks_position('none')
    ax.set_yticklabels('')
    ax.set_xlabel(cols[i])

Prepare data for creating model

In [None]:
# Shuffle data
data = data.sample(frac=1)

First model created with only one input feature: the number of interjections by the Investigator.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data["INV_Interjections"],
                                                    data['Group'], 
                                                    test_size = 0.3,
                                                    stratify = data['Group'], 
                                                    random_state = 8)

In [None]:
# confirm equal control vs. dementia split in train vs. test sets
print('Full group % dementia:', round(data['Group'].mean(), 4))
print('Training set % dementia:', round(y_train.mean(), 4))
print('Test set % dementia:', round(y_test.mean(), 4))

TO DO:
Set up for cross-validation within training data.  Also set random seed for the shuffle in addition to the random state already in the train/test split.

### Baseline models

For comparison, baseline models created from the dataset used for ADReSS challenge for the 2020 Interspeech Conference reached 75% accuracy and f1 scores using a different subset of linguistic data from the same raw dataset.\*

The dataset used for the ADreSS challenge was an age and gender-matched subset of the full Pitt dataset, and included spontaenous speech. The model used 34 linguistic features extracted from the raw dataset, including duration, total utterances, MLU (mean length of utterance), type-token ratio, open-closed class word ratio, and percentages of 9 parts of speech. 

The baseline model created in this notebook uses the portion of the Pitt dataset in which participants are asked to describe the cookie theft picture commonly used in aphasia testing and uses only one feature, the number of interjections by the interviewer.

It's also worth noting that it's unknown if the interviewer had knowledge of any diagnoses of the participants, and this knowledge could influence the number of their injections (i.e., whether they perceived a participant would need assistance given their diagnosis).  For this reason it would be useful to try at least one other potentially less biased feature for creating a baseline model.  


\* Luz S, Haider F, de la Fuente S, Fromm D, MacWhinney B. August 2020. *Alzheimer’s Dementia Recognition through Spontaneous Speech: The ADReSS Challenge.* https://arxiv.org/abs/2004.06833v3  


#### Logistic Regression prediction of control vs. dementia

In [None]:
logm= LogisticRegression()
log_baseline = logm.fit(pd.DataFrame(X_train), y_train)

In [None]:
# Assess fit of model
print('Accuracy of baseline model is:')
print(round(log_baseline.score(pd.DataFrame(X_test), y_test), 2))
print('Area under the ROC curve is:')
print(round(roc_auc_score(y_test, log_baseline.predict_proba(pd.DataFrame(X_test))[:, 1]), 2))
print('F1 score is:')
print(round(f1_score(y_test, log_baseline.predict(pd.DataFrame(X_test))), 2))

#### Prediction of MMSE

TO COME