# Import Libraries & Set Up
---

In [1]:
import warnings
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
warnings.filterwarnings('ignore')

palette = ['#800080', '#8A2BE2', '#FF69B4', '#DA70D6', '#9370DB', '#DDA0DD', '#BA55D3']
gradient_palette = sns.light_palette('#620080', as_cmap=True)
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=palette)
sns.set_theme(style="whitegrid", palette=palette)

# Dementia Dataset
---

## Import Dataset & Examine
---

### Import Dataset
---

In [3]:
dementia_df = pd.read_csv('data/dementia_data-MRI-features.csv')

### Dataset Info & Structure
---

In [None]:
print(dementia_df.shape)

In [None]:
print(dementia_df.info())

In [None]:
dementia_df.head()

### Statistical Summary
---

In [None]:
dementia_df.describe().T

In [None]:
print(f"Number of unique subjects: {len(dementia_df['Subject ID'].unique())}")

## Preparing the Data
---

### Target Examination
---

In [None]:
sns.countplot(x=dementia_df['Group'], palette=palette)

In [None]:
dementia_df.Group.value_counts()

The converted category consists of 37 records for 14 subjects.

In [None]:
dementia_df.loc[dementia_df.Group == 'Converted']

All those classified as Converted were Nondemented on their first visit and Demented on the final visit according to the data card.

We can hence resolve this category into Nondemented (first visit) and Demented (last visit), dropping nine records which lie between the first and final visits.

In [12]:
nondemented = [33,36,57,81,114,194,218,245,261,271,273,295,297,346]
demented = [35,38,59,83,115,195,220,246,265,272,274,296,298,348]
drop = [34,37,58,82,219,262,263,264,347]

In [13]:
for n in nondemented:
    dementia_df.Group.iloc[n] = 'Nondemented'
for n in demented:
    dementia_df.Group.iloc[n] = 'Demented'

In [14]:
dementia_df = dementia_df.drop(index =[34,37,58,82,219,262,263,264,347])

Now we can drop the unneeded columns.

In [15]:
dementia_df = dementia_df.drop(columns = ['Subject ID','MRI ID'])

Now we can visualise the target following these changes.

In [None]:
sns.countplot(x=dementia_df['Group'], palette=palette)

In [None]:
dementia_df.Group.value_counts()

### Data Types
---

We will change all categorical features to be numerical to make it easier to work with for now.

In [18]:
dementia_df['Group'] = dementia_df['Group'].map({'Nondemented': 0, 'Demented': 1})
dementia_df['M/F'] = dementia_df['M/F'].map({'M': 0, 'F': 1})
dementia_df['Hand'] = dementia_df['Hand'].map({'R': 0, 'L': 1})

In [19]:
dementia_df['Group'] = dementia_df['Group'].astype(int)
dementia_df['M/F'] = dementia_df['M/F'].astype(int)
dementia_df['Hand'] = dementia_df['Hand'].astype(int)

### Missing Values
---

In [None]:
dementia_df.isnull().sum()

Visualise the missing data to see if there is a pattern.

In [None]:
dementia_df[dementia_df.isnull().any(axis=1)]

We have already dropped nine rows, so another 19 would be too many to drop.

All rows with missing values are from demented patients, so we cannot use basic imputation as it would introduce bias.

Imputation by group could be used, but this may over-simplify the data and dilute context-specific patterns.

Therefore, K-Nearest-Neighbours imputation will be used.

In [22]:
from sklearn.impute import KNNImputer

In [23]:
imputer = KNNImputer(n_neighbors=5)

In [24]:
dementia_df = pd.DataFrame(imputer.fit_transform(dementia_df), columns=dementia_df.columns)

Check that there are no more missing values.

In [None]:
dementia_df.isnull().sum()

### Synthetic Minority Over-sampling Technique (SMOTE)
---

In [26]:
from imblearn.over_sampling import SMOTE

In [27]:
X = dementia_df.drop('Group', axis=1)
y = dementia_df['Group']

In [28]:
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

In [29]:
dementia_df = pd.DataFrame(X_resampled, columns=X.columns)
dementia_df['Group'] = y_resampled

In [None]:
sns.countplot(x=dementia_df['Group'], palette=palette)

## Data Distribution & Correlations
---

### Skewness Analysis
---

In [None]:
dementia_df.skew()

We can see that variables like Hand, EDUC, and ASF are nearly symmetrically distributed, while others show slight to moderate skewness.

MMSE is highly negatively skewed, and CDR is highly positively skewed.

We can compare this to the skewness of features for demented and non-demented patients specifically.

In [32]:
demented = dementia_df[dementia_df['Group'] == 1]
non_demented = dementia_df[dementia_df['Group'] == 0]

In [33]:
skew_comparison = pd.DataFrame({
    'Overall': dementia_df.skew(),
    'Non-Demented': non_demented.skew(),
    'Demented': demented.skew()
})

In [None]:
print(skew_comparison)

We can plot this data to more easily visualise it.

To do this we need to ensure the skew_comparison DataFrame has a column for variable names.

In [35]:
skew_comparison = skew_comparison.reset_index().rename(columns={'index': 'Variable'})

And then reshape the DataFrame.

In [36]:
skew_comparison = pd.melt(skew_comparison, id_vars='Variable', var_name='Group', value_name='Skewness')

In [None]:
plt.figure(figsize=(14, 8))
sns.barplot(x='Variable', y='Skewness', hue='Group', data=skew_comparison)
plt.title('Comparison of Skewness Between Demented and Non-Demented Groups')
plt.xlabel('Variable')
plt.ylabel('Skewness')
plt.xticks(rotation=45)
plt.legend(title='Group')
plt.grid(True)
plt.tight_layout()
plt.show()

The skewness analysis reveals key differences between the Non-Demented and Demented groups. MMSE and CDR show significant skew, with MMSE negatively skewed (indicating lower cognitive scores for the demented group) and CDR positively skewed (suggesting more advanced stages of dementia in demented individuals).

Age is more skewed in the Demented group, indicating that individuals in this group are, on average, older. MR Delay is right-skewed in the Demented group, pointing to longer delays for this group. The M/F distribution is left-skewed in the Non-Demented group, showing a higher proportion of females, while the Demented group has a more balanced gender distribution.

SES shows a higher skew in the Non-Demented group, suggesting that this group generally has a higher socioeconomic status. Finally, the CDR variable has a significant positive skew in the Non-Demented group, with most individuals scoring 0, indicating no dementia. These patterns highlight significant differences in cognitive function, demographics, and clinical measures between the two groups.

### Histogram
---

In [None]:
dementia_df.hist(figsize=(25,20))

As there is no variability in the 'Hand' feature, we will drop this too.

In [39]:
dementia_df = dementia_df.drop(columns='Hand')

### Correlations
---

We can now check the correlations between features in the dataset.

In [None]:
dementia_corr = dementia_df.copy().corr()
dementia_corr['Group'].sort_values(ascending = False)

We can plot this on a heatmap.

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(dementia_corr, annot=True, cmap=gradient_palette)
plt.show()

The correlation analysis reveals that CDR has the strongest positive correlation with the Group, indicating its significant role in predicting dementia severity. MMSE shows a strong negative correlation, with lower scores associated with dementia, making it another key predictor. nWBV also negatively correlates with the Group, suggesting that lower brain volume may be linked to dementia.

EDUC shows a moderate negative correlation, implying that lower education levels could be associated with a higher likelihood of dementia, though the effect is weaker. M/F indicates a slight male predominance in the demented group, but this is a minor factor. SES shows a weak positive correlation, suggesting higher socioeconomic status is slightly linked to the non-demented group, but this relationship is not strong. Other variables like Age, eTIV, Visit, MR Delay, and ASF have minimal correlations, suggesting they are less relevant for predicting dementia in this dataset.

In [42]:
important_features = ['Group', 'EDUC', 'MMSE', 'CDR', 'nWBV']

We can also visualise the important features in a pairplot.

In [None]:
sns.pairplot(dementia_df[important_features], hue='Group', palette=palette)

And finally let's shuffle and save the processed dataset.

In [44]:
dementia_df = dementia_df.sample(frac=1).reset_index(drop=True)

In [45]:
dementia_df.to_csv('data/dementia_data_processed.csv', index=False)

# Parkinson's Disease Dataset
---

## Import Dataset & Examine
---

### Import Dataset
---

In [46]:
parkinsons_df = pd.read_csv('data/parkinsons_data-VOICE-features.csv')

In [47]:
parkinsons_df.rename(columns={'name': 'Name', 'status': 'Status'}, inplace=True)

### Dataset Info & Structure
---

In [None]:
print(parkinsons_df.shape)

In [None]:
print(parkinsons_df.info())

In [None]:
parkinsons_df.head()

### Statistical Summary
---

In [None]:
parkinsons_df.describe().T

In [None]:
print(f"Number of unique subjects: {len(parkinsons_df['Name'].unique())}")

## Preparing the data
---

### Target Examination
---

In [None]:
sns.countplot(x=parkinsons_df['Status'], palette=palette)

In [None]:
parkinsons_df.Status.value_counts()

As there are no repeated patients in this dataset, we can remove the 'name' column.

In [55]:
parkinsons_df = parkinsons_df.drop(columns=['Name'])

### Data Types
---

As we saw from the dataset info, the only non-numerical column has been dropped, so we do not need to change any datatypes for this dataset.

### Missing values
---

In [None]:
parkinsons_df.isnull().sum()

As we can see, there are no missing values in this dataset, so we do not need to do anything here.

### Synthetic Minority Over-sampling Technique (SMOTE)
---

In [57]:
X = parkinsons_df.drop('Status', axis=1)
y = parkinsons_df['Status'] 

In [58]:
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

In [59]:
parkinsons_df = pd.DataFrame(X_resampled, columns=X.columns)
parkinsons_df['Status'] = y_resampled

In [None]:
sns.countplot(x=parkinsons_df['Status'], palette=palette)

## Data Distribution & Correlation
---

### Skewness Analysis
---

In [None]:
parkinsons_df.skew()

The MDVP-related features, such as MDVP: Fhi(Hz), MDVP: Jitter(%), and MDVP: RAP, exhibit strong positive skew, indicating that most values are clustered at the lower end with some extreme higher values. These features are likely important for prediction, as the spread of values can help distinguish between different conditions.

NHR also shows significant positive skew, while HNR and status have negative skew, with values concentrated towards the higher end.

Other features like RPDE, DFA, spread1, and spread2 have near-zero skew, implying more symmetric distributions.

We can compare this to the skewness of features for healthy and diseased patients specifically.

In [62]:
healthy = parkinsons_df[parkinsons_df['Status'] == 1]
diseased = parkinsons_df[parkinsons_df['Status'] == 0]

In [63]:
skew_comparison = pd.DataFrame({
    'Overall': parkinsons_df.skew(),
    'Healthy': healthy.skew(),
    'Diseased': diseased.skew()
})

In [None]:
print(skew_comparison)

We can plot this data to more easily visualise it.

To do this we need to ensure the skew_comparison DataFrame has a column for variable names.

In [65]:
skew_comparison = skew_comparison.reset_index().rename(columns={'index': 'Variable'})

And then reshape the DataFrame.

In [66]:
skew_comparison = pd.melt(skew_comparison, id_vars='Variable', var_name='Status', value_name='Skewness')

In [None]:
plt.figure(figsize=(14, 8))
sns.barplot(x='Variable', y='Skewness', hue='Status', data=skew_comparison)
plt.title('Comparison of Skewness Between Demented and Non-Demented Groups')
plt.xlabel('Variable')
plt.ylabel('Skewness')
plt.xticks(rotation=45)
plt.legend(title='Status')
plt.grid(True)
plt.tight_layout()
plt.show()

The skewness analysis of the Parkinson's dataset reveals several notable patterns between the Healthy and Diseased groups. MDVP: Fo(Hz) and MDVP: Fhi(Hz) exhibit high skewness in both groups, with the Healthy group showing a more pronounced positive skew, indicating that these features are more variable in the healthy population. MDVP Flo(Hz), MDVP: Jitter(%), and MDVP: Jitter(Abs) also show moderate skewness in both groups, with the Diseased group tending towards less positive skew, which could point to lower variability in these features for individuals with Parkinson's.

Shimmer-related features like MDVP: Shimmer and Shimmer: APQ5 are more skewed in the Healthy group, suggesting more variability in this measure for healthy individuals. On the other hand, MDVP: APQ has higher skewness in the Healthy group, possibly indicating a different vocal pattern or greater variance in healthy individuals compared to the diseased ones.

NHR shows significant positive skew in both groups, but the Healthy group has a higher skew, possibly reflecting more pronounced differences in speech-related features for healthy individuals.

HNR, Status, RPDE, DFA, and spread2 all exhibit negative skew, with HNR showing a more pronounced negative skew in the Diseased group. The negative skew of Status could reflect the distribution of disease severity, with most diseased individuals falling into lower severity levels.

### Histogram
---

In [None]:
parkinsons_df.hist(figsize=(25,20))

### Correlations
---

We can now check the correlations between features in the dataset.

In [None]:
parkinsons_corr = parkinsons_df.copy().corr()
parkinsons_corr['Status'].sort_values(ascending=False)

We can plot this on a heatmap.

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(parkinsons_corr, annot=True, cmap=gradient_palette)
plt.show()

The correlation analysis of the Parkinson's dataset reveals several important patterns related to the Status of the individuals. Spread1 and PPE show the strongest positive correlations with Status, indicating that greater variability in speech features and potentially higher vocal effort are associated with more severe Parkinson's symptoms. Spread2 also shows a moderate positive correlation, suggesting a similar relationship, though slightly weaker.

Speech-related features like MDVP: Shimmer, MDVP: APQ, and Shimmer: APQ5 have moderate positive correlations with Status, implying that these features are linked to disease severity in Parkinson's patients. Notably, MDVP: Shimmer(dB) and Shimmer: APQ3 also correlate moderately with Status, pointing to their potential role in distinguishing between stages of Parkinson's.

D2 and MDVP: Jitter(Abs) show weaker positive correlations, highlighting that vocal features associated with irregularities and pitch variation may also be relevant for assessing the severity of Parkinson's, though their impact is less pronounced than the other speech features.

On the other hand, HNR, MDVP: Fo(Hz), and MDVP: Flo(Hz) show negative correlations with Status, suggesting that lower values of these features may be associated with more severe Parkinson's symptoms. The stronger negative correlation between HNR and Status indicates that speech harmonics, which are influenced by vocal quality, could serve as a significant indicator of disease progression.

In summary, speech features such as Spread1, PPE, and MDVP: Shimmer have the strongest correlations with disease severity in Parkinson's patients, while features like HNR and MDVP: Fo(Hz) show significant negative correlations. This suggests that both the variability and quality of speech may be key indicators for predicting the severity of Parkinson's disease.

In [71]:
important_features = ['Status', 'spread1', 'PPE', 'MDVP:Shimmer', 'MDVP:APQ', 'Shimmer:APQ5', 'Shimmer:DDA', 'MDVP:Shimmer(dB)', 'HNR', 'MDVP:Fo(Hz)', 'MDVP:Flo(Hz)']

We can also visualise the important features in a pairplot.

In [None]:
sns.pairplot(parkinsons_df[important_features], hue='Status', palette=palette)

And finally let's shuffle and save the processed dataset.

In [73]:
parkinsons_df = parkinsons_df.sample(frac=1).reset_index(drop=True)

In [74]:
parkinsons_df.to_csv('data/parkinsons_data_processed.csv', index=False)