Alzheimer's Disease 🧠🔍📈
Prepared by Syahmi, Beatrice & Farriz

Alzheimer's disease is a progressive neurodegenerative disorder that primarily affects older adults, leading to the gradual loss of memory, cognitive function, and the ability to perform everyday activities. It is the most common cause of dementia, accounting for 60-80% of all dementia cases.

The exact cause of Alzheimer's disease is not fully understood, but it is believed to result from a combination of genetic, environmental, and lifestyle factors. Some key risk factors include: age, family history, genetics, lifestyle and heart health, and head injury.

We are representative from Alzheimer's organization.

Roles:

Project Manager (Mohamad Farriz Fikri)
Data Analyst (Beatrice Majang)
Machine Learning Engineer (Syahmi Sade)
Problem Statement:

There is a pressing need to analyze the available data to gain insights into the factors contributing to the disease's onset and progression, which can help in early detection and improved management of the condition
The data we used is from a Hospital in the United States. We will not reveal the hospital as for confidential purpose.

Objective:

Data visualization
Machine Learning: Compare models of Decision Tree, Random Forest, KNN, and Logistic Regression

1. Data Loading

This section involves data analysis and machine learning model implementation using various Python libraries. It includes importing necessary libraries for data manipulation, visualization, preprocessing, and model training.

In [None]:
# @title
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, f1_score, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
pd.set_option('display.max_columns', None)
%matplotlib inline


import warnings
warnings.filterwarnings("ignore")

sns.set_theme(context='notebook', palette='deep', style='darkgrid')

In [None]:
# @title
# Converting the data set into data frame structure
df = pd.read_csv('/content/alzheimers_disease_data.csv')

2. Data Inspection

This process involves a thorough examination of the dataset to understand its structure, data types, and overall content. We will display the initial overview of the data. Additionally, we'll review the dataset's information summary and data types for each column. Summary statistics for numerical columns will be generated to understand their distributions. We'll also check the number of unique values in categorical columns and identify any duplicate rows.

In this notebook, we performed a comprehensive data inspection on an Alzheimer's disease dataset. We explored the dataset structure, reviewed its information summary, and generated statistical summaries. This inspection provides a better understanding of the dataset and prepares it for further analysis and modeling.

In [None]:
# Displaying the data frame
df

In [None]:
# Displaying the info for data columns
df.info()

In [None]:
# Function to format the values become readable
def format_value(x):
    if abs(x) < 1e-2 and x != 0:
        return f'{x:.2e}'
    else:
        return f'{x:.2f}'

# Displaying the describe of the data frame
read = df.describe().T
readsummary = read.applymap(format_value)
readsummary

In [None]:
# To check if there is any duplicated values inside the data set
sum(df.duplicated())

In [None]:
# To confirm the counts of the DoctorInCharge
df.DoctorInCharge.value_counts()

3. Data Pre-processing

Data pre-processing is a critical step in preparing a dataset for analysis and modeling. For an Alzheimer's disease dataset, this process typically involves several key steps to clean and structure the data effectively. This includes removing unnecessary columns such as DoctorInCharge and PatientID, as well as identifying the unique values within specific DataFrame columns to understand the data distribution and ensure consistency.

Droping coloumns

In [None]:
df.drop(['PatientID', 'DoctorInCharge'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.tail()

Observe unique values in dataset

In [None]:
#unique values in each column
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values in column '{column}':")
    print(unique_values)
    print()

4. Data Visualization

Data visualization helps in identifying patterns, trends, and relationships between variables. For an Alzheimer's disease dataset, various types of visualizations can be employed to gain insights into clinical, genetic, and demographic data. In this case, we use bar charts to visualize the frequency distribution of categorical variables such as gender, ethnicity, and education level etc.

The dataset comprises 2,149 observations, with all values being non-null and numerical. There are no duplicate records present. After excluding the DoctorInCharge and PatientID columns, the dataset now includes 33 features. By employing visualization technique, we can gain insights into the distribution of features. This helps in making informed decisions for further analysis, feature engineering, and model development.

In [None]:
# Identify numerical columns: columns with more than 10 unique values are considered numerical
numerical_columns = [col for col in df.columns if df[col].nunique() > 10]

# Identify categorical columns: columns that are not numerical and not 'Diagnosis'
categorical_columns = df.columns.difference(numerical_columns).difference(['Diagnosis']).to_list()

In [None]:
# Custom labels for the categorical columns
custom_labels = {
    'Gender': ['Male', 'Female'],
    'Ethnicity': ['Caucassian', 'African American', 'Asian', 'Others'],
    'EducationLevel': ['None', 'High School', 'Bachelor\'s', 'Higher'],
    'Smoking': ['No', 'Yes'],
    'FamilyHistoryAlzheimers': ['No', 'Yes'],
    'CardiovascularDisease': ['No', 'Yes'],
    'Diabetes': ['No', 'Yes'],
    'Depression': ['No', 'Yes'],
    'HeadInjury': ['No', 'Yes'],
    'Hypertension': ['No', 'Yes'],
    'MemoryComplaints': ['No', 'Yes'],
    'BehavioralProblems': ['No', 'Yes'],
    'Confusion': ['No', 'Yes'],
    'Disorientation': ['No', 'Yes'],
    'PersonalityChanges': ['No', 'Yes'],
    'DifficultyCompletingTasks': ['No', 'Yes'],
    'Forgetfulness': ['No', 'Yes']
}

Categorical Features

In [None]:
custom_palette = sns.color_palette("Set2")

# Number of columns to plot side by side
n_cols = 2
n_rows = (len(categorical_columns) + n_cols - 1) // n_cols  # Calculate the number of rows needed

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, n_rows * 5))

# Flatten the axes array for easy iteration
axes = axes.flatten()

for i, column in enumerate(categorical_columns):
    sns.countplot(data=df, x=column, palette=custom_palette, ax=axes[i])
    axes[i].set_title(f'Countplot of {column}', fontweight='bold')

    # Set custom labels
    labels = custom_labels[column]
    ticks = range(len(labels))
    axes[i].set_xticks(ticks)
    axes[i].set_xticklabels(labels, fontstyle='italic', fontweight='bold')

    # Set axis labels to italic
    axes[i].set_xlabel(axes[i].get_xlabel(), fontstyle='italic')
    axes[i].set_ylabel(axes[i].get_ylabel(), fontstyle='italic')

# Remove empty subplots if the number of categorical columns is not even
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# Adjust space between plots
plt.subplots_adjust(wspace=20, hspace=30)

plt.tight_layout()
plt.show()

Numerical Features

In [None]:
n_rows = -(-len(numerical_columns) // n_cols)  # Ceiling division to get the number of rows

# Create subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 30))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Plot each numerical column
for idx, column in enumerate(numerical_columns):
    sns.histplot(data=df, x=column, kde=True, bins=20, ax=axes[idx])
    axes[idx].set_title(f'Distribution of {column}', fontsize=14, fontweight='bold')

# Remove any empty subplots
for i in range(len(numerical_columns), len(axes)):
    fig.delaxes(axes[i])

# Adjust layout
plt.tight_layout()
plt.show()

Correlations between features

1. Heatgraph
2. Pearson correlation coefficients

In [None]:
# Create a mask for the upper triangle
mask = np.triu(np.ones_like(df.corr(), dtype=bool))

# Plot heatmap of the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(),cmap="coolwarm", cbar_kws={"shrink": .5}, mask=mask)

plt.show()

In [None]:
# Compute Pearson correlation coefficients
correlations = df.corr(numeric_only=True)['Diagnosis'][:-1].sort_values()

# Set the size of the figure
plt.figure(figsize=(20, 7))

# Create a bar plot of the Pearson correlation coefficients
ax = correlations.plot(kind='bar', width=0.7)

# Set the y-axis limits and labels
ax.set(ylim=[-1, 1], ylabel='Pearson Correlation', xlabel='Features',
       title='Pearson Correlation with Diagnosis')

# Rotate x-axis labels for better readability
ax.set_xticklabels(correlations.index, rotation=45, ha='right')

plt.tight_layout()
plt.show()

Graphs for top 5 highlights features

1. Functional Assessment Scores by Diagnosis
2. Activities of Daily Living Score by Diagnosis
3. Mini-mental state examination score by Diagnosis
4. Behavioral Problems by Diagnosis
5. Distribution of Memories by Diagnosis
We also include what is the amount of patients with Alzheimer's Disease based on the total patients from the data sets

In [None]:
sns.swarmplot(data=df, y='FunctionalAssessment', x='Diagnosis')
plt.title(f'Distribution of Functional Assessment Scores by Diagnosis Categories')
plt.show()

5. Model Training and Evaluation

Model training where we use historical data to teach the model how to make predictions. In this case, we use the model of decision tree, random forest, K-Nearest neighbors, and logistic regression. Some parameters have been tested and the dataset have been split into training and testing set to evaluate the model's performance on unseen data.

Proper evaluation ensures that the model is reliable, generalizes well to new data, and provides actionable insights. For Alzheimer's disease prediction, we use common metrics such as F1 score. The F1 score is a metric used to evaluate the performance of a classification model, especially when dealing with imbalanced datasets. It combines both precision and recall into a single metric by calculating their harmonic mean. This provides a balance between the two, making it useful for scenarios where both false positives and false negatives are important. While a confusion matrix provides a detailed breakdown of prediction performance by showing the counts of true positives, true negatives, false positives, and false negatives.

Main goal: To choose which models have the highest F1 Score to detect if a patients is an Alzheimer's Disease patients or not

Normalize & Standardize the columns

In [None]:
columns = ['Age', 'BMI', 'AlcoholConsumption', 'PhysicalActivity', 'DietQuality', 'SleepQuality', 'SystolicBP', 'DiastolicBP',
           'CholesterolTotal', 'CholesterolLDL', 'CholesterolHDL', 'CholesterolTriglycerides', 'MMSE', 'FunctionalAssessment', 'ADL']

#normalize the columns
min_max_scaler = MinMaxScaler()
df[columns] = min_max_scaler.fit_transform(df[columns])

#standardize the columns
standard_scaler = StandardScaler()
df[columns] = standard_scaler.fit_transform(df[columns])

df

Confusion Matrix for the models & The difference of F1 Score between the models

In [None]:
# Assuming df is already defined and contains the necessary data
# Split data into features and target
X = df.drop(columns=['Diagnosis'])
y = df['Diagnosis']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

# Define hyperparameter grids for each model
param_grids = {
    'Decision Tree': {'max_depth': [3, 5, 7, 12, None]},
    'Random Forest': {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 7, 12, None]},
    'K-Nearest Neighbors': {'n_neighbors': [3, 5, 7]},
    'Logistic Regression': {'C': [0.1, 1, 10]}
}

# Instantiate classification models with default parameters
models = {
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Logistic Regression': LogisticRegression()
}

# Dictionary to store the F1 scores for each model
f1_scores = {
    'Model': [],
    'Dataset': [],
    'F1 Score': []
}

# Fit models using GridSearchCV for hyperparameter tuning
num_models = len(models)
fig, axes = plt.subplots((num_models + 1) // 2, 2, figsize=(15, (num_models // 2 + num_models % 2) * 6))
axes = axes.flatten()

for idx, (name, model) in enumerate(models.items()):
    grid_search = GridSearchCV(model, param_grids[name], cv=5, scoring='f1')
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_

    # Predict on test set
    y_test_pred = best_model.predict(X_test)
    f1_test = f1_score(y_test, y_test_pred)
    f1_scores['Model'].append(name)
    f1_scores['Dataset'].append('Test')
    f1_scores['F1 Score'].append(f1_test)

    # Plot confusion matrix
    ConfusionMatrixDisplay.from_estimator(best_model, X_test, y_test, ax=axes[idx])
    axes[idx].set_title(f'{name} Confusion Matrix', fontweight='bold')

    # Print F1 score for test set
    # print(name)
    # print(f'F1 Score for Test: {f1_test:.2f}\n')

plt.tight_layout()
plt.show()

# Convert the F1 scores to a DataFrame for easy plotting
f1_scores_df = pd.DataFrame(f1_scores)

# Filter out training dataset scores for the bar chart
f1_scores_test_df = f1_scores_df[f1_scores_df['Dataset'] == 'Test']

# Plot the F1 scores using a bar chart
plt.figure(figsize=(12, 8))
ax = sns.barplot(x='Model', y='F1 Score', data=f1_scores_test_df, palette='Set2')

# Add F1 scores on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height():.2f}',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center',
                xytext=(0, 10),
                textcoords='offset points',
                fontweight='bold')

plt.title('F1 Scores for Different Models (Test Dataset)', fontweight='bold')
plt.xlabel('Model', fontweight='bold')
plt.ylabel('F1 Score', fontstyle='italic')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Conclusion
1. Exploratory Data Analysis (EDA) for Alzheimer's disease data is a critical first step in understanding the disease's impact and its underlying factors. We can uncover significant patterns and trends that inform our understanding of Alzheimer's disease in United States.

Recommendations
1. Enhanced Data Collection: Improve the collection and availability of high-quality, comprehensive data on Alzheimer's disease in United States, including demographic, clinical, and socio-economic information.

2. Public Health Strategies: Use insights from EDA to inform targeted public health strategies and interventions aimed at early detection, prevention, and management of Alzheimer's disease.

3. Focused Research: Encourage further research based on the identified trends and significant variables from the EDA to develop predictive models and inferential statistics that can provide deeper insights into the disease.

4. Awareness and Education: Increase awareness and education about Alzheimer's disease among healthcare professionals and the general public to promote early diagnosis and effective management.

5. Policy Development: Support the development of policies that address the needs of Alzheimer's patients and their caregivers, ensuring access to necessary resources and support systems.