# Alzheimer's Disease Prediction: Statistical Learning Midterm Project---R11323024 Fan Wei-Yu---

## 1. Data Processing and ExplorationIn this section, we will import the necessary libraries, load the dataset, and perform preliminary data exploration and preprocessing.

In [ ]:
# Import necessary librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns# Set plotting styleplt.style.use('seaborn-v0_8-whitegrid')sns.set_palette('Set2')sns.set_context('talk')

In [ ]:
# Load the Alzheimer's disease datasetdata_path = 'data/alzheimers_disease_data.csv'alzheimer_data = pd.read_csv(data_path)# Display the first few rows of dataalzheimer_data.head()

### 1.1 Data Quality CheckFirst, check for missing values, duplicates, or other data quality issues.

In [ ]:
# Check for missing valuesmissing_values = alzheimer_data.isnull().sum().sort_values(ascending=False)print("Missing value count (top 10 columns):")print(missing_values.head(10))

In [ ]:
# Check data types and basic informationprint("Data types and non-null counts:")alzheimer_data.info()

In [ ]:
# Check for duplicate patient IDsduplicate_ids = alzheimer_data['PatientID'].duplicated().sum()print(f"Number of duplicate patient IDs: {duplicate_ids}")

In [ ]:
# View basic statistics of the datasetsummary_stats = alzheimer_data.describe(include='all')print("Data summary statistics:")print(summary_stats)

### 1.2 Data PreprocessingNext, perform feature engineering and data transformation, including:1. Categorical variable encoding (such as ethnicity and education level)2. Converting binary variables to boolean type3. Standardizing continuous variables

In [ ]:
# Map ethnicity and education level to meaningful namesethnicity_mapping = {0: 'Caucasian', 1: 'AfricanAmerican', 2: 'Asian', 3: 'Other'}education_mapping = {0: 'None', 1: 'HighSchool', 2: 'Bachelors', 3: 'Higher'}# Apply mappingalzheimer_data['Ethnicity'] = alzheimer_data['Ethnicity'].map(ethnicity_mapping)alzheimer_data['EducationLevel'] = alzheimer_data['EducationLevel'].map(education_mapping)# One-hot encode categorical variablesalzheimer_data = pd.get_dummies(alzheimer_data, columns=['Ethnicity', 'EducationLevel'], drop_first=True)# Display data after encodingprint("Data after categorical encoding:")print(alzheimer_data.head())

In [ ]:
# Drop unnecessary columns (patient ID and doctor information)alzheimer_data = alzheimer_data.drop(columns=['PatientID', 'DoctorInCharge'])

In [ ]:
# Convert binary columns to boolean typebinary_columns = [    'Gender', 'Smoking', 'FamilyHistoryAlzheimers', 'CardiovascularDisease',     'Diabetes', 'Depression', 'HeadInjury', 'Hypertension', 'MemoryComplaints',    'BehavioralProblems', 'Confusion', 'Disorientation', 'PersonalityChanges',     'DifficultyCompletingTasks', 'Forgetfulness', 'Diagnosis']# Convert to boolean typealzheimer_data[binary_columns] = alzheimer_data[binary_columns].astype(bool)# Display statistics for binary columnsprint("Distribution of binary variables after conversion:")print(alzheimer_data[binary_columns].mean().sort_values(ascending=False))

In [ ]:
# Standardize continuous variablescontinuous_columns = [    'SystolicBP', 'DiastolicBP', 'CholesterolTotal', 'CholesterolLDL',     'CholesterolHDL', 'CholesterolTriglycerides', 'Age', 'BMI',     'AlcoholConsumption', 'PhysicalActivity', 'DietQuality', 'SleepQuality',     'MMSE', 'FunctionalAssessment', 'ADL']# Use StandardScaler for standardizationfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()alzheimer_data[continuous_columns] = scaler.fit_transform(alzheimer_data[continuous_columns])# Display statistics after standardizationprint("Statistics of continuous variables after standardization:")print(alzheimer_data[continuous_columns].describe().T[['mean', 'std']])

### 1.3 Exploratory Data AnalysisNext, explore the distribution of data and relationships between features.

In [ ]:
# Check the distribution of the target variablediagnosis_counts = alzheimer_data['Diagnosis'].value_counts()print("Alzheimer's diagnosis distribution:")print(diagnosis_counts)# Visualize target variable distributionplt.figure(figsize=(10, 6))sns.countplot(x='Diagnosis', data=alzheimer_data)plt.title('Alzheimer\'s Disease Diagnosis Distribution')plt.xlabel('Diagnosis (False=No, True=Yes)')plt.ylabel('Patient Count')plt.xticks([0, 1], ['No Alzheimer\'s', 'Has Alzheimer\'s'])plt.show()

In [ ]:
# Calculate correlations with the target variabletarget_correlations = alzheimer_data.corr(numeric_only=False)['Diagnosis'].sort_values(ascending=False)print("Features most correlated with Alzheimer's diagnosis:")print(target_correlations.head(15))# Visualize correlation matrix for top 15 featuresplt.figure(figsize=(12, 10))top_features = target_correlations.abs().sort_values(ascending=False)[:15].indexcorrelation_matrix = alzheimer_data[top_features].corr()# Create a mask to hide the upper trianglemask = np.triu(np.ones_like(correlation_matrix, dtype=bool))# Plot heatmapsns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',             linewidths=0.5, fmt='.2f', mask=mask)plt.title('Correlation Matrix of Top 15 Features')plt.tight_layout()plt.show()

In [ ]:
# Visualize distributions of numerical featuresplt.figure(figsize=(20, 15))alzheimer_data[continuous_columns].hist(bins=30, figsize=(20, 15),                                    color='skyblue', edgecolor='black')plt.suptitle('Distribution of Continuous Variables', fontsize=20)plt.tight_layout()plt.subplots_adjust(top=0.95)plt.show()

## 2. Model Training and EvaluationThis section will implement various classification methods including logistic regression, random forest, and XGBoost, and compare these methods.

### 2.1 Split the Dataset

In [ ]:
# Import necessary machine learning librariesfrom sklearn.model_selection import train_test_split, GridSearchCVfrom sklearn.metrics import classification_report, roc_auc_score, accuracy_score, precision_score, recall_score, f1_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom xgboost import XGBClassifierfrom sklearn.metrics import roc_curve, auc, confusion_matrix, ConfusionMatrixDisplay# Split features and target variableX = alzheimer_data.drop('Diagnosis', axis=1)y = alzheimer_data['Diagnosis']# Split into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print(f"Training set size: {X_train.shape[0]} samples")print(f"Test set size: {X_test.shape[0]} samples")

### 2.2 Logistic Regression Model

In [ ]:
# Define logistic regression modellogistic_model = LogisticRegression()# Define hyperparameter gridlogistic_param_grid = {    'penalty': ['l2'],    'C': np.logspace(-2, 3, 20),  # Inverse of regularization strength    'solver': ['lbfgs', 'liblinear'],    'max_iter': [1000, 2500]}# Use grid search for hyperparameter tuninglogistic_grid_search = GridSearchCV(    logistic_model,     param_grid=logistic_param_grid,     cv=5,     scoring='accuracy',    n_jobs=-1,    verbose=1)# Fit the modellogistic_grid_search.fit(X_train, y_train)# Get the best modelbest_logistic = logistic_grid_search.best_estimator_# Display best hyperparametersprint("Best hyperparameters for logistic regression:")print(logistic_grid_search.best_params_)

In [ ]:
# Evaluate logistic regression modely_pred_logistic = best_logistic.predict(X_test)y_prob_logistic = best_logistic.predict_proba(X_test)[:, 1]print("Logistic Regression Classification Report:")print(classification_report(y_test, y_pred_logistic, digits=3))

In [ ]:
# Get coefficients from logistic regressionlogistic_coefficients = pd.DataFrame({    'Feature': X_train.columns,    'Coefficient': best_logistic.coef_[0]})# Sort by absolute coefficient valueslogistic_coefficients['AbsCoefficient'] = logistic_coefficients['Coefficient'].abs()logistic_coefficients = logistic_coefficients.sort_values(by='AbsCoefficient', ascending=False)# Display coefficientsprint("Logistic Regression Feature Importance (sorted by absolute coefficient):")print(logistic_coefficients[['Feature', 'Coefficient']].head(10))

In [ ]:
# Plot ROC curve for logistic regressionfpr_logistic, tpr_logistic, _ = roc_curve(y_test, y_prob_logistic)roc_auc_logistic = auc(fpr_logistic, tpr_logistic)plt.figure(figsize=(10, 8))plt.plot(fpr_logistic, tpr_logistic, color='blue', lw=2,          label=f'Logistic Regression ROC curve (AUC = {roc_auc_logistic:.3f})')plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('Logistic Regression Receiver Operating Characteristic (ROC)')plt.legend(loc="lower right")plt.grid(True, alpha=0.3)plt.show()

### 2.3 Random Forest Model

In [ ]:
# Initialize Random Forest classifierrf_model = RandomForestClassifier(random_state=42)# Define hyperparameter gridrf_param_grid = {    'n_estimators': [50, 100, 150],    'max_depth': [None, 10, 20, 30],    'min_samples_split': [2, 5, 10]}# Use grid search for hyperparameter tuningrf_grid_search = GridSearchCV(    rf_model,     param_grid=rf_param_grid,     cv=5,     scoring='accuracy',     n_jobs=-1,    verbose=1)# Fit the modelrf_grid_search.fit(X_train, y_train)# Get the best modelbest_rf = rf_grid_search.best_estimator_# Display best hyperparametersprint("Best hyperparameters for Random Forest:")print(rf_grid_search.best_params_)

In [ ]:
# Evaluate Random Forest modely_pred_rf = best_rf.predict(X_test)y_prob_rf = best_rf.predict_proba(X_test)[:, 1]print("Random Forest Classification Report:")print(classification_report(y_test, y_pred_rf, digits=3))

In [ ]:
# Get feature importance from Random Forestrf_importance = pd.DataFrame({    'Feature': X_train.columns,    'Importance': best_rf.feature_importances_})rf_importance = rf_importance.sort_values(by='Importance', ascending=False)# Plot feature importanceplt.figure(figsize=(12, 8))sns.barplot(x='Importance', y='Feature', data=rf_importance.head(15), palette='viridis')plt.title('Random Forest - Top 15 Important Features')plt.xlabel('Importance Score')plt.ylabel('Feature')plt.tight_layout()plt.show()# Print feature importanceprint("Random Forest Feature Importance:")print(rf_importance[['Feature', 'Importance']].head(10))

In [ ]:
# Plot ROC curve for Random Forestfpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)roc_auc_rf = auc(fpr_rf, tpr_rf)plt.figure(figsize=(10, 8))plt.plot(fpr_rf, tpr_rf, color='green', lw=2,          label=f'Random Forest ROC curve (AUC = {roc_auc_rf:.3f})')plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('Random Forest Receiver Operating Characteristic (ROC)')plt.legend(loc="lower right")plt.grid(True, alpha=0.3)plt.show()

### 2.4 XGBoost Model

In [ ]:
# Initialize XGBoost classifierxgb_model = XGBClassifier(    objective='binary:logistic',     eval_metric='logloss',    random_state=42)# Define hyperparameter gridxgb_param_grid = {    'max_depth': [3, 5, 7],    'learning_rate': [0.01, 0.1, 0.2],    'n_estimators': [50, 100, 150],    'subsample': [0.7, 0.8, 0.9],    'colsample_bytree': [0.6, 0.8, 1.0]}# Use grid search for hyperparameter tuningxgb_grid_search = GridSearchCV(    xgb_model,     param_grid=xgb_param_grid,     cv=5,     scoring='accuracy',     n_jobs=-1,    verbose=1)# Fit the modelxgb_grid_search.fit(X_train, y_train)# Get the best modelbest_xgb = xgb_grid_search.best_estimator_# Display best hyperparametersprint("XGBoost best hyperparameters:")print(xgb_grid_search.best_params_)

In [ ]:
# Evaluate XGBoost modely_pred_xgb = best_xgb.predict(X_test)y_prob_xgb = best_xgb.predict_proba(X_test)[:, 1]print("XGBoost Classification Report:")print(classification_report(y_test, y_pred_xgb, digits=3))

In [ ]:
# Get feature importance from XGBoostxgb_importance = pd.DataFrame({    'Feature': X_train.columns,    'Importance': best_xgb.feature_importances_})xgb_importance = xgb_importance.sort_values(by='Importance', ascending=False)# Plot feature importanceplt.figure(figsize=(12, 8))sns.barplot(x='Importance', y='Feature', data=xgb_importance.head(15), palette='magma')plt.title('XGBoost - Top 15 Important Features')plt.xlabel('Importance Score')plt.ylabel('Feature')plt.tight_layout()plt.show()# Print feature importanceprint("XGBoost Feature Importance:")print(xgb_importance[['Feature', 'Importance']].head(10))

In [ ]:
# Plot XGBoost ROC curvefpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_prob_xgb)roc_auc_xgb = auc(fpr_xgb, tpr_xgb)plt.figure(figsize=(10, 8))plt.plot(fpr_xgb, tpr_xgb, color='purple', lw=2,          label=f'XGBoost ROC curve (AUC = {roc_auc_xgb:.3f})')plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('XGBoost Receiver Operating Characteristic (ROC)')plt.legend(loc="lower right")plt.grid(True, alpha=0.3)plt.show()

## 3. Model Comparison and EvaluationNext, we'll compare all models' performance and evaluate their respective strengths and weaknesses.

In [ ]:
# Define function to evaluate modelsdef evaluate_model(y_true, y_pred, y_pred_prob=None, model_name="Model"):    """    Evaluate classification model and output various metrics        Parameters:    y_true: True labels    y_pred: Predicted labels    y_pred_prob: Prediction probabilities for ROC AUC    model_name: Model name    """    print(f"===== Evaluation Metrics for {model_name} =====")    print(f"Accuracy: {accuracy_score(y_true, y_pred):.3f}")    print(f"Precision: {precision_score(y_true, y_pred):.3f}")    print(f"Recall: {recall_score(y_true, y_pred):.3f}")    print(f"F1 Score: {f1_score(y_true, y_pred):.3f}")    if y_pred_prob is not None:        print(f"ROC AUC Score: {roc_auc_score(y_true, y_pred_prob):.3f}")    print("-" * 40)# Evaluate all modelsevaluate_model(y_test, y_pred_logistic, y_prob_logistic, model_name="Logistic Regression")evaluate_model(y_test, y_pred_rf, y_prob_rf, model_name="Random Forest")evaluate_model(y_test, y_pred_xgb, y_prob_xgb, model_name="XGBoost")

In [ ]:
# Plot confusion matrices for all modelsdef plot_confusion_matrix(y_true, y_pred, model_name="Model"):    """    Plot confusion matrix for a model        Parameters:    y_true: True labels    y_pred: Predicted labels    model_name: Model name    """    cm = confusion_matrix(y_true, y_pred)    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['No Alzheimer\'s', 'Has Alzheimer\'s'])    fig, ax = plt.subplots(figsize=(8, 6))    disp.plot(cmap='Blues', ax=ax)    plt.title(f"Confusion Matrix for {model_name}")    plt.tight_layout()    plt.show()# Plot confusion matrices for each modelplot_confusion_matrix(y_test, y_pred_logistic, model_name="Logistic Regression")plot_confusion_matrix(y_test, y_pred_rf, model_name="Random Forest")plot_confusion_matrix(y_test, y_pred_xgb, model_name="XGBoost")

In [ ]:
# Compare ROC curves for all modelsplt.figure(figsize=(12, 8))# Plot Logistic Regression ROC curveplt.plot(fpr_logistic, tpr_logistic, color='blue', lw=2,          label=f'Logistic Regression (AUC = {roc_auc_logistic:.3f})')# Plot Random Forest ROC curveplt.plot(fpr_rf, tpr_rf, color='green', lw=2,          label=f'Random Forest (AUC = {roc_auc_rf:.3f})')# Plot XGBoost ROC curveplt.plot(fpr_xgb, tpr_xgb, color='purple', lw=2,          label=f'XGBoost (AUC = {roc_auc_xgb:.3f})')# Plot baselineplt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--', label='Baseline')# Enhance chartplt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate', fontsize=12)plt.ylabel('True Positive Rate', fontsize=12)plt.title('ROC Curve Comparison for All Models', fontsize=16)plt.legend(loc="lower right", fontsize=12)plt.grid(True, alpha=0.3)plt.tight_layout()plt.show()

### 3.1 Feature Importance Comparison

In [ ]:
# Use Venn diagram to compare top 10 important features across three modelsfrom matplotlib_venn import venn3# Extract top 10 features from each modeltop_features_logistic = set(logistic_coefficients['Feature'].head(10))top_features_rf = set(rf_importance['Feature'].head(10))top_features_xgb = set(xgb_importance['Feature'].head(10))# Create Venn diagramplt.figure(figsize=(10, 8))venn = venn3(    [top_features_logistic, top_features_rf, top_features_xgb],    ('Logistic Regression', 'Random Forest', 'XGBoost'))# Add labels for each section of the Venn diagram# Only Logistic Regressionvenn.get_label_by_id('100').set_text('\n'.join(top_features_logistic - top_features_rf - top_features_xgb))# Logistic Regression and Random Forestvenn.get_label_by_id('110').set_text('\n'.join(top_features_logistic & top_features_rf - top_features_xgb))# Random Forest and XGBoostvenn.get_label_by_id('011').set_text('\n'.join(top_features_rf & top_features_xgb - top_features_logistic))# Only XGBoostvenn.get_label_by_id('001').set_text('\n'.join(top_features_xgb - top_features_logistic - top_features_rf))# Logistic Regression and XGBoostvenn.get_label_by_id('101').set_text('\n'.join(top_features_logistic & top_features_xgb - top_features_rf))# Common to all three modelsvenn.get_label_by_id('111').set_text('\n'.join(top_features_logistic & top_features_rf & top_features_xgb))# Only Random Forestvenn.get_label_by_id('010').set_text('')rf_only = top_features_rf - top_features_logistic - top_features_xgbif rf_only:    plt.annotate('\n'.join(rf_only),                 xy=venn.get_label_by_id('010').get_position() + np.array([0, 0.2]),                 xytext=(-20, 40),                 ha='center',                textcoords='offset points',                bbox=dict(boxstyle='round,pad=0.5', fc='gray', alpha=0.1),                arrowprops=dict(arrowstyle='->', connectionstyle='arc', color='gray'))plt.title('Top 10 Important Features Comparison Across Models', fontsize=16, y=1.1)plt.show()

## 4. Model Optimization and Feature SelectionBased on the analysis above, we will use feature selection techniques to optimize model performance.

In [ ]:
# Get features commonly identified as important by all three modelscommon_important_features = list(top_features_logistic & top_features_rf & top_features_xgb)print(f"Features identified as important by all three models ({len(common_important_features)} features):")print(common_important_features)

In [ ]:
# Get the union of important features from all modelsall_important_features = list(top_features_logistic | top_features_rf | top_features_xgb)print(f"Union of important features from all models ({len(all_important_features)} features):")for i, feature in enumerate(all_important_features):    print(f"{i+1}. {feature}")

In [ ]:
# Test how different feature counts affect model performanceresults = []for i in range(2, len(all_important_features) + 1, 2):    # Select the top i important features    selected_features = all_important_features[:i]    X_train_selected = X_train[selected_features]    X_test_selected = X_test[selected_features]        # Train XGBoost model (with best hyperparameters) on selected features    selected_model = XGBClassifier(        objective='binary:logistic',        max_depth=best_xgb.max_depth,        learning_rate=best_xgb.learning_rate,        n_estimators=best_xgb.n_estimators,        subsample=best_xgb.subsample,        colsample_bytree=best_xgb.colsample_bytree,        random_state=42    )        # Train the model    selected_model.fit(X_train_selected, y_train)        # Evaluate the model    y_pred = selected_model.predict(X_test_selected)    y_prob = selected_model.predict_proba(X_test_selected)[:, 1]        # Calculate evaluation metrics    results.append({        'FeatureCount': i,        'Accuracy': accuracy_score(y_test, y_pred),        'Precision': precision_score(y_test, y_pred),        'Recall': recall_score(y_test, y_pred),        'F1': f1_score(y_test, y_pred),        'ROC_AUC': roc_auc_score(y_test, y_prob)    })# Convert results to DataFrameresults_df = pd.DataFrame(results)# Plot effect of feature count on each metricplt.figure(figsize=(12, 8))metrics = ['Accuracy', 'Precision', 'Recall', 'F1', 'ROC_AUC']for metric in metrics:    plt.plot(results_df['FeatureCount'], results_df[metric], marker='o', label=metric)    plt.xlabel('Number of Features')plt.ylabel('Score')plt.title('Effect of Feature Count on Model Performance')plt.grid(True, alpha=0.3)plt.legend()plt.tight_layout()plt.show()# Print resultsprint(results_df)

### 4.1 Building Models with Selected Features

In [ ]:
# Select optimal feature subset sizeoptimal_feature_count = results_df.iloc[results_df['F1'].idxmax()]['FeatureCount']optimal_features = all_important_features[:int(optimal_feature_count)]print(f"Optimal number of features: {int(optimal_feature_count)}")print("Selected features:")for i, feature in enumerate(optimal_features):    print(f"{i+1}. {feature}")

In [ ]:
# Retrain all models using the optimal feature subsetX_train_optimal = X_train[optimal_features]X_test_optimal = X_test[optimal_features]# Train logistic regression modellogistic_optimal = LogisticRegression(    C=best_logistic.C,     solver=best_logistic.solver,     max_iter=best_logistic.max_iter)logistic_optimal.fit(X_train_optimal, y_train)y_pred_logistic_optimal = logistic_optimal.predict(X_test_optimal)y_prob_logistic_optimal = logistic_optimal.predict_proba(X_test_optimal)[:, 1]# Train random forest modelrf_optimal = RandomForestClassifier(    n_estimators=best_rf.n_estimators,    max_depth=best_rf.max_depth,    min_samples_split=best_rf.min_samples_split,    random_state=42)rf_optimal.fit(X_train_optimal, y_train)y_pred_rf_optimal = rf_optimal.predict(X_test_optimal)y_prob_rf_optimal = rf_optimal.predict_proba(X_test_optimal)[:, 1]# Train XGBoost modelxgb_optimal = XGBClassifier(    objective='binary:logistic',    max_depth=best_xgb.max_depth,    learning_rate=best_xgb.learning_rate,    n_estimators=best_xgb.n_estimators,    subsample=best_xgb.subsample,    colsample_bytree=best_xgb.colsample_bytree,    random_state=42)xgb_optimal.fit(X_train_optimal, y_train)y_pred_xgb_optimal = xgb_optimal.predict(X_test_optimal)y_prob_xgb_optimal = xgb_optimal.predict_proba(X_test_optimal)[:, 1]

In [ ]:
# Evaluate models using optimal featuresprint("===== Model Evaluation with Optimal Feature Subset =====")evaluate_model(y_test, y_pred_logistic_optimal, y_prob_logistic_optimal, model_name="Logistic Regression (Optimal Features)")evaluate_model(y_test, y_pred_rf_optimal, y_prob_rf_optimal, model_name="Random Forest (Optimal Features)")evaluate_model(y_test, y_pred_xgb_optimal, y_prob_xgb_optimal, model_name="XGBoost (Optimal Features)")

### 4.2 Ensemble Learning Model

In [ ]:
# Create ensemble model using StackingClassifierfrom sklearn.ensemble import StackingClassifier# Define base modelsestimators = [    ('logistic', LogisticRegression(C=best_logistic.C, solver=best_logistic.solver, max_iter=best_logistic.max_iter)),    ('rf', RandomForestClassifier(n_estimators=best_rf.n_estimators, max_depth=best_rf.max_depth, min_samples_split=best_rf.min_samples_split, random_state=42)),    ('xgb', XGBClassifier(objective='binary:logistic', max_depth=best_xgb.max_depth, learning_rate=best_xgb.learning_rate, n_estimators=best_xgb.n_estimators, subsample=best_xgb.subsample, colsample_bytree=best_xgb.colsample_bytree, random_state=42))]# Create stacking classifierstacking_model = StackingClassifier(    estimators=estimators,    final_estimator=LogisticRegression(),    cv=5,    n_jobs=-1)# Train stacking model (using optimal features)stacking_model.fit(X_train_optimal, y_train)# Evaluate stacking modely_pred_stack = stacking_model.predict(X_test_optimal)y_prob_stack = stacking_model.predict_proba(X_test_optimal)[:, 1]evaluate_model(y_test, y_pred_stack, y_prob_stack, model_name="Stacking Ensemble Model (Optimal Features)")

## 5. Summary and DiscussionIn this Alzheimer's disease prediction project, we compared three classification methods: Logistic Regression, Random Forest, and XGBoost. Through feature selection and model optimization, we can draw the following conclusions:1. **Model Comparison**: All three models performed well in predicting Alzheimer's disease, with XGBoost and Random Forest slightly outperforming Logistic Regression overall.2. **Important Features**: Functional Assessment, Memory Complaints, and MMSE test scores were the most important features for predicting Alzheimer's disease.3. **Feature Selection**: We found that using approximately 10 selected features achieved similar or better prediction performance compared to using all features, while reducing computational complexity.4. **Ensemble Learning**: The stacking model, which combined the strengths of all three base models, showed more stable performance on the optimal feature subset.Overall, this study demonstrates the potential of machine learning techniques in early diagnosis and prediction of Alzheimer's disease. In clinical practice, this could provide valuable support for early intervention and treatment planning.