### Hypothesis Testing

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# read data into a data frame
df = pd.read_csv("alzheimers_prediction_dataset.csv")
df

Unnamed: 0,Country,Age,Gender,Education Level,BMI,Physical Activity Level,Smoking Status,Alcohol Consumption,Diabetes,Hypertension,...,Dietary Habits,Air Pollution Exposure,Employment Status,Marital Status,Genetic Risk Factor (APOE-ε4 allele),Social Engagement Level,Income Level,Stress Levels,Urban vs Rural Living,Alzheimer’s Diagnosis
0,Spain,90,Male,1,33.0,Medium,Never,Occasionally,No,No,...,Healthy,High,Retired,Single,No,Low,Medium,High,Urban,No
1,Argentina,72,Male,7,29.9,Medium,Former,Never,No,No,...,Healthy,Medium,Unemployed,Widowed,No,High,Low,High,Urban,No
2,South Africa,86,Female,19,22.9,High,Current,Occasionally,No,Yes,...,Average,Medium,Employed,Single,No,Low,Medium,High,Rural,No
3,China,53,Male,17,31.2,Low,Never,Regularly,Yes,No,...,Healthy,Medium,Retired,Single,No,High,Medium,Low,Rural,No
4,Sweden,58,Female,3,30.0,High,Former,Never,Yes,No,...,Unhealthy,High,Employed,Married,No,Low,Medium,High,Rural,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74278,Russia,60,Female,3,22.6,High,Former,Never,No,No,...,Average,High,Unemployed,Widowed,No,Medium,High,Medium,Rural,No
74279,UK,58,Male,18,30.6,Low,Never,Occasionally,Yes,No,...,Average,Medium,Unemployed,Single,No,Medium,High,High,Rural,No
74280,Spain,57,Female,13,28.2,Medium,Never,Regularly,No,No,...,Healthy,Low,Employed,Single,Yes,High,Low,Low,Rural,No
74281,Brazil,73,Female,7,29.0,Low,Never,Regularly,No,No,...,Healthy,Low,Employed,Widowed,No,Low,Low,High,Rural,No


In [3]:
df.columns = df.columns.str.replace("’", "'")  # replace curly apostrophes with straight ones
df = df.rename(columns={"Family History of Alzheimer's": "Family History"})
print(df.columns.tolist())

['Country', 'Age', 'Gender', 'Education Level', 'BMI', 'Physical Activity Level', 'Smoking Status', 'Alcohol Consumption', 'Diabetes', 'Hypertension', 'Cholesterol Level', 'Family History', 'Cognitive Test Score', 'Depression Level', 'Sleep Quality', 'Dietary Habits', 'Air Pollution Exposure', 'Employment Status', 'Marital Status', 'Genetic Risk Factor (APOE-ε4 allele)', 'Social Engagement Level', 'Income Level', 'Stress Levels', 'Urban vs Rural Living', "Alzheimer's Diagnosis"]


#### Split features into numerical and categorical columns.

In [5]:
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
categorical_features = df.select_dtypes(include=['object', 'category', 'bool']).columns

#### Loop through numerical features and run t-tests

In [7]:
from scipy.stats import ttest_ind

# Convert Alzheimer's (ind. var.) to binary
df["Alzheimer's Diagnosis"] = df["Alzheimer's Diagnosis"].map({'Yes': 1, 'No': 0})

from scipy.stats import ttest_ind

for col in numerical_features:
    group1 = df[df["Alzheimer's Diagnosis"] == 1][col].dropna()
    group0 = df[df["Alzheimer's Diagnosis"] == 0][col].dropna()
    
    if len(group1) > 1 and len(group0) > 1 and group1.std() > 0 and group0.std() > 0:
        t_stat, p = ttest_ind(group1, group0)
        print(f"{col}: p = {p:.4f}")
    else:
        print(f"{col}: cannot perform t-test (insufficient data or zero variance)")

Age: p = 0.0000
Education Level: p = 0.3091
BMI: p = 0.6426
Cognitive Test Score: p = 0.7557


Age appears to be the only numerical feature that shows a statistically significant relationship with the binary Alzheimer's diagnosis variable (p < 0.05). In contrast, Education Level, BMI, and Cognitive Test Score all have high p-values, indicating no statistically significant difference in their means between individuals diagnosed with Alzheimer’s and those not diagnosed. Therefore, Age may be a meaningful predictor, while the other features are less likely to contribute individually to classification performance.

#### Check for any present null values

In [10]:
print(df[["Alzheimer's Diagnosis", "Age", "Education Level", "BMI", "Cognitive Test Score"]].isnull().sum())

Alzheimer's Diagnosis    0
Age                      0
Education Level          0
BMI                      0
Cognitive Test Score     0
dtype: int64


There are no null values

#### Loop through categorical features and run chi-square tests

In [13]:
from scipy.stats import chi2_contingency
import pandas as pd

for col in categorical_features:
    contingency_table = pd.crosstab(df[col], df["Alzheimer's Diagnosis"])
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    print(f"{col}: p = {p_value:.4f}")

Country: p = 0.0000
Gender: p = 0.7156
Physical Activity Level: p = 0.7007
Smoking Status: p = 0.5682
Alcohol Consumption: p = 0.2818
Diabetes: p = 0.4721
Hypertension: p = 0.7544
Cholesterol Level: p = 0.5719
Family History: p = 0.0000
Depression Level: p = 0.7476
Sleep Quality: p = 0.9543
Dietary Habits: p = 0.4662
Air Pollution Exposure: p = 0.4601
Employment Status: p = 0.2761
Marital Status: p = 0.9126
Genetic Risk Factor (APOE-ε4 allele): p = 0.0000
Social Engagement Level: p = 0.6845
Income Level: p = 0.1653
Stress Levels: p = 0.3603
Urban vs Rural Living: p = 0.2665
Alzheimer's Diagnosis: p = 0.0000


Chi-square tests indicate that Country, Family History, and the Genetic Risk Factor (APOE-ε4 allele) are significantly associated with Alzheimer’s diagnosis (p < 0.05). This suggests that these factors could be important predictors for identifying individuals at risk. On the other hand, variables such as Gender, Physical Activity Level, and Income Level do not show significant associations with Alzheimer’s, and are unlikely to contribute much to predictive models based on this dataset.

### Feature Selection

##### So far, we have identified important variables as: Age, Country, Family History, and the Genetic Risk Factor (APOE-ε4 allele)

### Prepare data for modeling

In [36]:
from sklearn.preprocessing import StandardScaler

# One-hot encoding for categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)  # drop_first to avoid multicollinearity

scaler = StandardScaler()
df_encoded[['Age', 'BMI']] = scaler.fit_transform(df_encoded[['Age', 'BMI']])

### Model Data

In [43]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X = df_encoded.drop("Alzheimer's Diagnosis", axis=1)  # Features
y = df_encoded["Alzheimer's Diagnosis"]  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model and fit it (max iterations = 1000)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate the model
print(f"Accuracy: {model.score(X_test, y_test):.4f}")

Accuracy: 0.7145
