## Cognitive Status, Demographics, Race, and PMI-Associated Genes

This Python script rigorously examines two aspects:

1. Demographic and Racial Associations with Cognitive Status: Utilizing logistic regression to model the relationships between factors like sex, genetics, education, age, and race with the presence of dementia.
2. Post-Mortem Interval (PMI) and Gene Expression: Employing linear regression and FDR correction to identify genes whose expression levels are significantly associated with the time elapsed after death.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
from scipy.stats import spearmanr

# Load donor metadata
metadata = pd.read_excel("sea_ad_cohort_donor_metadata_072524.xlsx")

# Define mapping for categorical variables
category_mapping = {
    "Dementia": 1,
    "No dementia": 0,
    "Checked": 1,
    "Unchecked": 0,
    "Female": 1,
    "Male": 0,
    "4/4": 2,
    "2/4": 1,
    "3/4": 1,
    "2/2": 0,
    "3/3": 0,
    "2/3": 0
}

# Apply mapping to categorical columns
categorical_columns = [
    "Cognitive Status", 
    "Race (choice=White)", 
    "Race (choice=Black/ African American)",
    "Race (choice=Asian)", 
    "Race (choice=American Indian/ Alaska Native)", 
    "Race (choice=Native Hawaiian or Pacific Islander)",
    "APOE Genotype",
    "Sex"
]

for col in categorical_columns:
    metadata[col] = metadata[col].map(category_mapping)

# Select relevant columns for analysis
feature_cols = [
    "Sex",
    "APOE Genotype",
    "Race (choice=White)",
    "Race (choice=Black/ African American)",
    "Race (choice=Asian)",
    "Race (choice=American Indian/ Alaska Native)",
    "Race (choice=Native Hawaiian or Pacific Islander)",
    "Years of education",
    "Age at Death",
    "Cognitive Status"
]

# Extract and rename features for analysis
features_df = metadata[feature_cols].rename(columns={
    "Cognitive Status": "Cognitive_Status",
    "APOE Genotype": "APOE_Genotype",
    "Age at Death": "Age_Death",
    "Years of education": "Years_education",
    "Race (choice=White)": "Race_White",
    "Race (choice=Black/ African American)": "Race_Black",
    "Race (choice=Asian)": "Race_Asian",
    "Race (choice=American Indian/ Alaska Native)": "Race_AmIndian",
    "Race (choice=Native Hawaiian or Pacific Islander)": "Race_Hawaiian"
})

# Model 1: Demographic factors associated with cognitive status
model1 = smf.logit(
    "Cognitive_Status ~ Sex + APOE_Genotype + Years_education + Age_Death", 
    data=features_df
).fit()

print("Model 1: Demographics and Cognitive Status")
print("="*70)
print(model1.summary())
print("\n")

# Model 2: Race factors associated with cognitive status
X = sm.add_constant(features_df[['Race_White', 'Race_Black', 'Race_Asian', 
                                'Race_AmIndian', 'Race_Hawaiian']])
model2 = sm.Logit(features_df['Cognitive_Status'], X).fit_regularized(method='l1', alpha=0.1)

print("Model 2: Race and Cognitive Status")
print("="*70)
print(model2.summary())
print("\n")

Optimization terminated successfully.
         Current function value: 0.654460
         Iterations 5
Model 1: Demographics and Cognitive Status
                           Logit Regression Results                           
Dep. Variable:       Cognitive_Status   No. Observations:                   84
Model:                          Logit   Df Residuals:                       79
Method:                           MLE   Df Model:                            4
Date:                Sun, 27 Apr 2025   Pseudo R-squ.:                 0.05581
Time:                        12:09:58   Log-Likelihood:                -54.975
converged:                       True   LL-Null:                       -58.224
Covariance Type:            nonrobust   LLR p-value:                    0.1648
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          -0.8052      3.278     -0.246      0

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.multitest import multipletests

# Load data
expr_df = pd.read_csv("pseudobulk_data.csv", index_col=0)  # Each row is a donor, each column is a gene
meta_df = pd.read_csv("meta_extracted.csv")  # Contains donor_id and PMI

# Merge expression and PMI information
meta_df = meta_df.set_index("Donor ID")
combined = expr_df.join(meta_df, how='inner')

# Perform linear regression for each gene
results = []
for gene in expr_df.columns:
    y = combined[gene]
    X = sm.add_constant(combined["PMI"])  # Add intercept term
    model = sm.OLS(y, X).fit()
    pval = model.pvalues["PMI"]
    results.append((gene, pval))

# Organize results and perform multiple testing correction
results_df = pd.DataFrame(results, columns=["Gene", "P_Value"])
results_df["FDR"] = multipletests(results_df["P_Value"], method='fdr_bh')[1]

# Mark significance
results_df["Significant"] = results_df["FDR"] < 0.05

# Count the number of significant genes
significant_count = results_df["Significant"].sum()

# Print the number of significant genes
print(f"Number of significant genes (FDR < 0.05): {significant_count}")

# Save the list of significant genes
significant_genes = results_df[results_df["Significant"] == True]["Gene"]
significant_genes.to_csv("significant_genes.csv", index=False, header=True)

Number of significant genes (FDR < 0.05): 73
