<a href="https://colab.research.google.com/github/chizhang9135/CMU_FALL24_SEM_TEAM/blob/master/Copy_of_Additional_Experiments_for_Sprint2_by_LanLan_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Colab serves as a supplementary experiment to the methods previously adopted by the team. In this approach, I selected Heart Failure as a representative condition under CVD (Cardiovascular Disease) and conducted an analysis to identify common genes between Heart Failure and Alzheimer’s Disease (AD).

The key steps of this approach are as follows:

1. Identify common genes between Heart Failure and AD.
2. Further analyze these common genes to detect those with similar expression patterns.
3. Train an AI model using the 5000+ identified common genes to predict the likelihood of AD.
4. Input the data of common genes with similar expression patterns from Heart Failure into the AI model to estimate the probability of these patients developing AD.

This supplementary method was designed to address some challenges observed in previous methodologies and to provide a more focused exploration of the relationship between AD and Heart Failure.



In [None]:
!pip install scanpy
!pip install cellxgene_census
!pip install --user scikit-misc
!pip install gspread oauth2client
!pip install google-auth

Collecting scanpy
  Downloading scanpy-1.10.4-py3-none-any.whl.metadata (9.3 kB)
Collecting anndata>=0.8 (from scanpy)
  Downloading anndata-0.11.1-py3-none-any.whl.metadata (8.2 kB)
Collecting legacy-api-wrap>=1.4 (from scanpy)
  Downloading legacy_api_wrap-1.4.1-py3-none-any.whl.metadata (2.1 kB)
Collecting pynndescent>=0.5 (from scanpy)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Collecting session-info (from scanpy)
  Downloading session_info-1.0.0.tar.gz (24 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting umap-learn!=0.5.0,>=0.5 (from scanpy)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting array-api-compat!=1.5,>1.4 (from anndata>=0.8->scanpy)
  Downloading array_api_compat-1.9.1-py3-none-any.whl.metadata (1.6 kB)
Collecting stdlib_list (from session-info->scanpy)
  Downloading stdlib_list-0.11.0-py3-none-any.whl.metadata (3.3 kB)
Downloading scanpy-1.10.4-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━

In [None]:
# Import Function

import urllib
import scanpy
import cellxgene_census

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import ttest_ind

import anndata
import numpy as np
import scipy.sparse

def anndata_to_numpy(adata):
  """Converts an anndata.AnnData object to a numpy array.

  Args:
    adata: The anndata.AnnData object to convert.

  Returns:
    A numpy array with the same data as the AnnData object.
  """

  # If the data is sparse, convert it to a dense array.
  if scipy.sparse.issparse(adata):
    data = adata.toarray()
  else:
    data = adata

  # If the data is empty, create a numpy array filled with zeros.
  if data.size == 0:
    return np.zeros(adata.shape)

  return data

In [None]:
# For google spreadsheet
import gspread
from google.auth import default
from google.colab import auth

auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

Download AD data and then find all significant genes.

In [None]:
# Download Data
cellxgene_census.download_source_h5ad("9813a1d4-d107-459e-9b2e-7687be935f69", to_path="data.h5ad")

Downloading: 100%|██████████| 229M/229M [00:12<00:00, 19.7MB/s]


In [None]:
adata = scanpy.read_h5ad("data.h5ad")

In [None]:
adata.var_names

Index(['ENSG00000278915', 'ENSG00000168454', 'ENSG00000139180',
       'ENSG00000229177', 'ENSG00000204564', 'ENSG00000116717',
       'ENSG00000254418', 'ENSG00000114654', 'ENSG00000257894',
       'ENSG00000198398',
       ...
       'ENSG00000121940', 'ENSG00000261555', 'ENSG00000204516',
       'ENSG00000175147', 'ENSG00000267772', 'ENSG00000160799',
       'ENSG00000272264', 'ENSG00000175792', 'ENSG00000066084',
       'ENSG00000119203'],
      dtype='object', length=33091)

In [None]:
from scipy.stats import ttest_ind_from_stats

# normalize the data
scanpy.pp.normalize_total(adata, target_sum=1e4)

# Split data into Alzheimer’s and normal cohorts
alzheimers = adata[adata.obs['disease'] == 'Alzheimer disease']
normal = adata[adata.obs['disease'] == 'normal']

# Calculate mean expression for each gene in each cohort
alz_mean_expr = np.mean(alzheimers.X, axis=0).A1
norm_mean_expr = np.mean(normal.X, axis=0).A1

# Perform t-tests to find significant differences
p_values = []
# for gene_index in range(adata.shape[1]):
#     _, p_val = ttest_ind(alzheimers[:, gene_index].X.toarray(), normal[:, gene_index].X.toarray())
#     p_values.append(p_val)
# Perform t-tests for all genes at once (vectorized)
alzheimers_np = anndata_to_numpy(alzheimers.X)
normal_np = anndata_to_numpy(normal.X)
_, p_values = ttest_ind(alzheimers_np, normal_np)

# Organize results in a DataFrame
gene_names = adata.var_names
results_df = pd.DataFrame({
    'Gene': gene_names,
    'Alzheimer_Mean': alz_mean_expr,
    'Normal_Mean': norm_mean_expr,
    'p_value': p_values
})

# Filter significant genes (e.g., p < 0.05)
significant_genes_ad = results_df[results_df['p_value'] < 0.05]

In [None]:
print(significant_genes_ad)

                  Gene  Alzheimer_Mean  Normal_Mean       p_value
4      ENSG00000204564        0.563652     0.492174  7.313454e-04
6      ENSG00000254418        0.027734     0.041033  1.146894e-02
8      ENSG00000257894        1.018200     0.915008  5.116708e-04
10     ENSG00000085117        0.048355     0.021033  1.619154e-05
11     ENSG00000092140        0.586979     0.537454  2.293741e-02
...                ...             ...          ...           ...
33080  ENSG00000114786        0.082630     0.057181  3.365352e-03
33084  ENSG00000175147        0.055529     0.091432  2.385167e-06
33085  ENSG00000267772        0.003787     0.000482  2.931256e-02
33086  ENSG00000160799        0.552677     0.493750  4.499366e-03
33089  ENSG00000066084        1.495923     1.777048  6.328411e-16

[11057 rows x 4 columns]


In [None]:
len(significant_genes_ad.Gene.to_list())

11057

Store all AD significant genes into speadsheet in A1
Google sheets: [link text](https://docs.google.com/spreadsheets/d/1hx3zqEDFQyarcQOVpeHuSduayPxV35UCEg2snjcRZkY/edit?usp=sharing)

In [None]:
# Replace 'your_spreadsheet_name' with the actual name of your spreadsheet
sh = gc.open('SEM genes')

# Select the worksheet (tab) where you want to store the array
# Replace 'your_worksheet_name' with the actual name of the worksheet
worksheet = sh.worksheet('name1')

In [None]:
data_to_insert = [[element] for element in significant_genes_ad.Gene.to_list()]
worksheet.update('A1', data_to_insert)

  worksheet.update('A1', data_to_insert)


{'spreadsheetId': '1DvltWJcjen71N_WWRVnt5CuKvpqB5IJgSqomgJHY-QM',
 'updatedRange': 'name1!A1:A11066',
 'updatedRows': 11066,
 'updatedColumns': 1,
 'updatedCells': 11066}

Download heart failure data and find all significant genes.

In [None]:
# Download Data
cellxgene_census.download_source_h5ad("bab7432a-5cfe-45ea-928c-422d03c45cdd", to_path="heart_data.h5ad")

Downloading: 100%|██████████| 832M/832M [00:48<00:00, 18.1MB/s]


In [None]:
heart_adata_whole = scanpy.read_h5ad("heart_data.h5ad")

In [None]:
heart_adata_whole

AnnData object with n_obs × n_vars = 180956 × 27410
    obs: 'orig_cluster', 'orig_sub_cluster', 'broad_lineage', 'author_cell_type', 'dev_state', 'subtype', 'precisest_label', 'tissue_id', 'batch', 'size_factor', 'donor_id', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'is_primary_data', 'author_stage', 'tissue_fragment', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'batch_condition', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_liger', 'X_umap2d', 'X_umap3d'

In [None]:
heart_adata = heart_adata_whole[:, :2000]

In [None]:
from scipy.stats import ttest_ind_from_stats

# normaliza the data
scanpy.pp.normalize_total(heart_adata, target_sum=1e4)

# Split data into heart failure’s and normal cohorts
heart_failure = heart_adata[heart_adata.obs['disease'] == 'heart failure']
heart_normal = heart_adata[heart_adata.obs['disease'] == 'normal']

# Perform t-tests to find significant differences
p_values_heart = []
heart_failure_np = anndata_to_numpy(heart_failure.X)
heart_normal_np = anndata_to_numpy(heart_normal.X)
_, p_values_heart = ttest_ind(heart_failure_np, heart_normal_np)

# Organize results in a DataFrame
heart_gene_names = heart_adata.var_names
heart_results_df = pd.DataFrame({
    'Gene': heart_gene_names,
    'p_value': p_values_heart
})

# Filter significant genes (e.g., p < 0.05)
heart_significant_genes_ad = heart_results_df[heart_results_df['p_value'] < 0.05]

  view_to_actual(adata)


In [None]:
print(heart_significant_genes_ad)

Store all heart failure significant genes into speadsheet in another speadsheet

In [None]:
# Replace 'your_spreadsheet_name' with the actual name of your spreadsheet
sh = gc.open('SEM genes')

# Select the worksheet (tab) where you want to store the array
# Replace 'your_worksheet_name' with the actual name of the worksheet
worksheet = sh.worksheet('name2')

In [None]:
data_to_insert = [[element] for element in heart_significant_genes_ad.Gene.to_list()]
worksheet.update('A1', data_to_insert)

Read gene from name1 and name2 then we find the overlap gene in the name3

In [None]:
# For google spreadsheet
import gspread
from google.auth import default
from google.colab import auth

auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

In [None]:
# Open the spreadsheet by its name
spreadsheet = gc.open('SEM genes')

# Select the worksheet you want to read from
worksheet = spreadsheet.worksheet('name1')

# Get all values from a specific column (e.g., column 'A')
ad_column_values = worksheet.col_values(1)  # 1 represents column A, 2 for B, and so on

# Remove the header row if present
ad_column_values = column_values[1:]  # Start from the second element to skip the header

# Select the worksheet you want to read from
worksheet = spreadsheet.worksheet('name2')

# Get all values from a specific column (e.g., column 'A')
heart_column_values = worksheet.col_values(1)  # 1 represents column A, 2 for B, and so on

# Remove the header row if present
heart_column_values = column_values[1:]  # Start from the second element to skip the header

common_gene = list(set(ad_column_values).intersection(heart_column_values))

Store those common gene into name3

In [None]:
# Replace 'your_spreadsheet_name' with the actual name of your spreadsheet
sh = gc.open('SEM genes')

# Select the worksheet (tab) where you want to store the array
# Replace 'your_worksheet_name' with the actual name of the worksheet
worksheet = sh.worksheet('name3')

In [None]:
data_to_insert = [[element] for element in common_gene]
worksheet.update('A1', data_to_insert)

  worksheet.update('A1', data_to_insert)


{'spreadsheetId': '1DvltWJcjen71N_WWRVnt5CuKvpqB5IJgSqomgJHY-QM',
 'updatedRange': 'name3!A1:A11064',
 'updatedRows': 11064,
 'updatedColumns': 1,
 'updatedCells': 11064}

Continue to find common gene with similar expression pattern

In [None]:
# For google spreadsheet
import gspread
from google.auth import default
from google.colab import auth

auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

# Open the spreadsheet by its name
spreadsheet = gc.open('SEM genes')

# Select the worksheet you want to read from
worksheet = spreadsheet.worksheet('name3')

# Get all values from a specific column (e.g., column 'A')
common_gene = worksheet.col_values(1)  # 1 represents column A, 2 for B, and so on

In [None]:
adata = scanpy.read_h5ad("data.h5ad")

In [None]:
heart_adata_whole = scanpy.read_h5ad("heart_data.h5ad")

In [None]:
adata_selected_genes = adata[:,adata.var.index.isin(common_gene)]

In [None]:
heart_adata_selected_genes = heart_adata_whole[:,heart_adata_whole.var.index.isin(common_gene)]

In [None]:
from scipy.stats import ttest_ind_from_stats

# normaliza the data
scanpy.pp.normalize_total(adata_selected_genes, target_sum=1e4)

# Split data into Alzheimer’s and normal cohorts
alzheimers = adata_selected_genes[adata_selected_genes.obs['disease'] == 'Alzheimer disease']
normal = adata_selected_genes[adata_selected_genes.obs['disease'] == 'normal']

# Calculate mean expression for each gene in each cohort
alz_mean_expr = np.mean(alzheimers.X, axis=0).A1
norm_mean_expr = np.mean(normal.X, axis=0).A1

# Organize results in a DataFrame
gene_names = adata_selected_genes.var_names
results_df = pd.DataFrame({
    'Gene': gene_names,
    'Alzheimer_Mean': alz_mean_expr,
    'Normal_Mean': norm_mean_expr,
})

  view_to_actual(adata)


In [None]:
# normaliza the data
scanpy.pp.normalize_total(heart_adata_selected_genes, target_sum=1e4)

# Split data into Heart failure’s and normal cohorts
heart_failure = heart_adata_selected_genes[heart_adata_selected_genes.obs['disease'] == 'heart failure']
heart_normal = heart_adata_selected_genes[heart_adata_selected_genes.obs['disease'] == 'normal']

# Calculate mean expression for each gene in each cohort
heart_failure_mean_expr = np.mean(heart_failure.X, axis=0).A1
heart_normal_mean_expr = np.mean(heart_normal.X, axis=0).A1

# Organize results in a DataFrame
heart_gene_names = heart_adata_selected_genes.var_names
heart_results_df = pd.DataFrame({
    'Gene': heart_gene_names,
    'Heart_Failure_Mean': heart_failure_mean_expr,
    'Heart_Normal_Mean': heart_normal_mean_expr,
})

In [None]:
print(heart_results_df[heart_results_df['Gene'] == 'ENSG00000139973'].iloc[0, 1])

0.24822056


Compare the common genes' expression pattern in AD and CVD

In [None]:
final_common = []

for name in heart_gene_names:
  ad_value1 = results_df[results_df['Gene'] == name].iloc[0, 1]
  ad_value2 = results_df[results_df['Gene'] == name].iloc[0, 2]
  heart_value1 = heart_results_df[heart_results_df['Gene'] == name].iloc[0, 1]
  heart_value2 = heart_results_df[heart_results_df['Gene'] == name].iloc[0, 2]
  if ((ad_value1 > ad_value2 and heart_value1 > heart_value2) or (ad_value1 < ad_value2 and heart_value1 < heart_value2)):
    final_common.append(name)

['ENSG00000272512', 'ENSG00000230415', 'ENSG00000162572', 'ENSG00000131584', 'ENSG00000127054', 'ENSG00000107404', 'ENSG00000175756', 'ENSG00000215915', 'ENSG00000160075', 'ENSG00000272106', 'ENSG00000197530', 'ENSG00000248333', 'ENSG00000008128', 'ENSG00000215790', 'ENSG00000078369', 'ENSG00000162585', 'ENSG00000157916', 'ENSG00000157911', 'ENSG00000149527', 'ENSG00000157870', 'ENSG00000215912', 'ENSG00000236948', 'ENSG00000069424', 'ENSG00000097021', 'ENSG00000162408', 'ENSG00000162413', 'ENSG00000007923', 'ENSG00000171735', 'ENSG00000237728', 'ENSG00000049246', 'ENSG00000238290', 'ENSG00000162426', 'ENSG00000180758', 'ENSG00000234546', 'ENSG00000188807', 'ENSG00000173614', 'ENSG00000054523', 'ENSG00000160049', 'ENSG00000142655', 'ENSG00000171824', 'ENSG00000198793', 'ENSG00000116661', 'ENSG00000116685', 'ENSG00000083444', 'ENSG00000116688', 'ENSG00000175147', 'ENSG00000171729', 'ENSG00000162458', 'ENSG00000037637', 'ENSG00000219481', 'ENSG00000074964', 'ENSG00000117154', 'ENSG000001

In [None]:
print(len(final_common))

5196


store final common gene in name4

In [None]:
# Replace 'your_spreadsheet_name' with the actual name of your spreadsheet
sh = gc.open('SEM genes')

# Select the worksheet (tab) where you want to store the array
# Replace 'your_worksheet_name' with the actual name of the worksheet
worksheet = sh.worksheet('name4')

In [None]:

data_to_insert = [[element] for element in final_common]
worksheet.update('A1', data_to_insert)

  worksheet.update('A1', data_to_insert)


{'spreadsheetId': '1DvltWJcjen71N_WWRVnt5CuKvpqB5IJgSqomgJHY-QM',
 'updatedRange': 'name4!A1:A5196',
 'updatedRows': 5196,
 'updatedColumns': 1,
 'updatedCells': 5196}

Train a model for prediction of AD

In [None]:
# For google spreadsheet
import gspread
from google.auth import default
from google.colab import auth

auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

# Open the spreadsheet by its name
spreadsheet = gc.open('SEM genes')

# Select the worksheet you want to read from
worksheet = spreadsheet.worksheet('name4')

# Get all values from a specific column (e.g., column 'A')
final_common = worksheet.col_values(1)  # 1 represents column A, 2 for B, and so on

In [None]:
adata = scanpy.read_h5ad("data.h5ad")

In [None]:
heart_adata_whole = scanpy.read_h5ad("heart_data.h5ad")

In [None]:
adata_selected_genes = adata[:,adata.var.index.isin(final_common)]

In [None]:
adata_selected_genes

View of AnnData object with n_obs × n_vars = 23197 × 5196
    obs: 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'SORT', 'Amyloid', 'Age', 'RIN', 'nCount_SCT', 'nFeature_SCT', 'nCount_Exon', 'nFeature_Exon', 'PMI', 'Braak', 'Sample.ID', 'Cell.Types', 'tissue_ontology_term_id', 'assay_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'donor_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'citation', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_pca', 'X_umap'

In [None]:
heart_adata_selected_genes = heart_adata_whole[:,heart_adata_whole.var.index.isin(final_common)]

In [None]:
heart_adata_selected_genes

View of AnnData object with n_obs × n_vars = 180956 × 5196
    obs: 'orig_cluster', 'orig_sub_cluster', 'broad_lineage', 'author_cell_type', 'dev_state', 'subtype', 'precisest_label', 'tissue_id', 'batch', 'size_factor', 'donor_id', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'is_primary_data', 'author_stage', 'tissue_fragment', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'batch_condition', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_liger', 'X_umap2d', 'X_umap3d'

In [None]:
import numpy as np
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from scipy.sparse import issparse

# Step 0: order it
# Sort the .var DataFrame by the 'feature_name' column
adata_selected_genes.var = adata_selected_genes.var.sort_values(by='feature_name')

# Reorder the columns of the expression matrix (X) to match the new order in .var
adata_selected_genes = adata_selected_genes[:, adata_selected_genes.var.index]

# Step 1: Create binary target variable (y) for 'Alzheimer disease'
y = (adata_selected_genes.obs['disease'] == 'Alzheimer disease').astype(int)  # Binary encoding: 1 for AD, 0 otherwise

# Step 2: Prepare features (X)
# Assuming adata.X contains gene expression data
X = adata_selected_genes.X.toarray() if hasattr(adata_selected_genes.X, 'toarray') else adata_selected_genes.X  # Convert sparse to dense if needed

# Step 3: Scale features for better performance
scaler = StandardScaler()
X_combined = scaler.fit_transform(X) #X_combined)

# Step 4: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42, stratify=y)

# Step 5: Train a logistic regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Step 6: Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Optional: Display model coefficients for interpretability (depending on number of features)
print("Coefficients of features (diseases and genes):")
print(model.coef_)


Model Accuracy: 0.92
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.92      0.92      2293
           1       0.92      0.92      0.92      2347

    accuracy                           0.92      4640
   macro avg       0.92      0.92      0.92      4640
weighted avg       0.92      0.92      0.92      4640

Coefficients of features (diseases and genes):
[[ 0.15372041 -0.15641501  0.04543275 ... -0.10740753 -0.03403764
  -0.15549407]]


Another ML Model: Neural Network

In [None]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Step 0: order it
# Sort the .var DataFrame by the 'feature_name' column
adata_selected_genes.var = adata_selected_genes.var.sort_values(by='feature_name')

# Reorder the columns of the expression matrix (X) to match the new order in .var
adata_selected_genes = adata_selected_genes[:, adata_selected_genes.var.index]

# Step 1: Create binary target variable (y) for 'Alzheimer disease'
y = (adata_selected_genes.obs['disease'] == 'Alzheimer disease').astype(int)  # Binary encoding: 1 for AD, 0 otherwise

# Step 2: Prepare features (X)
# Assuming adata.X contains gene expression data
X = adata_selected_genes.X.toarray() if hasattr(adata_selected_genes.X, 'toarray') else adata_selected_genes.X  # Convert sparse to dense if needed

# Step 3: Scale features for better performance
scaler = StandardScaler()
X_combined = scaler.fit_transform(X) #X_combined)

# Step 4: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42, stratify=y)

# Step 4: Build a simple neural network model
model = Sequential([
    Dense(64, input_shape=(X_train.shape[1],), activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')  # Binary classification output
])

# Step 5: Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Step 6: Train the model
history = model.fit(X_train, y_train, validation_split=0.2, epochs=10, batch_size=128, verbose=1)

# Step 7: Evaluate the model
y_pred = (model.predict(X_test) > 0.5).astype(int)  # Convert probabilities to binary predictions
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 14ms/step - accuracy: 0.7323 - loss: 0.5190 - val_accuracy: 0.9184 - val_loss: 0.2058
Epoch 2/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - accuracy: 0.9430 - loss: 0.1468 - val_accuracy: 0.9353 - val_loss: 0.1653
Epoch 3/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step - accuracy: 0.9795 - loss: 0.0596 - val_accuracy: 0.9313 - val_loss: 0.1873
Epoch 4/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 23ms/step - accuracy: 0.9921 - loss: 0.0291 - val_accuracy: 0.9324 - val_loss: 0.2365
Epoch 5/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.9928 - loss: 0.0218 - val_accuracy: 0.9310 - val_loss: 0.2713
Epoch 6/10
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.9949 - loss: 0.0166 - val_accuracy: 0.9267 - val_loss: 0.3073
Epoch 7/10
[1m116/116

AttributeError: 'Sequential' object has no attribute 'coef_'

In [None]:
heart_failure = heart_adata_selected_genes[heart_adata_selected_genes.obs['disease'] == 'heart failure']
heart_normal = heart_adata_selected_genes[heart_adata_selected_genes.obs['disease'] == 'normal']

In [None]:
heart_failure

View of AnnData object with n_obs × n_vars = 4594 × 5196
    obs: 'orig_cluster', 'orig_sub_cluster', 'broad_lineage', 'author_cell_type', 'dev_state', 'subtype', 'precisest_label', 'tissue_id', 'batch', 'size_factor', 'donor_id', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'is_primary_data', 'author_stage', 'tissue_fragment', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'batch_condition', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_liger', 'X_umap2d', 'X_umap3d'

In [None]:
heart_normal

View of AnnData object with n_obs × n_vars = 163283 × 5196
    obs: 'orig_cluster', 'orig_sub_cluster', 'broad_lineage', 'author_cell_type', 'dev_state', 'subtype', 'precisest_label', 'tissue_id', 'batch', 'size_factor', 'donor_id', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'is_primary_data', 'author_stage', 'tissue_fragment', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'batch_condition', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_liger', 'X_umap2d', 'X_umap3d'

In [None]:
# Step 0: order it
# Sort the .var DataFrame by the 'feature_name' column
heart_failure.var = heart_failure.var.sort_values(by='feature_name')

# Reorder the columns of the expression matrix (X) to match the new order in .var
heart_failure = heart_failure[:, heart_failure.var.index]

# Step 1: Create binary target variable (y) for 'heart failure'
y = (heart_failure.obs['disease'] == 'heart failure').astype(int)  # Binary encoding: 1 for AD, 0 otherwise

# Step 2: Prepare features (X)
# Assuming adata.X contains gene expression data
X = heart_failure.X.toarray() if hasattr(heart_failure.X, 'toarray') else heart_failure.X  # Convert sparse to dense if needed

# Step 3: Scale features for better performance
scaler = StandardScaler()
X_combined = scaler.fit_transform(X)

# Step 4: Evaluate the model
#y_pred = model.predict(X_combined)
y_pred = (model.predict(X_combined) > 0.5).astype(int)  # Convert probabilities to binary predictions
accuracy = accuracy_score(y, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(classification_report(y, y_pred))

[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step
Model Accuracy: 0.54
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.54      0.70      4594

    accuracy                           0.54      4594
   macro avg       0.50      0.27      0.35      4594
weighted avg       1.00      0.54      0.70      4594



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The "Model Accuracy: 0.54" above actually means the possibility for heart failure patients to get AD.


Comprehensive Findings from AD and CVD Gene Expression Analysis Using AI Model

1. AD Model Training:

A machine learning model was trained using gene expression data of over 5,000
genes that showed significant differences between Alzheimer’s Disease (AD) and normal samples. The goal was to predict the probability of a sample being classified as AD based on gene expression patterns.

The model achieved an impressive accuracy of 92%, with high precision, recall, and F1-scores across both classes (AD and non-AD). This indicates the model's robustness and reliability in distinguishing AD-related samples.
Application to CVD Patients:

2. Application to CVD Patients:

The trained model was applied to gene expression profiles of Cardiovascular Disease (CVD) patients. For this analysis, only genes with common expression patterns across AD and CVD were used, highlighting shared genetic susceptibilities or pathways.

The model predicted that CVD patients had a 54% probability of being classified as having AD, based on the overlapping gene expression patterns.

3. Insights from Feature Importance:

The model's coefficients for features (genes) reveal the relative importance of specific genes in predicting AD. These insights can guide further investigation into the shared genetic mechanisms and their biological relevance in both diseases.

4. Significance of Results:

These findings demonstrate a strong overlap in gene expression patterns between AD and CVD, suggesting potential shared pathological pathways or genetic risk factors.

The high accuracy of the model provides confidence in its predictions and supports its potential use in identifying at-risk populations for AD among CVD patients.

This preliminary work paves the way for further exploration of the genetic relationship between AD and CVD, potentially aiding early diagnosis and personalized interventions.