| Column index | Column names            | How the information was obtained | Description of the column                                                                                                                                                                                                |
| :----------- | :---------------------- | :------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1            | Pubmed ID               | Extract                          | Pubmed PMID of the publication                                                                                                                                                                                           |
| 2            | Record Type             | Infer                            | SNP-based or Gene-based association                                                                                                                                                                                      |
| 3            | SNP                     | Extract                          | rsID shown in the publication                                                                                                                                                                                            |
| 4            | Coordinates             | Extract                          | Chromosome and base-pair information shown in publication (note can vary across genome builds)                                                                                                                           |
| 5            | Locus                   | Extract                          | Reported gene in the association record in the results table in the publication (not the eQTL or gene-based gene)                                                                                                        |
| 6            | Reported gene           | Extract                          | Gene name for gene-based test. NA if that’s SNP-based test.                                                                                                                                                              |
| 7            | Interactions            | Extract                          | Reported interaction results                                                                                                                                                                                             |
| 8            | Population              | Infer                            | Choose from the following, derived from "Population" column, population mapped into these categories: African American, Arab, Asian, Caribbean Hispanic, Caucasian, Hispanic, Non-Hispanic Caucasian, Non-Hispanic White |
| 9            | Population (detailed)   | Extract                          | Population description from the publication                                                                                                                                                                              |
| 10           | Cohort                  | Infer                            | Cohorts derived from Cohort (detailed) column: if consortium name was available (e.g., ADGC, CHARGE, these will be used)                                                                                                 |
| 11           | Cohort (detailed)       | Extract                          | Cohort description from the publication                                                                                                                                                                                  |
| 12           | Sample size             | Extract                          | Total sample size                                                                                                                                                                                                        |
| 13           | Subset Analyzed         | Extract                          | What samples were used for analyses. This is designed for distinguishing between same SNPs with different p-values in the same data table in a publication.                                                              |
| 14           | Phenotype               | Infer                            | Derived from Phenotype (detailed) column. Choose from the following: AD, ADRD, Cognitive, Expression, Fluid biomarker, Imaging, Neuropathology, Non-ADRD, Other                                                          |
| 15           | Phenotype (detailed)    | Extract                          | Defined as outcome of the regression analyses, shown as what is described in text                                                                                                                                        |
| 16           | Association type        | Infer                            | Choose from the following: eQTL, Disease risk, Endophenotype, AAO/Survival, Pleiotropy, Cross phenotype                                                                                                                  |
| 17           | RA1                     | Extract                          | Reported Allele 1 - the first allele reported in the association record in the publication, if any                                                                                                                       |
| 18           | RA2                     | Extract                          | Reported Allele 2 - the second allele reported in the association record in the publication, if any                                                                                                                      |
| 19           | AF                      | Extract                          | Reported allele frequency across all samples (for this association record) in the publication                                                                                                                            |
| 20           | P-value                 | Extract                          | p-value reported; show corrected p-value if available                                                                                                                                                                    |
| 21           | Effect Size             | Extract                          | Effect size type (OR, Beta, FDR etc.) and value of the effect size                                                                                                                                                       |
| 22           | Confidence Interval     | Extract                          | 95% confidence interval of the effect size (if any)                                                                                                                                                                      |
| 23           | Stage                   | Infer                            | Stage of the analysis reported in the publication, e.g., “Stage n” (n=1,2,3). If nothing is reported, choose from the following: “Discovery”, “Validation”, “Meta-analysis", "Joint-analysis"                            |
| 24           | Model                   | Extract                          | Description of model: what kind of statistical model, and if the analyses were adjusted for anything                                                                                                                     |
| 25           | Imputation              | Extract                          | How the data is imputed. Choose from the following: 1000G, HapMap, HRC                                                                                                                                                   |
| 26           | View in GenomicsDB      | Compute / cross-reference        | URL link for viewing the record in NIAGADS GenomicsDB                                                                                                                                                                    |
| 27           | Nearest gene            | Compute                          | Distance from the SNP to the Locus (base-pair information included)                                                                                                                                                      |
| 28           | Most severe consequence | Compute / cross-reference        | Functional information (VEP provided by NIAGADS GenomicsDB)                                                                                                                                                              |


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import torch
from torch import nn
from google.colab import drive
import os

In [7]:
drive_mounted = False
if not drive_mounted:
    drive.mount('/content/drive', force_remount=True)
    drive_mounted = True
    print("Drive mounted.")
drive_path = "/content/drive/MyDrive/Colab_Dev/ALZ_Variant"


Mounted at /content/drive
Drive mounted.


In [8]:
# Get a list of all items in the directory
all_items = os.listdir(drive_path)

# Filter out items that are not files or end with '.ipynb'
file_names = [item for item in all_items if os.path.isfile(os.path.join(drive_path, item)) and not item.endswith('.ipynb')]

print("List of filenames (excluding .ipynb):")
for file_name in file_names:
    print(file_name)

List of filenames (excluding .ipynb):
advp.hg38.bed
advp.hg38.tsv


In [None]:
#now that we have the file names, we have to read them


In [9]:
bed_file_path = os.path.join(drive_path, "advp.hg38.bed")
bed_df = pd.read_csv(bed_file_path, sep='\t', header=None)
display(bed_df.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,#dbSNP_hg38_chr,dbSNP_hg38_chrStart,dbSNP_hg38_chrEnd,Top SNP,P-value,LocusName,RA 1(Reported Allele 1),nonref_allele,nonref_effect,OR_nonref,...,Study type,Study Design,Pubmed PMID,Population_map,Cohort_simple3,Sample size,Analysis group,Phenotype,Phenotype-derived,most_severe_consequence
1,chr1,6434682,6434683,rs12074379,0.00726,ESPN,T,T,NR,,...,SNP-based,Disease risk,30636644,Caucasian,"ADGC, CHS, CHARGE, HRS",10191,Plan 3 (only females),AD,AD,intron_variant
2,chr1,6434682,6434683,rs12074379,8.51E-40,NR,T,T,NR,,...,SNP-based,eQTL,30636644,Caucasian,"ADGC, CHS, CHARGE, HRS",10191,Plan 3 (only females),ESPN (ILMN_1806710) expression,Expression,intron_variant
3,chr1,8708070,8708071,rs112053331,0.0009,RERE,NR,NR,NR,,...,SNP-based,Cross phenotype,30010129,Caucasian,IGAP,54162,All,AD,AD,intron_variant
4,chr1,8708070,8708071,rs112053331,0.08392,,NR,NR,NR,,...,Gene-based,Cross phenotype,30010129,Caucasian,IGAP,54162,All,AD,AD,intron_variant


# Task
Prepare the dataframe for training a neural network model based on the provided schema and head display.

## Inspect the data

### Subtask:
Examine the column names and data types to understand the structure and content of the dataframe.


**Reasoning**:
Display the column names, data types, and the first few rows of the dataframe to understand its structure and content as requested in the instructions.



In [10]:
print("Column Names:")
print(bed_df.columns)
print("\nData Types:")
print(bed_df.dtypes)
print("\nFirst 5 Rows:")
display(bed_df.head())

Column Names:
Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20],
      dtype='int64')

Data Types:
0     object
1     object
2     object
3     object
4     object
5     object
6     object
7     object
8     object
9     object
10    object
11    object
12    object
13    object
14    object
15    object
16    object
17    object
18    object
19    object
20    object
dtype: object

First 5 Rows:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,#dbSNP_hg38_chr,dbSNP_hg38_chrStart,dbSNP_hg38_chrEnd,Top SNP,P-value,LocusName,RA 1(Reported Allele 1),nonref_allele,nonref_effect,OR_nonref,...,Study type,Study Design,Pubmed PMID,Population_map,Cohort_simple3,Sample size,Analysis group,Phenotype,Phenotype-derived,most_severe_consequence
1,chr1,6434682,6434683,rs12074379,0.00726,ESPN,T,T,NR,,...,SNP-based,Disease risk,30636644,Caucasian,"ADGC, CHS, CHARGE, HRS",10191,Plan 3 (only females),AD,AD,intron_variant
2,chr1,6434682,6434683,rs12074379,8.51E-40,NR,T,T,NR,,...,SNP-based,eQTL,30636644,Caucasian,"ADGC, CHS, CHARGE, HRS",10191,Plan 3 (only females),ESPN (ILMN_1806710) expression,Expression,intron_variant
3,chr1,8708070,8708071,rs112053331,0.0009,RERE,NR,NR,NR,,...,SNP-based,Cross phenotype,30010129,Caucasian,IGAP,54162,All,AD,AD,intron_variant
4,chr1,8708070,8708071,rs112053331,0.08392,,NR,NR,NR,,...,Gene-based,Cross phenotype,30010129,Caucasian,IGAP,54162,All,AD,AD,intron_variant


**Reasoning**:
The first row of the dataframe contains the actual column names, and the current column names are just numerical indices. All columns are of type object. I will rename the columns using the values in the first row and drop the first row to fix this issue.



In [11]:
bed_df.columns = bed_df.iloc[0]
bed_df = bed_df[1:].reset_index(drop=True)
print("Column Names after renaming:")
print(bed_df.columns)
print("\nData Types after renaming:")
print(bed_df.dtypes)
print("\nFirst 5 Rows after renaming:")
display(bed_df.head())

Column Names after renaming:
Index(['#dbSNP_hg38_chr', 'dbSNP_hg38_chrStart', 'dbSNP_hg38_chrEnd',
       'Top SNP', 'P-value', 'LocusName', 'RA 1(Reported Allele 1)',
       'nonref_allele', 'nonref_effect', 'OR_nonref', 'nearest_gene_symb',
       'Study type', 'Study Design', 'Pubmed PMID', 'Population_map',
       'Cohort_simple3', 'Sample size', 'Analysis group', 'Phenotype',
       'Phenotype-derived', 'most_severe_consequence'],
      dtype='object', name=0)

Data Types after renaming:
0
#dbSNP_hg38_chr            object
dbSNP_hg38_chrStart        object
dbSNP_hg38_chrEnd          object
Top SNP                    object
P-value                    object
LocusName                  object
RA 1(Reported Allele 1)    object
nonref_allele              object
nonref_effect              object
OR_nonref                  object
nearest_gene_symb          object
Study type                 object
Study Design               object
Pubmed PMID                object
Population_map          

Unnamed: 0,#dbSNP_hg38_chr,dbSNP_hg38_chrStart,dbSNP_hg38_chrEnd,Top SNP,P-value,LocusName,RA 1(Reported Allele 1),nonref_allele,nonref_effect,OR_nonref,...,Study type,Study Design,Pubmed PMID,Population_map,Cohort_simple3,Sample size,Analysis group,Phenotype,Phenotype-derived,most_severe_consequence
0,chr1,6434682,6434683,rs12074379,0.00726,ESPN,T,T,NR,,...,SNP-based,Disease risk,30636644,Caucasian,"ADGC, CHS, CHARGE, HRS",10191,Plan 3 (only females),AD,AD,intron_variant
1,chr1,6434682,6434683,rs12074379,8.509999999999999e-40,NR,T,T,NR,,...,SNP-based,eQTL,30636644,Caucasian,"ADGC, CHS, CHARGE, HRS",10191,Plan 3 (only females),ESPN (ILMN_1806710) expression,Expression,intron_variant
2,chr1,8708070,8708071,rs112053331,0.0009,RERE,NR,NR,NR,,...,SNP-based,Cross phenotype,30010129,Caucasian,IGAP,54162,All,AD,AD,intron_variant
3,chr1,8708070,8708071,rs112053331,0.08392,,NR,NR,NR,,...,Gene-based,Cross phenotype,30010129,Caucasian,IGAP,54162,All,AD,AD,intron_variant
4,chr1,11487006,11487007,rs2379135,0.0156,PTCHD2,NR,NR,NR,,...,SNP-based,Endophenotype,22245343,Caucasian,ADNI,757,All,MRI,Imaging,intron_variant


## Address missing values

### Subtask:
Identify and handle missing values appropriately, considering different strategies for numerical and categorical columns.


**Reasoning**:
Calculate and display the number of missing values for each column.



In [12]:
print("Missing values before handling:")
print(bed_df.isnull().sum())

Missing values before handling:
0
#dbSNP_hg38_chr               0
dbSNP_hg38_chrStart           0
dbSNP_hg38_chrEnd             0
Top SNP                       0
P-value                       8
LocusName                   102
RA 1(Reported Allele 1)     124
nonref_allele               124
nonref_effect               122
OR_nonref                  2956
nearest_gene_symb            10
Study type                    0
Study Design                  0
Pubmed PMID                   0
Population_map                0
Cohort_simple3                0
Sample size                   0
Analysis group               24
Phenotype                     0
Phenotype-derived             0
most_severe_consequence      14
dtype: int64


**Reasoning**:
Identify columns with missing values and determine their data types (or infer based on content) to prepare for handling them.



In [13]:
# Identify columns with missing values
cols_with_missing = bed_df.columns[bed_df.isnull().any()].tolist()
print("\nColumns with missing values:", cols_with_missing)

# Infer data types based on column content and schema description
# P-value, OR_nonref, Sample size are likely numerical
# LocusName, RA 1(Reported Allele 1), nonref_allele, nonref_effect, nearest_gene_symb, Analysis group, most_severe_consequence are likely categorical

numerical_cols_with_missing = ['P-value', 'OR_nonref']
categorical_cols_with_missing = [col for col in cols_with_missing if col not in numerical_cols_with_missing]

print("Inferred numerical columns with missing values:", numerical_cols_with_missing)
print("Inferred categorical columns with missing values:", categorical_cols_with_missing)


Columns with missing values: ['P-value', 'LocusName', 'RA 1(Reported Allele 1)', 'nonref_allele', 'nonref_effect', 'OR_nonref', 'nearest_gene_symb', 'Analysis group', 'most_severe_consequence']
Inferred numerical columns with missing values: ['P-value', 'OR_nonref']
Inferred categorical columns with missing values: ['LocusName', 'RA 1(Reported Allele 1)', 'nonref_allele', 'nonref_effect', 'nearest_gene_symb', 'Analysis group', 'most_severe_consequence']


**Reasoning**:
Handle missing values using appropriate strategies for numerical and categorical columns and then verify that the missing values have been handled.



In [14]:
# Handle missing values

# For numerical columns, impute with the median (more robust to outliers than mean)
for col in numerical_cols_with_missing:
    # Convert to numeric, coercing errors to NaN, before calculating median
    bed_df[col] = pd.to_numeric(bed_df[col], errors='coerce')
    median_val = bed_df[col].median()
    bed_df[col] = bed_df[col].fillna(median_val)

# For categorical columns, impute with a placeholder 'Unknown'
for col in categorical_cols_with_missing:
    bed_df[col] = bed_df[col].fillna('Unknown')

# Verify that missing values have been addressed
print("\nMissing values after handling:")
print(bed_df.isnull().sum())


Missing values after handling:
0
#dbSNP_hg38_chr            0
dbSNP_hg38_chrStart        0
dbSNP_hg38_chrEnd          0
Top SNP                    0
P-value                    0
LocusName                  0
RA 1(Reported Allele 1)    0
nonref_allele              0
nonref_effect              0
OR_nonref                  0
nearest_gene_symb          0
Study type                 0
Study Design               0
Pubmed PMID                0
Population_map             0
Cohort_simple3             0
Sample size                0
Analysis group             0
Phenotype                  0
Phenotype-derived          0
most_severe_consequence    0
dtype: int64


## Convert data types

### Subtask:
Ensure columns have the correct data types for numerical and categorical features. Convert columns to numerical types where necessary.


**Reasoning**:
Convert columns that should be numerical to a numerical data type using `pd.to_numeric()` and then verify the data types.



In [15]:
# Identify numerical columns based on schema and previous steps
numerical_cols = ['P-value', 'Sample size']

# Convert numerical columns to numeric type, coercing errors
for col in numerical_cols:
    bed_df[col] = pd.to_numeric(bed_df[col], errors='coerce')

# Verify the data types of all columns
print("Data Types after conversion:")
print(bed_df.dtypes)

Data Types after conversion:
0
#dbSNP_hg38_chr             object
dbSNP_hg38_chrStart         object
dbSNP_hg38_chrEnd           object
Top SNP                     object
P-value                    float64
LocusName                   object
RA 1(Reported Allele 1)     object
nonref_allele               object
nonref_effect               object
OR_nonref                  float64
nearest_gene_symb           object
Study type                  object
Study Design                object
Pubmed PMID                 object
Population_map              object
Cohort_simple3              object
Sample size                float64
Analysis group              object
Phenotype                   object
Phenotype-derived           object
most_severe_consequence     object
dtype: object


## Encode categorical features

### Subtask:
Convert categorical features into a numerical format suitable for the neural network model using techniques like one-hot encoding.


**Reasoning**:
Identify categorical columns, apply one-hot encoding, concatenate with numerical columns, and drop original categorical columns.



In [16]:
# Identify categorical columns (those with object dtype)
categorical_cols = bed_df.select_dtypes(include='object').columns

# Apply one-hot encoding
bed_df_encoded = pd.get_dummies(bed_df, columns=categorical_cols, drop_first=True)

# Display the first few rows of the encoded DataFrame
display(bed_df_encoded.head())

# Display the data types of the encoded DataFrame
print("\nData Types after encoding:")
print(bed_df_encoded.dtypes)

Unnamed: 0,P-value,OR_nonref,Sample size,#dbSNP_hg38_chr_chr10,#dbSNP_hg38_chr_chr11,#dbSNP_hg38_chr_chr12,#dbSNP_hg38_chr_chr13,#dbSNP_hg38_chr_chr14,#dbSNP_hg38_chr_chr15,#dbSNP_hg38_chr_chr16,...,most_severe_consequence_missense_variant,most_severe_consequence_missense_variant&splice_region_variant,most_severe_consequence_non_coding_transcript_exon_variant,most_severe_consequence_splice_region_variant,most_severe_consequence_splice_region_variant&intron_variant,most_severe_consequence_splice_region_variant&synonymous_variant,most_severe_consequence_stop_lost,most_severe_consequence_stop_retained_variant,most_severe_consequence_synonymous_variant,most_severe_consequence_upstream_gene_variant
0,0.00726,1.03,10191.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,8.51e-40,1.03,10191.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,0.0009,1.03,54162.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,0.08392,1.03,54162.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,0.0156,1.03,757.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False



Data Types after encoding:
P-value                                                             float64
OR_nonref                                                           float64
Sample size                                                         float64
#dbSNP_hg38_chr_chr10                                                  bool
#dbSNP_hg38_chr_chr11                                                  bool
                                                                     ...   
most_severe_consequence_splice_region_variant&synonymous_variant       bool
most_severe_consequence_stop_lost                                      bool
most_severe_consequence_stop_retained_variant                          bool
most_severe_consequence_synonymous_variant                             bool
most_severe_consequence_upstream_gene_variant                          bool
Length: 9193, dtype: object


## Select relevant features

### Subtask:
Based on the schema and the goal of training an AI model, select the columns that will be used as features and the target variable.


**Reasoning**:
Select the feature and target columns based on the schema and create the X and y dataframes.



In [17]:
# 1. Define a list of feature column names
feature_cols = [
    'P-value',
    'OR_nonref',
    'Sample size',
    '#dbSNP_hg38_chr',
    'dbSNP_hg38_chrStart',
    'dbSNP_hg38_chrEnd',
    'Top SNP',
    'LocusName',
    'RA 1(Reported Allele 1)',
    'nonref_allele',
    'nonref_effect',
    'Study type',
    'Study Design',
    'Pubmed PMID',
    'Population_map',
    'Cohort_simple3',
    'Analysis group',
    'Phenotype',
    'most_severe_consequence'
]

# Remove columns that are not in the encoded dataframe or are identifiers/descriptive
# We need to use the columns from the *encoded* dataframe for features
feature_cols_encoded = [col for col in bed_df_encoded.columns if col in feature_cols or any(col.startswith(f + '_') for f in feature_cols)]

# Remove specific columns identified as not relevant for prediction
irrelevant_cols = [
    'Pubmed PMID',
    'Population (detailed)', # Not in the encoded df anyway
    'Cohort (detailed)', # Not in the encoded df anyway
    'Subset Analyzed', # Not in the encoded df anyway
    'Phenotype (detailed)', # Not in the encoded df anyway
    'Model', # Not in the encoded df anyway
    'Imputation', # Not in the encoded df anyway
    'View in GenomicsDB', # Not in the encoded df anyway
    'Nearest gene', # Not in the encoded df anyway
    'Most severe consequence' # This was the target column, now replaced by most_severe_consequence_encoded
]

# Filter out irrelevant columns from the encoded feature columns
feature_cols_final = [col for col in feature_cols_encoded if col not in irrelevant_cols and not any(col.startswith(f + '_') for f in irrelevant_cols if f in bed_df_encoded.columns)]


# 2. Define the target variable column name
target_col = 'Phenotype-derived'

# 3. Create a pandas DataFrame X containing only the selected feature columns from bed_df_encoded.
X = bed_df_encoded[feature_cols_final]

# 4. Create a pandas Series y containing the target variable column from bed_df.
y = bed_df[target_col]

# Verify the shapes of X and y
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

# Display the first few rows of X and y
print("\nFirst 5 Rows of X:")
display(X.head())

print("\nFirst 5 Rows of y:")
display(y.head())

Shape of X: (6346, 8239)
Shape of y: (6346,)

First 5 Rows of X:


Unnamed: 0,P-value,OR_nonref,Sample size,#dbSNP_hg38_chr_chr10,#dbSNP_hg38_chr_chr11,#dbSNP_hg38_chr_chr12,#dbSNP_hg38_chr_chr13,#dbSNP_hg38_chr_chr14,#dbSNP_hg38_chr_chr15,#dbSNP_hg38_chr_chr16,...,most_severe_consequence_missense_variant,most_severe_consequence_missense_variant&splice_region_variant,most_severe_consequence_non_coding_transcript_exon_variant,most_severe_consequence_splice_region_variant,most_severe_consequence_splice_region_variant&intron_variant,most_severe_consequence_splice_region_variant&synonymous_variant,most_severe_consequence_stop_lost,most_severe_consequence_stop_retained_variant,most_severe_consequence_synonymous_variant,most_severe_consequence_upstream_gene_variant
0,0.00726,1.03,10191.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,8.51e-40,1.03,10191.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,0.0009,1.03,54162.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,0.08392,1.03,54162.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,0.0156,1.03,757.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False



First 5 Rows of y:


Unnamed: 0,Phenotype-derived
0,AD
1,Expression
2,AD
3,AD
4,Imaging


## Split the data

### Subtask:
Split the data into training and testing sets.


**Reasoning**:
Split the feature matrix X and the target variable y into training and testing sets using train_test_split.



In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (5076, 8239)
Shape of X_test: (1270, 8239)
Shape of y_train: (5076,)
Shape of y_test: (1270,)


## Summary:

### Data Analysis Key Findings

*   Initially, the dataframe columns were numerical indices, and all data was of the 'object' type. The first row contained the true header information.
*   After renaming columns using the first row and removing it, meaningful column names were established, though all data types remained 'object'.
*   Missing values were identified in several columns. Numerical columns ('P-value', 'OR\_nonref') were imputed with the median, while categorical columns were imputed with 'Unknown'.
*   'P-value' and 'Sample size' columns were successfully converted to the `float64` data type.
*   Categorical columns were successfully one-hot encoded, resulting in a significant increase in the number of features (from original columns to 8239 features).
*   Relevant features were selected, and the target variable ('Phenotype-derived') was identified.
*   The data was split into training (80%) and testing (20%) sets: `X_train` (5076, 8239), `X_test` (1270, 8239), `y_train` (5076,), and `y_test` (1270,).

### Insights or Next Steps

*   The high dimensionality of the feature space after one-hot encoding (8239 features) might require dimensionality reduction techniques (e.g., PCA) before training the neural network to potentially improve performance and reduce training time.
*   The target variable `y` is still in its original categorical format. It will need to be encoded (e.g., using label encoding or one-hot encoding if the neural network output layer is designed for multi-class classification) before being used to train the model.


In [19]:
from sklearn.preprocessing import OneHotEncoder

# Initialize OneHotEncoder
# Use handle_unknown='ignore' to handle potential new categories in the test set
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit and transform on y_train, then transform y_test
y_train_encoded = encoder.fit_transform(y_train.values.reshape(-1, 1))
y_test_encoded = encoder.transform(y_test.values.reshape(-1, 1))

# Convert back to DataFrame (optional, but can be helpful)
y_train_encoded_df = pd.DataFrame(y_train_encoded, columns=encoder.get_feature_names_out(['Phenotype-derived']))
y_test_encoded_df = pd.DataFrame(y_test_encoded, columns=encoder.get_feature_names_out(['Phenotype-derived']))


print("Shape of y_train_encoded:", y_train_encoded.shape)
print("Shape of y_test_encoded:", y_test_encoded.shape)

print("\nFirst 5 Rows of y_train_encoded_df:")
display(y_train_encoded_df.head())

print("\nFirst 5 Rows of y_test_encoded_df:")
display(y_test_encoded_df.head())

Shape of y_train_encoded: (5076, 9)
Shape of y_test_encoded: (1270, 9)

First 5 Rows of y_train_encoded_df:


Unnamed: 0,Phenotype-derived_AD,Phenotype-derived_ADRD,Phenotype-derived_Cognitive,Phenotype-derived_Expression,Phenotype-derived_Fluid biomarker,Phenotype-derived_Imaging,Phenotype-derived_Neuropathology,Phenotype-derived_Non-ADRD,Phenotype-derived_Other
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



First 5 Rows of y_test_encoded_df:


Unnamed: 0,Phenotype-derived_AD,Phenotype-derived_ADRD,Phenotype-derived_Cognitive,Phenotype-derived_Expression,Phenotype-derived_Fluid biomarker,Phenotype-derived_Imaging,Phenotype-derived_Neuropathology,Phenotype-derived_Non-ADRD,Phenotype-derived_Other
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


# Task
Apply binary encoding to the categorical features to reduce the feature count and prepare the data for training a neural network model.

## Identify categorical columns

### Subtask:
Identify the categorical columns that were one-hot encoded.


**Reasoning**:
Identify the columns in the `bed_df` DataFrame that have an 'object' data type and store them in a list. Then, print the list.



In [20]:
original_categorical_cols = bed_df.select_dtypes(include='object').columns.tolist()
print("Original categorical columns:", original_categorical_cols)

Original categorical columns: ['#dbSNP_hg38_chr', 'dbSNP_hg38_chrStart', 'dbSNP_hg38_chrEnd', 'Top SNP', 'LocusName', 'RA 1(Reported Allele 1)', 'nonref_allele', 'nonref_effect', 'nearest_gene_symb', 'Study type', 'Study Design', 'Pubmed PMID', 'Population_map', 'Cohort_simple3', 'Analysis group', 'Phenotype', 'Phenotype-derived', 'most_severe_consequence']


## Apply binary encoding

### Subtask:
Implement binary encoding on the original categorical columns before one-hot encoding.


**Reasoning**:
Implement binary encoding on the original categorical columns before one-hot encoding.



In [21]:
from category_encoders import BinaryEncoder

# Initialize BinaryEncoder
encoder = BinaryEncoder(cols=original_categorical_cols, handle_unknown='ignore', return_df=True)

# Fit and transform the bed_df DataFrame
bed_df_binary_encoded = encoder.fit_transform(bed_df)

# Display the head of the bed_df_binary_encoded DataFrame
display(bed_df_binary_encoded.head())

# Print the shape of the bed_df_binary_encoded DataFrame
print("\nShape of bed_df_binary_encoded:", bed_df_binary_encoded.shape)

ModuleNotFoundError: No module named 'category_encoders'

**Reasoning**:
The previous command failed because the `category_encoders` library is not installed. I need to install it before I can use `BinaryEncoder`.



In [22]:
%pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Downloading category_encoders-2.8.1-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.8.1


**Reasoning**:
Now that the `category_encoders` library is installed, I can proceed with applying binary encoding to the categorical columns as originally intended.



In [23]:
from category_encoders import BinaryEncoder

# Initialize BinaryEncoder
encoder = BinaryEncoder(cols=original_categorical_cols, handle_unknown='ignore', return_df=True)

# Fit and transform the bed_df DataFrame
bed_df_binary_encoded = encoder.fit_transform(bed_df)

# Display the head of the bed_df_binary_encoded DataFrame
display(bed_df_binary_encoded.head())

# Print the shape of the bed_df_binary_encoded DataFrame
print("\nShape of bed_df_binary_encoded:", bed_df_binary_encoded.shape)

Unnamed: 0,#dbSNP_hg38_chr_0,#dbSNP_hg38_chr_1,#dbSNP_hg38_chr_2,#dbSNP_hg38_chr_3,#dbSNP_hg38_chr_4,dbSNP_hg38_chrStart_0,dbSNP_hg38_chrStart_1,dbSNP_hg38_chrStart_2,dbSNP_hg38_chrStart_3,dbSNP_hg38_chrStart_4,...,Phenotype_9,Phenotype-derived_0,Phenotype-derived_1,Phenotype-derived_2,Phenotype-derived_3,most_severe_consequence_0,most_severe_consequence_1,most_severe_consequence_2,most_severe_consequence_3,most_severe_consequence_4
0,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
3,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
4,0,0,0,0,1,0,0,0,0,0,...,1,0,0,1,1,0,0,0,0,1



Shape of bed_df_binary_encoded: (6346, 127)


## Combine encoded features

### Subtask:
Combine the binary-encoded categorical features with the numerical features.


**Reasoning**:
Combine the numerical columns from the original dataframe with the binary-encoded categorical columns to create the final feature set.



In [24]:
# Identify numerical columns in the original bed_df
numerical_cols = bed_df.select_dtypes(include=np.number).columns.tolist()

# Extract numerical columns from the original bed_df
bed_df_numerical = bed_df[numerical_cols]

# Concatenate numerical and binary-encoded categorical columns
X_binary_encoded = pd.concat([bed_df_numerical, bed_df_binary_encoded], axis=1)

# Display the head and shape of the combined DataFrame
print("Head of the combined DataFrame (X_binary_encoded):")
display(X_binary_encoded.head())
print("\nShape of the combined DataFrame (X_binary_encoded):", X_binary_encoded.shape)

Head of the combined DataFrame (X_binary_encoded):


Unnamed: 0,P-value,OR_nonref,Sample size,#dbSNP_hg38_chr_0,#dbSNP_hg38_chr_1,#dbSNP_hg38_chr_2,#dbSNP_hg38_chr_3,#dbSNP_hg38_chr_4,dbSNP_hg38_chrStart_0,dbSNP_hg38_chrStart_1,...,Phenotype_9,Phenotype-derived_0,Phenotype-derived_1,Phenotype-derived_2,Phenotype-derived_3,most_severe_consequence_0,most_severe_consequence_1,most_severe_consequence_2,most_severe_consequence_3,most_severe_consequence_4
0,0.00726,1.03,10191.0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
1,8.51e-40,1.03,10191.0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
2,0.0009,1.03,54162.0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
3,0.08392,1.03,54162.0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
4,0.0156,1.03,757.0,0,0,0,0,1,0,0,...,1,0,0,1,1,0,0,0,0,1



Shape of the combined DataFrame (X_binary_encoded): (6346, 130)


## Split the data

### Subtask:
Split the data into training and testing sets with the newly encoded features.


**Reasoning**:
Split the feature matrix X_binary_encoded and the target variable y into training and testing sets.



In [25]:
X_train_binary, X_test_binary, y_train, y_test = train_test_split(X_binary_encoded, y, test_size=0.2, random_state=42)

print("Shape of X_train_binary:", X_train_binary.shape)
print("Shape of X_test_binary:", X_test_binary.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train_binary: (5076, 130)
Shape of X_test_binary: (1270, 130)
Shape of y_train: (5076,)
Shape of y_test: (1270,)


## Encode target variable

### Subtask:
One-hot encode the target variable again for the neural network.


**Reasoning**:
One-hot encode the target variable for the neural network.



In [26]:
from sklearn.preprocessing import OneHotEncoder

# Initialize OneHotEncoder
# Use sparse=False for dense output, handle_unknown='ignore' to handle potential new categories in the test set
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit and transform on y_train, then transform y_test
y_train_encoded = encoder.fit_transform(y_train.values.reshape(-1, 1))
y_test_encoded = encoder.transform(y_test.values.reshape(-1, 1))

# Print the shapes of the encoded target variables
print("Shape of y_train_encoded:", y_train_encoded.shape)
print("Shape of y_test_encoded:", y_test_encoded.shape)

Shape of y_train_encoded: (5076, 9)
Shape of y_test_encoded: (1270, 9)


## Summary:

### Data Analysis Key Findings

*   Binary encoding was applied to the original categorical columns, resulting in a significant reduction in the number of features compared to what one-hot encoding would produce. The `bed_df_binary_encoded` DataFrame has a shape of (6346, 127).
*   The binary-encoded categorical features were successfully combined with the numerical features, creating a single DataFrame `X_binary_encoded` with a shape of (6346, 130).
*   The combined data was split into training and testing sets using a 80/20 ratio, resulting in training sets with shapes (5076, 130) and testing sets with shapes (1270, 130) for features.
*   The target variable `y` was successfully one-hot encoded for both the training and testing sets, resulting in `y_train_encoded` with a shape of (5076, 9) and `y_test_encoded` with a shape of (1270, 9).

### Insights or Next Steps

*   The data is now preprocessed with binary encoding for categorical features and one-hot encoding for the target variable, making it suitable for training a neural network.
*   The next logical step is to define and train a neural network model using the prepared `X_train_binary`, `y_train_encoded`, `X_test_binary`, and `y_test_encoded` datasets.


In [27]:
X_train_binary_np = X_train_binary.to_numpy() if isinstance(X_train_binary, pd.DataFrame) else X_train_binary
X_test_binary_np = X_test_binary.to_numpy() if isinstance(X_test_binary, pd.DataFrame) else X_test_binary
y_train_encoded_np = y_train_encoded.to_numpy() if isinstance(y_train_encoded, pd.DataFrame) else y_train_encoded
y_test_encoded_np = y_test_encoded.to_numpy() if isinstance(y_test_encoded, pd.DataFrame) else y_test_encoded


# Define the filename for the bundled data
bundled_filename = os.path.join(drive_path, 'preprocessed_alz_data.npz')

# Save the arrays into a single .npz file
np.savez(bundled_filename, X_train=X_train_binary_np, X_test=X_test_binary_np, y_train=y_train_encoded_np, y_test=y_test_encoded_np)

print(f"Data bundled and saved to {bundled_filename}")

# You can load the data back later using:
# loaded_data = np.load(bundled_filename)
# X_train_loaded = loaded_data['X_train']
# X_test_loaded = loaded_data['X_test']
# y_train_loaded = loaded_data['y_train']
# y_test_loaded = loaded_data['y_test']

Data bundled and saved to /content/drive/MyDrive/Colab_Dev/ALZ_Variant/preprocessed_alz_data.npz
