# Analysis of Cardiac Myopathy and RNA Therapeutics using DNABERT

This notebook demonstrates an end-to-end analysis pipeline that integrates multiple data sources relevant to cardiac myopathy and RNA therapeutics. We will:

- Load RNA-seq data from GEO (GSE55296) to examine gene expression differences in heart tissue.
- Process GTEx data (RNA-seq TPM, sample attributes, and subject phenotypes) to extract heart-specific expression profiles.
- Read in the human reference genome (GRCh38) to enable extraction of nucleotide sequences for target genes.
- Outline preliminary steps for using DNABERT_2 (and its DNABERT‑S foundation model) for generating DNA embeddings and classifying on-target/off-target siRNA binding sites.

The following sections provide detailed code and explanations.

## 1. Importing Libraries

We first import the necessary Python libraries. We use `pandas` for data manipulation, `matplotlib` for visualization, and `Bio` (Biopython) for parsing FASTA files. Additional libraries (such as `numpy`) will help with numerical operations.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from Bio import SeqIO

# Set matplotlib style for clarity
plt.style.use('default')

print("Libraries imported successfully.")

## 2. Data Directory Structure

We assume that the data has been downloaded using the provided `data_download.sh` script, which creates the following directories:

- `data/geo/` for the GEO RNA-seq series matrix file (GSE55296)
- `data/encode/` for ENCODE data (manual download required)
- `data/gtex/` for GTEx files (RNA-seq TPM data, sample attributes, subject phenotypes)
- `data/reference/` for the human reference genome (GRCh38)

Let’s confirm that these directories and files exist.

In [None]:
base_dir = 'data'
subdirs = ['geo', 'encode', 'gtex', 'reference']

for sub in subdirs:
    path = os.path.join(base_dir, sub)
    print(f"{path}:", os.listdir(path) if os.path.exists(path) else "Directory not found")

## 3. Loading and Processing GEO RNA-seq Data (GSE55296)

The GEO series matrix file contains gene expression data from heart tissue samples (including patients with cardiomyopathy and controls). In this example, we load the text file and inspect its header information. Note that the series matrix file is a tab-delimited text file with metadata lines starting with an exclamation mark (`!`).

For deeper analysis, you may need to parse the sample information and extract the actual gene expression matrix.

In [None]:
# Define path to the GEO series matrix file
geo_file = os.path.join('data', 'geo', 'GSE55296_series_matrix.txt')

with open(geo_file, 'r') as f:
    lines = f.readlines()

# Print the first 20 lines to inspect metadata
for line in lines[:20]:
    print(line.strip())

# Note: Further parsing is required to extract the data matrix from the file.

## 4. Loading and Processing GTEx Data

The GTEx project provides several files. We load the following:

1. **RNA-seq Gene TPM Data (GCT format):** This file contains TPM values for all genes across samples. The first two rows are header information.
2. **Sample Attributes:** Contains metadata (including tissue type) for each sample.
3. **Subject Phenotypes:** Contains phenotypic data for GTEx donors (optional for this analysis).

We then filter the sample attributes to extract only the heart tissues: "Heart - Left Ventricle" and "Heart - Atrial Appendage."

In [None]:
import gzip

## 4.1 Load GTEx RNA-seq TPM Data
gtex_tpm_file = os.path.join('data', 'gtex', 'GTEx_Analysis_v8_RNA-seq_RNA-SeQCv1.1.9_gene_tpm.gct.gz')

# GCT files have 2 header rows. We use pandas to read the file, skipping the first 2 rows.
gtex_df = pd.read_csv(gtex_tpm_file, sep='\t', skiprows=2)
print("GTEx TPM Data shape:", gtex_df.shape)
print(gtex_df.head())

## 4.2 Load GTEx Sample Attributes
sample_attr_file = os.path.join('data', 'gtex', 'GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt')
sample_attr_df = pd.read_csv(sample_attr_file, sep='\t')
print("Sample Attributes shape:", sample_attr_df.shape)
print(sample_attr_df.head())

## 4.3 Load GTEx Subject Phenotypes (Optional)
subject_phen_file = os.path.join('data', 'gtex', 'GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt')
subject_phen_df = pd.read_csv(subject_phen_file, sep='\t')
print("Subject Phenotypes shape:", subject_phen_df.shape)
print(subject_phen_df.head())

### Filtering GTEx Data for Heart Tissues

We use the sample attributes file to filter for heart samples. Specifically, we look for samples where the `SMTSD` (Sample Tissue Description) column has either "Heart - Left Ventricle" or "Heart - Atrial Appendage."

In [None]:
# Filter sample attributes for heart tissues
heart_samples = sample_attr_df[sample_attr_df['SMTSD'].isin(['Heart - Left Ventricle', 'Heart - Atrial Appendage'])]
print("Number of heart tissue samples:", heart_samples.shape[0])

# Display a few examples
heart_samples[['SAMPID', 'SMTSD']].head()

### Extracting Heart Expression Data

Now that we have the list of heart sample IDs (SAMPID), we filter the GTEx TPM data to retain only the columns corresponding to these samples. The first two columns in the GCT file are typically identifiers (e.g., `Name` and `Description`).

In [None]:
# Get list of heart sample IDs
heart_sample_ids = heart_samples['SAMPID'].tolist()

# Select identifier columns and heart sample columns
cols_to_keep = ['Name', 'Description'] + heart_sample_ids
gtex_heart_df = gtex_df[cols_to_keep]
print("Shape of GTEx heart TPM data:", gtex_heart_df.shape)

# Display the first few rows
gtex_heart_df.head()

## 5. Loading the Human Reference Genome (GRCh38)

We now load the human reference genome in FASTA format. This file will allow us to extract the nucleotide sequences for genes of interest. Here, we use Biopython’s `SeqIO` to parse the FASTA file.

In [None]:
ref_genome_file = os.path.join('data', 'reference', 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz')

print("Reading reference genome (this may take a moment)...")
genome_records = list(SeqIO.parse(ref_genome_file, "fasta"))
print(f"Number of sequences in the reference genome: {len(genome_records)}")

# Display the first record's header
print(genome_records[0].id)

# For targeted analysis, you might extract sequences for specific chromosomes or genes later.

## 6. Integrating DNABERT_2 for DNA Embeddings and Sequence Analysis

In this section, we outline the steps to use DNABERT_2 and its foundation model DNABERT‑S for generating DNA embeddings. We assume that the DNABERT_2 package has been installed (via the environment setup script) and that the model files are available in the cloned repository under `src/DNABERT_2`.

### 6.1 Loading the DNABERT_2 Model

For example, if using Hugging Face’s Transformers interface, you can load a pre-trained DNABERT_2 model. (Note: Replace `<model_name_or_path>` with the correct path if necessary.)

In [None]:
from transformers import BertTokenizer, BertModel

# Example: Load DNABERT_2 (or DNABERT-S) model
# Replace with the actual model directory if needed
model_path = 'src/DNABERT_2'

print("Loading DNABERT_2 tokenizer and model...")
tokenizer = BertTokenizer.from_pretrained(model_path, do_lower_case=False)
model = BertModel.from_pretrained(model_path)

print("DNABERT_2 model loaded successfully.")

# Example tokenization of a DNA sequence (using 6-mer tokenization if applicable)
sequence = "ATGCGTACGTAGCTAGCTAGCTAG"
tokens = tokenizer.tokenize(sequence)
print("Tokenized sequence:", tokens)

### 6.2 Fine-Tuning for siRNA Target/Off-target Classification (Outline)

The next steps (beyond this notebook) would involve preparing a labeled dataset of on-target and off-target sequences, fine-tuning the DNABERT_2 model on that data, and then evaluating its performance. The general workflow would include:

1. **Data Preparation:** Create a TSV or CSV file with columns for the sequence and the label (e.g., `1` for on-target, `0` for off-target).
2. **Fine-Tuning:** Use Hugging Face’s Trainer API or a custom training loop to fine-tune DNABERT_2 on your dataset.
3. **Evaluation:** Evaluate the model using appropriate metrics (accuracy, precision, recall) and analyze the attention weights to interpret key motifs.

Due to time constraints for a weekend project, we focus here on the data loading and preliminary steps. Detailed training scripts can be added later as needed.

## 7. Data Visualization and Preliminary Analysis

Here, we perform some basic visualizations to inspect the GTEx heart expression data. For example, we can plot the distribution of TPM values for a gene of interest across heart samples.

In [None]:
# Let's choose a gene of interest. The 'Name' column in the GTEx TPM file typically contains Ensembl gene IDs.
# For demonstration, we pick the first gene in the dataset.
gene_id = gtex_heart_df.iloc[0]['Name']
print("Gene ID:", gene_id)

# Extract TPM values (skip the identifier columns)
tpm_values = gtex_heart_df.iloc[0, 2:].astype(float)

plt.figure(figsize=(8, 4))
plt.hist(tpm_values, bins=30, color='skyblue', edgecolor='black')
plt.title(f'TPM Distribution for Gene {gene_id}')
plt.xlabel('TPM')
plt.ylabel('Frequency')
plt.show()

## 8. Conclusions and Next Steps

In this notebook, we have:

- Loaded and inspected GEO RNA-seq data (GSE55296) for cardiac myopathy.
- Processed GTEx data to extract heart-specific RNA-seq TPM values and sample metadata.
- Parsed the human reference genome for potential sequence extraction.
- Outlined the integration of DNABERT_2 (and DNABERT‑S) for generating DNA embeddings and fine-tuning on siRNA target classification.

### Next Steps:

1. **Refine Data Parsing:** Develop more robust parsing functions for the GEO series matrix to extract sample-specific expression data.
2. **Sequence Extraction:** Use the reference genome and gene annotations (e.g., from ENSEMBL) to extract nucleotide sequences for key genes.
3. **Model Fine-Tuning:** Prepare a labeled dataset and fine-tune the DNABERT_2 model for on-target/off-target classification.
4. **Advanced Visualization:** Generate attention heatmaps and sequence logos to interpret key motifs identified by DNABERT_2.

This notebook serves as the starting point for an integrated genomic analysis pipeline aimed at improving RNA therapeutic design.