# Analysis of Cardiac Myopathy and RNA Therapeutics using DNABERT

This notebook demonstrates an end-to-end analysis pipeline that integrates multiple data sources relevant to cardiac myopathy and RNA therapeutics. We will:

- Load RNA-seq data from GEO (GSE55296) to examine gene expression differences in heart tissue.
- Process GTEx data (RNA-seq TPM, sample attributes, and subject phenotypes) to extract heart-specific expression profiles.
- Read in the human reference genome (GRCh38) to enable extraction of nucleotide sequences for target genes.
- Outline preliminary steps for using DNABERT_2 for generating DNA embeddings and classifying on-target/off-target siRNA binding sites.

In [None]:
import os
import yaml
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from Bio import SeqIO

# Set matplotlib style for clarity
plt.style.use('default')

# Load configuration
with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

def get_path(*parts):
    """Build path relative to notebook location"""
    return os.path.join('..', *parts)

print("Libraries and configuration loaded successfully.")

## Data Directory Structure

We assume that the data has been downloaded using the provided `data_download.sh` script, which creates the following directories from the configuration:

- GEO directory for RNA-seq series matrix file
- ENCODE directory for ChIP-seq data (manual download required)
- GTEx directory for RNA-seq TPM data and metadata
- Reference directory for the human genome

Let's confirm these directories exist.

In [None]:
# Check all data directories
for dir_name, dir_path in config['directories'].items():
    full_path = get_path(dir_path)
    print(f"{dir_path}:", os.listdir(full_path) if os.path.exists(full_path) else "Directory not found")

In [None]:
# Load GEO series matrix file
geo_file = get_path(config['directories']['geo'], config['files']['geo']['series_matrix']['filename'])

with open(geo_file, 'r') as f:
    lines = f.readlines()

print("First 20 lines of GEO series matrix:")
for line in lines[:20]:
    print(line.strip())

In [None]:
# Load GTEx data
gtex_dir = config['directories']['gtex']

# Load TPM data
gtex_tpm_file = get_path(gtex_dir, config['files']['gtex']['tpm_data']['filename'])
gtex_df = pd.read_csv(gtex_tpm_file, sep='\t', skiprows=2)
print("GTEx TPM Data shape:", gtex_df.shape)
print(gtex_df.head())

# Load sample attributes
sample_attr_file = get_path(gtex_dir, config['files']['gtex']['sample_attributes']['filename'])
sample_attr_df = pd.read_csv(sample_attr_file, sep='\t')
print("\nSample Attributes shape:", sample_attr_df.shape)
print(sample_attr_df.head())

# Load subject phenotypes
subject_phen_file = get_path(gtex_dir, config['files']['gtex']['subject_phenotypes']['filename'])
subject_phen_df = pd.read_csv(subject_phen_file, sep='\t')
print("\nSubject Phenotypes shape:", subject_phen_df.shape)
print(subject_phen_df.head())

In [None]:
# Load reference genome
ref_genome_file = get_path(config['directories']['reference'],
                          config['files']['reference']['genome']['filename'])

print("Reading reference genome (this may take a moment)...")
genome_records = list(SeqIO.parse(ref_genome_file, "fasta"))
print(f"Number of sequences in the reference genome: {len(genome_records)}")
print("First record:", genome_records[0].id)

In [None]:
# Load DNABERT model
from transformers import BertTokenizer, BertModel

model_path = '../src/DNABERT_2'

print("Loading DNABERT_2 tokenizer and model...")
tokenizer = BertTokenizer.from_pretrained(model_path, do_lower_case=False)
model = BertModel.from_pretrained(model_path)

print("DNABERT_2 model loaded successfully.")

# Example tokenization
sequence = "ATGCGTACGTAGCTAGCTAGCTAG"
tokens = tokenizer.tokenize(sequence)
print("Tokenized sequence:", tokens)