# DeepTopic

Sample notebook to train DeepTopic model.

In [1]:
import enhancerai as enhai

We can use function {func}`enhancerai.import_topics` to import data into an {class}`anndata.AnnData` object,
with the imported topics as the `AnnData.obs` and the consensus peak regions as the `AnnData.var`..

In [2]:
adata = enhai.import_topics(
    topics_folder="/staging/leuven/stg_00002/lcb/lmahieu/projects/DeepTopic/biccn_test/otsu",
    peaks_file="/staging/leuven/stg_00002/lcb/lmahieu/projects/DeepTopic/biccn_test/consensus_peaks_bicnn.bed",
    compress=True,
    # topics_subset=["topic_1", "topic_2"], # optional subset of topics to import
)
adata



AnnData object with n_obs × n_vars = 80 × 546993
    obs: 'file_path', 'n_open_regions'
    var: 'n_topics', 'chr', 'start', 'end'

The `import_topics` function will also add a couple of columns with variables of interest to your `AnnData.obs` and `Anndata.var` (AnnData.obs.n_open_regions and AnnData.var.n_topics), which you can use to inspect and get a feel of your data.

To be able to do region to topic modelling, we'll need to add the DNA sequences to our `AnnData` object. We can do this by using {func}`enhancerai.pp.add_dna_sequence` and referencing to a local Fasta file with the `fasta_path=/path/to/local.fasta` argument. Alternatively, we can simple provide a name of a genome, which will use genomepy to download a reference genome. The DNA sequences will be located in your AnnData.varm.

To train a model, we'll need to add a *split* column to our dataset, which we can do using {func}`enhancerai.pp.train_val_test`.  
We can add a `random_state` to ensure the data will be split in the same manner in the future when `shuffle=True`(default).

In [3]:
# We can split randomly on the regions
enhai.pp.train_val_test_split(
    adata, type="random", val_size=0.1, test_size=0.1, random_state=42
)

# Or, choose the chromosomes for the validation and test sets
# enhai.pp.train_val_test_split(
#     adata, type="chr", chr_val=["chr4", "chrX"], chr_test=["chr2", "chr3"]
# )

print(adata.var["split"].value_counts())
adata.var

split
train    437593
test      54700
val       54700
Name: count, dtype: int64


Unnamed: 0_level_0,n_topics,chr,start,end,split
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
chr1:3094805-3095305,5,chr1,3094805,3095305,train
chr1:3095470-3095970,0,chr1,3095470,3095970,train
chr1:3112174-3112674,1,chr1,3112174,3112674,test
chr1:3113534-3114034,2,chr1,3113534,3114034,train
chr1:3119746-3120246,8,chr1,3119746,3120246,train
...,...,...,...,...,...
chrX:169879313-169879813,3,chrX,169879313,169879813,train
chrX:169880181-169880681,0,chrX,169880181,169880681,train
chrX:169925477-169925977,1,chrX,169925477,169925977,train
chrX:169948550-169949050,0,chrX,169948550,169949050,train


## Train

In [4]:
from enhancerai.tl.zoo import DeepTopicCNN
from enhancerai.tl.dataloaders import AnnDataModule
from enhancerai.tl.tasks import TopicClassification

# Chosen model architecture
architecture = DeepTopicCNN(num_classes=80, seq_len=500)

# Datamodule, containing the train, validation and test dataloaders
datamodule = AnnDataModule(
    adata,
    genome_file ='/staging/leuven/res_00001/genomes/10xgenomics/CellRangerARC/refdata-cellranger-arc-mm10-2020-A-2.0.0/fasta/genome.fa',
    batch_size=256,
    num_workers=30,
    in_memory=True,
    random_reverse_complement=True
)

# Task definition (losses, metrics, and how a training step is performed), initialized
# with the chosen model architecture
task = TopicClassification(80, architecture, lr=0.001)

  return F.conv1d(input, weight, bias, self.stride,


In [5]:
from enhancerai.tl import fit

# Define the Trainer object with run information
fit(task, datamodule, project_name="test-biccn-enhancerai", max_epochs=40)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mlukas-mahieu[0m ([33mlukas-mahieu-vib[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
/lustre1/project/stg_00002/mambaforge/vsc35862/envs/enhancerai/lib/python3.11/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /lustre1/project/stg_00002/mambaforge/vsc35862/envs/ ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/lustre1/project/stg_00002/mambaforge/vsc35862/envs/enhancerai/lib/python3.11/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /lustre1/project/stg_00002/mambaforge/vsc35862/envs/ ...
You are

Loading sequences into memory...
Loading sequences into memory...


/lustre1/project/stg_00002/mambaforge/vsc35862/envs/enhancerai/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:653: Checkpoint directory /lustre1/project/stg_00002/lcb/lmahieu/projects/EnhancerAI/docs/notebooks/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type             | Params
---------------------------------------------------
0 | loss          | BCELoss          | 0     
1 | model         | DeepTopicCNN     | 11.7 M
2 | train_metrics | MetricCollection | 0     
3 | val_metrics   | MetricCollection | 0     
4 | test_metrics  | MetricCollection | 0     
---------------------------------------------------
11.7 M    Trainable params
0         Non-trainable params
11.7 M    Total params
46.777    Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]



Epoch 10:   4%|▎         | 60/1710 [00:07<03:18,  8.31it/s, v_num=84qs, train/loss_step=0.133, val/loss=0.134, train/loss_epoch=0.137] 

/lustre1/project/stg_00002/mambaforge/vsc35862/envs/enhancerai/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...


In [7]:
from enhancerai.tl import evaluate, predict
import numpy as np

# Evaluate the model on the test set
evaluate(task, datamodule)

# Predict the labels of the full dataset
results = predict(task, datamodule)
results = np.vstack([x.cpu().numpy() for x in results])
results

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/luna.kuleuven.be/u0166574/miniconda3/envs/enhancerai/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing DataLoader 0: 100%|██████████| 97/97 [00:03<00:00, 24.92it/s]


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting DataLoader 0: 100%|██████████| 725/725 [00:26<00:00, 27.43it/s]


array([[0.03991739, 0.01989121, 0.9506258 ],
       [0.04678131, 0.01655637, 0.9515035 ],
       [0.26182953, 0.23671702, 0.5713035 ],
       ...,
       [0.22486855, 0.18705001, 0.60592073],
       [0.2437813 , 0.32724276, 0.41594404],
       [0.3834117 , 0.28862736, 0.38204765]], dtype=float32)