Currently, operations with genomic regions are implemented for Agarwal, AgarwalJoint, Malinois, DeepSTARR, Sure, Kircher data.

# Agarwal dataset

Agarwal data uses hg38

In [1]:
import mpramnist
from mpramnist.Agarwal.dataset import AgarwalDataset

## Common use

In [2]:
train = AgarwalDataset(split="train", cell_type="HepG2", root="../data/")
val = AgarwalDataset(split="val", cell_type="HepG2", root="../data/")
test = AgarwalDataset(split="test", cell_type="HepG2", root="../data/")
print(train)
print("------------")
print(val)
print("------------")
print(test)

Dataset AgarwalDataset (MpraDaraset)
    Number of datapoints: 98336
    Root location: ../data/Agarwal
    Using split: [1, 2, 3, 4, 5, 6, 7, 8]
    Split: {'train': 98336, 'val': 12298, 'test': 12298}
    Task: Regression
    Description: The Agarwal dataset is based on a lentiviral MPRA system. The total data volume was 122,926 sequences for the HepG2 cell line, 196,664 for K562, and 46,185 for WTC11. Each sequence is 200 nucleotides long. The regression task was to predict a scalar value of regulatory activity for the corresponding cell line.
------------
Dataset AgarwalDataset (MpraDaraset)
    Number of datapoints: 12292
    Root location: ../data/Agarwal
    Using split: [9]
    Split: {'train': 98336, 'val': 12298, 'test': 12298}
    Task: Regression
    Description: The Agarwal dataset is based on a lentiviral MPRA system. The total data volume was 122,926 sequences for the HepG2 cell line, 196,664 for K562, and 46,185 for WTC11. Each sequence is 200 nucleotides long. The regr

## Exclude genomic regions

In [3]:
test_regions = [
    {"chrom": "10", "start": 1, "end": 1000000000},
    {"chrom": "1", "start": 1, "end": 89032143},
]

test_dataset = AgarwalDataset(
    split="test", cell_type="WTC11", genomic_regions=test_regions, root="../data/"
)

train_dataset = AgarwalDataset(
    split="train",
    cell_type="WTC11",
    genomic_regions=test_regions,
    exclude_regions=True,
    root="../data/",
)

In [4]:
print(train_dataset)
print("===================")
print(test_dataset)

Dataset AgarwalDataset (MpraDaraset)
    Number of datapoints: 41878
    Root location: ../data/Agarwal
    Using split: genomic region
    Split: {'train': 36946, 'val': 4622, 'test': 4622}
    Task: Regression
    Description: The Agarwal dataset is based on a lentiviral MPRA system. The total data volume was 122,926 sequences for the HepG2 cell line, 196,664 for K562, and 46,185 for WTC11. Each sequence is 200 nucleotides long. The regression task was to predict a scalar value of regulatory activity for the corresponding cell line.
Dataset AgarwalDataset (MpraDaraset)
    Number of datapoints: 4307
    Root location: ../data/Agarwal
    Using split: genomic region
    Split: {'train': 36946, 'val': 4622, 'test': 4622}
    Task: Regression
    Description: The Agarwal dataset is based on a lentiviral MPRA system. The total data volume was 122,926 sequences for the HepG2 cell line, 196,664 for K562, and 46,185 for WTC11. Each sequence is 200 nucleotides long. The regression task was t

## Include genomic regions

In [5]:
test_regions = [
    {"chrom": "10", "start": 1, "end": 1000000000},
    {"chrom": "1", "start": 1, "end": 1000000000},
    {"chrom": "20", "start": 1, "end": 1000000000},
]
val_regions = [
    {"chrom": "2", "start": 1, "end": 1000000000},
    {"chrom": "3", "start": 1, "end": 1000000000},
    {"chrom": "4", "start": 1, "end": 1000000000},
    {"chrom": "Y", "start": 1, "end": 1000000000},
]
train_regions = [
    {"chrom": "5", "start": 1, "end": 1000000000},
    {"chrom": "6", "start": 1, "end": 1000000000},
    {"chrom": "7", "start": 1, "end": 1000000000},
    {"chrom": "X", "start": 1, "end": 1000000000},
]

test = AgarwalDataset(
    split="test", cell_type="K562", genomic_regions=test_regions, root="../data/"
)

val = AgarwalDataset(
    split="val", cell_type="K562", genomic_regions=val_regions, root="../data/"
)

train = AgarwalDataset(
    split="train", cell_type="K562", genomic_regions=train_regions, root="../data/"
)

  df = pd.read_csv(file_path, sep="\t")
  df = pd.read_csv(file_path, sep="\t")
  df = pd.read_csv(file_path, sep="\t")


In [6]:
print(train)
print("===================")
print(val)
print("===================")
print(test)

Dataset AgarwalDataset (MpraDaraset)
    Number of datapoints: 22044
    Root location: ../data/Agarwal
    Using split: genomic region
    Split: {'train': 167328, 'val': 19670, 'test': 19670}
    Task: Regression
    Description: The Agarwal dataset is based on a lentiviral MPRA system. The total data volume was 122,926 sequences for the HepG2 cell line, 196,664 for K562, and 46,185 for WTC11. Each sequence is 200 nucleotides long. The regression task was to predict a scalar value of regulatory activity for the corresponding cell line.
Dataset AgarwalDataset (MpraDaraset)
    Number of datapoints: 4674
    Root location: ../data/Agarwal
    Using split: genomic region
    Split: {'train': 167328, 'val': 19670, 'test': 19670}
    Task: Regression
    Description: The Agarwal dataset is based on a lentiviral MPRA system. The total data volume was 122,926 sequences for the HepG2 cell line, 196,664 for K562, and 46,185 for WTC11. Each sequence is 200 nucleotides long. The regression task

## Download regions from BED-file

In [7]:
test_regions = ["10\t1\t1000000000", "1\t1\t1000000000", "20\t1\t1000000000"]

with open("test_regions.bed", "w") as f:
    for region in test_regions:
        f.write(region + "\n")

In [8]:
test_dataset = AgarwalDataset(
    split="test",
    cell_type="HepG2",
    genomic_regions="./test_regions.bed",
    root="../data/",
)
train_dataset = AgarwalDataset(
    split="train",
    cell_type="HepG2",
    genomic_regions="./test_regions.bed",
    exclude_regions=True,
    root="../data/",
)

In [9]:
print(train_dataset)
print("===================")
print(test_dataset)

Dataset AgarwalDataset (MpraDaraset)
    Number of datapoints: 99736
    Root location: ../data/Agarwal
    Using split: genomic region
    Split: {'train': 98336, 'val': 12298, 'test': 12298}
    Task: Regression
    Description: The Agarwal dataset is based on a lentiviral MPRA system. The total data volume was 122,926 sequences for the HepG2 cell line, 196,664 for K562, and 46,185 for WTC11. Each sequence is 200 nucleotides long. The regression task was to predict a scalar value of regulatory activity for the corresponding cell line.
Dataset AgarwalDataset (MpraDaraset)
    Number of datapoints: 23190
    Root location: ../data/Agarwal
    Using split: genomic region
    Split: {'train': 98336, 'val': 12298, 'test': 12298}
    Task: Regression
    Description: The Agarwal dataset is based on a lentiviral MPRA system. The total data volume was 122,926 sequences for the HepG2 cell line, 196,664 for K562, and 46,185 for WTC11. Each sequence is 200 nucleotides long. The regression task 

# AgarwalJoint dataset

AgarwalJoint data uses hg38

In [10]:
import mpramnist
from mpramnist.AgarwalJoint.dataset import AgarwalJointDataset

In [11]:
test_regions = [
    {"chrom": "10", "start": 1, "end": 1000000000},
    {"chrom": "1", "start": 1, "end": 1000000000},
    {"chrom": "20", "start": 1, "end": 1000000000},
]

test_dataset = AgarwalJointDataset(
    split="test",
    cell_type=["HepG2", "K562", "WTC11"],
    genomic_regions=test_regions,
    root="../data/",
)

train_dataset = AgarwalJointDataset(
    split="train",
    cell_type=["HepG2", "K562", "WTC11"],
    genomic_regions=test_regions,
    exclude_regions=True,
    root="../data/",
)

In [12]:
print(train_dataset)
print("------------")
print(test_dataset)

Dataset AgarwalJointDataset (MpraDaraset)
    Number of datapoints: 45527
    Root location: ../data/AgarwalJoint
    Using split: genomic region
    Split: {'train': 36946, 'val': 4622, 'test': 4622}
    Task: Regression
    Description: The AgarwalJoint dataset contains 55,338 sequences, each 200 nucleotides long, comprising enhancers from the HepG2, K562, and WTC11 cell lines. The regression task is to predict three scalar values of regulatory activity for each of these cell lines.
------------
Dataset AgarwalJointDataset (MpraDaraset)
    Number of datapoints: 9811
    Root location: ../data/AgarwalJoint
    Using split: genomic region
    Split: {'train': 36946, 'val': 4622, 'test': 4622}
    Task: Regression
    Description: The AgarwalJoint dataset contains 55,338 sequences, each 200 nucleotides long, comprising enhancers from the HepG2, K562, and WTC11 cell lines. The regression task is to predict three scalar values of regulatory activity for each of these cell lines.


# Malinois dataset

Malinois data uses hg19

In [13]:
import mpramnist
from mpramnist.Malinois.dataset import MalinoisDataset

In [14]:
test_regions = [
    {"chrom": "10", "start": 1, "end": 1000000000},
    {"chrom": "1", "start": 1, "end": 1000000000},
    {"chrom": "X", "start": 1, "end": 1000000000},
]
train_dataset = MalinoisDataset(
    split="train",
    filtration="none",
    root="../data/",
    genomic_regions=test_regions,
    exclude_regions=True,
)
test_dataset = MalinoisDataset(
    split="test", filtration="none", root="../data/", genomic_regions=test_regions
)

In [15]:
print(train_dataset)
print("------------")
print(test_dataset)

Dataset MalinoisDataset (MpraDaraset)
    Number of datapoints: 665160
    Root location: ../data/Malinois
    Using split: genomic region
    Split: {'train': 668946, 'val': 58809, 'test': 62582}
    Task: Regression
    Description: The Malinois dataset includes 798,064 sequences tested in the K562, HepG2, and SK-N-SH cell lines. The original sequence length is approximately 200 nucleotides, and it is recommended to extend them to 600 bp. The task is to predict three normalized regulatory activity values for the respective cell lines.
------------
Dataset MalinoisDataset (MpraDaraset)
    Number of datapoints: 118440
    Root location: ../data/Malinois
    Using split: genomic region
    Split: {'train': 668946, 'val': 58809, 'test': 62582}
    Task: Regression
    Description: The Malinois dataset includes 798,064 sequences tested in the K562, HepG2, and SK-N-SH cell lines. The original sequence length is approximately 200 nucleotides, and it is recommended to extend them to 600 bp.

In [16]:
train_dataset = MalinoisDataset(
    split="train",
    filtration="none",
    root="../data/",
    genomic_regions="./test_regions.bed",
    exclude_regions=True,
)
test_dataset = MalinoisDataset(
    split="test",
    filtration="none",
    root="../data/",
    genomic_regions="./test_regions.bed",
)

In [17]:
print(train_dataset)
print("------------")
print(test_dataset)

Dataset MalinoisDataset (MpraDaraset)
    Number of datapoints: 654074
    Root location: ../data/Malinois
    Using split: genomic region
    Split: {'train': 668946, 'val': 58809, 'test': 62582}
    Task: Regression
    Description: The Malinois dataset includes 798,064 sequences tested in the K562, HepG2, and SK-N-SH cell lines. The original sequence length is approximately 200 nucleotides, and it is recommended to extend them to 600 bp. The task is to predict three normalized regulatory activity values for the respective cell lines.
------------
Dataset MalinoisDataset (MpraDaraset)
    Number of datapoints: 129526
    Root location: ../data/Malinois
    Using split: genomic region
    Split: {'train': 668946, 'val': 58809, 'test': 62582}
    Task: Regression
    Description: The Malinois dataset includes 798,064 sequences tested in the K562, HepG2, and SK-N-SH cell lines. The original sequence length is approximately 200 nucleotides, and it is recommended to extend them to 600 bp.

# DeepSTARR dataset

DeepSTARR data uses BDGP R5/dm3

In [18]:
import mpramnist
from mpramnist.DeepStarr.dataset import DeepStarrDataset

In [19]:
val_regions = [
    {"chrom": "2R", "start": 1, "end": 21144049//2},
]
test_regions = [
    {"chrom": "2R", "start": 21144049//2, "end": 21144049},
]
val_test = val_regions + test_regions
train_dataset = DeepStarrDataset(
    split="train",
    root="../data/",
    use_original_reverse_complement=True,
    genomic_regions=val_test,
    exclude_regions=True,
)
val_dataset = DeepStarrDataset(
    split="val",  root="../data/", genomic_regions=val_regions
)
test_dataset = DeepStarrDataset(
    split="test",  root="../data/", genomic_regions=test_regions
)

In [20]:
print(train_dataset)
print("===================")
print(val_dataset)
print("===================")
print(test_dataset)

Dataset DeepStarrDataset (MpraDaraset)
    Number of datapoints: 402322
    Root location: ../data/DeepStarr
    Using split: genomic region
    Split: {'train': 201139, 'val': 40570, 'test': 3978}
    Task: Regression
    Description: The DeepSTARR data contains information on enhancer activity in D. melanogaster S2 cells. The dataset includes 282,892 sequences, each 249 nucleotides in length. The regression task is to predict two independent values of enhancer activity for two types of promoters.
Dataset DeepStarrDataset (MpraDaraset)
    Number of datapoints: 40562
    Root location: ../data/DeepStarr
    Using split: genomic region
    Split: {'train': 201139, 'val': 40570, 'test': 3978}
    Task: Regression
    Description: The DeepSTARR data contains information on enhancer activity in D. melanogaster S2 cells. The dataset includes 282,892 sequences, each 249 nucleotides in length. The regression task is to predict two independent values of enhancer activity for two types of prom

# Sure dataset

In [1]:
import mpramnist
from mpramnist.Sure.dataset import SureDataset

In [8]:
test_regions = [
    {"chrom": "10", "start": 1, "end": 1000000000},
    {"chrom": "1", "start": 1, "end": 1000000000},
    {"chrom": "X", "start": 1, "end": 1000000000},
]

GENOME_IDS = SureDataset.GENOME_IDS
task = "regression"

train_dataset = SureDataset(
    task=task,
    genome_id=GENOME_IDS[2],
    split="train",
    root="../data/",
    genomic_regions=test_regions,
    exclude_regions=True,
) 

test_dataset = SureDataset(
    task=task,
    genome_id=GENOME_IDS[2],
    split="test",
    root="../data/",
    genomic_regions=test_regions,
) 

In [9]:
print(train_dataset)
print("===================")
print(test_dataset)

Dataset SureDataset (MpraDaraset)
    Number of datapoints: 473375
    Root location: ../data/Sure
    Using split: genomic region
    Split: {'train': 446214, 'val': 62580, 'test': 62639}
    Task: Regression or Multi-class
    Description: The Sure dataset contains data from the analysis of genomes from 4 individuals. Experiments were conducted on the K562 and HepG2 cell lines. The dataset volume is ~400–600 thousand training and ~50–70 thousand test and validation sequences per individual. The classification task is to predict a class label reflecting the expression level for each cell line. The regression task is to predict a continuous value of the expression level for K562 and HepG2.
Dataset SureDataset (MpraDaraset)
    Number of datapoints: 98058
    Root location: ../data/Sure
    Using split: genomic region
    Split: {'train': 446214, 'val': 62580, 'test': 62639}
    Task: Regression or Multi-class
    Description: The Sure dataset contains data from the analysis of genomes 

# Kircher dataset

Kircher data uses hg38

In [24]:
import mpramnist
from mpramnist.Kircher.dataset import KircherDataset

In [25]:
test_regions = [
    {"chrom": "10", "start": 1, "end": 1000000000},
    {"chrom": "1", "start": 1, "end": 1000000000},
]

In [26]:
kircher = KircherDataset(
    length=200, 
    genomic_regions=test_regions,
    root="../data/"
)
kircher

Dataset KircherDataset (MpraDaraset)
    Number of datapoints: 17266
    Root location: ../data/Kircher
    Using split: test
    Split: {'BCL11A': 2062, 'FOXE1': 2048, 'MSMB': 2019, 'ZFAND3': 2012, 'SORT1.2': 1998, 'SORT1-flip': 1974, 'UC88': 1964, 'RET': 1962, 'MYCrs6983267': 1950, 'SORT1': 1927, 'TCF7L2': 1910, 'IRF6': 1906, 'PKLR-48h': 1794, 'PKLR-24h': 1776, 'MYCrs11986220': 1681, 'ZRSh-13h2': 1662, 'ZRSh-13': 1661, 'IRF4': 1510, 'GP1BA': 1268, 'LDLR.2': 1093, 'LDLR': 1083, 'F9': 984, 'HNF4A': 977, 'TERT-HEK': 975, 'TERT-GAa': 974, 'TERT-GSc': 974, 'TERT-GBM': 973, 'HBG1': 907, 'HBB': 623}
    Task: Regression
    Description: The Kircher dataset is based on MPRA saturation mutagenesis results and includes 44,647 variants of regulatory elements tested in various cell lines. These data are proposed to be used as a benchmark set for evaluating the quality of machine learning models. The regression task involved predicting a scalar value of regulatory activity for the corresponding c

In [27]:
kircher = KircherDataset(
    length=200, 
    genomic_regions=test_regions,
    exclude_regions=True,
    root="../data/"
)
kircher

Dataset KircherDataset (MpraDaraset)
    Number of datapoints: 27381
    Root location: ../data/Kircher
    Using split: test
    Split: {'BCL11A': 2062, 'FOXE1': 2048, 'MSMB': 2019, 'ZFAND3': 2012, 'SORT1.2': 1998, 'SORT1-flip': 1974, 'UC88': 1964, 'RET': 1962, 'MYCrs6983267': 1950, 'SORT1': 1927, 'TCF7L2': 1910, 'IRF6': 1906, 'PKLR-48h': 1794, 'PKLR-24h': 1776, 'MYCrs11986220': 1681, 'ZRSh-13h2': 1662, 'ZRSh-13': 1661, 'IRF4': 1510, 'GP1BA': 1268, 'LDLR.2': 1093, 'LDLR': 1083, 'F9': 984, 'HNF4A': 977, 'TERT-HEK': 975, 'TERT-GAa': 974, 'TERT-GSc': 974, 'TERT-GBM': 973, 'HBG1': 907, 'HBB': 623}
    Task: Regression
    Description: The Kircher dataset is based on MPRA saturation mutagenesis results and includes 44,647 variants of regulatory elements tested in various cell lines. These data are proposed to be used as a benchmark set for evaluating the quality of machine learning models. The regression task involved predicting a scalar value of regulatory activity for the corresponding c