## Loading Genomic Benchmarks into AWS HealthOmics (Optional)

[AWS HealthOmics](https://aws.amazon.com/healthomics/) is a purpose-built service that helps healthcare and life science organizations and their software partners store, query, and analyze genomic, transcriptomic, and other omics data and then generate insights from that data to improve health. It supports large-scale analysis and collaborative research.

In this scenario, we will be fine-tuning a pre-trained [Caduceus](https://caduceus-dna.github.io/) model for a range of DNA sequence classification tasks published on [Genomic Benchmarks](https://bmcgenomdata.biomedcentral.com/articles/10.1186/s12863-023-01123-8) (Grešová, K., Martinek, V., Čechák, D. et al., 2023).

The Genomic Benchmark datasets are publicly available on [HuggingFace](https://huggingface.co/katarinagresova), but you can optionally use this notebook to convert the datasets into FASTQ format and import them into an AWS HealthOmics Sequence Store. In the training phase, you can choose to read the datasets from this Sequence Store or download directly from HuggingFace. 

So while this use case leverages publicly available data, the HealthOmics integration demonstrates an alternative workflow that may be useful for genomics research institutions who wish to train models on their own proprietary DNA sequence datasets.

## 0. Prerequisites

First, create a bucket that you can use for "staging" the FASTQ files before they are imported into HealthOmics.

In [10]:
import sagemaker

account_id = sagemaker.Session().account_id()

S3_BUCKET = f"genomic-benchmarks-staging-{account_id}"

In [9]:
!aws s3api create-bucket --bucket "$S3_BUCKET"

{
    "Location": "/genomic-benchmarks-staging-767398100082"
}


Next, create a service role for `omics.amazonaws.com` to be able to access the staged datasets and import the readsets. More details in the [User Guide](https://docs.aws.amazon.com/omics/latest/dev/create-reference-store.html#api-create-reference-store).

Use the following trust policy:

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "omics.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

And make sure it has permissions to objects from the bucket you just created, e.g.

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetBucketLocation"
                
            ],
            "Resource": [
                "arn:aws:s3:::<REPLACE_ME_WITH_S3_BUCKET>",
                "arn:aws:s3:::<REPLACE_ME_WITH_S3_BUCKET>/*"
            ]
         }
      ]
   }   
}
```

For this notebook, we will assume the role is named `OmicsImportRole`.

In [None]:
# assumes you named the role `OmicsImportRole`
IMPORT_JOB_ROLE_ARN = f"arn:aws:iam::{account_id}:role/OmicsImportRole"

## 1. Create the HealthOmics Sequence Store

HealthOmics sequence stores allow you to store genomic files in common formats like FASTQ and BAM. In this case, we will be converting the Genomic Benchmark sequences into gzip-compressed FASTQ files.

Before we can import the genomic files (also called read sets), we need to create the sequence store.

In [None]:
SEQUENCE_STORE_NAME = "genomic_benchmarks"

In [1]:
import boto3
from datasets import load_dataset
from scripts.omics_utils import create_fastq_entry, gzip_fileobj

omics = boto3.client("omics")

seq_store_resp = omics.create_sequence_store(
    name=SEQUENCE_STORE_NAME,
    description="Genomic Benchmarks datasets for DNA sequence classification | https://bmcgenomdata.biomedcentral.com/articles/10.1186/s12863-023-01123-8#citeas",
)
seq_store_id = seq_store_resp["id"]

# take note of this sequence store ID for the next notebook
print(f"Sequence store ID: {seq_store_id}")

Sequence store ID: 9757315158


Take note of the Sequence Store ID, as you'll need it in the next notebook (training the model).

## 2. Download the Datasets, Convert to FASTQ, and Upload to S3

The following loop will download each benchmark task's dataset from  HuggingFace, convert it into gzipped FASTQ format (with dummy values for the quality scores, since they are not used in the classification task), and upload it to S3. There will be 1 file per task per split (train/test).

In [None]:
s3_client = boto3.client("s3")

# loop through all the tasks
# https://huggingface.co/katarinagresova/Genomic_Benchmarks_{task}
tasks = [
    "demo_coding_vs_intergenomic_seqs",
    "demo_human_or_worm",
    "dummy_mouse_enhancers_ensembl",
    "human_enhancers_cohn",
    "human_enhancers_ensembl",
    "human_ensembl_regulatory",
    "human_nontata_promoters",
    "human_ocr_ensembl",
]

sources = []
for task in tasks:
    print(f"Preparing task: {task}")
    s3_prefix = f"genomic_benchmarks/{task}/"
    s3_uris = []
    for split in ["train", "test"]:
        # download the HF dataset and convert to FASTQ
        ds = load_dataset(f"katarinagresova/Genomic_Benchmarks_{task}", split=split)
    
        fastq_content = ""
        for idx, row in enumerate(ds):
            fastq_content += create_fastq_entry(
                sequence=row["seq"],
                sequence_id=f"label_{row['label']}_idx_{idx}",
            )
        # upload FASTQ's to S3
        s3_key = os.path.join(s3_prefix, split, f"{split}_combined.fastq.gz")
        s3_client.put_object(
            Bucket=S3_BUCKET,
            Key=s3_key,
            Body=gzip_fileobj(io.StringIO(fastq_content)),
        )
        s3_uris.append(f"s3://{S3_BUCKET}/{s3_key}")
        print(f"Successfully uploaded {s3_key} with {len(ds)} sequences")

    # include the task name in the metadata for the read set
    # useful for filtering when reading data
    sources += [
        {
            "sourceFiles": {"source1": s3_uri},
            "sourceFileType": "FASTQ",
            "subjectId": task,
            "sampleId": task,
        }
        for s3_uri in s3_uris
    ]

## 3. Create a Read Set Import Job

Finally, we can create a Read Set Import job to take the FASTQ files from our staging bucket and add them to the sequence store. Note that we included the task ID as the `subjectId` (`sources` in the above loop) - this will help us retrieve the read sets for each task later, when training the model.

In [None]:
# create a readset from these FASTQ files
import_job_resp = omics.start_read_set_import_job(
    sequenceStoreId=seq_store_id,
    roleArn=IMPORT_JOB_ROLE_ARN,
    sources=sources,
)
import_job_id = import_job_resp["id"]
print(f"Import job ID: {import_job_id}")

waiter = omics.get_waiter('read_set_import_job_completed')
waiter.wait(
    id=import_job_id,
    sequenceStoreId=seq_store_id,
)

print("Job complete")