### Notebook adapted from https://github.com/aws-samples/amazon-omics-tutorials/tree/main/notebooks on 2024-09-18

# Running R2R (Ready to Run) scRNA-Seq processing with STARsolo workflow

The scRNAseq with STARsolo workflow is based on the nf-core/scrnseq pipeline. 


![scRNA seq data flow](./images/scrnaseq_dataflow.jpg)

![R2R scRNA STARsolo](./images/r2r_scrnastarsolo.png)

This workflow uses STARsolo to analyze droplet single cell RNA sequencing data, and it takes raw FASTQ reads files, and performs the following operations:

- Error correction and demultiplexing of cell barcodes using the default 10x whitelist
- Mapping the reads to the reference genome using the standard STAR spliced read alignment algorithm
- Error correction and collapsing (deduplication) of Unique Molecular Identifiers (UMIa)
- Quantification of per-cell gene expression by counting the number of reads per gene
- Quantification of other transcriptomic features

## Prerequisites
### Python requirements
* Python >= 3.8
* Packages:
  * boto3 >= 1.26.19
  * botocore >= 1.29.19

### AWS requirements

#### AWS CLI
You will need the AWS CLI installed and configured in your environment. Supported AWS CLI versions are:

* AWS CLI v2 >= 2.9.3 (Recommended)
* AWS CLI v1 >= 1.27.19

#### Output buckets
You will need a bucket **in the same region** you are running this tutorial in, to store workflow outputs.

## Policy setup
This notebook runs under the role that was created or selected during notebook creation.<br>
By executing the following code snippet you can crosscheck the role name.

In [4]:
import boto3
boto3.client('sts').get_caller_identity()['Arn']

'arn:aws:sts::851725420776:assumed-role/AmazonSageMakerServiceCatalogProductsUseRole/SageMaker'

We need to enrich this role with policy permissions, so that actions executed in upcoming statements do not fail.<br>
Here is a sample policy that can to be added to the role. It must be noted that this is a sample policy, for the needs of this project.

In a production environment, the actual policy should follow the principle of least privileges.

In [None]:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:GetPolicy",
                "iam:CreatePolicy",
                "iam:DeletePolicy",
                "iam:ListPolicyVersions",
                "iam:ListEntitiesForPolicy",
                "iam:CreateRole",
                "iam:DeleteRole",
                "iam:DeletePolicyVersion",
                "iam:AttachRolePolicy",
                "iam:DetachRolePolicy",
                "iam:ListAttachedRolePolicies",
                "iam:PassRole",
                "omics:*"
            ],
            "Resource": "*"
        }
    ]
}

## Environment setup

Reset environment, in case you are re-running this tutorial.<br> 

In [5]:
%reset -f

Load helper functions from helper notebook.

In [6]:
%run 200-omics_helper_functions.ipynb

Import libraries

In [7]:
import boto3
from urllib.parse import urlparse

## Create a service IAM role
To use Amazon Omics, you need to create an IAM role that grants the Omics service permissions to access resources in your account. We'll do this below using the IAM client.

> **Note**: this step is fully automated from the Omics Workflows Console when you create a run

In [8]:
omics_role_name = 'omics-r2r-tutorial-service-role'
omics_role_trust_policy =  {
        "Version": "2012-10-17",
        "Statement": [{
            "Principal": {
                "Service": "omics.amazonaws.com"
            },
            "Effect": "Allow",
            "Action": "sts:AssumeRole"
        }]
    }

# delete role (if it exists) and create a new one
omics_role = omics_helper_recreate_role(omics_role_name, omics_role_trust_policy)

In [9]:
omics_role

{'Role': {'Path': '/',
  'RoleName': 'omics-r2r-tutorial-service-role',
  'RoleId': 'AROA4MTWKRDUHMQUTRLMG',
  'Arn': 'arn:aws:iam::851725420776:role/omics-r2r-tutorial-service-role',
  'CreateDate': datetime.datetime(2024, 9, 27, 19, 8, 56, tzinfo=tzlocal()),
  'AssumeRolePolicyDocument': {'Version': '2012-10-17',
   'Statement': [{'Principal': {'Service': 'omics.amazonaws.com'},
     'Effect': 'Allow',
     'Action': 'sts:AssumeRole'}]}},
 'ResponseMetadata': {'RequestId': '4543c371-7091-42c9-a5c9-ed015560427a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Fri, 27 Sep 2024 19:08:55 GMT',
   'x-amzn-requestid': '4543c371-7091-42c9-a5c9-ed015560427a',
   'content-type': 'text/xml',
   'content-length': '815'},
  'RetryAttempts': 0}}

After creating the role, we next need to add policies to grant permissions. In this case, we are allowing read/write access to all S3 buckets in the account. This is fine for this tutorial, but in a real world setting you will want to scope this down to only the necessary resources. We are also adding a permissions to create CloudWatch Logs which is where any outputs sent to `STDOUT` or `STDERR` are collected.

In [10]:
s3_policy_name = f"omics-r2r-tutorial-s3-access-policy"
s3_policy_permissions = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject",
                    "s3:Get*",
                    "s3:List*",
                ],
                "Resource": [
                    "arn:aws:s3:::*/*"
                ]
            }
        ]
    }

AWS_ACCOUNT_ID = boto3.client('sts').get_caller_identity()['Account']

logs_policy_name = f"omics-r2r-tutorial-logs-access-policy"
logs_policy_permissions = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "logs:CreateLogGroup"
                ],
                "Resource": [
                    f"arn:aws:logs:*:{AWS_ACCOUNT_ID}:log-group:/aws/omics/WorkflowLog:*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "logs:DescribeLogStreams",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents",
                ],
                "Resource": [
                    f"arn:aws:logs:*:{AWS_ACCOUNT_ID}:log-group:/aws/omics/WorkflowLog:log-stream:*"
                ]
            }
        ]
    }

s3_policy = omics_helper_recreate_policy(s3_policy_name, s3_policy_permissions)
logs_policy = omics_helper_recreate_policy(logs_policy_name, logs_policy_permissions)

# attach policies to role
iam_client = boto3.client("iam")
iam_client.attach_role_policy(RoleName=omics_role['Role']['RoleName'], PolicyArn=s3_policy['Policy']['Arn'])
iam_client.attach_role_policy(RoleName=omics_role['Role']['RoleName'], PolicyArn=logs_policy['Policy']['Arn'])

{'ResponseMetadata': {'RequestId': '4c1ff52b-48b9-4830-9bae-69f9594214a1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Fri, 27 Sep 2024 19:09:04 GMT',
   'x-amzn-requestid': '4c1ff52b-48b9-4830-9bae-69f9594214a1',
   'content-type': 'text/xml',
   'content-length': '212'},
  'RetryAttempts': 0}}

## Selecting the StarSolo workflow

In [11]:
omics_client = boto3.client('omics')

r2r_workflows = omics_client.list_workflows(type="READY2RUN")
r2r_workflows_items = r2r_workflows['items']

workflow = [r2r_workflow_item for r2r_workflow_item in r2r_workflows_items if r2r_workflow_item["id"] == "2174942" ][0]
omics_helper_pretty_print(workflow)

{
  "arn": "arn:aws:omics:us-west-2::workflow/2174942",
  "id": "2174942",
  "name": "scRNAseq with STARsolo",
  "status": "ACTIVE",
  "type": "READY2RUN",
  "creationTime": "2023-05-15 00:00:00+00:00",
  "metadata": {
    "licensed": "false",
    "publisher": "nf-core",
    "estimatedDuration": "150",
    "version": "1.0"
  }
}


We get the full details of the specific workflow, in order to examine its parameter template.

In [12]:
workflow_details_parameterTemplate = omics_client.get_workflow(id=workflow['id'], type="READY2RUN")['parameterTemplate']
omics_helper_pretty_print(workflow_details_parameterTemplate)

{
  "samplename": {
    "description": "A string representing the name of the sample."
  },
  "input": {
    "description": "An array of maps, with each element containing two fields: fastq_1 and fastq_2. fastq_1 points to the S3 or Omics Storage URI containing the forward read of paired-end sequencing. fastq_2 points to the S3 or Omics Storage URI containing the reverse read of paired-end sequencing."
  },
  "protocol": {
    "description": "A string representing the 10X Protocol used (case sensitive): 10XV2, 10XV3."
  }
}


The specific workflow has three parameters, the description of which is shown in the output.<br>
We can now run the workflow, as any other workflow through the Amazon Omics.

## Executing the StarSolo workflow
Prior to run execution, we get the current region, in which this notebook is operating. <br>
We will use region name to compose the regional S3 bucket that holds input test data for the specific workflow.

In [8]:
region_name = boto3.Session().region_name
print(region_name)

us-west-2


In [9]:
sample_name = "20k_NSCLC_DTC"
input_fastq1_path_uri = f"s3://omics-{region_name}/sample-inputs/2174942/20k_NSCLC_DTC_3p_nextgem_gex_S4_L001_R1_001.fastq.gz"
input_fastq2_path_uri = f"s3://omics-{region_name}/sample-inputs/2174942/20k_NSCLC_DTC_3p_nextgem_gex_S4_L001_R2_001.fastq.gz"
protocol = "10XV3"

output_uri = "s3://ready2runtestoutput/run_results"

run = omics_client.start_run(
    workflowId=workflow['id'],
    workflowType='READY2RUN',
    name="2174942 R2R workflow run",
    roleArn=omics_role['Role']['Arn'],
    parameters={
        "samplename": sample_name,
        "input": [{'fastq_1': input_fastq1_path_uri,
                    'fastq_2': input_fastq2_path_uri}],
        "protocol": protocol
    },
    outputUri=output_uri,
)

print(f"running workflow {workflow['id']}, starting run {run['id']}")

running workflow 2174942, starting run 2542738
run 2542738 is running
Waiter RunCompleted failed: Max attempts exceeded. Previously accepted state: For expression "status" we matched expected path: "RUNNING"


In [None]:
try:
    waiter = omics_client.get_waiter('run_running')
    waiter.wait(id=run['id'], WaiterConfig={'Delay': 30, 'MaxAttempts': 60})

    print(f"run {run['id']} is running")

    waiter = omics_client.get_waiter('run_completed')
    waiter.wait(id=run['id'], WaiterConfig={'Delay': 60, 'MaxAttempts': 60*2})

    print(f"run {run['id']} completed")
except botocore.exceptions.WaiterError as e:
    print(e)

Once the run completes we can verify its status by getting its details:

In [10]:
omics_helper_pretty_print(omics_client.get_run(id=run['id']))

{
  "arn": "arn:aws:omics:us-west-2:851725420776:run/2542738",
  "id": "2542738",
  "status": "COMPLETED",
  "workflowId": "2174942",
  "workflowType": "READY2RUN",
  "roleArn": "arn:aws:iam::851725420776:role/omics-r2r-tutorial-service-role",
  "name": "2174942 R2R workflow run",
  "parameters": {
    "samplename": "20k_NSCLC_DTC",
    "input": [
      {
        "fastq_2": "s3://omics-us-west-2/sample-inputs/2174942/20k_NSCLC_DTC_3p_nextgem_gex_S4_L001_R2_001.fastq.gz",
        "fastq_1": "s3://omics-us-west-2/sample-inputs/2174942/20k_NSCLC_DTC_3p_nextgem_gex_S4_L001_R1_001.fastq.gz"
      }
    ],
    "protocol": "10XV3"
  },
  "storageCapacity": 1200,
  "outputUri": "s3://ready2runtestoutput/run_results",
  "startedBy": "arn:aws:sts::851725420776:assumed-role/AmazonSageMakerServiceCatalogProductsUseRole/SageMaker",
  "creationTime": "2024-09-23 21:07:49.401680+00:00",
  "startTime": "2024-09-23 21:17:48.440000+00:00",
  "stopTime": "2024-09-23 22:51:04.002240+00:00",
  "tags": {},


## Validating output of the workflow
We can verify that the correct output was generated by listing the `outputUri` for the workflow run:

In [23]:
run_id = "2542738"

In [24]:
s3uri = urlparse(omics_client.get_run(id=run_id)['outputUri'])
boto3.client('s3').list_objects_v2(Bucket=s3uri.netloc, Prefix='/'.join([s3uri.path[1:], "2542738"]))['Contents']

[{'Key': 'run_results/2542738/',
  'LastModified': datetime.datetime(2024, 9, 23, 21, 7, 50, tzinfo=tzlocal()),
  'ETag': '"d41d8cd98f00b204e9800998ecf8427e"',
  'ChecksumAlgorithm': ['SHA256'],
  'Size': 0,
  'StorageClass': 'STANDARD'},
 {'Key': 'run_results/2542738/logs/engine.log',
  'LastModified': datetime.datetime(2024, 9, 23, 22, 45, 28, tzinfo=tzlocal()),
  'ETag': '"423f123b68b362f71bb7fb1a613f8820"',
  'ChecksumAlgorithm': ['CRC32'],
  'Size': 49220,
  'StorageClass': 'STANDARD'},
 {'Key': 'run_results/2542738/pubdir/fastqc/20k_NSCLC_DTC_1_fastqc.html',
  'LastModified': datetime.datetime(2024, 9, 23, 22, 43, 38, tzinfo=tzlocal()),
  'ETag': '"3f8bcacbc2897f401bd4d7c1fae39ca4"',
  'ChecksumAlgorithm': ['CRC32'],
  'Size': 511452,
  'StorageClass': 'STANDARD'},
 {'Key': 'run_results/2542738/pubdir/fastqc/20k_NSCLC_DTC_1_fastqc.zip',
  'LastModified': datetime.datetime(2024, 9, 23, 22, 43, 38, tzinfo=tzlocal()),
  'ETag': '"c40f6a649503093f76295021a61a4263"',
  'ChecksumAlgori

Like standard workflows, R2R workflows support all the features of the Amazon Omics Platform. <br>
As such, tasks, logs and run groups are fully supported. Here, we showcase how to get list of tasks and corresponding log streams.

In [20]:
tasks = omics_client.list_run_tasks(id=run_id)
omics_helper_pretty_print(tasks['items'])

[
  {
    "taskId": "2219876",
    "status": "COMPLETED",
    "name": "NFCORE_SCRNASEQ:SCRNASEQ:MULTIQC",
    "cpus": 2,
    "memory": 7,
    "creationTime": "2024-09-23 22:35:29.338300+00:00",
    "startTime": "2024-09-23 22:39:07.334000+00:00",
    "stopTime": "2024-09-23 22:39:40.969000+00:00",
    "gpus": 0,
    "instanceType": "omics.m.large"
  },
  {
    "taskId": "9116974",
    "status": "COMPLETED",
    "name": "NFCORE_SCRNASEQ:SCRNASEQ:CUSTOM_DUMPSOFTWAREVERSIONS (1)",
    "cpus": 2,
    "memory": 7,
    "creationTime": "2024-09-23 22:27:25.778950+00:00",
    "startTime": "2024-09-23 22:35:13.693000+00:00",
    "stopTime": "2024-09-23 22:35:27.698000+00:00",
    "gpus": 0,
    "instanceType": "omics.m.large"
  },
  {
    "taskId": "8063649",
    "status": "COMPLETED",
    "name": "NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (20k_NSCLC_DTC)",
    "cpus": 8,
    "memory": 57,
    "creationTime": "2024-09-23 22:27:25.662380+00:00",
    "startTime": "2024-09-23 22:32:16.14

and get specific task details with:

In [21]:
task = omics_client.get_run_task(id=run_id, taskId=tasks['items'][0]['taskId'])
omics_helper_pretty_print(task)

{
  "taskId": "2219876",
  "status": "COMPLETED",
  "name": "NFCORE_SCRNASEQ:SCRNASEQ:MULTIQC",
  "cpus": 2,
  "memory": 7,
  "creationTime": "2024-09-23 22:35:29.338300+00:00",
  "startTime": "2024-09-23 22:39:07.334000+00:00",
  "stopTime": "2024-09-23 22:39:40.969000+00:00",
  "logStream": "arn:aws:logs:us-west-2:851725420776:log-group:/aws/omics/WorkflowLog:log-stream:run/2542738/task/2219876",
  "gpus": 0,
  "instanceType": "omics.m.large"
}


After running the cell above we should see that each task has an associated CloudWatch Logs LogStream. These capture any text generated by the workflow task that has been sent to either `STDOUT` or `STDERR`. These outputs are helpful for debugging any task failures and can be retrieved with:

In [22]:
events = boto3.client('logs').get_log_events(
    logGroupName="/aws/omics/WorkflowLog",
    logStreamName=f"run/{run_id}/task/{task['taskId']}"
)
for event in events['events']:
    print(event['message'])

Task started
/// MultiQC 🔍 | v1.13
|           multiqc | Search path : /mnt/workflow/5d/805705d365195e69c76efc5365afde
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 12/12
|    custom_content | software_versions: Found 1 sample (html)
|    custom_content | nf-core-scrnaseq-summary: Found 1 sample (html)
|    custom_content | nf-core-scrnaseq-methods-description: Found 1 sample (html)
|            snippy | Found 1 reports
|          bargraph | Tried to make bar plot, but had no data: snippy_variants
|            fastqc | Found 2 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : multiqc_report.html
|           multiqc | Data        : multiqc_data
|           multiqc | Plots       : multiqc_plots
|           multiqc | MultiQC complete
Task succeeded


Functionality of Run Groups is not presented here, since it is identical to those in the workflows notebook tutorial