# AWS-RoseTTAFold

## I. Introduction

This notebook runs the [RoseTTAFold](https://www.ipd.uw.edu/2021/07/rosettafold-accurate-protein-structure-prediction-accessible-to-all/) algorithm developed by Minkyung Baek et al. and described in [M. Baek et al., Science 
10.1126/science.abj8754 2021](https://www.ipd.uw.edu/wp-content/uploads/2021/07/Baek_etal_Science2021_RoseTTAFold.pdf) on AWS.

<img src="img/RF_workflow.png" alt="RoseTTAFold Workflow" width="800px" />

The AWS workflow depends on a Batch compute environment.

<img src="img/AWS-RoseTTAFold-arch.png" alt="AWS-RoseTTAFold Architecture" width="800px" />

## II. Environment setup

In [None]:
## Install dependencies
!pip install -r requirements.txt

## Import helper functions at src/rfutils.py
from rfutils import rfutils

## Load additional dependencies
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import boto3
import glob
import json
import pandas as pd
import sagemaker

pd.set_option("max_colwidth", None)

# Get service clients
session = boto3.session.Session()
sm_session = sagemaker.session.Session()
region = session.region_name
role = sagemaker.get_execution_role()
s3 = boto3.client("s3", region_name=region)
account_id = boto3.client("sts").get_caller_identity().get("Account")

bucket = sm_session.default_bucket()

## III. Input Protein Sequence

Provide the path to a .fasta file

In [None]:
#seq = SeqIO.read("data/T1078.fa", "fasta")
seq = SeqIO.read("data/T1036s1.fa", "fasta")

Or, alternatively enter a protein sequence manually

In [None]:
seq = SeqRecord(
    Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
    id="YP_025292.1",
    name="HokC",
    description="toxic membrane protein, small",
)

In [None]:
print(f"Protein sequence for analysis is \n{seq}")

## IV. Submit RoseTTAFold Jobs

### Generate Job Name

In [None]:
job_name = rfutils.create_job_name(seq.id)
print(f"Automatically-generated job name is: {job_name}")

### Upload fasta file to S3

In [None]:
input_uri = rfutils.upload_fasta_to_s3(seq, bucket, job_name)

### Submit jobs to AWS Batch queues

In [None]:
two_step_response = rfutils.submit_2_step_job(
    bucket=bucket,
    job_name=job_name,
    data_prep_input_file="input.fa",
    data_prep_job_definition="AWS-RoseTTAFold-CPU",
    data_prep_queue="AWS-RoseTTAFold-CPU",
    data_prep_cpu=16,
    data_prep_mem=60,
    predict_job_definition="AWS-RoseTTAFold-GPU",
    predict_queue="AWS-RoseTTAFold-GPU",
    predict_cpu=32,
    predict_mem=90,
    predict_gpu=2,
)
data_prep_jobId = two_step_response[0]["jobId"]
predict_jobId = two_step_response[1]["jobId"]

## V. Check Status of Data Prep and Prediction Jobs

In [None]:
rfutils.get_rf_job_info(hrs_in_past=1)

Pause while the data prep job starts up

In [None]:
rfutils.wait_for_job_start(data_prep_jobId)

Get logs for data prep job (Run this multiple times to see how the job progresses)

In [None]:
data_prep_logStreamName = rfutils.get_batch_job_info(data_prep_jobId)["logStreamName"]
rfutils.get_batch_logs(data_prep_logStreamName).tail(n=10)

## VI. Retrieve and Display Multiple Sequence Alignment (MSA) Results

In [None]:
rfutils.display_msa(data_prep_jobId, bucket)

## VII. Retrieve and Display Predicted Structure

Pause while the predict job starts up

In [None]:
rfutils.wait_for_job_start(predict_jobId)

Get logs for predict job (Run this multiple times to see how the job progresses)

In [None]:
predict_logStreamName = rfutils.get_batch_job_info(predict_jobId)["logStreamName"]
rfutils.get_batch_logs(predict_logStreamName).tail(n=10)

In [None]:
rfutils.display_structure(predict_jobId, bucket)

## VIII. Display the Results of Historical Runs

In [None]:
# Example Jobs for Demo
data_prep_jobId = "4dce37f9-9ae5-4e56-b06c-babb02259218"
predict_jobId = "ac779444-94c5-49b0-990a-88745e6c70b9"
bucket = "sagemaker-us-east-1-032243382548"

rfutils.display_msa(data_prep_jobId, bucket)
rfutils.display_structure(predict_jobId, bucket)

## IX. Analyze Multiple Sequences Simultaneously

In [None]:
fasta_files = glob.glob("data/*.fa")
job_ids = []
for file in fasta_files:
    seq = SeqIO.read(file, "fasta")
    job_name = rfutils.get_job_name(seq.id)
    print(f"Automatically-generated job name is: {job_name}")
    input_uri = rfutils.upload_fasta_to_s3(seq, bucket, job_name)
    two_step_response = rfutils.submit_2_step_job(
        bucket=bucket,
        job_name=job_name,
        data_prep_input_file="input.fa",
        data_prep_job_definition="AWS-RoseTTAFold-CPU",
        data_prep_queue="AWS-RoseTTAFold-CPU",
        data_prep_cpu=24,
        data_prep_mem=80,
        predict_job_definition="AWS-RoseTTAFold-GPU",
        predict_queue="AWS-RoseTTAFold-GPU",
        predict_cpu=24,
        predict_mem=80,
        predict_gpu=2,
    )

In [None]:
output = rfutils.get_rf_job_info(hrs_in_past=1)