# Sequence Read Archive (SRA) on AWS

## Drive New Research Discoveries by Exploring Raw DNA Sequencing Data

The Sequence Read Archive (SRA) data is stored on AWS as part of the Registry Open Data. The data, stored in AWS Simple Storage Service (S3) is publicly available and does not require an AWS Account to access. However in this example, we are specifying an AWS S3 Bucket to store the data in your AWS Account.

### Install and Import Required Libraries

To learn more about the individual libraries being imported, please reference: https://docs.python.org/3/library/index.html/>

In [None]:
pip install Bio

In [None]:
import os
import gzip
import boto3
import matplotlib.pyplot as plt
from Bio import SeqIO
import collections
from collections import defaultdict

### Define Source S3 Bucket and Destination S3 Bucket

The <code>source_bucket</code> variable specifies the S3 bucket from which you will copy the Sequence Read Archive data from. The repository contains many buckets containing different data types. For the purpose of this example, you will copy data from a specific bucket. You can modify the <code>source_bucket</code> with another bucket name or keep the default below.

The <code>dest_bucket</code> variable represents the S3 bucket within your AWS Account where the Sequence Read Archive data will be copied to. Before executing, update the <code>dest_bucket</code> variable below as shown with the name of the S3 Bucket you created during the CloudFormation Stack execution. If you did not use CloudFormation to prep the environment, you will need to manually create an S3 bucket and populate the name below in <code>dest_bucket</code>

In [None]:
source_bucket = 'sra-pub-src-1'
dest_bucket = 'enter_the_bucket_name_created_through_cloud_formation_here'

### Initialize Python Boto3 Client

In this example you are leveraging AWS S3 to transfer the sequencing data files.

In [None]:
s3 = boto3.client('s3')

### Search the SRA Bucket and Copy FASTQ Files to Your Destination Bucket

The combination of the Python paginator feature and S3 List Objects operation allows you to search the provided <code>source_bucket</code> returning X number of objects where X is defined below as MaxItems.

You can modify the MaxItems to return a larger list but note that the more files in the bucket, the longer the list will take to return. 

For the purpose of this lab you will choose an object with a <code>key</code> like <code>.fastq.gz.1</code> to populate into the next section for copying the file.

In [None]:
paginator = s3.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=source_bucket,
PaginationConfig={'MaxItems': 20})
for page in page_iterator:
    for obj in page['Contents']:
        print(f's3://{source_bucket}/{obj["Key"]}')

### Set Variables

Example Variables are provided below and can be used as is. Update if desired to utilize other fastq files.

In [None]:
filename='Sample_ApoB_Mouse_S4_L003_I1_001.fastq.gz.1'
key='ERR10009485/Sample_ApoB_Mouse_S4_L003_I1_001.fastq.gz.1'
prefix='ERR10009485'

### Copy FASTQ File from Source Bucket to Destination Bucket

In [None]:
s3 = boto3.resource('s3')

In [None]:
copy_source = {
        'Bucket': source_bucket,
        'Key': key
}
s3.meta.client.copy(copy_source, dest_bucket, key)

print(f"Copied files from {source_bucket} to {dest_bucket}/{key}")

### List Files of Destination Bucket

In [None]:
s3 = boto3.client('s3')

In [None]:
paginator = s3.get_paginator('list_objects')
operation_parameters = {'Bucket': dest_bucket,
                        'Prefix': prefix}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
    print(page['Contents'])

### Download FASTQ File from S3 to Sagemaker Notebook Instance

Due to the size of the FASTQ files you will need to download them to the local SageMaker Notebook instance already created for you. The file will still be available in S3 for use until you manually delete it or remove the CloudFormation stack. It is advised to consider the OpenData version of the file as long term to reduce duplication of data and minimize resources per Sustainability best practices.

In [None]:
s3 = boto3.client("s3") 

In [None]:
s3.download_file( 
    Filename=filename, 
    Bucket=dest_bucket, 
    Key=key
)

### Read Nucleotide Data

Below the provided gzip-compressed FASTQ file is read, the first sequence record is extracted and information on the record is printed including the sequence data and the associated quality scores.

In [None]:
records = SeqIO.parse(gzip.open(filename, 'rt', encoding='utf-8'), 'fastq')
record = next(records)
print(record)
print(record.id,record.description,record.seq)
print(record.letter_annotations)

### Plot the Nucleotide Reads

Using the `gzip` module the FASTQ file will be opened in read mode. The file is read line by line, skipping the header and quality score lines, while counting the occurences of each nucleotide (A, T, C, G) using  the `collections.Counter()` class . After counting the nucleotides, the script uses the `matplotlib.pyplot` library to generate plotar graph showing the distribution of the nucleotide 4. The `plt.figure()` function sets the size of the plot, and the `plt.bar()` function creates tplotbar graph using the nucleotide counts as the y-axis values and the nucleotides as the x-axis labe 
5. Finally, the `plt.xlabel()`, `plt.ylabel()`, and `plt.title()` functions add labels and a title to the plot, and the `plt.show()` function displays the plot.

### Define Function to Read Specific # of Lines of FASTQ File

In [None]:
def read_fastq_subset(filename, start_line=0, num_lines=None):
    """
    Read a subset of lines from a gzipped FASTQ file.
    
    Args:
        filename (str): The path to the gzipped FASTQ file.
        start_line (int, optional): The starting line number to read (default is 0).
        num_lines (int, optional): The number of lines to read (default is None, which means read all lines).
    
    Returns:
        generator: A generator that yields the selected lines from the FASTQ file.
    """
    with gzip.open(filename, 'rt') as f:
        # Skip the initial lines
        for _ in range(start_line):
            f.readline()
        
        # Read the selected lines
        if num_lines is None:
            yield from f
        else:
            for _ in range(num_lines):
                yield f.readline().strip()

### Execute the Function to Plot the Nucleotide Counts

By default, the entire FASTQ file will be read. If you want to only read a subset, modify the start_line and num_lines below, accordingly.

**Note:** This cell processes the entire genomics file and may take 10-15 minutes to complete depending on file size. The cell will show [*] while running - please be patient as it counts nucleotides and generates the visualization.

In [None]:
nucleotide_counts = collections.Counter()
for line in read_fastq_subset(filename, start_line=0, num_lines=None):
        if line.startswith('@'):
            # Skip the header line
            continue
        elif line.startswith('+'):
            # Skip the quality score line
            continue
        else:
            # Count the nucleotides in the sequence line
            nucleotide_counts.update(line.strip())

# Plot the distribution of nucleotides
plt.figure(figsize=(8, 6))
plt.plot(nucleotide_counts.keys(), nucleotide_counts.values())
plt.xlabel('Nucleotide')
plt.ylabel('Count')
plt.title('Distribution of Nucleotides')
plt.show()

### Cleanup Environment

To avoid encuring costs associated with the environment created in your account. Delete the CloudFormation Stack which will remove all created resources. Execute the cell below to delete the downloaded FASTQ file. If the S3 bucket created by CloudFormaton is not empty, the Delete step for CloudFormation will fail.

In [None]:
# Comprehensive cleanup for versioned S3 buckets
def cleanup_versioned_bucket(bucket_name):
    """Delete all objects and versions from a versioned S3 bucket"""
    print(f"Starting cleanup of bucket: {bucket_name}")
    
    try:
        # First, list all current objects
        response = s3.list_objects_v2(Bucket=bucket_name)
        if 'Contents' in response:
            print(f"Found {len(response['Contents'])} current objects")
            for obj in response['Contents']:
                print(f"  - {obj['Key']}")
        else:
            print("No current objects found")
        
        # List and delete all object versions
        paginator = s3.get_paginator('list_object_versions')
        pages = paginator.paginate(Bucket=bucket_name)
        
        delete_keys = []
        total_versions = 0
        total_markers = 0
        
        for page in pages:
            # Delete object versions
            if 'Versions' in page:
                for version in page['Versions']:
                    delete_keys.append({
                        'Key': version['Key'], 
                        'VersionId': version['VersionId']
                    })
                    total_versions += 1
            
            # Delete delete markers
            if 'DeleteMarkers' in page:
                for marker in page['DeleteMarkers']:
                    delete_keys.append({
                        'Key': marker['Key'], 
                        'VersionId': marker['VersionId']
                    })
                    total_markers += 1
        
        print(f"Found {total_versions} object versions and {total_markers} delete markers")
        
        if delete_keys:
            # Delete in batches of 1000 (AWS limit)
            deleted_count = 0
            for i in range(0, len(delete_keys), 1000):
                batch = delete_keys[i:i+1000]
                if batch:
                    response = s3.delete_objects(
                        Bucket=bucket_name,
                        Delete={'Objects': batch}
                    )
                    deleted_count += len(batch)
                    print(f"Deleted batch of {len(batch)} objects/versions")
            
            print(f"Successfully deleted {deleted_count} total objects/versions from {bucket_name}")
        else:
            print(f"No objects or versions to delete in {bucket_name}")
        
    except Exception as e:
        print(f"Error cleaning up {bucket_name}: {str(e)}")
        import traceback
        traceback.print_exc()

# Clean up both buckets
print("=== S3 Bucket Cleanup ===")
cleanup_versioned_bucket(dest_bucket)
print()

# Construct logging bucket name
account_id = dest_bucket.split('-')[0]
logging_bucket = f"{account_id}-opendata-sra-logs"
cleanup_versioned_bucket(logging_bucket)

print()
print("=== Cleanup Complete ===")
print("You can now safely delete the CloudFormation stack.")