# Fetch SRA Metadata

This notebook gets information on the library layout and sequencing type for each run. It also checks if the runs actually exit in a way that can be fetched and downloaded.

In [57]:
import pandas as pd 
from Bio import Entrez
from io import StringIO

In [58]:
Entrez.email = "your.email@example.com"

In [59]:
sra_df = pd.read_csv("SRAQueryResults.csv")
accessions = sra_df["acc"].tolist()
len(accessions)

9606

**Beware, this takes a really long time or run (~1.5 hours)**

In [None]:
# Download the metadata for each accession using Entrez efetch
metadata_df = pd.DataFrame()
for acc in accessions:
    try:
        handle = Entrez.efetch(db="sra", id=acc, rettype="runinfo", retmode="text")
        data = handle.read()
        handle.close()
        
        # Check if data is empty
        if not data or len(data.decode('utf-8').strip()) == 0:
            print(f"Warning: No data returned for {acc}")
            continue
            
        # Parse and append to dataframe
        temp_df = pd.read_csv(StringIO(data.decode('utf-8')))
        metadata_df = pd.concat([metadata_df, temp_df], ignore_index=True)
        
    except Exception as e:
        print(f"Error fetching data for {acc}: {e}")
        continue

Error fetching data for SRR31362709: HTTP Error 400: Bad Request
Error fetching data for ERR12207680: HTTP Error 400: Bad Request
Error fetching data for ERR14752059: HTTP Error 400: Bad Request
Error fetching data for ERR8050042: HTTP Error 400: Bad Request
Error fetching data for ERR11894314: HTTP Error 400: Bad Request
Error fetching data for SRR13365202: HTTP Error 400: Bad Request
Error fetching data for SRR30558039: HTTP Error 400: Bad Request
Error fetching data for ERR14752306: HTTP Error 400: Bad Request
Error fetching data for SRR18134279: HTTP Error 400: Bad Request
Error fetching data for SRR28279775: HTTP Error 400: Bad Request
Error fetching data for SRR17464150: HTTP Error 400: Bad Request
Error fetching data for ERR7932528: HTTP Error 400: Bad Request
Error fetching data for SRR12450125: HTTP Error 400: Bad Request
Error fetching data for SRR28288233: HTTP Error 400: Bad Request
Error fetching data for SRR400025: HTTP Error 400: Bad Request
Error fetching data for SRR30

Some of the requests just randomly fail. This cell tries again with the failed runs and adds them.

In [60]:
missing = set(sra_df.acc) - set(metadata_df.Run)
for acc in missing:
    try:
        handle = Entrez.efetch(db="sra", id=acc, rettype="runinfo", retmode="text")
        data = handle.read()
        handle.close()
        
        # Check if data is empty
        if not data or len(data.decode('utf-8').strip()) == 0:
            print(f"Warning: No data returned for {acc}")
            continue
            
        # Parse and append to dataframe
        temp_df = pd.read_csv(StringIO(data.decode('utf-8')))
        metadata_df = pd.concat([metadata_df, temp_df], ignore_index=True)
        
    except Exception as e:
        print(f"Error fetching data for {acc}: {e}")
        continue

Error fetching data for ERR14752306: HTTP Error 400: Bad Request
Error fetching data for ERR12207680: HTTP Error 400: Bad Request
Error fetching data for ERR14752059: HTTP Error 400: Bad Request
Error fetching data for ERR7932528: HTTP Error 400: Bad Request


And some runs just don't exist under the accession they say they do. Finally, remove duplicate rows and make a final dataset with only the accessions in both the original SRA accession list and that I was able to fetch metadata for.

In [None]:
common_accessions = list(set(sra_df.acc) & set(metadata_df.Run))
result_df = sra_df[sra_df.acc.isin(common_accessions)].merge(metadata_df, left_on='acc', right_on='Run', how='left')
result_df = result_df.drop_duplicates()

In [65]:
# Filter to get only LibraryLayout in ["SINGLE", "PAIRED"]
result_df= result_df[result_df['LibraryLayout'].isin(["SINGLE", "PAIRED"])]
# Filter to get only the Platform in ["ILLUMINA", "PACBIO_SMRT", "OXFORD_NANOPORE"]
result_df = result_df[result_df['Platform'].isin(["ILLUMINA", "PACBIO_SMRT", "OXFORD_NANOPORE"])]
# Save the final result to a CSV file
result_df.to_csv("SRA_Runs.csv", index=False)