# demo ffq for finding sequencing data and metadata from public databases

>"[`ffq` (Fetch FastQ)](https://github.com/pachterlab/ffq) is a command line tool for finding sequencing data from public databases. "ffq receives an accession and returns the metadata for that accession as well as the metadata for all downstream accessions following the connections between GEO, SRA, EMBL-EBI, DDBJ, and Biosample" - SOURCE: https://github.com/pachterlab/ffq

This notebook demonstrates a number of ways to use ffq. Including options for where you may have a hard time using it on the command line because on a remote machine where it doesn't install the same as locally, or at least make it easy to use the command line.

Note if you are interested in metadata, you may want to check out Logan Search as it exposes a lot of detail, more than I see ffq access with an accession (although there may ways I haven't explored for ffq). Anyway, see [my logan_results_analysis-binder](https://github.com/fomightez/logan_results_analysis-binder/) for more information about Logan Search.

In [3]:
%pip install ffq -q

Note: you may need to restart the kernel to use updated packages.


In [2]:
!ffq --help

usage: ffq [-h] [-o OUT] [-l LEVEL] [--ftp] [--aws] [--gcp] [--ncbi] [--split]
           [--verbose] [--version]
           IDs [IDs ...]

ffq 0.3.1: A command line tool to find sequencing data from SRA / GEO / ENCODE
/ ENA / EBI-EMBL / DDBJ / Biosample.

positional arguments:
  IDs         One or multiple SRA / GEO / ENCODE / ENA / EBI-EMBL / DDBJ /
              Biosample accessions, DOIs, or paper titles

options:
  -h, --help  Show this help message and exit
  -o OUT      Path to write metadata (default: standard out)
  -l LEVEL    Max depth to fetch data within accession tree
  --ftp       Return FTP links
  --aws       Return AWS links
  --gcp       Return GCP links
  --ncbi      Return NCBI links
  --split     Split output into separate files by accession (`-o` is a
              directory)
  --verbose   Print debugging information
  --version   show program's version number and exit


Of course, that's how you use it inside a running Jupyter Notebook.  
**If you were on the command line, you'd leave off the exclamation point**. Otherwise,the same commands shown in this section would be used in the same manner and yield the same results.

In [3]:
!ffq ERR5670887	

[2025-04-16 19:58:16,196]    INFO Parsing run ERR5670887
{
    "ERR5670887": {
        "accession": "ERR5670887",
        "experiment": "ERX5386380",
        "study": "ERP125915",
        "sample": "ERS6200127",
        "title": "MinION sequencing",
        "attributes": {
            "ENA-FIRST-PUBLIC": "2021-07-01",
            "ENA-LAST-UPDATE": "2021-07-01"
        },
        "files": {
            "ftp": [],
            "aws": [],
            "gcp": [],
            "ncbi": []
        }
    }
}


In [4]:
!ffq --ncbi ERR5654687	

[2025-04-16 19:58:18,012]    INFO Parsing run ERR5654687
[]


I found I get more associated metadata using the 'sample' listed in the ffq results for an SRA entry. (Not the same as the 'BioSample' that begins usually with `SAMN`.) Here is executing `ffq` with the one listed from the results of `!ffq ERR5670887` earlier:

In [5]:
!ffq ERS6200127

[2025-04-16 19:58:19,610]    INFO Parsing sample ERS6200127
[2025-04-16 19:58:19,920]    INFO Getting Experiment for ERS6200127
[2025-04-16 19:58:19,920]    INFO Parsing Experiment ERX5386380
[2025-04-16 19:58:20,202]    INFO Parsing run ERR5670887
{
    "ERS6200127": {
        "accession": "ERS6200127",
        "title": "OC43-MRC5-STM2120",
        "organism": "Homo sapiens",
        "attributes": {
            "ENA-CHECKLIST": "ERC000011",
            "ENA-FIRST-PUBLIC": "2021-07-01",
            "organism": "Homo sapiens",
            "ENA-LAST-UPDATE": "2021-07-01",
            "scientific_name": "Homo sapiens",
            "common name": "human",
            "cell_type": "MRC5"
        },
        "experiments": {
            "ERX5386380": {
                "accession": "ERX5386380",
                "title": "MinION sequencing",
                "platform": "OXFORD_NANOPORE",
                "instrument": "MinION",
                "runs": {
                    "ERR5670887": {
      

And for a different one:

In [6]:
!ffq SRS3243030

[2025-04-16 19:58:21,822]    INFO Parsing sample SRS3243030
[2025-04-16 19:58:22,329]    INFO Getting Experiment for SRS3243030
[2025-04-16 19:58:22,329]    INFO Parsing Experiment SRX4022539
[2025-04-16 19:58:22,642]    INFO Parsing run SRR7093892
{
    "SRS3243030": {
        "accession": "SRS3243030",
        "title": "98_17yr_Male_Caucasian",
        "organism": "Homo sapiens",
        "attributes": {
            "INSDC secondary accession": "SRS3243030",
            "NCBI submission package": "Generic.1.0",
            "disease": "Normal",
            "ethnicity": "Caucasian",
            "organism": "Homo sapiens",
            "Sex": "male",
            "cell id": "GM07753",
            "age": "17",
            "source_name": "Skin; Unspecified",
            "BioSampleModel": "Generic",
            "ENA-FIRST-PUBLIC": "2022-03-29",
            "ENA-LAST-UPDATE": "2022-03-29"
        },
        "experiments": {
            "SRX4022539": {
                "accession": "SRX4022539",

I detail a more full example related to accession and accessing the metadata in json form [at the bottom of this post on Biostarts](https://www.biostars.org/p/9522636/#9608530).

**That should be most of what you need for ffq use.**  
See [the ffq Github repo](https://github.com/pachterlab/ffq) for more command examples.



----------

The `bio` package is another package that offers access to metadata at the SRA.

In [4]:
%pip install bio -q

Note: you may need to restart the kernel to use updated packages.


In [5]:
!bio search SRR17607594

[
    {
        "run_accession": "SRR17607594",
        "sample_accession": "SAMN24891916",
        "sample_alias": "CN25-T",
        "sample_description": "Human sample from Homo sapiens",
        "first_public": "2022-08-22",
        "country": "",
        "scientific_name": "Homo sapiens",
        "fastq_bytes": "47696043951;33618545069",
        "base_count": "163951599482",
        "read_count": "542886091",
        "library_name": "Single nuclei RNA-CN25-Tumor",
        "library_strategy": "OTHER",
        "library_source": "TRANSCRIPTOMIC SINGLE CELL",
        "library_layout": "PAIRED",
        "instrument_platform": "ILLUMINA",
        "instrument_model": "Illumina HiSeq 4000",
        "study_title": "Radial glial cell signatures with FGFR3 hypomethylation and overexpression characterize central neurocytoma",
        "fastq_url": [
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR176/094/SRR17607594/SRR17607594_1.fastq.gz",
            "https://ftp.sra.ebi.ac.uk/vol1/fastq

**That should be most of what you need for bio use.**  
See [the bio package documentation](https://www.bioinfo.help/) for more command examples.

#### Compare and contrast some examples of `ffq` & `bio search` use

I do note that there's some advantages to each of them in certain ways.   
I'll show some examples of each in turn to contrast.

In [6]:
!ffq SRR23849628

[2025-04-17 02:21:24,174]    INFO Parsing run SRR23849628
{
    "SRR23849628": {
        "accession": "SRR23849628",
        "experiment": "SRX19662564",
        "study": "SRP410260",
        "sample": "SRS17033009",
        "title": "PromethION sequencing; GSM7093690: 35cycle_10X; Homo sapiens; RNA-Seq",
        "attributes": {
            "ENA-FIRST-PUBLIC": "2023-06-23",
            "ENA-LAST-UPDATE": "2023-06-23"
        },
        "files": {
            "ftp": [
                {
                    "accession": "SRR23849628",
                    "filename": "SRR23849628_1.fastq.gz",
                    "filetype": "fastq",
                    "filesize": 10790875511,
                    "filenumber": 1,
                    "md5": "05b77f9a3d01bd63e66b796c16e86b90",
                    "urltype": "ftp",
                    "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR238/028/SRR23849628/SRR23849628_1.fastq.gz"
                }
            ],
            "aws": [],
            "g

In [7]:
!bio search SRR23849628

[
    {
        "run_accession": "SRR23849628",
        "sample_accession": "SAMN33743856",
        "sample_alias": "GSM7093690",
        "sample_description": "35cycle_10X",
        "first_public": "2023-06-23",
        "country": "missing",
        "scientific_name": "Homo sapiens",
        "fastq_bytes": "10790875511",
        "base_count": "10715195111",
        "read_count": "6444874",
        "library_name": "GSM7093690",
        "library_strategy": "RNA-Seq",
        "library_source": "TRANSCRIPTOMIC",
        "library_layout": "SINGLE",
        "instrument_platform": "OXFORD_NANOPORE",
        "instrument_model": "PromethION",
        "study_title": "Counting and correcting errors within unique molecular identifiers to generate absolute numbers of sequencing molecules [scRNA-Seq]",
        "fastq_url": [
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR238/028/SRR23849628/SRR23849628_1.fastq.gz"
        ],
        "info": "11 GB files; 6.4 million reads; 10715.2 million seq

Note the `bio search` command reports number of reads as 6.4 million.  
In this case `bio` doesn't give the source of the data as clear as `ffq` does. It is atypical, though as seen from the unrelated example `!bio search SRR17607594` executed as the first `bio search` example above.

In [8]:
!ffq ERR5670887

[2025-04-17 02:21:26,507]    INFO Parsing run ERR5670887
{
    "ERR5670887": {
        "accession": "ERR5670887",
        "experiment": "ERX5386380",
        "study": "ERP125915",
        "sample": "ERS6200127",
        "title": "MinION sequencing",
        "attributes": {
            "ENA-FIRST-PUBLIC": "2021-07-01",
            "ENA-LAST-UPDATE": "2021-07-01"
        },
        "files": {
            "ftp": [],
            "aws": [],
            "gcp": [],
            "ncbi": []
        }
    }
}


`ffq` will typically offer you more options for retrieval. Even in cases where other options exist, `bio search` won't feature the other URLs, at least currently:

In [9]:
!bio search ERR5670887

[
    {
        "run_accession": "ERR5670887",
        "sample_accession": "SAMEA8515329",
        "sample_alias": "OC43-MRC5-STM2120",
        "sample_description": "MRC5 cells infected with HCoV-OC43 at an MOI of 3 for 24 hrs in the presence of STM2120",
        "first_public": "2021-07-01",
        "country": "",
        "scientific_name": "Homo sapiens",
        "fastq_bytes": "",
        "base_count": "0",
        "read_count": "0",
        "library_name": "OC43-MRC5-STM2120-biorep1",
        "library_strategy": "OTHER",
        "library_source": "TRANSCRIPTOMIC",
        "library_layout": "SINGLE",
        "instrument_platform": "OXFORD_NANOPORE",
        "instrument_model": "MinION",
        "study_title": "Targeting the m6A RNA modification pathway blocks SARS-CoV-2 and HCoV-OC43 replication",
        "bio_error": "invalid data: could not convert string to float: ''",
        "fastq_url": [
            "https://"
        ],
        "info": "0 files; 0.0 million reads; 0.0 milli

Note `bio search ERR5670887` gives no URL, unlike the result from `ffq`. Also, observe no number of reads reported even though there's 451,214 reads according to [SRA's RUN page for `ERR5670887` ](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=ERR5670887&display=reads).

That should cover most of what you need for `ffq` and `bio` use.

However, if you need options for calling ffq when using on remote machines where the command line invocation isn't quite working as expected, or you want to integrate it with Python without calling the command line version since it is written in Python, read on for some options....


----------

`ffq` normally gets installed in `bin` that can bre referenced if you needed to call it directly and `ffq` wasn't in the path because remote machine use. Here's where you'd invoke directly here to give you an idea:

In [9]:
!/srv/conda/envs/notebook/bin/ffq

usage: ffq [-h] [-o OUT] [-l LEVEL] [--ftp] [--aws] [--gcp] [--ncbi] [--split]
           [--verbose] [--version]
           IDs [IDs ...]

ffq 0.3.1: A command line tool to find sequencing data from SRA / GEO / ENCODE
/ ENA / EBI-EMBL / DDBJ / Biosample.

positional arguments:
  IDs         One or multiple SRA / GEO / ENCODE / ENA / EBI-EMBL / DDBJ /
              Biosample accessions, DOIs, or paper titles

options:
  -h, --help  Show this help message and exit
  -o OUT      Path to write metadata (default: standard out)
  -l LEVEL    Max depth to fetch data within accession tree
  --ftp       Return FTP links
  --aws       Return AWS links
  --gcp       Return GCP links
  --ncbi      Return NCBI links
  --split     Split output into separate files by accession (`-o` is a
              directory)
  --verbose   Print debugging information
  --version   show program's version number and exit


--------

## Running ffq from inside Python

ffq seems to run on the command line but I had considered using it inside remote machines where for some reason I didn't have access to the command line version. (Although maybe it was in `/bin/` on the machine somewhere and I could point at it with an absolute path? I found on MyBinder, you can point directly at `!/srv/conda/envs/notebook/bin/ffq`). So I was looking to use Python and found some ways. This could be useful, too, for adapting to code in situations, and so I include it here...

Besides since `ffq` is Python-based, it would be nice to have a way to use `ffq` alongside Python without resorting to using `os.system()` and handling getting the output back into Python objects.

Worked out how to run `ffq` equivalent fom inside Python in a session from a launch from [the binder project's 'simple `requirements.txt` based example' repo here](https://github.com/binder-examples/requirements) June 2023. (I thought it wasn't working the other day from MyBinder, and so I was actually surprised it worked to get the result. At the time, I found now it is working to get the metadata I was developing something and so just trying to get to the point I expected it to time out. Maybe there was an issue the other day with the service `ffq` accesses.)

Note that in theory [ffq-api](https://github.com/seqeralabs/ffq-api) used via curl could allow me to use ffq not from the command line in inside OSG and the OSPool. However, I found the current public instance doesn't seem updated or doesn't have all the information that `ffq` makes available. I'll demonstrate that at the bottom.

---------

### Updated options: Option 1 use a script or code where import it and then give arrguments via `sys.argv()`

I had worked this out to try and use on the command line in execute node of OSG OSPool some of the deepTools' scripts that don't have the `if __name__ == "__main__"` as part of their code [The ones with can just be called with `python -m` with the arguments after, like `python -m deeptools.bamCompare --help`], consulting with Claude.ai for some options for these more complex cases where cannot use `python -m`. Claude had suggested use of `sys.argv`. Note you have to account for the first argument from the command line argument list actually being the name of the command called on the command line. (Note that since sys.argv using by default by `main()` here you don't need to feed it in, unlike I had been doing with `parser.parse_args()`, see below.

In [10]:
import sys
from ffq.main import main
sys.argv = ['ffq','--ftp', 'ERX5777701']
main()

[2025-04-16 19:58:26,845]    INFO Parsing Experiment ERX5777701
[2025-04-16 19:58:27,165]    INFO Parsing run ERR6140859


[
    {
        "accession": "ERR6140859",
        "filename": "ERR6140859_1.fastq.gz",
        "filetype": "fastq",
        "filesize": 46548794,
        "filenumber": 1,
        "md5": "87fbe7c5b04a66a8f17da0f16bb6bf36",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_1.fastq.gz"
    },
    {
        "accession": "ERR6140859",
        "filename": "ERR6140859_2.fastq.gz",
        "filetype": "fastq",
        "filesize": 48630523,
        "filenumber": 2,
        "md5": "49a6698e9c9099f21f6ace7953925027",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_2.fastq.gz"
    }
]


As I pointed out above you'll note that the first argument doesn't get consumed by the `main()` ffq, just like it doesn't behind the scenes when run on the command line but that you have to account for the first one in the list being the name of the command called on the command line. In other words in the example here it doesn't get used even though it is passed in. And so it is actually moot whatever you have there as long as you have something there as a placeholder to have the agruments start at the position after the first. Illustrating that with a nonsensical example as the first:

In [11]:
import sys
from ffq.main import main
sys.argv = ['superMOOT','ERX5777701']
main()

[2025-04-16 19:58:28,201]    INFO Parsing Experiment ERX5777701
[2025-04-16 19:58:28,211]    INFO Parsing run ERR6140859


{
    "ERX5777701": {
        "accession": "ERX5777701",
        "title": "Fly Cell Atlas: single-cell transcriptomes of the entire adult Drosophila - Smartseq2 dataset",
        "platform": "ILLUMINA",
        "instrument": "Illumina NovaSeq 6000",
        "runs": {
            "ERR6140859": {
                "accession": "ERR6140859",
                "experiment": "ERX5777701",
                "study": "ERP130174",
                "sample": "ERS7029664",
                "title": "Illumina NovaSeq 6000 paired end sequencing; Fly Cell Atlas: single-cell transcriptomes of the entire adult Drosophila - Smartseq2 dataset",
                "attributes": {
                    "ENA-FIRST-PUBLIC": "2021-07-02",
                    "ENA-LAST-UPDATE": "2021-07-02"
                },
                "files": {
                    "ftp": [
                        {
                            "accession": "ERR6140859",
                            "filename": "ERR6140859_1.fastq.gz",
             

(SIDENOTE, these links can be used in combination with `curl` on MyBinder to retrieve the fastq files eventhough the ftp port is blocked. Turns out you can just change the `curl -OL ftp:/` to `curl -OL https:/`.)

You can really see that first item in the arguments list coming from the command by running this example from https://stackoverflow.com/a/76443894/8508004 , which also was the source of the trick to put the arguments after:

In [12]:
!python -c "import sys; print([[str(x),a] for x,a in enumerate(sys.argv)])" a b c

[['0', '-c'], ['1', 'a'], ['2', 'b'], ['3', 'c']]


There the `-c` from the command gets consumed as the first argument in the list.

### Updated options: Option 2 use `python -c` to run on command line from inside Python 

Note that I worked out that for ffq this, `python -c "from ffq.main import main; main()"`,  will work to give USAGE information. (I had worked this out to try and use on the command line in execute node of OSG OSPool some of the deepTools' scripts that don't have the ``if __name__ == "__main__"` as part of their code [The ones with can just be called with `python -m` with the arguments after, like `python -m deeptools.bamCompare --help`], consulting with Claude.ai for some options for these more complex cases where cannot use `python -m`. ). Adding in the trick of using https://stackoverflow.com/a/76443894/8508004 I should be able to put the arguments after that, too, in order to run with arguments and not just get USAGE!
                                                     
                                                     
I believe this will also work on OSG OSPool but there it'll be `python3 -c`.

In action:

In [13]:
!python -c "from ffq.main import main; main()" --ftp ERX5777701

[2025-04-16 19:58:29,629]    INFO Parsing Experiment ERX5777701
[2025-04-16 19:58:29,929]    INFO Parsing run ERR6140859
[
    {
        "accession": "ERR6140859",
        "filename": "ERR6140859_1.fastq.gz",
        "filetype": "fastq",
        "filesize": 46548794,
        "filenumber": 1,
        "md5": "87fbe7c5b04a66a8f17da0f16bb6bf36",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_1.fastq.gz"
    },
    {
        "accession": "ERR6140859",
        "filename": "ERR6140859_2.fastq.gz",
        "filetype": "fastq",
        "filesize": 48630523,
        "filenumber": 2,
        "md5": "49a6698e9c9099f21f6ace7953925027",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_2.fastq.gz"
    }
]


---------

### EARLIER OPTION I had worked out based on pasting and adapting source code for ffq - THIS IS WAY MORE COMPLEX THAN  ABOVE BECAUSE HAVE TO DIG THROUGH PERTINENT CODE AND ADD IT HERE IN ADAPTED FORM (so if updated significantly in subsequentdevelopment, you miss out or have to do again)

Based on https://github.com/pachterlab/ffq/blob/898f56b0b9a07f152d2771fbb5157bf928febf4f/ffq/main.py and that I had seen `from ffq.main import run_ffq` work when I was inside Python running on OSG where I had also installed ffq in the environment 

Sending the arguments based on https://docs.python.org/3/library/argparse.html, especially [the 'Parsing arguments' section](https://docs.python.org/3/library/argparse.html#parsing-arguments) that fortunately shows how to pass the equivalents in of when you call something from the command line with arguments, which reminds me of `sh` module use in Python:

```python
parser.parse_args(['--sum', '7', '-1', '42'])
```

### The next two cells demonstrate running ffq from inside Python code!:

In [14]:
#--------MOST OF THESE SECTIONS ADAPTED FROM THE SOURCE CODE--------------------------------------------------

import argparse
import re
import os
import sys
#------First section mainly from `ffq/ffq/ffq.py`-------------------------------------------------------------
# TODO evenetually [Sic] create an accession class
# TODO better handling DOI parsing
def validate_accessions(accessions, search_types):
    # 1. extract the prefix 2. determine if prefix is valid or its a DOI
    # {accession: str, prefix: str, valid: bool}

    IDs = []
    for input_accession in accessions:
        # encode needs :3 ?
        # bioproject needs :3 ?
        # biosample needs :4 or : 5 ?
        accession = input_accession.upper()

        valid = False
        prefix = re.findall(r"(\D+).+", accession)[0]

        if prefix in search_types:
            valid = True

        elif DOI_PARSER.match(accession) is not None:
            valid = True
            logger.warning("Searching by DOI may result in missing information.")
            prefix = "DOI"
        else:
            prefix = "UNKNOWN"
        # TODO add error if not valid

        IDs.append(
            {"accession": accession, "prefix": prefix, "valid": valid, "error": None}
        )

    return IDs


def parse_run(soup):
    """Given a BeautifulSoup object representing a run, parse out relevant
    information.

    :param soup: a BeautifulSoup object representing a run
    :type soup: bs4.BeautifulSoup

    :return: a dictionary containing run information
    :rtype: dict
    """
    accession = soup.find("PRIMARY_ID", text=RUN_PARSER).text
    experiment = (
        soup.find("PRIMARY_ID", text=EXPERIMENT_PARSER).text
        if soup.find("PRIMARY_ID", text=EXPERIMENT_PARSER)
        else soup.find("EXPERIMENT_REF")["accession"]
    )

    study_parsed = soup.find("ID", text=PROJECT_PARSER)
    if study_parsed:
        study = study_parsed.text
    else:
        #     logger.warning(
        #         'Failed to parse study information from ENA XML. Falling back to '
        #         'ENA search...'
        #     )
        study = search_ena_run_study(accession)
    sample_parsed = soup.find("ID", text=SAMPLE_PARSER)
    if sample_parsed:
        sample = sample_parsed.text
    else:
        # logger.warning(
        #     'Failed to parse sample information from ENA XML. Falling back to '
        #     'ENA search...'
        # )
        sample = search_ena_run_sample(accession)
    title = soup.find("TITLE").text

    attributes = {}

    for attr in soup.find_all("RUN_ATTRIBUTE"):
        try:
            tag = attr.find("TAG").text
            value = attr.find("VALUE").text
            attributes[tag] = value
        except:  # noqa
            pass
    if attributes:
        try:
            attributes["ENA-SPOT-COUNT"] = int(attributes["ENA-SPOT-COUNT"])
            attributes["ENA-BASE-COUNT"] = int(attributes["ENA-BASE-COUNT"])
        except:  # noqa
            pass
    ftp_files = get_files_metadata_from_run(soup)
    # print(ftp_files)
    # ftp_files = [file for file in ftp_files if accession in file['url']]
    # print(ftp_files)
    # for file in ftp_files:
    #     if accession in file['url']:
    # url, md5, size =file['url'], file['md5'], file['size']
    # # we want url last, so we delete they key and include it later
    # del file['url'], file['md5'], file['size']
    # filetype, fileno = parse_url(file['url'])
    # file['filetype'] = filetype
    # file['filenumber'] = fileno

    alt_links_soup = ncbi_fetch_fasta(accession, "sra")

    aws_links = parse_ncbi_fetch_fasta(alt_links_soup, "AWS")
    aws_results = []
    for url in aws_links:
        if accession in url:
            filetype, fileno = parse_url(url)
            aws_results.append(
                {
                    "accession": accession,
                    "filename": url.split("/")[-1],
                    "filetype": filetype,
                    "filesize": None,
                    "filenumber": fileno,
                    "md5": None,
                    "urltype": "aws",
                    "url": url,
                }
            )

    gcp_links = parse_ncbi_fetch_fasta(alt_links_soup, "GCP")
    gcp_results = []
    for url in gcp_links:
        if accession in url:
            filetype, fileno = parse_url(url)
            gcp_results.append(
                {
                    "accession": accession,
                    "filename": url.split("/")[-1],
                    "filetype": filetype,
                    "filesize": None,
                    "filenumber": fileno,
                    "md5": None,
                    "urltype": "gcp",
                    "url": url,
                }
            )

    ncbi_links = parse_ncbi_fetch_fasta(alt_links_soup, "NCBI")
    ncbi_results = []
    for url in ncbi_links:
        if accession in url:
            filetype, fileno = parse_url(url)
            ncbi_results.append(
                {
                    "accession": accession,
                    "filename": url.split("/")[-1],
                    "filetype": filetype,
                    "filesize": None,
                    "filenumber": fileno,
                    "md5": None,
                    "urltype": "ncbi",
                    "url": url,
                }
            )
    files = {
        "ftp": ftp_files,
        "aws": aws_results,
        "gcp": gcp_results,
        "ncbi": ncbi_results,
    }
    return {
        "accession": accession,
        "experiment": experiment,
        "study": study,
        "sample": sample,
        "title": title,
        "attributes": attributes,
        "files": files,
    }


def parse_sample(soup):
    """Given a BeautifulSoup object representing a sample, parse out relevant
    information.

    :param soup: a BeautifulSoup object representing a sample
    :type soup: bs4.BeautifulSoup

    :return: a dictionary containing sample information
    :rtype: dict
    """
    accession = soup.find("PRIMARY_ID", text=SAMPLE_PARSER).text
    title = soup.find("TITLE").text
    organism = soup.find("SCIENTIFIC_NAME").text
    sample_attribute = soup.find_all("SAMPLE_ATTRIBUTE")
    try:
        attributes = {
            attr.find("TAG").text: attr.find("VALUE").text for attr in sample_attribute
        }
    except:  # noqa
        attributes = ""
    if attributes:
        try:
            attributes["ENA-SPOT-COUNT"] = int(attributes["ENA-SPOT-COUNT"])
            attributes["ENA-BASE-COUNT"] = int(attributes["ENA-BASE-COUNT"])
        except:  # noqa
            pass
    try:

        experiment = soup.find(
            re.compile(r"PRIMARY_ID|ID"), text=EXPERIMENT_PARSER
        ).text
        # try:
        #     experiment = soup.find('ID', text=EXPERIMENT_PARSER).text
        # except:  # noqa
        #     experiment = soup.find('PRIMARY_ID', text=EXPERIMENT_PARSER).text

    except:  # noqa
        logger.warning(
            "Failed to parse sample information from ENA XML. Falling back to "
            "ENA search..."
        )
        try:
            experiment = search_ena(
                accession,
                "secondary_sample_accession",
                "read_experiment",
                "experiment_accession",
            )[0]

        except:  # noqa
            experiment = ""
            logger.warning("No experiment found")

    return {
        "accession": accession,
        "title": title,
        "organism": organism,
        "attributes": attributes,
        "experiments": experiment,
    }


def parse_experiment_with_run(soup, level):
    """Given a BeautifulSoup object representing an experiment, parse out relevant
    information.

    :param soup: a BeautifulSoup object representing an experiment
    :type soup: bs4.BeautifulSoup

    :param l: positive integer representing how many downstream accession levels should be fetched.
    :type l: int

    :return: a dictionary containing experiment information
    :rtype: dict
    """
    accession = soup.find("PRIMARY_ID", text=EXPERIMENT_PARSER).text
    title = soup.find("TITLE").text
    platform = soup.find("INSTRUMENT_MODEL").find_parent().name
    instrument = soup.find("INSTRUMENT_MODEL").text

    experiment = {
        "accession": accession,
        "title": title,
        "platform": platform,
        "instrument": instrument,
    }
    if level is None or level > 1:
        # Returns all of the runs associated with an experiment
        runs = srx_to_srrs(accession)
        if len(runs) == 1:
            logger.warning(f"There is 1 run for {accession}")

        else:
            logger.warning(f"There are {len(runs)} runs for {accession}")

        runs = {run: ffq_run(run) for run in runs}

        experiment.update({"runs": runs})
        return experiment
    else:
        return experiment


def parse_study(soup):
    """Given a BeautifulSoup object representing a study, parse out relevant
    information.

    :param soup: a BeautifulSoup object representing a study
    :type soup: bs4.BeautifulSoup

    :return: a dictionary containing study information
    :rtype: dict
    """
    accession = soup.find("PRIMARY_ID", text=PROJECT_PARSER).text
    title = soup.find("STUDY_TITLE").text
    abstract = soup.find("STUDY_ABSTRACT").text if soup.find("STUDY_ABSTRACT") else ""
    return {"accession": accession, "title": title, "abstract": abstract}


def parse_gse_search(soup):
    """Given a BeautifulSoup object representing a geo study, parse out relevant
    information.

    :param soup: a BeautifulSoup object representing a study
    :type soup: bs4.BeautifulSoup

    :return: a dictionary containing geo study unique identifier based on a search
    :rtype: dict
    """
    data = json.loads(soup.text)
    if data["esearchresult"]["idlist"]:
        accession = data["esearchresult"]["querytranslation"].split("[")[0]
        geo_id = data["esearchresult"]["idlist"][-1]
        return {"accession": accession, "geo_id": geo_id}
    else:
        raise InvalidAccession("Provided GSE accession is invalid")


def parse_gse_summary(soup):
    """Given a BeautifulSoup object representing a geo study identifier, parse out relevant
    information.

    :param soup: a BeautifulSoup object representing a study
    :type soup: bs4.BeautifulSoup

    :return: a dictionary containing summary of geo study information
    :rtype: dict
    """
    data = json.loads(soup.text)

    geo_id = data["result"]["uids"][-1]

    relations = data["result"][f"{geo_id}"]["extrelations"]
    for value in relations:
        if value["relationtype"] == "SRA":  # may have many samples?
            sra = value

    if sra:
        srp = sra["targetobject"]
        return {"accession": srp}


def ffq_run(accession, level=0):  # noqa
    """Fetch Run information.

    :param accession: run accession (SRR, ERR or DRR)
    :type accession: str

    :return: dictionary of run information
    :rtype: dict
    """
    logger.info(f"Parsing run {accession}")
    run = parse_run(get_xml(accession))
    return run


def ffq_study(accession, level=None):
    """Fetch Study information.

    :param accession: study accession (SRP, ERP or DRP)
    :type accession: str

    :param l: positive integer representing how many downstream accession levels should be fetched.
    :type l: int

    :return: dictionary of study information. The dictionary contains a
             'samples' key, which is a dictionary of all the samples in the study, as
             returned by `ffq_sample`.
    :rtype: dict
    """
    logger.info(f"Parsing Study {accession}")
    study = parse_study(get_xml(accession))
    if level is None or level != 1:
        try:
            level -= 1
        except:  # noqa
            pass
        logger.info(f"Getting Sample for {accession}")
        sample_ids = get_samples_from_study(accession)
        logger.warning(f"There are {str(len(sample_ids))} samples for {accession}")
        samples = [ffq_sample(sample_id, level) for sample_id in sample_ids]
        study.update({"samples": {sample["accession"]: sample for sample in samples}})
        return study
    else:
        return study


def ffq_gse(accession, level=None):
    """Fetch GSE information.

    This function finds the GSMs corresponding to the GSE and calls `ffq_gsm`.

    :param accession: GSE accession
    :type accession: str

    :param l: positive integer representing how many downstream accession levels should be fetched.
    :type l: int

    :return: dictionary containing GSE information. The dictionary contains a
             'sample' key, which is a dictionary of all the GSMs in the study, as
             returned by `ffq_gsm`.
    :rtype: dict
    """
    logger.info(f"Parsing GEO {accession}")
    gse = parse_gse_search(get_gse_search_json(accession))
    logger.info(f"Finding supplementary files for GEO {accession}")
    time.sleep(1)
    supp = geo_to_suppl(accession, "GSE")
    if len(supp) > 0:
        gse.update({"supplementary_files": supp})
    else:
        logger.info(f"No supplementary files found for {accession}")
    gse.pop("geo_id")
    if level is None or level != 1:
        try:
            level -= 1
        except:  # noqa
            pass
        time.sleep(1)
        gsm_ids = gse_to_gsms(accession)
        logger.warning(f"There are {str(len(gsm_ids))} samples for {accession}")
        gsms = [ffq_gsm(gsm_id, level) for gsm_id in gsm_ids]
        gse.update({"geo_samples": {sample["accession"]: sample for sample in gsms}})
        return gse
    else:
        return gse


def ffq_gsm(accession, level=None):
    """Fetch GSM information.

    This function finds the SRS corresponding to the GSM and calls `ffq_sample`.

    :param accession: GSM accession
    :type accession: str

    :param l: positive integer representing how many downstream accession levels should be fetched.
    :type l: int

    :return: dictionary containing GSM information. The dictionary contains a
             'sample' key, which is a dictionary of the sample asssociated to the GSM, as
             returned by `ffq_sample`.
    :rtype: dict
    """
    logger.info(f"Parsing GSM {accession}")
    gsm = get_gsm_search_json(accession)
    logger.info(f"Finding supplementary files for GSM {accession}")
    time.sleep(1)
    supp = geo_to_suppl(accession, "GSM")
    if supp:
        gsm.update({"supplementary_files": supp})
    else:
        logger.info(f"No supplementary files found for {accession}")

    gsm.update(gsm_to_platform(accession))
    if level is None or level != 1:
        try:
            level -= 1
        except:  # noqa
            pass
        logger.info(f"Getting sample for {accession}")
        srs = gsm_id_to_srs(gsm.pop("geo_id"))
        if srs:
            sample = ffq_sample(srs, level)
            gsm.update({"samples": {sample["accession"]: sample}})
        else:
            return gsm
        return gsm
    else:
        return gsm


def ffq_experiment(accession, level=None):
    """Fetch Experiment information.

    :param accession: experiment accession (SRX, ERX or DRX)
    :type accession: str

    :param l: positive integer representing how many downstream accession levels should be fetched.
    :type l: int

    :return: dictionary of experiment information. The dictionary contains a
             'runs' key, which is a dictionary of all the runs in the study, as
             returned by `ffq_run`.
    :rtype: dict
    """
    logger.info(f"Parsing Experiment {accession}")
    experiment = parse_experiment_with_run(get_xml(accession), level)
    return experiment


def ffq_sample(accession, level=None):
    """Fetch Sample information.

    :param accession: sample accession (SRS, ERS or DRS)
    :type accession: str

    :param l: positive integer representing how many downstream accession levels should be fetched.
    :type l: int

    :return: dictionary of sample information. The dictionary contains a
             'runs' key, which is a dictionary of all the runs in the study, as
             returned by `ffq_run`.
    :rtype: dict
    """
    logger.info(f"Parsing sample {accession}")
    xml_sample = get_xml(accession)
    sample = parse_sample(xml_sample)
    if level is None or level != 1:
        try:
            level -= 1
        except:  # noqa
            pass
        logger.info(f"Getting Experiment for {accession}")
        exp_id = sample["experiments"]
        if not exp_id:
            try:
                alias = xml_sample.SAMPLE.attrs["alias"]
                id = get_gsm_search_json(alias)["geo_id"]
                exp_id = ncbi_summary("gds", id)[id]["extrelations"][0]["targetobject"]
            except:  # noqa
                logger.warning(f"No Experiment found for {accession}")
        if "," in exp_id:
            exp_ids = exp_id.split(",")
            experiments = [ffq_experiment(exp_id, level) for exp_id in exp_ids]
            sample.update(
                {
                    "experiments": [
                        {experiment["accession"]: experiment}
                        for experiment in experiments
                    ]
                }
            )
            return sample
        else:
            experiment = ffq_experiment(exp_id, level)
            sample.update({"experiments": {experiment["accession"]: experiment}})
        return sample
    else:
        return sample


def ffq_encode(accession, level=0):
    """Fetch ENCODE ids information. This
    function receives an ENCSR, ENCBS or ENCD
    ENCODE id and fetches the associated metadata

    :param accession: an ENCODE id (ENCSR, ENCBS or ENCD)
    :type accession: str

    :return: dictionary of ENCODE id metadata.
    :rtype: dict
    """
    logger.info(f"Parsing {accession}")
    encode = parse_encode_json(accession, get_encode_json(accession))
    return encode


def ffq_bioproject(accession, level=0):  # noqa
    """Fetch bioproject ids information. This
    function receives a CXR accession
    and fetches the associated metadata

    :param accession: a bioproject CXR id
    :type accession: str

    :return: dictionary of bioproject metadata.
    :rtype: dict
    """
    return parse_bioproject(ena_fetch(accession, "bioproject"))


def ffq_biosample(accession, level=None):
    """Fetch biosample ids information. This
    function receives a SAMN accession
    and fetches the associated metadata

    :param accession: a biosample SAMN id
    :type accession: str

    :return: dictionary of biosample metadata.
    :rtype: dict
    """
    # commented below: old implementation using ncbi to fetch biosample data
    # soup = ena_fetch(accession, 'biosample')
    # sample = soup.find('id', text=SAMPLE_PARSER).text
    soup = get_xml(accession)
    sample = soup.SAMPLE.attrs["accession"]
    try:
        level = level - 1
    except:  # noqa
        pass
    sample_data = ffq_sample(sample, level)
    return {"accession": accession, "samples": sample_data}


def ffq_doi(doi, level=0):  # noqa
    """Fetch DOI information.

    This function first searches CrossRef for the paper title, then uses that
    to find any SRA studies that match the title. If there are, all the runs in
    each study are fetched. If there are not, Pubmed is searched for the DOI,
    which may contain GEO IDs. If there are GEO IDs, `ffq_gse` is called for each.
    If not, the Pubmed entry may include SRA links. If there are, `ffq_run` is
    called for each linked run. These runs are then grouped by SRP.

    :param doi: paper DOI
    :type doi: str

    :return: list of SRA or GEO studies that are linked to this paper. If
             there are SRA studies matching the paper title, the returned
             list is a list of SRA studies. If not, and the paper includes
             a GEO link, it is a list of GEO studies. If not, and the paper
             includes SRA links, it is a list of SRPs.
    :rtype: list
    """
    # Sanitize DOI so that it doesn't include leading http or https
    parsed = urlparse(doi)

    if parsed.scheme:
        doi = parsed.path.strip("/")

    logger.info(f"Searching for DOI '{doi}'")
    paper = get_doi(doi)
    title = paper["title"][0]

    logger.info(f"Searching for Study SRP with title '{title}'")
    study_accessions = search_ena_title(title)

    if study_accessions:
        logger.info(
            f'Found {len(study_accessions)} studies that match this title: {", ".join(study_accessions)}'
        )
        return [ffq_study(accession, None) for accession in study_accessions]

    # If not study with the title is found, search Pubmed, which can be linked
    # to a GEO accession.
    logger.warning(
        ("No studies found with the given title. " f"Searching Pubmed for DOI '{doi}'")
    )
    pubmed_ids = ncbi_search("pubmed", doi)

    if not pubmed_ids:
        raise Exception("No Pubmed records match the DOI")
    if len(pubmed_ids) > 1:
        raise Exception(f'{len(pubmed_ids)} match the DOI: {", ".join(pubmed_ids)}')

    pubmed_id = pubmed_ids[0]
    logger.info(f"Searching for GEO record linked to Pubmed ID '{pubmed_id}'")
    geo_ids = ncbi_link("pubmed", "gds", pubmed_id)
    if geo_ids:
        # Convert these geo ids to GSE accessions
        gses = geo_ids_to_gses(geo_ids)
        logger.info(f'Found {len(gses)} GEO Accessions: {", ".join(gses)}')
        if len(gses) != len(geo_ids):
            raise Exception(
                (
                    "Number of GEO Accessions found does not match the number of GEO "
                    f"records: expected {len(geo_ids)} but found {len(gses)}"
                )
            )
        # Sleep for 1sec because NCBI has rate-limiting to 3 requests/sec
        time.sleep(1)
        return [ffq_gse(accession) for accession in gses]

    # If the pubmed id is not linked to any GEO record, search for SRA records
    logger.warning(
        (
            f"No GEO records are linked to the Pubmed ID '{pubmed_id}'. "
            "Searching for SRA record linked to this Pubmed ID."
        )
    )
    time.sleep(1)
    sra_ids = ncbi_link("pubmed", "sra", pubmed_id)
    if sra_ids:
        srrs = sra_ids_to_srrs(sra_ids)
        logger.warning(f"Found {len(srrs)} run accessions.")
        runs = [ffq_run(accession) for accession in srrs]

        # Group runs by project to keep things consistent.
        studies = {}
        for run in runs:
            study = run["study"].copy()  # Prevent recursive dict
            # get the study accession if exists and add the run to the runs
            studies.setdefault(study["accession"], study).setdefault("runs", {})[
                run["accession"]
            ] = run

        return [v for k, v in studies.items()]
    else:
        raise Exception(f"No SRA records are linked to Pubmed ID '{pubmed_id}'")

        
#--------ABOVE AND BELOW FROM DIFFERENT SOUCE IN THE SOURCE CODE------------------------------------------
#------Next section mainly from `ffq/ffq/main.py`---------------------------------------------------------
RUN_TYPES = (
    "SRR",
    "ERR",
    "DRR",
)
PROJECT_TYPES = (
    "SRP",
    "ERP",
    "DRP",
)
EXPERIMENT_TYPES = (
    "SRX",
    "ERX",
    "DRX",
)
SAMPLE_TYPES = ("SRS", "ERS", "DRS", "CRS")
GEO_TYPES = ("GSE", "GSM")
ENCODE_TYPES = ("ENCSR", "ENCBS", "ENCDO")
BIOPROJECT_TYPES = (
    "CRX",
)  # TODO implement CRR and CRP, most dont have public metadata.
BIOSAMPLE_TYPES = ("SAMN", "SAMD", "SAMEA", "SAMEG")
OTHER_TYPES = ("DOI",)
SEARCH_TYPES = (
    RUN_TYPES
    + PROJECT_TYPES
    + EXPERIMENT_TYPES
    + SAMPLE_TYPES
    + GEO_TYPES
    + ENCODE_TYPES
    + BIOPROJECT_TYPES
    + BIOSAMPLE_TYPES
    + OTHER_TYPES
)

# main ffq caller
FFQ = {
    "DOI": ffq_doi,
    "GSM": ffq_gsm,
    "GSE": ffq_gse,
}
FFQ.update({t: ffq_run for t in RUN_TYPES})
FFQ.update({t: ffq_study for t in PROJECT_TYPES})
FFQ.update({t: ffq_experiment for t in EXPERIMENT_TYPES})
FFQ.update({t: ffq_sample for t in SAMPLE_TYPES})
FFQ.update({t: ffq_encode for t in ENCODE_TYPES})
FFQ.update({t: ffq_bioproject for t in BIOPROJECT_TYPES})
FFQ.update({t: ffq_biosample for t in BIOSAMPLE_TYPES})

RUN_PARSER = re.compile(r"(SRR.+)|(ERR.+)|(DRR.+)")
EXPERIMENT_PARSER = re.compile(r"(SRX.+)|(ERX.+)|(DRX.+)")
PROJECT_PARSER = re.compile(r"(SRP.+)|(ERP.+)|(DRP.+)")
SAMPLE_PARSER = re.compile(r"(SRS.+)|(ERS.+)|(DRS.+)")
DOI_PARSER = re.compile("^10.\d{4,9}\/[-._;()\/:A-Z0-9]+")  # noqa

#--------ABOVE AND BELOW FROM DIFFERENT SOUCE IN THE SOURCE CODE----------------------------------------------------
#------Next section adapted from `ffq/ffq/main.py` to set up the `args` to use with `run_ffq` imported below--------
parser = argparse.ArgumentParser(
    description=(
        (
            f"ffq : A command line tool to find sequencing data "
            "from SRA / GEO / ENCODE / ENA / EBI-EMBL / DDBJ / Biosample."
        )
    )
)
parser._actions[0].help = parser._actions[0].help.capitalize()

parser.add_argument(
    "IDs",
    help=(
        "One or multiple SRA / GEO / ENCODE / ENA / EBI-EMBL / DDBJ / Biosample accessions, "
        "DOIs, or paper titles"
    ),
    nargs="+",
)
parser.add_argument(
    "-o",
    metavar="OUT",
    help=("Path to write metadata (default: standard out)"),
    type=str,
    required=False,
)

parser.add_argument(
    "-t",
    metavar="TYPE",
    help=argparse.SUPPRESS,
    type=str,
    required=False,
    choices=SEARCH_TYPES,
)

parser.add_argument(
    "-l",
    metavar="LEVEL",
    help="Max depth to fetch data within accession tree",
    type=int,
)

parser.add_argument("--ftp", help="Return FTP links", action="store_true")

parser.add_argument("--aws", help="Return AWS links", action="store_true")  # noqa

parser.add_argument("--gcp", help="Return GCP links", action="store_true")  # noqa

parser.add_argument("--ncbi", help="Return NCBI links", action="store_true")  # noqa
parser.add_argument(
    "--split",
    help="Split output into separate files by accession  (`-o` is a directory)",  # noqa
    action="store_true",
)
parser.add_argument(
    "--verbose", help="Print debugging information", action="store_true"
)
parser.add_argument(
    "--version", action="version", version=f"%(prog)s 0.3"
)

#--------ABOVE FROM THE SOURCE CODE--------------------------------------------------

from ffq.main import run_ffq
#run_ffq(parser.parse_args(['--ftp','ERX5777701'])) #<-- this works but makes more sense to have a set-up cell for code and then run equivalent process in its own cell

In [15]:
run_ffq(parser.parse_args(['--ftp','ERX5777701']))

[2025-04-16 19:58:31,166]    INFO Parsing Experiment ERX5777701
[2025-04-16 19:58:31,171]    INFO Parsing run ERR6140859


[{'accession': 'ERR6140859',
  'filename': 'ERR6140859_1.fastq.gz',
  'filetype': 'fastq',
  'filesize': 46548794,
  'filenumber': 1,
  'md5': '87fbe7c5b04a66a8f17da0f16bb6bf36',
  'urltype': 'ftp',
  'url': 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_1.fastq.gz'},
 {'accession': 'ERR6140859',
  'filename': 'ERR6140859_2.fastq.gz',
  'filetype': 'fastq',
  'filesize': 48630523,
  'filenumber': 2,
  'md5': '49a6698e9c9099f21f6ace7953925027',
  'urltype': 'ftp',
  'url': 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_2.fastq.gz'}]

##### Gives the same as the command line!!

In [16]:
!ffq --ftp ERX5777701

[2025-04-16 19:58:32,187]    INFO Parsing Experiment ERX5777701
[2025-04-16 19:58:32,488]    INFO Parsing run ERR6140859
[
    {
        "accession": "ERR6140859",
        "filename": "ERR6140859_1.fastq.gz",
        "filetype": "fastq",
        "filesize": 46548794,
        "filenumber": 1,
        "md5": "87fbe7c5b04a66a8f17da0f16bb6bf36",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_1.fastq.gz"
    },
    {
        "accession": "ERR6140859",
        "filename": "ERR6140859_2.fastq.gz",
        "filetype": "fastq",
        "filesize": 48630523,
        "filenumber": 2,
        "md5": "49a6698e9c9099f21f6ace7953925027",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_2.fastq.gz"
    }
]


I realized later that using it without specifying source can allow you to learn much more about available resources. 

In [17]:
run_ffq(parser.parse_args(['ERX5777701']))

[2025-04-16 19:58:33,647]    INFO Parsing Experiment ERX5777701
[2025-04-16 19:58:33,660]    INFO Parsing run ERR6140859


{'ERX5777701': {'accession': 'ERX5777701',
  'title': 'Fly Cell Atlas: single-cell transcriptomes of the entire adult Drosophila - Smartseq2 dataset',
  'platform': 'ILLUMINA',
  'instrument': 'Illumina NovaSeq 6000',
  'runs': {'ERR6140859': {'accession': 'ERR6140859',
    'experiment': 'ERX5777701',
    'study': 'ERP130174',
    'sample': 'ERS7029664',
    'title': 'Illumina NovaSeq 6000 paired end sequencing; Fly Cell Atlas: single-cell transcriptomes of the entire adult Drosophila - Smartseq2 dataset',
    'attributes': {'ENA-FIRST-PUBLIC': '2021-07-02',
     'ENA-LAST-UPDATE': '2021-07-02'},
    'files': {'ftp': [{'accession': 'ERR6140859',
       'filename': 'ERR6140859_1.fastq.gz',
       'filetype': 'fastq',
       'filesize': 46548794,
       'filenumber': 1,
       'md5': '87fbe7c5b04a66a8f17da0f16bb6bf36',
       'urltype': 'ftp',
       'url': 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_1.fastq.gz'},
      {'accession': 'ERR6140859',
       'filenam

### Can I parse the Python result like documentation suggests to obtain the links from the ffq resultes to download raw data

Under [the section 'Downloading data'](https://github.com/pachterlab/ffq/tree/898f56b0b9a07f152d2771fbb5157bf928febf4f#downloading-data):

>"ffq is specifically designed to download metadata and to facilitate obtaining links to sequence files. To download raw data from the links obtained with ffq you can use..." curl among other things

In [the documentation they say](https://github.com/pachterlab/ffq/tree/898f56b0b9a07f152d2771fbb5157bf928febf4f#ftp), "Alternatively, the urls can be extracted from the json output with jq and then piped into cURL."   

This is the command line equivalent example they provide in the documentation:

```shell
ffq --ftp SRR10668798 | jq -r '.[] | .url' | xargs curl -O
```

However, as ffq isn't registering as a command line command executable on the command line in OSG's access point (and I assume for execution point), is there a way I can I accomplish the equivalent with Python to set up for getting the sequence files?

[Here](https://stackoverflow.com/a/65078827/8508004) suggests pyjq will work for the `jq` step.

In [18]:
%pip install pyjq -q

Note: you may need to restart the kernel to use updated packages.


In [19]:
raw_json_output = run_ffq(parser.parse_args(['--ftp','ERX5777701']))

[2025-04-16 19:59:16,945]    INFO Parsing Experiment ERX5777701
[2025-04-16 19:59:16,957]    INFO Parsing run ERR6140859


In [20]:
raw_json_output

[{'accession': 'ERR6140859',
  'filename': 'ERR6140859_1.fastq.gz',
  'filetype': 'fastq',
  'filesize': 46548794,
  'filenumber': 1,
  'md5': '87fbe7c5b04a66a8f17da0f16bb6bf36',
  'urltype': 'ftp',
  'url': 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_1.fastq.gz'},
 {'accession': 'ERR6140859',
  'filename': 'ERR6140859_2.fastq.gz',
  'filetype': 'fastq',
  'filesize': 48630523,
  'filenumber': 2,
  'md5': '49a6698e9c9099f21f6ace7953925027',
  'urltype': 'ftp',
  'url': 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_2.fastq.gz'}]

Parsing that based on example from [here](https://stackoverflow.com/a/65078827/8508004) and the ffq documentation. The example from stackoverflow:

```python
import pyjq
print(pyjq.all( ".members[] | [.name]", {"members": [ {"name": "foo"} ]} ))
```

Example from ffq documentation with jq:

```shell
$ ffq --ftp SRR10668798 | jq -r '.[] | .url' | xargs curl -O
```

So makes me think, this would work:

```python
import pyjq
print(pyjq.all( ".[] | .url",raw_json_output ))
```

In [21]:
import pyjq
print(pyjq.all( ".[] | .url",raw_json_output ))

['ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_1.fastq.gz', 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_2.fastq.gz']


That worked!

So to make the curl commands to get each set of specified fastq files, it would be:

In [22]:
import os
list_urls = pyjq.all( ".[] | .url",raw_json_output ) 
for url in list_urls:
    os.system(f"echo curl -OL {url}")

curl -OL ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_1.fastq.gz
curl -OL ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR614/009/ERR6140859/ERR6140859_2.fastq.gz


Note that the above cell would time out for the `os.system(f"curl -OL {url}")` when run on MyBinder because FTP blocked there. (SO FOR NOW JUST ECHOING THE DRAFTED COMMANDS.) Oddly though it did work back in June 2023 (I had the old version showing curl retreiving the fastq files) when I ran it bt maybe Gesis hadn't blocked FTP port at that time. (I know that I didn't run in Huggigface because installing `pyjq` fails there.)  
Where I'm intending to run this, `curl` works on the command line and FTP port is open, and so I think this is fine approach for that step. I wouldn't have been able to use `os.system(f"ffq -ftp {acc_id}")` there on OSG OSpool because that was the issue that ffq wasn't getting registered to the command line on OSG OSPool Singularity Apptainer.    
At first glance, this maybe doesn't jibe fully with the 'pure-Python way' of using `ffq` that I was suggesting at the top of this document; however, the curl retrieval is a downstream step [and could be done with Python using the requests module](https://stackoverflow.com/questions/46311212/pythons-requests-equivalent-of-curl-o).

------

### ffq-api's public instance used in conjunction with `curl` isn't a replacement for using ffq

OH WAIT, **this now works on MyBinder!!!** The rest of this section was originally drafted when the code below didn't work in the session, and so take the rest of this section with a grain of salt for now (leaving it for now because it may go back to not working as MyBinder changes)....

`curl` works to retrieve via FTP link in the place I want to use `ffq` (OSG's OSPool) but cannot seem to get ffq to register as command line there in the Singularity appatainer. So in theory I can use curl with ffq-api's public instance to use `ffq` even where it isn't available on command line. However,currently it doesn't get anything for `ERX5777701`, despite the direct current `ffq working well with `ERX5777701` above.

Showing it doesn't really work:

In [23]:
output = !curl https://ffq.seqera.io/v1alpha1/ERX5777701 
import json
# pretty print json based on https://www.digitalocean.com/community/tutorials/python-pretty-print-json & https://stackoverflow.com/a/36941257/8508004
json_object2 = json.loads(output[-1])
json_formatted_str = json.dumps(json_object2, indent=2)
print(json_formatted_str)

{
  "results": {
    "ERX5777701": {
      "accession": "ERX5777701",
      "title": "Fly Cell Atlas: single-cell transcriptomes of the entire adult Drosophila - Smartseq2 dataset",
      "platform": "ILLUMINA",
      "instrument": "Illumina NovaSeq 6000",
      "runs": {
        "ERR6140859": {
          "accession": "ERR6140859",
          "experiment": "ERX5777701",
          "study": "ERP130174",
          "sample": "ERS7029664",
          "title": "Illumina NovaSeq 6000 paired end sequencing; Fly Cell Atlas: single-cell transcriptomes of the entire adult Drosophila - Smartseq2 dataset",
          "attributes": {
            "ENA-FIRST-PUBLIC": "2021-07-02",
            "ENA-LAST-UPDATE": "2021-07-02"
          },
          "files": {
            "ftp": [
              {
                "accession": "ERR6140859",
                "filename": "ERR6140859_1.fastq.gz",
                "filetype": "fastq",
                "filesize": 46548794,
                "filenumber": 1,
          

Note, that none of the useful URLs or **anything** retrieved!

Yet, the example in the `ffq-api` documentation (`curl https://ffq.seqera.io/v1alpha1/SRR9990627 | jq`) works and so I don't know why I get different results than actually using current ffq.

-----

(Note to self I ended up using `sratoolkit` to fetch the fastqs when I wanted to save space/time on the remote machine.)