Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use no-sign-request flag when downloading from S3. #31

Closed
wants to merge 2 commits into from

Conversation

bsmith89
Copy link
Collaborator

Added --no-sign-request flag to several obvious cases of aws s3 cp where I believe they're only used for downloading from s3 to local.

As long as the reference data bucket is public, this should enable usage of iggtools midas_run_species and other subcommands on local servers without any AWS credentials at all.

@bsmith89 bsmith89 changed the title Use no-sign-request flag when downloading from S3. [WIP] Use no-sign-request flag when downloading from S3. Jan 17, 2020
@bsmith89
Copy link
Collaborator Author

I still get a similar error:


$iggtools midas_run_species -1 data/SS01117.m.proc.r1.fq.gz -2 data/SS01117.m.proc.r2.fq.gz data/SS01117.m.proc.iggtools/species

1579295513.4:  Doing important work in subcommand midas_run_species with args
1579295513.4:  {
1579295513.4:      "subcommand": "midas_run_species",
1579295513.4:      "force": false,
1579295513.4:      "debug": false,
1579295513.4:      "zzz_slave_mode": false,
1579295513.4:      "batch_branch": "master",
1579295513.4:      "batch_memory": 378880,
1579295513.4:      "batch_vcpus": 48,
1579295513.4:      "batch_queue": "pairani",
1579295513.4:      "batch_ecr_image": "iggtools",
1579295513.4:      "outdir": "data/SS01117.m.proc.iggtools/species",
1579295513.4:      "r1": "data/SS01117.m.proc.r1.fq.gz",
1579295513.4:      "r2": "data/SS01117.m.proc.r2.fq.gz",
1579295513.4:      "word_size": 28,
1579295513.4:      "aln_mapid": null,
1579295513.4:      "aln_cov": 0.75,
1579295513.4:      "max_reads": null
1579295513.4:  }
1579295513.4:  'rm -rf data/SS01117.m.proc.iggtools/species/species/temp/'
1579295513.4:  'mkdir -p data/SS01117.m.proc.iggtools/species/species/temp/'
1579295513.4:  Overwriting pre-existing ./phyeco.fa with reference download.
1579295513.4:  'rm -f ./phyeco.fa'
1579295513.4:  Overwriting pre-existing ./phyeco.fa.bwt with reference download.
1579295513.4:  'rm -f ./phyeco.fa.bwt'
1579295513.4:  Overwriting pre-existing ./phyeco.fa.header with reference download.
1579295513.4:  'rm -f ./phyeco.fa.header'
1579295513.4:  Overwriting pre-existing ./phyeco.fa.sa with reference download.
1579295513.4:  'rm -f ./phyeco.fa.sa'
1579295513.4:  Overwriting pre-existing ./phyeco.fa.sequence with reference download.
1579295513.4:  'rm -f ./phyeco.fa.sequence'
1579295513.4:  Overwriting pre-existing ./phyeco.map with reference download.
1579295513.4:  'rm -f ./phyeco.map'
1579295513.4:  'aws s3 cp --only-show-errors --no-sign-request s3://microbiome-igg/2.0/marker_genes/phyeco/phyeco.fa.lz4 - | lz4 -dc > ./phyeco.fa'
1579295513.4:  'aws s3 cp --only-show-errors --no-sign-request s3://microbiome-igg/2.0/marker_genes/phyeco/phyeco.fa.header.lz4 - | lz4 -dc > ./phyeco.fa.header'
1579295513.4:  'aws s3 cp --only-show-errors --no-sign-request s3://microbiome-igg/2.0/marker_genes/phyeco/phyeco.map.lz4 - | lz4 -dc > ./phyeco.map'
1579295513.4:  'aws s3 cp --only-show-errors --no-sign-request s3://microbiome-igg/2.0/marker_genes/phyeco/phyeco.fa.sequence.lz4 - | lz4 -dc > ./phyeco.fa.sequence'
1579295513.4:  'aws s3 cp --only-show-errors --no-sign-request s3://microbiome-igg/2.0/marker_genes/phyeco/phyeco.fa.sa.lz4 - | lz4 -dc > ./phyeco.fa.sa'
1579295513.5:  'aws s3 cp --only-show-errors --no-sign-request s3://microbiome-igg/2.0/marker_genes/phyeco/phyeco.fa.bwt.lz4 - | lz4 -dc > ./phyeco.fa.bwt'
Unable to locate credentials. You can configure credentials by running "aws configure".
Traceback (most recent call last):
  File "/pollard/home/bsmith/anaconda3/envs/ucfmt4/bin/iggtools", line 11, in <module>
    load_entry_point('iggtools', 'console_scripts', 'iggtools')()
  File "/pollard/home/bsmith/Projects/iggtools/iggtools/__main__.py", line 18, in main
    return subcommand_main(subcommand_args)
  File "/pollard/home/bsmith/Projects/iggtools/iggtools/subcommands/midas_run_species.py", line 298, in main
    midas_run_species(args)
  File "/pollard/home/bsmith/Projects/iggtools/iggtools/subcommands/midas_run_species.py", line 272, in midas_run_species
    db = UHGG()
  File "/pollard/home/bsmith/Projects/iggtools/iggtools/models/uhgg.py", line 13, in __init__
    self.species, self.representatives, self.genomes = _UHGG_load(table_of_contents_tsv)
  File "/pollard/home/bsmith/Projects/iggtools/iggtools/models/uhgg.py", line 20, in _UHGG_load
    with InputStream(toc_tsv) as table_of_contents:
  File "/pollard/home/bsmith/Projects/iggtools/iggtools/common/utils.py", line 104, in __init__
    path = smart_glob(path, expected=1)[0]
  File "/pollard/home/bsmith/Projects/iggtools/iggtools/common/utils.py", line 271, in smart_glob
    assert actual in expected, f"Expected {expected_str} file(s) for {pattern}, found {actual}"try t
AssertionError: Expected 1 file(s) for s3://microbiome-igg/2.0/genomes.tsv.lz4, found 0

This seems odd to me because the file it fails on, s3://microbiome-igg/2.0/genomes.tsv.lz4, never shows up in the logging output as one of the files being downloaded. It's also unfortunate that I don't know which aws command output Unable to locate credentials. You can configure credentials by running "aws configure". to stderr, so it's not clear if there's a hidden call that's missing the --no-sign-request flag.

When I try to download the missing file directly I get a weird error:

$ aws s3 cp --no-sign-request s3://microbiome-igg/2.0/genomes.tsv.lz4 .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

@bsmith89
Copy link
Collaborator Author

bsmith89 commented Jan 17, 2020

With @zhaoc1 adding ListObjectsV2 permissions to s3://microbiome-igg/2.0, and one additional --no-sign-request added to the cryptic usage of aws s3 ls in the iggtools.common.utils.smart_glob function, it seems like it's working now!

I can run iggtools midas_run_species -1 data/SS01117.m.proc.r1.fq.gz -2 data/SS01117.m.proc.r2.fq.gz data/SS01117.m.proc.iggtools/species without error, at least. :)

@bsmith89 bsmith89 changed the title [WIP] Use no-sign-request flag when downloading from S3. Use no-sign-request flag when downloading from S3. Jan 17, 2020
@boris-dimitrov
Copy link
Contributor

boris-dimitrov commented Jan 19, 2020

Thanks this is highly valuable research. I think it will need a bit more work before we integrate it. One danger with making this default is related to the following question: If the user's environment completely lacks any AWS config, how are we sure that "s3://micriobome-igg/2.0" points to the data we created in the CZ Biohub account? Anyone can create an identically named folder in their own AWS account, and if they do so, what happens --- does it break accesses to our "s3://microbiome-igg/2.0" because it's no longer globally unique?

Again this is a highly valuable investigation and we need something like it, but I don't think we are quite ready to integrate it yet. We will just have to think carefully about & abstract appropriately the AWS s3 reference accesses.

@boris-dimitrov
Copy link
Contributor

It turns out AWS s3 bucket names live in a global namespace, so my concern does not apply to downloading references. That's great. I will merge this soon.

Documentation: """
An Amazon S3 bucket name is globally unique, and the namespace is shared by all AWS accounts. This means that after a bucket is created, the name of that bucket cannot be used by another AWS account in any AWS Region until the bucket is deleted.
"""

@boris-dimitrov
Copy link
Contributor

The main thing that I am trying to figure out now is how to make this transparently work both with public and private buckets. One approach would be to have a whitelist of known-public buckets and when reading from those add the no-signature flag. Still need credentials for writing, and for reading non-public buckets. We can also try to access a bucket without credentials, a few times accounting for general failures, to determine if it should be added to the whitelist.

@bsmith89
Copy link
Collaborator Author

I'm definitely not grokking all of the trade-offs here. As a presumptive user—assuming that my one-iggtools-execution-per-sample is a reasonable workflow—a local copy of the whole database seems like the right way to go. The --no-sign-request flag for s3 files would therefore not be particularly important, and the greater complexity of a system that automatically detects whether the flag should be used is probably not worth the reward.

Perhaps my time would be better spent on a PR for using the db locally? My first impression was that that might only require a simple patch to iggtools/params/inputs.py...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants