Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crawl of openfmri datasets fails since aggregate-metadata crashes with AttributeError #1930

Closed
yarikoptic opened this issue Oct 29, 2017 · 3 comments · Fixed by #1939
Closed

Comments

@yarikoptic
Copy link
Member

(git)smaug:\u2026atasets-openfmri-crawl-20171028/datalad/crawl/openfmri/ds000001[master]git
$> datalad crawl         
[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg 
[INFO   ] Creating a pipeline for the openfmri dataset ds000001 
Connecting to bucket: openneuro
[INFO   ] S3 session: Connecting to the bucket openneuro 
Bucket info:
  Versioning: S3ResponseError: 403 Forbidden
     Website: S3ResponseError: 403 Forbidden
         ACL: <Policy: openfmri (owner) = READ, openfmri (owner) = WRITE, openfmri (owner) = READ_ACP, openfmri (owner) = WRITE_ACP, http://acs.amazonaws.com/groups/global/AllUsers = READ, http://acs.amazonaws.com/groups/global/AllUsers = READ_ACP>
ds000001/
[INFO   ] Creating a pipeline for the openfmri bucket 
[WARNING] ATM we assume prefixes to correspond only to directories, adding / 
[INFO   ] Running pipeline [[<function switch_branch at 0x7f40e1039aa0>, [<datalad.crawler.nodes.s3.crawl_s3 object at 0x7f40e0678d90>, sub(ok_missing=True, subs=<<{'url': {'^s3://openfm...>>), switch(default=None, key='datalad_action', mapping=<<{'commit': <function _...>>, re=False)], <function switch_branch at 0x7f40e065ded8>], <function switch_branch at 0x7f40e065df50>, [<datalad.crawler.nodes.crawl_url.crawl_url object at 0x7f40e0678dd0>, [a_href_match(query='.*release_history.txt'), assign(assignments={'filename': 'changelog.txt'}, interpolate=False), <datalad.crawler.nodes.annex.Annexificator object at 0x7f40e101c690>], [a_href_match(query='.*/.*\\.(tgz|tar.*|zip)'), sub(ok_missing=False, subs=<<{'url': {'(http)s?(://...>>), <function func_node at 0x7f40dfff3140>, <datalad.crawler.nodes.annex.Annexificator object at 0x7f40e101c690>], <function _commit_versions at 0x7f40dfff31b8>], <datalad.crawler.nodes.annex._remove_obsolete object at 0x7f40e0678c50>, [{'loop': False}, <function switch_branch at 0x7f40dfff3230>, <function merge_branch at 0x7f40dfff3398>, <function _remove_other_versions at 0x7f40dfff3488>, [{'loop': True}, find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), assign(assignments=<<{'dataset_file': 'ds00...>>, interpolate=True), switch(default=<function>, key='dataset_file', mapping=<<{'.*///ds000030_R1\\.0...>>, re=True)], [find_files(dirs=False, fail_if_none=False, regex=<<'(\\.(tsv|csv|txt|json...>>, topdir='.'), fix_permissions(executable=False, file_re='.*', input='filename', path=None)], <function switch_branch at 0x7f40dfff3668>, <function merge_branch at 0x7f40dfff3758>, <function _finalize at 0x7f40dfff37d0>], <function switch_branch at 0x7f40dfff3848>, <function _finalize at 0x7f40dfff38c0>] 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Checking out an existing branch incoming-s3-openneuro 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Checking out an existing branch master 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Checking out an existing branch incoming 
[INFO   ] Fetching 'https://openfmri.org/dataset/ds000001/' 
                                                                                                                                                                                    [INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Checking out an existing branch incoming-processed 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Checking out an existing branch master 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Aggregate metadata for dataset /mnt/btrfs/datasets-openfmri-crawl-20171028/datalad/crawl/openfmri/ds000001 
[ERROR  ] __exit__ [json_py.py:dump2xzstream:64] (AttributeError) 
datalad crawl  5.17s user 2.84s system 53% cpu 15.010 total
@mih
Copy link
Member

mih commented Oct 29, 2017

Could you run this with --dbg, looks as if this is related to AutomagicIO?

@yarikoptic
Copy link
Member Author

[DEBUG  ] Dump metadata of <Dataset path=/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001> (merge mode: init) into <Dataset path=/mnt/btrfs/datasets/datalad/crawl> 
[DEBUG  ] no usable BIDS metadata for CHANGES in <Dataset path=/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001>: File '/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001/CHANGES' could not be found in the current BIDS project. [bids_layout.py:get_nearest_helper:33] 
[DEBUG  ] no usable BIDS metadata for README in <Dataset path=/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001>: File '/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001/README' could not be found in the current BIDS project. [bids_layout.py:get_nearest_helper:33] 
[DEBUG  ] no usable BIDS metadata for participants.tsv in <Dataset path=/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001>: File '/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001/participants.tsv' could not be found in the current BIDS project. [bids_layout.py:get_nearest_helper:33] 
Traceback (most recent call last):
  File "/home/yoh/proj/datalad/datalad-master/venvs/dev/bin/datalad", line 8, in <module>
    main()
  File "/home/yoh/proj/datalad/datalad-master/datalad/cmdline/main.py", line 347, in main
    ret = cmdlineargs.func(cmdlineargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/base.py", line 425, in call_from_parser
    ret = cls.__call__(**kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/crawl.py", line 130, in __call__
    output = run_pipeline(pipeline, stats=stats)
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/pipeline.py", line 114, in run_pipeline
    output = list(xrun_pipeline(*args, **kwargs))
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/pipeline.py", line 194, in xrun_pipeline
    for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)):
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/pipeline.py", line 270, in xrun_pipeline_steps
    for data_ in data_in_to_loop:
  File "/home/yoh/proj/datalad/datalad-master/datalad/crawler/nodes/annex.py", line 1329, in _finalize
    aggregate_metadata(dataset='^', path=self.repo.path)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 437, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 425, in return_func
    results = list(results)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 382, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 449, in _process_results
    for res in results:
  File "/home/yoh/proj/datalad/datalad-master/datalad/metadata/aggregate.py", line 667, in __call__
    to_save)
  File "/home/yoh/proj/datalad/datalad-master/datalad/metadata/aggregate.py", line 238, in _extract_metadata
    store(meta, objpath)
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/json_py.py", line 64, in dump2xzstream
    with lzma.LZMAFile(fname, mode='w') as f:
AttributeError: __exit__
()
> /home/yoh/proj/datalad/datalad-master/datalad/support/json_py.py(64)dump2xzstream()
-> with lzma.LZMAFile(fname, mode='w') as f:
(Pdb) p fname
'/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001/.datalad/metadata/objects/11/cn-ceee6877e54bbbe61688bcfa2dadac'
(Pdb) 
[1]  + 13262 suspended  datalad --dbg -l debug crawl
(dev)3 10569 ->148 [1].....................................:Sun 29 Oct 2017 09:41:08 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001[master]git
$> ls -l /mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001/.datalad/metadata/objects/11/cn-ceee6877e54bbbe61688bcfa2dadac
-rw------- 1 yoh datalad 32 Oct 29 09:40 /mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001/.datalad/metadata/objects/11/cn-ceee6877e54bbbe61688bcfa2dadac

FWIW, you could trigger it by running

datalad install  ///openfmri/ds000001; cd ds000001; datalad crawl

@mih
Copy link
Member

mih commented Oct 30, 2017

I ran this (twice) sucessfully (using current master) and could not replicate the problem.

mih@meiner /tmp/openfmri/ds000001 (git)-[master] % datalad crawl
[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg 
[INFO   ] Creating a pipeline for the openfmri dataset ds000001 
Connecting to bucket: openneuro
You need to authenticate with 'datalad-test-s3' credentials. team@datalad.org provides information on how to gain access

<REDACTED>
[INFO   ] S3 session: Connecting to the bucket openneuro 
Bucket info:
  Versioning: S3ResponseError: 403 Forbidden
     Website: S3ResponseError: 403 Forbidden
         ACL: <Policy: openfmri (owner) = READ, openfmri (owner) = WRITE, openfmri (owner) = READ_ACP, openfmri (owner) = WRITE_ACP, http://acs.amazonaws.com/groups/global/AllUsers = READ, http://acs.amazonaws.com/groups/global/AllUsers = READ_ACP>
ds000001/
[INFO   ] Creating a pipeline for the openfmri bucket 
[WARNING] ATM we assume prefixes to correspond only to directories, adding / 
[INFO   ] Running pipeline [[<function switch_branch at 0x7fd8bd05ee60>, [<datalad.crawler.nodes.s3.crawl_s3 object at 0x7fd8b91799d0>, sub(ok_missing=True, subs=<<{'url': {'^s3://openfm...>>), switch(default=None, key='datalad_action', mapping=<<{'commit': <function _...>>, re=False)], <function switch_branch at 0x7fd8b9e90488>], <function switch_branch at 0x7fd8b9e671b8>, [<datalad.crawler.nodes.crawl_url.crawl_url object at 0x7fd8b9179910>, [a_href_match(query='.*release_history.txt'), assign(assignments={'filename': 'changelog.txt'}, interpolate=False), <datalad.crawler.nodes.annex.Annexificator object at 0x7fd8c26abcd0>], [a_href_match(query='.*/.*\\.(tgz|tar.*|zip)'), sub(ok_missing=False, subs=<<{'url': {'(http)s?(://...>>), <function func_node at 0x7fd8c43515f0>, <datalad.crawler.nodes.annex.Annexificator object at 0x7fd8c26abcd0>], <function _commit_versions at 0x7fd8b91737d0>], <datalad.crawler.nodes.annex._remove_obsolete object at 0x7fd8b9179b90>, [{'loop': False}, <function switch_branch at 0x7fd8b9173848>, <function merge_branch at 0x7fd8b9173938>, <function _remove_other_versions at 0x7fd8b9173a28>, [{'loop': True}, find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), assign(assignments=<<{'dataset_file': 'ds00...>>, interpolate=True), switch(default=<function>, key='dataset_file', mapping=<<{'.*///ds000030_R1\\.0...>>, re=True)], [find_files(dirs=False, fail_if_none=False, regex=<<'(\\.(tsv|csv|txt|json...>>, topdir='.'), fix_permissions(executable=False, file_re='.*', input='filename', path=None)], <function switch_branch at 0x7fd8b9173c08>, <function merge_branch at 0x7fd8b9173c80>, <function _finalize at 0x7fd8b9173cf8>], <function switch_branch at 0x7fd8b9173d70>, <function _finalize at 0x7fd8b9173de8>] 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Did not find branch 'incoming-s3-openneuro' locally. Checking out remote one 'origin/incoming-s3-openneuro' 
[INFO   ] Checking out an existing branch incoming-s3-openneuro 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Checking out an existing branch master 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Did not find branch 'incoming' locally. Checking out remote one 'origin/incoming' 
[INFO   ] Checking out an existing branch incoming 
[INFO   ] Fetching 'https://openfmri.org/dataset/ds000001/' 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Did not find branch 'incoming-processed' locally. Checking out remote one 'origin/incoming-processed' 
[INFO   ] Checking out an existing branch incoming-processed 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Checking out an existing branch master 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Aggregate metadata for dataset /tmp/openfmri/ds000001 
[INFO   ] Aggregate metadata for dataset /tmp/openfmri 
[INFO   ] Update aggregate metadata in dataset at: /tmp/openfmri/ds000001 
[INFO   ] Update aggregate metadata in dataset at: /tmp/openfmri 
[INFO   ] Attempting to save 176 files/datasets 
[INFO   ] No git house-keeping performed as no notable changes to git 
[INFO   ] Finished running pipeline: URLs processed: 7,  Files processed: 7, skipped: 6225 
[INFO   ] Total stats: URLs processed: 7,  Files processed: 7, skipped: 6225,  Datasets crawled: 1 
datalad crawl  7,84s user 4,73s system 18% cpu 1:09,61 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants