Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WTF - bring back and extend information on metadata extractors etc #7309

Merged
merged 8 commits into from
Jun 3, 2023

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Feb 28, 2023

Ref datalad/datalad-metalad#340 .

In general I feel that implementation for that issue is a good fit for datalad wtf so I looked inside to discover a function (_describe_metadata_elements) which already did what we want but was not used anywhere! I discovered that the use (but not definition/body) of it was removed in 616fb1b, so as a first step I brought back use of it. Then

  • extended it also with datalad.metadata.indexers
  • throughout (not only in metadata but also extensions) removed populating load_error with None if there were no error. IMHO it is pointless waste of screen real-estate. As our formatter is generic, we probably do not want it always to not print every field with None as value, so decided to remove it here.
  • added gathering information on possible version of extractor
  • added gathering information on possible generation of extractor which is proposed in Add __generation__ to metadata extractors to be able to tell one from another datalad-metalad#351 - so this PR is contingent on arriving to conclusion there
    • we discovered that 'legacy-legacy' specification exists in 2 places -- in metalad and datalad_deprecated and ideally should be centralized. Most likely within metalad and make deprecated depend on it (attn @christian-monch )
  • moved from section_subsection to section.subsection so it matches better and added functionality to request an entire section, e.g. -S metadata returns all subsections metadata.*

Example of output which you would get after yq '.extensions| keys_unsorted | .[] | @sh' < extensions.yaml | sed -e "s,[\"'],,g" | xargs pip install, i.e. installing all datalad extensions registed in datalad-extensions

❯ datalad wtf -S metadata

WTF

metadata.extractors

  • annex:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy-legacy
    • module: datalad_metalad.extractors.legacy.annex
  • audio:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy-legacy
    • module: datalad_metalad.extractors.legacy.audio
  • bids:
    • distribution: datalad-neuroimaging 0.3.3
    • module: datalad_neuroimaging.extractors.bids
  • bids_dataset:
    • distribution: datalad-neuroimaging 0.3.3
    • doc: Top level BIDS extractor class interfacing with metalad Inherits from metalad's DatasetMetadataExtractor class
    • generation: gen4
    • module: datalad_neuroimaging.extractors.bids_dataset
    • version: 0.0.1
  • datacite:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy-legacy
    • module: datalad_metalad.extractors.legacy.datacite
  • datacite_gin:
    • distribution: datalad-catalog 0.2.0
    • load_error: ModuleNotFoundError(No module named 'datalad.metadata.definitions')
    • module: datalad_catalog.extractors.datacite_gin
  • datalad_core:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy-legacy
    • module: datalad_metalad.extractors.legacy.datalad_core
  • datalad_rfc822:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy-legacy
    • module: datalad_metalad.extractors.legacy.datalad_rfc822
  • dicom:
    • distribution: datalad-neuroimaging 0.3.3
    • module: datalad_neuroimaging.extractors.dicom
  • exif:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy-legacy
    • module: datalad_metalad.extractors.legacy.exif
  • external_dataset:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: gen4
    • module: datalad_metalad.extractors.external_dataset
  • external_file:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: gen4
    • module: datalad_metalad.extractors.external_file
  • frictionless_datapackage:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy-legacy
    • module: datalad_metalad.extractors.legacy.frictionless_datapackage
  • image:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy-legacy
    • module: datalad_metalad.extractors.legacy.image
  • metalad_annex:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy
    • module: datalad_metalad.extractors.annex
  • metalad_core:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy
    • module: datalad_metalad.extractors.core
  • metalad_custom:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy
    • module: datalad_metalad.extractors.custom
  • metalad_example_dataset:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: gen4
    • module: datalad_metalad.extractors.metalad_example_dataset
    • version: 0.0.1
  • metalad_example_file:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: gen4
    • module: datalad_metalad.extractors.metalad_example_file
    • version: 0.0.1
  • metalad_external_dataset:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: gen4
    • module: datalad_metalad.extractors.external_dataset
  • metalad_external_file:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: gen4
    • module: datalad_metalad.extractors.external_file
  • metalad_runprov:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy
    • module: datalad_metalad.extractors.runprov
  • metalad_studyminimeta:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy
    • module: datalad_metalad.extractors.studyminimeta.main
  • nidm:
    • distribution: datalad-neuroimaging 0.3.3
    • module: datalad_neuroimaging.extractors.nidm
  • nifti1:
    • distribution: datalad-neuroimaging 0.3.3
    • module: datalad_neuroimaging.extractors.nifti1
  • xmp:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • generation: legacy-legacy
    • module: datalad_metalad.extractors.legacy.xmp

metadata.filters

  • metalad_demofilter:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • doc: Create a "histogram"-like summary of the key values of all specified name_tuple across all metadata that is yielded by the metadata iterables. Histograms bins are determined by the metadata format and "name" within the format. The "name" is a flattened JSON key hierarchy. The set of metadata yielded by the iterables is determined by the metadata urls and the recursion flag that are passed to "datalad meta-filter".
    • module: datalad_metalad.filters.demofilter
    • version: 1.0

metadata.indexers

  • metalad_studyminimeta:
    • distribution: datalad-metalad 0.4.12+45.g5f623bb
    • doc: Indexer for metadata that was extracted from studyminimeta metadata (usually contained in ".studyminimeta.yaml"-files).
    • module: datalad_metalad.indexers.studyminimeta

or using pyperclip.copy(pandas.DataFrame.from_dict(dl.wtf(sections=['metadata'], result_renderer='disabled')[0]['infos']['metadata.extractors']).T.replace(np. nan,'',regex=True).to_markdown()) we get

module distribution generation doc version load_error
annex datalad_metalad.extractors.legacy.annex datalad-metalad 0.4.12+45.g5f623bb legacy-legacy
audio datalad_metalad.extractors.legacy.audio datalad-metalad 0.4.12+45.g5f623bb legacy-legacy
datacite datalad_metalad.extractors.legacy.datacite datalad-metalad 0.4.12+45.g5f623bb legacy-legacy
datalad_core datalad_metalad.extractors.legacy.datalad_core datalad-metalad 0.4.12+45.g5f623bb legacy-legacy
datalad_rfc822 datalad_metalad.extractors.legacy.datalad_rfc822 datalad-metalad 0.4.12+45.g5f623bb legacy-legacy
exif datalad_metalad.extractors.legacy.exif datalad-metalad 0.4.12+45.g5f623bb legacy-legacy
frictionless_datapackage datalad_metalad.extractors.legacy.frictionless_datapackage datalad-metalad 0.4.12+45.g5f623bb legacy-legacy
image datalad_metalad.extractors.legacy.image datalad-metalad 0.4.12+45.g5f623bb legacy-legacy
xmp datalad_metalad.extractors.legacy.xmp datalad-metalad 0.4.12+45.g5f623bb legacy-legacy
bids datalad_neuroimaging.extractors.bids datalad-neuroimaging 0.3.3
bids_dataset datalad_neuroimaging.extractors.bids_dataset datalad-neuroimaging 0.3.3 gen4 Top level BIDS extractor class interfacing with metalad Inherits from metalad's DatasetMetadataExtractor class 0.0.1
dicom datalad_neuroimaging.extractors.dicom datalad-neuroimaging 0.3.3
nidm datalad_neuroimaging.extractors.nidm datalad-neuroimaging 0.3.3
nifti1 datalad_neuroimaging.extractors.nifti1 datalad-neuroimaging 0.3.3
datacite_gin datalad_catalog.extractors.datacite_gin datalad-catalog 0.2.0 ModuleNotFoundError(No module named 'datalad.metadata.definitions')
external_dataset datalad_metalad.extractors.external_dataset datalad-metalad 0.4.12+45.g5f623bb gen4
external_file datalad_metalad.extractors.external_file datalad-metalad 0.4.12+45.g5f623bb gen4
metalad_annex datalad_metalad.extractors.annex datalad-metalad 0.4.12+45.g5f623bb legacy
metalad_core datalad_metalad.extractors.core datalad-metalad 0.4.12+45.g5f623bb legacy
metalad_custom datalad_metalad.extractors.custom datalad-metalad 0.4.12+45.g5f623bb legacy
metalad_example_dataset datalad_metalad.extractors.metalad_example_dataset datalad-metalad 0.4.12+45.g5f623bb gen4 0.0.1
metalad_example_file datalad_metalad.extractors.metalad_example_file datalad-metalad 0.4.12+45.g5f623bb gen4 0.0.1
metalad_external_dataset datalad_metalad.extractors.external_dataset datalad-metalad 0.4.12+45.g5f623bb gen4
metalad_external_file datalad_metalad.extractors.external_file datalad-metalad 0.4.12+45.g5f623bb gen4
metalad_runprov datalad_metalad.extractors.runprov datalad-metalad 0.4.12+45.g5f623bb legacy
metalad_studyminimeta datalad_metalad.extractors.studyminimeta.main datalad-metalad 0.4.12+45.g5f623bb legacy

TODOs:

  • Decide either to
    1. 👍 finish/keep here in datalad core or
    2. 👎 here only formalize ability to extend WTF SECTION_CALLABLES (not yet sure how to make it dynamic within EnsureChoice ) from extensions and support for section.subsection, and move actual code into datalad-metalad.
  • Create a changelog snippet (add the CHANGELOG-missing label to this pull request in order to have a snippet generated from its title;
    or use scriv create locally and include the generated file in the pull request, see scriv).

Thanks for contributing!

@adswa
Copy link
Member

adswa commented Mar 13, 2023

This PR breaks the test case for #6712, which ensured that calling wtf on a non-existent section would yield an impossible result instead of a key error. While the PR ensures this doesn't happen from the command line by validating the section argument, it somewhat re-introduces the/a bug for the Python API:

The test case:

# check that wtf of an unavailable section yields impossible result (#6712)
res = wtf(sections=['murkie'], on_failure='ignore')
eq_(res[0]["status"], "impossible")

In the command line, a non-existent section would now result in a constraint error:

(handbook) adina@muninn in ~/repos/datalad on git:enh-metadata-wtf
❱ datalad wtf -S crap
usage: datalad wtf [-h] [-d DATASET] [-s {some|all}] [-S SECTION] [--flavor {full|short}] [-D {html_details}] [-c] [--version]
datalad wtf: error: argument -S/--section: invalid constraint:{None, 'configuration', 'credentials', 'datalad', 'dataset', 'dependencies', 'environment', 'extensions', 'git-annex', 'location', 'metadata', 'metadata.extractors', 'metadata.filters', 'metadata.indexers', 'python', 'system', '*'} value: 'crap'

But there is no parameter validation in Python, and calling wtf with a non-existent section returns an ok result:

In [3]: wtf(sections=['murkie'])
# WTF
Out[3]: 
[{'action': 'wtf',
  'path': '/home/adina/repos/datalad',
  'type': 'dataset',
  'status': 'ok',
  'decor': None,
  'infos': {},
  'flavor': 'full'}]

@yarikoptic
Copy link
Member Author

Thank you @adswa for analysis and pointers! Pushed d6c8edb to resolve it and filed #7322

@yarikoptic yarikoptic added semver-patch Increment the patch version when merged CHANGELOG-missing When a PR's description does not contain a changelog item, yet. labels Mar 13, 2023
@yarikoptic
Copy link
Member Author

I will also rebase -- I believe that should address linters.

@yarikoptic yarikoptic added CHANGELOG-missing When a PR's description does not contain a changelog item, yet. and removed CHANGELOG-missing When a PR's description does not contain a changelog item, yet. labels Mar 13, 2023
@github-actions github-actions bot removed the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Mar 13, 2023
@yarikoptic
Copy link
Member Author

appveyor - known #7320

@yarikoptic yarikoptic marked this pull request as ready for review March 14, 2023 04:01
yarikoptic and others added 8 commits March 22, 2023 10:22
It was originally removed in 616fb1b as a part of
"remove all metadata stuff" step but really this does not need any metadata
module (so no dependencies) and IMHO well worth being in the WTF. Demand for such
information is stated in datalad/datalad-metalad#340
Output of WTF is already lengthy.  Adding lines which give no information is not helping consuming
the WTF information
@codeclimate
Copy link

codeclimate bot commented Mar 22, 2023

Code Climate has analyzed commit 3ac402e and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category Count
Security 1

View more on Code Climate.

Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit torn on this one. Overall, I think I'd prefer to aim for complete removal datalad.metadata and therefore reintroducing reporting on datalad.metadata.XXX seems a bit off and should probably be patched in by metalad instead.

OTOH, it kinda works and doesn't really hurt.

In any case, what's not entirely clear to me is, what you consider TODO here, @yarikoptic. What's marked as TODO is checked (although the decision part only has on vote ;-) ), but the description does mention cross-dependencies. What's the state of those?

@yarikoptic
Copy link
Member Author

I'm a bit torn on this one. Overall, I think I'd prefer to aim for complete removal datalad.metadata and therefore reintroducing reporting on datalad.metadata.XXX seems a bit off and should probably be patched in by metalad instead.

I would not mind metalad to introduce integration with datalad wtf but then someone would need to work it out first in datalad wtf (probably via entry points). Until there are takers on that I guess we would benefit from having an implementation here,

Here is an updated table which can be produced for docs using python -c "import tabulate, numpy as np, pyperclip, pandas, datalad.api as dl; pyperclip.copy(pandas.DataFrame.from_dict(dl.wtf(sections=['metadata'], result_renderer='disabled')[0]['infos']['metadata.extractors']).T.replace(np. nan,'',regex=True).to_markdown())"
module distribution generation doc version
annex datalad_metalad.extractors.legacy.annex datalad-metalad 0.4.12+101.gd87bff8 2
audio datalad_metalad.extractors.legacy.audio datalad-metalad 0.4.12+101.gd87bff8 2
datacite datalad_metalad.extractors.legacy.datacite datalad-metalad 0.4.12+101.gd87bff8 2
datalad_core datalad_metalad.extractors.legacy.datalad_core datalad-metalad 0.4.12+101.gd87bff8 2
datalad_rfc822 datalad_metalad.extractors.legacy.datalad_rfc822 datalad-metalad 0.4.12+101.gd87bff8 2
exif datalad_metalad.extractors.legacy.exif datalad-metalad 0.4.12+101.gd87bff8 2
frictionless_datapackage datalad_metalad.extractors.legacy.frictionless_datapackage datalad-metalad 0.4.12+101.gd87bff8 2
image datalad_metalad.extractors.legacy.image datalad-metalad 0.4.12+101.gd87bff8 2
xmp datalad_metalad.extractors.legacy.xmp datalad-metalad 0.4.12+101.gd87bff8 2
bids datalad_neuroimaging.extractors.bids datalad-neuroimaging 0.3.3+7.gc1721c9
bids_dataset datalad_neuroimaging.extractors.bids_dataset datalad-neuroimaging 0.3.3+7.gc1721c9 4 Top level BIDS extractor class interfacing with metalad Inherits from metalad's DatasetMetadataExtractor class 0.0.1
dicom datalad_neuroimaging.extractors.dicom datalad-neuroimaging 0.3.3+7.gc1721c9
nidm datalad_neuroimaging.extractors.nidm datalad-neuroimaging 0.3.3+7.gc1721c9
nifti1 datalad_neuroimaging.extractors.nifti1 datalad-neuroimaging 0.3.3+7.gc1721c9
datacite_gin datalad_catalog.extractors.datacite_gin datalad-catalog 0.2.0+64.g9f6e568 4 Inherits from metalad's DatasetMetadataExtractor class 0.0.1
external_dataset datalad_metalad.extractors.external_dataset datalad-metalad 0.4.12+101.gd87bff8 4
external_file datalad_metalad.extractors.external_file datalad-metalad 0.4.12+101.gd87bff8 4
metalad_annex datalad_metalad.extractors.annex datalad-metalad 0.4.12+101.gd87bff8 3
metalad_core datalad_metalad.extractors.core datalad-metalad 0.4.12+101.gd87bff8 3
metalad_custom datalad_metalad.extractors.custom datalad-metalad 0.4.12+101.gd87bff8 3
metalad_example_dataset datalad_metalad.extractors.metalad_example_dataset datalad-metalad 0.4.12+101.gd87bff8 4 0.0.1
metalad_example_file datalad_metalad.extractors.metalad_example_file datalad-metalad 0.4.12+101.gd87bff8 4 0.0.1
metalad_external_dataset datalad_metalad.extractors.external_dataset datalad-metalad 0.4.12+101.gd87bff8 4
metalad_external_file datalad_metalad.extractors.external_file datalad-metalad 0.4.12+101.gd87bff8 4
metalad_genericjson_dataset datalad_metalad.extractors.genericjson_dataset datalad-metalad 0.4.12+101.gd87bff8 4 Generic JSON dataset-level extractor class Inherits from metalad's DatasetMetadataExtractor class 0.0.1
metalad_genericjson_file datalad_metalad.extractors.genericjson_file datalad-metalad 0.4.12+101.gd87bff8 4 Main 'custom' file-level extractor class Inherits from metalad's FileMetadataExtractor class 0.0.1
metalad_runprov datalad_metalad.extractors.runprov datalad-metalad 0.4.12+101.gd87bff8 3
metalad_studyminimeta datalad_metalad.extractors.studyminimeta.main datalad-metalad 0.4.12+101.gd87bff8 3

although the decision part only has on vote ;-)

yeah, it seems we never set the requirements neither on duration nor on % to consider quorum ;-)

but the description does mention cross-dependencies.

what do you mean by that? either we need to add dependencies on extensions ? (there should be none, we pick up only the ones which are installed if any)

@yarikoptic
Copy link
Member Author

on dependencies -- after/if we merge this I would be happy to add such table to https://github.com/datalad/datalad-extensions#readme because we do know about all extensions there. It would also be useful run/test to ensure that we have all extensions co-installable and no weird version limits (like recently discovered within pybids on sqlalchemy) forbid us that -- since ATM each extension there is installed only in an individual CI run.

@bpoldrack
Copy link
Member

@yarikoptic

what do you mean by that?

I mean this part from the PR description:

added gathering information on possible generation of extractor which is proposed in Add generation to metadata extractors to be able to tell one from another datalad-metalad#351 - so this PR is contingent on arriving to conclusion there
we discovered that 'legacy-legacy' specification exists in 2 places -- in metalad and datalad_deprecated and ideally should be centralized. Most likely within metalad and make deprecated depend on it (attn @christian-monch )

What's the state of that (arriving at a conclusion there) and what does it mean for this one here?

@yarikoptic
Copy link
Member Author

ah, good point -- filed datalad/datalad-metalad#370 so we do not forget but I think it has no relation to this PR.

@yarikoptic
Copy link
Member Author

eh, I thought we had this merged already when I was trying to use metalad for runprov but apparently not! @mih @adswa - WDYT about this PR? meanwhile I will rebase it

here is how the table looks now for the metalad
module distribution generation version doc load_error
annex datalad_metalad.extractors.legacy.annex datalad-metalad 0.4.12+101.gd87bff8 2
audio datalad_metalad.extractors.legacy.audio datalad-metalad 0.4.12+101.gd87bff8 2
datacite datalad_metalad.extractors.legacy.datacite datalad-metalad 0.4.12+101.gd87bff8 2
datalad_core datalad_metalad.extractors.legacy.datalad_core datalad-metalad 0.4.12+101.gd87bff8 2
datalad_rfc822 datalad_metalad.extractors.legacy.datalad_rfc822 datalad-metalad 0.4.12+101.gd87bff8 2
exif datalad_metalad.extractors.legacy.exif datalad-metalad 0.4.12+101.gd87bff8 2
external_dataset datalad_metalad.extractors.external_dataset datalad-metalad 0.4.12+101.gd87bff8 4
external_file datalad_metalad.extractors.external_file datalad-metalad 0.4.12+101.gd87bff8 4
frictionless_datapackage datalad_metalad.extractors.legacy.frictionless_datapackage datalad-metalad 0.4.12+101.gd87bff8 2
image datalad_metalad.extractors.legacy.image datalad-metalad 0.4.12+101.gd87bff8 2
metalad_annex datalad_metalad.extractors.annex datalad-metalad 0.4.12+101.gd87bff8 3
metalad_core datalad_metalad.extractors.core datalad-metalad 0.4.12+101.gd87bff8 3
metalad_custom datalad_metalad.extractors.custom datalad-metalad 0.4.12+101.gd87bff8 3
metalad_example_dataset datalad_metalad.extractors.metalad_example_dataset datalad-metalad 0.4.12+101.gd87bff8 4 0.0.1
metalad_example_file datalad_metalad.extractors.metalad_example_file datalad-metalad 0.4.12+101.gd87bff8 4 0.0.1
metalad_external_dataset datalad_metalad.extractors.external_dataset datalad-metalad 0.4.12+101.gd87bff8 4
metalad_external_file datalad_metalad.extractors.external_file datalad-metalad 0.4.12+101.gd87bff8 4
metalad_genericjson_dataset datalad_metalad.extractors.genericjson_dataset datalad-metalad 0.4.12+101.gd87bff8 4 0.0.1 Generic JSON dataset-level extractor class Inherits from metalad's DatasetMetadataExtractor class
metalad_genericjson_file datalad_metalad.extractors.genericjson_file datalad-metalad 0.4.12+101.gd87bff8 4 0.0.1 Main 'custom' file-level extractor class Inherits from metalad's FileMetadataExtractor class
metalad_runprov datalad_metalad.extractors.runprov datalad-metalad 0.4.12+101.gd87bff8 3
metalad_studyminimeta datalad_metalad.extractors.studyminimeta.main datalad-metalad 0.4.12+101.gd87bff8 3
xmp datalad_metalad.extractors.legacy.xmp datalad-metalad 0.4.12+101.gd87bff8 ModuleNotFoundError(No module named 'libxmp')

@yarikoptic
Copy link
Member Author

Since this is a not critical code path, and at large returns prior present functionally, had not received objections, I will proceed merging it in merging it in a couple of days unless objections raised

@yarikoptic yarikoptic merged commit 425a8f7 into datalad:master Jun 3, 2023
@yarikoptic-gitmate
Copy link
Collaborator

PR released in 0.19.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd-wtf semver-patch Increment the patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants