-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support next generation metadata in search #6518
Support next generation metadata in search #6518
Conversation
do you have some dataset(s) available publicly to try it out on? |
2cea937
to
093fcc7
Compare
There is new metadata on |
And my venv: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test. Could be a matrix run (in travis or appveyor, I don't care) which installs metalad, extracts/aggregates metadata, and performs search. Or do you think we should publish some sample dataset like that or may be even "move" search into metalad
altogether?
Also I left a comment about possibly bringing some "identification" to various metadata approaches datalad went through since current "next_generation" might be "echo_from_the_past" in the future.
Adding an option on which metadata to search sounds like a useful thing given present multiplicity of metadatas
from datalad_metalad.dump import Dump | ||
next_generation_metadata_available = True | ||
except ImportError: | ||
next_generation_metadata_available = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would not hurt if we start "versioning" the metadata approach ;) I would count this as _gen4
- 1st gen:
.datalad/meta
(old implementation withindatalad
) - 2nd gen:
.datalad/metadata/aggregate_v1.json
(currently indatalad
) - 3rd gen: the "old" one of metalad
- 4th gen: this "next_generation" - current as in
metalad
, right?
May be even add that info into docs/source/metadata.rst
to reach clarity? WDYT @mih -- I think having some codenames for metadata approaches would be helpful, at least to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. _gen4
would be appropriate, if we discriminate between _gen2
and _gen3
based on the interface, since the storage is the same (at least in _gen2
and _gen3
).
I will start using "generation four".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since gen2
and gen3
are storage-wise identical, I will refer to them as legacy
@@ -172,11 +186,11 @@ def _get_containingds_from_agginfo(info, rpath): | |||
return dspath | |||
|
|||
|
|||
def query_aggregated_metadata(reporton, ds, aps, recursive=False, | |||
**kwargs): | |||
def legacy_query_aggregated_metadata(reporton, ds, aps, recursive=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correspondingly gen2_query_...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query_aggregated_metadata
method from master would actually be compatible with at least gen2
and gen3
(if I am not mistaken, @mih could clarify that).
I would therefore still rename it to legacy_query_aggregated_metadata
because it covers two generations (due to the the shared storage model),
I have not talked with @christian-monch about his plans with this code, but personally I'd like to see all of |
I think it will indeed make sense to remove all metadata handling and move NB: originally posted this comment in another PR by mistake ;) |
My plans for metadata beyond this PR are to remove metadata code from datalad core and rely solely on the datalad-metalad extension. This would support the execution of old and new extractors, and the handling of new git-stored metadata out of the box. Metadata contentTo support old metadata content, i.e
SearchSince search is orthogonal to metadata storage and handling (if it uses a high level metadata-access API), it can be altered independently. I think we might want to take search out of datalad core and move it either into the datalad-metalad extension, or into its own extension. It makes sense to combine it with datalad-metalad because it has a strong dependency on the metadata-access API. On the other hand, the one size fits all approach to searching might not be able to tackle specific search requirements, e.g. graph-based search. For such cases a specific search-extension might be better suited. I would like to put these ideas out for discussion here. I will move them to individual issues if we have some basic agreement on the general direction. |
It introduces a new metadata storage format. It also introduces extractors-classes for file- or dataset-level extraction. Besides that it keeps compatibility with the existing extractors, i.e. the extractors used in datalad-core and the extractors used in `datalad-metalad<=0.2.2).
|
If NG-metalad is installed, search will use the metadata returned by NG-metalad's dump command to build a search index.
40dcc2f
to
aa6718c
Compare
This commit uses "gen4" prefix in "gen4_query_aggregated_metadata" to identify the query aggregated metadata method for metdata stored by the fourth generation of metalad, i.e. by metalad 0.3.x. Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Codecov Report
@@ Coverage Diff @@
## master #6518 +/- ##
==========================================
+ Coverage 88.84% 90.43% +1.58%
==========================================
Files 353 353
Lines 45825 45888 +63
==========================================
+ Hits 40713 41498 +785
+ Misses 5112 4390 -722
Continue to review full report at Codecov.
|
This commit uses "legacy" to refer to metadata storage prior datalad-metalad version 0.3.0 and "gen4" to refer to metadata storage used in datalad-metalad version 0.3.x
This commit will catch NoMetadataStoreFound in 'query_aggregated_metadata'. So it should be save to call this method with datalad_metalad installed and no gen4-metadata available. It will just report no gen4-metadata. Also unifies object availability in different code paths, i.e. 'Dump' and 'NoMetadataFound' Adds a missing argument description
This commit fixes the indentation of method arguments
This commit adds a simple test for 'gen4_query_aggregated_metadata' that uses a mocked Dump() implementation
This commit ensures that kwargs are added to the result dictionary in 'gen4_query_aggregated_metadata', as stated in the description.
This commit adapts the handling of non-existing gen4 metadata to the handling of non-existing legacy metadata, i.e. yield an impossible result.
This commit adds test for the case that no gen4 metadata was found in the given dataset.
This commit specifies that a chardet version larger than 3.0.4 and smaller than 5.0.0 is required. The reason for the upper limit is that "requests" does only support chardet version 4.
eh, one test of annex stalled on travis :-( I really don't like such ones:
I have restarted that run |
datalad/metadata/search.py
Outdated
@@ -1304,6 +1314,14 @@ class Search(Interface): | |||
doc="""if given, the formal query that was generated from the given | |||
query string is shown, but not actually executed. This is mostly useful | |||
for debugging purposes."""), | |||
use_metadata=Parameter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me think about it more but ATM I dislike use_metadata
which immediately suggested me being bool
and then it makes little sense since search works only on metadata. Needed me to get to definition of use_metadata
here (from points of use above) to understand what it actually means.
Why such confusion: Even though we do have @use_casssette(cassette_name)
, typically anything use_
is bool
. So may be something like metadata_type
or metadata_format
would be a better name and align better with its docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the parameter name from use_metadata
to metadata_source
.
This commit changes the name of the parameter '--use-metadata' to '--metadata-source'. It also changes the name of the related variables from 'use_metadata' to 'metadata_source'.
This commit replaces the last internal uses of 'use_metadata' with 'metadata_source'.
This commit adds a test to verify that metadata_source values select the correct metadata sources
Thx. Haven't seen that before. Will keep an eye on it |
This commit enforces git annex version 10.20220525 for travis tests of datalad/support. This is intended to fix the following hanging test: datalad.support.tests.test_annexrepo.py::test_is_available[False] This was reported in PR datalad#6518
Code Climate has analyzed commit 90c61d6 and detected 5 issues on this pull request. Here's the issue category breakdown:
View more on Code Climate. |
90c61d6
to
b7628b0
Compare
Git annex version was updated to fix the stalling tests in appveyor. The newly selected version of git-annex comes together with ssh version 9, which triggers an error in the ssh copy test (see issue #6655). |
ok, let's see where it would bring us! ;) |
This PR extends search to use next generation metadata, if a next generation metalad extension is found.
This PR is marked as work in progress and exists to support experiments with the new metadata system. One shortcoming of this patch is that legacy and NG-metadata results are both used and not de-duplicated. Also, the mapping of NG-metadata on the expected result structure, i.e. the legacy metadata-results, is probably sub-optimal.
search
that allows the selection of the metadata that should be searched.Changelog
💫 Enhancements and new features