Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support next generation metadata in search #6518

Merged
merged 19 commits into from
Jul 6, 2022

Conversation

christian-monch
Copy link
Contributor

@christian-monch christian-monch commented Mar 2, 2022

This PR extends search to use next generation metadata, if a next generation metalad extension is found.

This PR is marked as work in progress and exists to support experiments with the new metadata system. One shortcoming of this patch is that legacy and NG-metadata results are both used and not de-duplicated. Also, the mapping of NG-metadata on the expected result structure, i.e. the legacy metadata-results, is probably sub-optimal.

  • possibly add a flag to search that allows the selection of the metadata that should be searched.
  • add test

Changelog

💫 Enhancements and new features

  • use next generation metadata code in search, if it is availble

@christian-monch christian-monch added the semver-internal Changes only affect the internal API label Mar 2, 2022
@yarikoptic
Copy link
Member

do you have some dataset(s) available publicly to try it out on?

@christian-monch
Copy link
Contributor Author

do you have some dataset(s) available publicly to try it out on?

There is new metadata on smaug at /mnt/btrfs/datasets-metalad-cm/datalad/crawl/hcp-openaccess/HCP1200/100206
There is old and new metadata on smaug at /mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs.

@christian-monch
Copy link
Contributor Author

And my venv: ~cristian/venv/datalad-metalad-python3.9/

@yarikoptic yarikoptic changed the title WIP: use next generation metadata in search Use next generation metadata in search Apr 5, 2022
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test. Could be a matrix run (in travis or appveyor, I don't care) which installs metalad, extracts/aggregates metadata, and performs search. Or do you think we should publish some sample dataset like that or may be even "move" search into metalad altogether?

Also I left a comment about possibly bringing some "identification" to various metadata approaches datalad went through since current "next_generation" might be "echo_from_the_past" in the future.

Adding an option on which metadata to search sounds like a useful thing given present multiplicity of metadatas

from datalad_metalad.dump import Dump
next_generation_metadata_available = True
except ImportError:
next_generation_metadata_available = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would not hurt if we start "versioning" the metadata approach ;) I would count this as _gen4

  • 1st gen: .datalad/meta (old implementation within datalad)
  • 2nd gen: .datalad/metadata/aggregate_v1.json (currently in datalad)
  • 3rd gen: the "old" one of metalad
  • 4th gen: this "next_generation" - current as in metalad, right?

May be even add that info into docs/source/metadata.rst to reach clarity? WDYT @mih -- I think having some codenames for metadata approaches would be helpful, at least to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. _gen4 would be appropriate, if we discriminate between _gen2 and _gen3 based on the interface, since the storage is the same (at least in _gen2 and _gen3).

I will start using "generation four".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since gen2 and gen3 are storage-wise identical, I will refer to them as legacy

@@ -172,11 +186,11 @@ def _get_containingds_from_agginfo(info, rpath):
return dspath


def query_aggregated_metadata(reporton, ds, aps, recursive=False,
**kwargs):
def legacy_query_aggregated_metadata(reporton, ds, aps, recursive=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correspondingly gen2_query_...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query_aggregated_metadata method from master would actually be compatible with at least gen2 and gen3 (if I am not mistaken, @mih could clarify that).

I would therefore still rename it to legacy_query_aggregated_metadata because it covers two generations (due to the the shared storage model),

datalad/metadata/metadata.py Outdated Show resolved Hide resolved
@mih
Copy link
Member

mih commented Jun 8, 2022

I have not talked with @christian-monch about his plans with this code, but personally I'd like to see all of datalad/metadata be removed from the core package as soon as possible. It is one of these pieces that connects large chunks of otherwise obsolete functionality. Therefore adding more layers of backward compatibility is not attractive too me, except when it come with the consideration for maintenance load.

@yarikoptic
Copy link
Member

I think it will indeed make sense to remove all metadata handling and move search into -metalad (or should it be datalad-metalad-based-search extension?) whenever metalad provides feature parity to replace metadata handling in the core. Does it ATM? (IIRC it also has already at least 2 levels of different metadata formats and not yet sure if does provide handling in the most recent reincarnation for extractors in extensions such as neuroimaging).

NB: originally posted this comment in another PR by mistake ;)

@christian-monch christian-monch self-assigned this Jun 14, 2022
@christian-monch christian-monch changed the title Use next generation metadata in search WIP: use next generation metadata in search Jun 20, 2022
@christian-monch
Copy link
Contributor Author

My plans for metadata beyond this PR are to remove metadata code from datalad core and rely solely on the datalad-metalad extension. This would support the execution of old and new extractors, and the handling of new git-stored metadata out of the box.

Metadata content

To support old metadata content, i.e aggregate_v1.json plus its object files, there are a number of possibilities that can co-exist:

  1. Have a datalad-metalad-old extension, that would basically be the datalad-metalad-0.2.x. This would allow to switch between processing old and new metadata.

  2. Convert old metadata to new metadata (issue datalad-metalad/Rf some pep8 #118 Migration tool from metalad <= 0.2.1 metadata to metalad >= 0.3.0 metadata datalad-metalad#118). This would allow unified processing with the new metadata extension.

Search

Since search is orthogonal to metadata storage and handling (if it uses a high level metadata-access API), it can be altered independently. I think we might want to take search out of datalad core and move it either into the datalad-metalad extension, or into its own extension. It makes sense to combine it with datalad-metalad because it has a strong dependency on the metadata-access API. On the other hand, the one size fits all approach to searching might not be able to tackle specific search requirements, e.g. graph-based search. For such cases a specific search-extension might be better suited.

I would like to put these ideas out for discussion here. I will move them to individual issues if we have some basic agreement on the general direction.

@christian-monch
Copy link
Contributor Author

IIRC it also has already at least 2 levels of different metadata formats [...]

It introduces a new metadata storage format. It also introduces extractors-classes for file- or dataset-level extraction. Besides that it keeps compatibility with the existing extractors, i.e. the extractors used in datalad-core and the extractors used in `datalad-metalad<=0.2.2).

[...] not yet sure if does provide handling in the most recent reincarnation for extractors in extensions such as neuroimaging
If the extractors are instances of the datalad-core or datalad-metalad extractor classes, then they should be supported.

If NG-metalad is installed, search will use the metadata
returned by NG-metalad's dump command to build a search
index.
This commit uses "gen4" prefix in
"gen4_query_aggregated_metadata" to identify the
query aggregated metadata method for metdata stored
by the fourth generation of metalad, i.e. by metalad 0.3.x.

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
@codecov
Copy link

codecov bot commented Jun 20, 2022

Codecov Report

Merging #6518 (cf3ca3b) into master (797dde3) will increase coverage by 1.58%.
The diff coverage is 96.34%.

❗ Current head cf3ca3b differs from pull request most recent head e9aab74. Consider uploading reports for the commit e9aab74 to get more accurate results

@@            Coverage Diff             @@
##           master    #6518      +/-   ##
==========================================
+ Coverage   88.84%   90.43%   +1.58%     
==========================================
  Files         353      353              
  Lines       45825    45888      +63     
==========================================
+ Hits        40713    41498     +785     
+ Misses       5112     4390     -722     
Impacted Files Coverage Δ
datalad/metadata/metadata.py 89.72% <92.10%> (+0.24%) ⬆️
datalad/metadata/search.py 88.93% <100.00%> (-0.08%) ⬇️
datalad/metadata/tests/test_search.py 98.52% <100.00%> (+0.24%) ⬆️
datalad/support/sshconnector.py 87.19% <0.00%> (+0.34%) ⬆️
datalad/__init__.py 97.95% <0.00%> (+16.32%) ⬆️
datalad/tests/utils.py 57.02% <0.00%> (+35.39%) ⬆️
datalad/tests/test_tests_utils.py 92.34% <0.00%> (+92.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 797dde3...e9aab74. Read the comment docs.

This commit uses "legacy" to refer to metadata storage
prior datalad-metalad version 0.3.0 and "gen4" to refer
to metadata storage used in datalad-metalad version
0.3.x
This commit will catch NoMetadataStoreFound in
'query_aggregated_metadata'. So it should be save
to call this method with datalad_metalad installed
and no gen4-metadata available. It will just report
no gen4-metadata.

Also unifies object availability in different code
paths, i.e. 'Dump' and 'NoMetadataFound'

Adds a missing argument description
This commit fixes the indentation of method arguments
This commit adds a simple test for
'gen4_query_aggregated_metadata' that uses a
mocked Dump() implementation
This commit ensures that kwargs are added to the
result dictionary in 'gen4_query_aggregated_metadata',
as stated in the description.
This commit adapts the handling of non-existing
gen4 metadata to the handling of non-existing
legacy metadata, i.e. yield an impossible result.
This commit adds test for the case that no gen4 metadata was found
in the given dataset.
@christian-monch christian-monch changed the title WIP: use next generation metadata in search Use next generation metadata in search Jun 23, 2022
This commit specifies that a chardet version larger
than 3.0.4 and smaller than 5.0.0 is required. The
reason for the upper limit is that "requests" does
only support chardet version 4.
@yarikoptic
Copy link
Member

eh, one test of annex stalled on travis :-( I really don't like such ones:

../datalad/support/tests/test_annexrepo.py::test_is_available[True] PASSED [ 59%]
../datalad/support/tests/test_annexrepo.py::test_is_available[False] 
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
The build has been terminated

I have restarted that run

@@ -1304,6 +1314,14 @@ class Search(Interface):
doc="""if given, the formal query that was generated from the given
query string is shown, but not actually executed. This is mostly useful
for debugging purposes."""),
use_metadata=Parameter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me think about it more but ATM I dislike use_metadata which immediately suggested me being bool and then it makes little sense since search works only on metadata. Needed me to get to definition of use_metadata here (from points of use above) to understand what it actually means.
Why such confusion: Even though we do have @use_casssette(cassette_name), typically anything use_ is bool. So may be something like metadata_type or metadata_format would be a better name and align better with its docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the parameter name from use_metadata to metadata_source.

This commit changes the name of the parameter
'--use-metadata' to '--metadata-source'. It also
changes the name of the related variables from
'use_metadata' to 'metadata_source'.
This commit replaces the last internal uses of
'use_metadata' with 'metadata_source'.
This commit adds a test to verify that metadata_source
values select the correct metadata sources
@christian-monch
Copy link
Contributor Author

I have restarted that run

Thx. Haven't seen that before. Will keep an eye on it

This commit enforces git annex version 10.20220525 for
travis tests of datalad/support. This is intended to
fix the following hanging test:

datalad.support.tests.test_annexrepo.py::test_is_available[False]

This was reported in PR datalad#6518
@codeclimate
Copy link

codeclimate bot commented Jul 3, 2022

Code Climate has analyzed commit 90c61d6 and detected 5 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 4
Security 1

View more on Code Climate.

@christian-monch
Copy link
Contributor Author

Git annex version was updated to fix the stalling tests in appveyor.

The newly selected version of git-annex comes together with ssh version 9, which triggers an error in the ssh copy test (see issue #6655).

@yarikoptic
Copy link
Member

ok, let's see where it would bring us! ;)

@yarikoptic yarikoptic changed the title Use next generation metadata in search Support next generation metadata in search Jul 6, 2022
@yarikoptic yarikoptic merged commit 65ef546 into datalad:master Jul 6, 2022
@bpoldrack bpoldrack mentioned this pull request Jul 7, 2022
40 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-internal Changes only affect the internal API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants