Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: add infrastructure for specialized indexers #4963

Merged
merged 8 commits into from
Oct 20, 2020

Conversation

christian-monch
Copy link
Contributor

@christian-monch christian-monch commented Sep 25, 2020

This PR introduces an infrastructure for specialized indexers, which means indexers, that are specifically built to generate an index from metadata of a given format (the metadata format name is the same as the name of the extractor that generated the metadata). If datalad search needs to build an index for metadata it will now check whether a matching specialized indexer is installed. If a matching indexer is found, it will be called to generate the index, i.e. the set of key-value pairs that represents the metadata in the search context (this is done for all four search modes, i.e.: autofield, blobtext, egrep, and egrepcs).

If no specialized indexer is found for a metadata-format, the generic indexer that was used before will be used for this format.

This PR closes issue #4944

@codecov
Copy link

codecov bot commented Sep 25, 2020

Codecov Report

Merging #4963 into master will increase coverage by 0.14%.
The diff coverage is 96.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4963      +/-   ##
==========================================
+ Coverage   89.64%   89.79%   +0.14%     
==========================================
  Files         289      293       +4     
  Lines       40952    41168     +216     
==========================================
+ Hits        36713    36965     +252     
+ Misses       4239     4203      -36     
Impacted Files Coverage Δ
datalad/plugin/wtf.py 83.60% <80.00%> (+1.24%) ⬆️
datalad/metadata/indexers/base.py 88.88% <88.88%> (ø)
datalad/metadata/tests/test_search.py 98.27% <97.72%> (-0.20%) ⬇️
datalad/metadata/search.py 89.03% <100.00%> (+0.34%) ⬆️
datalad/distribution/tests/test_create_sibling.py 80.53% <0.00%> (-18.74%) ⬇️
datalad/support/tests/test_sshconnector.py 84.53% <0.00%> (-13.14%) ⬇️
datalad/support/github_.py 73.82% <0.00%> (-4.83%) ⬇️
datalad/support/sshconnector.py 82.18% <0.00%> (-3.67%) ⬇️
datalad/distribution/drop.py 96.42% <0.00%> (-2.37%) ⬇️
datalad/interface/annotate_paths.py 88.00% <0.00%> (-2.19%) ⬇️
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73e03e3...ed1692e. Read the comment docs.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. FTR: Two concrete implementations for specific metadata standard can be found here datalad/datalad-metalad#63. We might want to think about stripping all adhoc index generating logic from the present search implementation, and formalize it like this, too.

Thx @christian-monch !

@mih
Copy link
Member

mih commented Oct 9, 2020

@yarikoptic Are you planning to review this, or can we go ahead without you?

@yarikoptic
Copy link
Member

reviewing now

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions

datalad/metadata/indexers/base.py Show resolved Hide resolved
datalad/metadata/indexers/base.py Show resolved Hide resolved
datalad/metadata/indexers/base.py Outdated Show resolved Hide resolved
datalad/metadata/indexers/base.py Show resolved Hide resolved
@@ -173,16 +175,48 @@ def _deep_kv(basekey, dct):
else:
yield key, v

def get_indexer_for(metadata_format_name: str) -> callable:
indexer = (tuple(iter_entry_points('datalad.metadata.indexers', metadata_format_name)) or (None,))[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add a comment on why [0] -- not obvious to me why the first one and disregard the others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only look at the first returned entry point, if any, because there should be at most one indexer for ``metadata_format_name´´.

Comment added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer such assumption "coded in" with the feedback to the "user". E.g. at least a log debug message (or better a warning) issued if more than one somehow found present. It is often such assumptions which later cause a big grief trying to figure out WTF my X is not working (because you had multiple but some other was chosen and nothing alerted you about that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setuptools do not allow duplicated names in a group. So to get multiple entry points, you would have to do something very weird. ThereforeI I assumed that there would only be one entry_point for the given name and group.

If the name is not present in the group, an appropriate error is shown to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://setuptools.readthedocs.io/en/latest/pkg_resources.html
doc says

iter_entry_points(group, name=None)
Yield entry point objects from group matching name

If name is None, yields all entry points in group from all distributions in the working set, otherwise only ones matching both group and name are yielded. ...

I have not tried to overload etc, but insofar documentation hints that there could be multiple entry points for the same (group, name). Overall, I would not rely on some assumption whenever function is an iterator "by design" -- why not to "hard-code" the assumption and make it explicit here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Glad we solved that! :-)

I was derailed by two things:

Each project distribution can advertise at most one entry point of a given name within the same entry point group. For example, a distutils extension could advertise two different distutils.commands entry points, as long as they had different names.
  • there seemed to be an understanding in wtf.py:_describe_metadata_extractors that names only appear once in a group since they are used as dict-keys.

BUT: as the doc continues to state, multiple names can appear through different distributions using the same group-name-combination in different distributions. In our case, if, for example, datalad-metalad overwrites an extractor frpm datalad or an indexer, those will all appear under a single group-name-combination.

Given the existing code base, I guess we have assumed that extensions should not overwrite existing entry-points!?

I guess the question is then, do we want to allow extensions to overwrite existing entry points? If not, we can just check for multiple occurrences and report an error, if they appear.

If we want to include multiple entry point implementations, how do we discriminate between them? Should we add an optional "entry point source"-specifier to configurations for, e. g. extractors or indexers? For example: datalad-config could contain datalad.metadata.nativetype entries that optionally specify the module of the extractor should be used:

nativetype = datalad_core[datalad.metadata.extractors.datalad_core],

or we could specify the distribution that provided the module:

nativetype = datalad_core[datalad-metalad],

This can be used to disambiguate the case of multiple implementations. I will create an issue for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created: #5043

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effectively current code "allows" for multiple implementations, but silently disregards all but first. #5043 is indeed good to solve but I would just insist on adding at least a lgr.debug or better lgr.warning (since unsupported until decision on #5043 is made) if multiple are detected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warning and fall back to built-in indexer, if multiple indexers are found.

datalad/metadata/search.py Outdated Show resolved Hide resolved
lgr.warning('Failed to load indexer %s: %s', indexer.name, exc_str(e))
return lambda metadata: _deep_kv('', metadata)

def _iterable_to_unicode_string(value, perform_operation):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this helper introduced solely to work differently on val2str value, which is already known by here... I would have made it more obvious by e.g.

if val2str:
   def _val2str_helper(value):
       if isinstance(value, ...):
          ...
else:
   _val2str_helper = lambda value: value

and thus

  • avoided internal if upon every value to consider
  • made name more specific (not every iterable is BTW is considered, e.g. no dict treatment etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, will create different definitions. Will use def instead of lambda assignment, as suggested in PEP8 E731,

datalad/metadata/search.py Outdated Show resolved Hide resolved

def test_external_indexer():
""" check that external indexer are called """
from pkg_resources import EntryPoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike for search.py itself, I think it is ok to import pkg_resources (once) within a tests file at the top. Or it was causing troubles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, applied

datalad/metadata/search.py Outdated Show resolved Hide resolved
christian-monch and others added 3 commits October 13, 2020 11:46
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost there, left few minor comments


def test_external_indexer():
""" check that external indexer are called """
class MockedIndexer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it subclass MetadataIndexer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be. Then we would not only test that externally available indexers are loaded, but also that the class-hierarchy is properly designed/implemented. Will add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, that is the idea... in particular if abc and @abstractmethod are used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

""" check that external indexer are called """
class MockedIndexer:
def __init__(self, metadata_format_name: str):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super().__init__ if subclassed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done

from typing import Any, Dict, List, Union


class MetadataIndexer(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment in the test that likely MockedIndexer should subclass this one.

Since it defines interface, should it may be use abc module and define @abstractmethods so subclasses then are guaranteed to implement this interface "fully"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

@@ -173,16 +175,48 @@ def _deep_kv(basekey, dct):
else:
yield key, v

def get_indexer_for(metadata_format_name: str) -> callable:
indexer = (tuple(iter_entry_points('datalad.metadata.indexers', metadata_format_name)) or (None,))[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer such assumption "coded in" with the feedback to the "user". E.g. at least a log debug message (or better a warning) issued if more than one somehow found present. It is often such assumptions which later cause a big grief trying to figure out WTF my X is not working (because you had multiple but some other was chosen and nothing alerted you about that)

datalad/metadata/search.py Outdated Show resolved Hide resolved
Define indexer.base.MetadataIndexer as abstract base class
Fix comment typos
Add info-output for fallback to standard indexer
Since distributions may overwrite extractors or indexers
with new implementations, we add the distribution name
to the reported names. That ensures (with the current
implementation) that all extractors and indexers are
reported. In addition we report the distribution for
the different extractors and indexers.

It should be noted, that extractors and indexers from
different distributions are not necessarily compatible,
e.g. extractors from datalad-metalad are not compatible
with datalad-core extractor invocations. Therefore we
use a built-in fall back indexer if multiple indexers
are detected. The new wtf-format should simplify
debugging of this situation.
Check that only one indexer, i.e. one entry point, exists
for a given metadata name.

If multiple entry points are found, issue a warning and
use a built-in default indexer, since we cannot determine
reliably which indexer should be used.
@yarikoptic
Copy link
Member

It seems all my dire wishes came to life, so how could I resist temptation???

@yarikoptic yarikoptic merged commit f66dce7 into datalad:master Oct 20, 2020
@christian-monch
Copy link
Contributor Author

It seems all my dire wishes came to life, so how could I resist temptation???

Glad I could fulfill some of your wishes ;-) On to new ones...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants