New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: Field-selective search in egrep mode (fixes gh-2337) #2735

Merged
merged 7 commits into from Aug 8, 2018

Conversation

Projects
None yet
4 participants
@mih
Member

mih commented Aug 3, 2018

This PR implements a number of changes for the egrep default search mode that should make it substantially more useful, without putting too much strain on novice users.

  • by default queries are case-sensitive now. An example documents how to change that in a query on a case-by-case basis.
  • search can be limited to specific keys, syntax described and documented in examples
  • multiple queries (list of expressions) are evaluated using AND to determine whether something is a hit
  • a single multi-field query (e.g. pa*:findme) is a hit, when any matching field matches the query
  • all matching key/value combinations across all (multi-field) queries are reported in the query_matched result field.
  • mode-specific search hit number limit: egrep has none now #2639
@yarikoptic

just a quick look. main question is how to instruct to not even try the "fieldspec" for a given query part

):
print('KKKKKKKKKKKKKKKKKK', query, mode)

This comment has been minimized.

@yarikoptic

yarikoptic Aug 3, 2018

Member

leftover I guess

return query.lower()
query = assure_list(query)
simple_fieldspec = re.compile(r"(?P<field>\S+?):(?P<query>.*)")
quoted_fieldspec = re.compile(r"'(?P<field>[^']+?)':(?P<query>.*)")

This comment has been minimized.

@yarikoptic

yarikoptic Aug 3, 2018

Member

I guess this should be adjusted to allow for saying explicitly that : is a part of "direct" (just q), non-fieldspec query. I guess we could allow for it to be escaped \, so above ones could then become instead of \S+? a \S*?[^\] , and we would then need to "unscape" \: ? but then what if we are looking for \:? (I guess would need a lookbehind that previous one is not \ and then we convert \\ to \
?

This comment has been minimized.

@mih

mih Aug 3, 2018

Member

I though about it, and I find a leading colon to be more natural to indicate ALL FIELDS. I.e. :term: with colon will search for term: with colon. and :this is identical to this. If one needs to search for something starting with a colon, this will do: ::term

This comment has been minimized.

@yarikoptic

yarikoptic Aug 3, 2018

Member

Awesome

mih added some commits Aug 3, 2018

ENH+RF: Use leading colon to indicate SEARCH ALL FIELDS
Needed to disable argparse fromfile_prefix_arg setting to make this
work. We don't need it, and it might get in the way for other args
as well.
@codecov

This comment has been minimized.

codecov bot commented Aug 3, 2018

Codecov Report

Merging #2735 into master will decrease coverage by 0.01%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2735      +/-   ##
==========================================
- Coverage   90.42%   90.41%   -0.02%     
==========================================
  Files         245      245              
  Lines       30918    30935      +17     
==========================================
+ Hits        27959    27970      +11     
- Misses       2959     2965       +6
Impacted Files Coverage Δ
datalad/cmdline/main.py 78.65% <ø> (ø) ⬆️
datalad/metadata/tests/test_search.py 92.96% <100%> (+0.11%) ⬆️
datalad/metadata/tests/test_base.py 98.51% <100%> (ø) ⬆️
datalad/metadata/search.py 80.22% <91.66%> (+0.33%) ⬆️
datalad/api.py 86.84% <0%> (-7.9%) ⬇️
datalad/interface/utils.py 94.26% <0%> (-0.41%) ⬇️
datalad/plugin/export_to_figshare.py 20.14% <0%> (-0.3%) ⬇️
datalad/plugin/export_archive.py 100% <0%> (ø) ⬆️
datalad/plugin/tests/test_export_archive.py 100% <0%> (ø) ⬆️
datalad/cmdline/tests/test_main.py 97.05% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cce8cf5...792932d. Read the comment docs.

mih added some commits Aug 4, 2018

RF: Mode-specific search hit limits (fixes gh-2639)
Rational: In egrep no weighting of search hits is performed, hence
limiting makes very little sense.
@mih

This comment has been minimized.

Member

mih commented Aug 5, 2018

Unless any objections are voiced, I am inclined to merge this soonish.

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented on datalad/metadata/search.py in 2839a6f Aug 5, 2018

This is a change in behavior right? I thought it was incencitive for the case in the basic search for years, it an I wrong?

@mih

This comment has been minimized.

Member

mih commented Aug 5, 2018

This is a change in behavior right? I thought it was incencitive for the case in the basic search for years, it an I wrong?

Yes. The change is case-sensitive by default. Rational: It is easy to turn into case-insensitive, but not the other way round. In the former implementation there wasn't even a way to conduct a case sensitive search -- which is quite handy to be able to do.

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented Aug 6, 2018

Could you give me a realistic use case where case really matters?

@mih

This comment has been minimized.

Member

mih commented Aug 6, 2018

Sure. Any kind of exact label where case encodes meaning, or broadens the namespace (e.g. a hash based ID that is generated like those in a shortened URL). Another one are names: a person with name Mike might be easier to distinguish from a mike. Same logic why the cmdline tool grep has case-insensitive match as an option.

But the point is really what is the value of being unable to perform case sensitive only vs. both. Any realistic query has way more fundamental issues than casing. In general, with unknown metadata structure and type, one has little idea what to search for at all.

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented Aug 6, 2018

Any kind of exact label where case encodes meaning, or broadens the namespace (e.g. a hash based ID that is generated like those in a shortened URL).

most likely case doesn't matter there as far as I see it

Another one are names: a person with name Mike might be easier to distinguish from a mike.

And with many other names, if they aren't as "unique" as Mike that they could be a part of the bigger word you would need to anchor them around anyways, thus making query more complex and thus possibly adding an option to make it also case-sensitive.
Some times even acronyms might be cased interestingly different: Unicef vs proper UNICEF or AOL and Aol.

But the point is really what is the value of being unable to perform case sensitive only vs. both.

Yes -- it is useful to be able to restrict casing, but IMHO it shouldn't come at a cost of making case-sensitive to be the default behavior for our default search mode!

Same logic why the cmdline tool grep has case-insensitive match as an option.

I could argue that here we should aim not for grep [**] level of user-friendliness for the default/simplest search utility. Rather more I see a default search datalad to be of usability level of "google", "apt", "emacs" [*] (;-)), where by default casing doesn't matter. And in the majority of the usecases I used, search case didn't matter. When I am searching for "alzheimer" or "spin-echo" or "neuroimaging" I do not care what was the casing, either it was a first word in the title (and thus for sure upper-cased) or just in a list of keywords (who knows - may be they are all upper-cased there). If I was to look for NeuroImage as the journal, I would indeed want case sensitive search (isn't it a nice example for emacs [*]? ;-))

So overall, I see case-sensitivity as a useful feature to have (either as an additional option, or parameter in query, or automagic like emacs does when upper-case letter is present in the query), but I would strongly vote for maintaining current behavior of default search being case insensitive since it covers (much) more of use-cases over the ones where case-sensitivity is actually needed.

If we were to breed cmdline options I would have voted for having --case-sensitive=auto,off,on with auto providing emacs-style case sensitivity. Would be handy for the majority of use-cases.

[*] in emacs search is case insensitive if there is no upper-case letter in the search query. Kinda interesting usability twist.
[**] we aren't really using or implementing grep here AFAIK so mode name is already just that -- a name which shouldn't really impose grep's default behavior upon us

@mih

This comment has been minimized.

Member

mih commented Aug 6, 2018

OK, so I'll make a new mode that implements the emacs behavior. I am fine with making that default. But I will keep the present one around, as I think being able to perform a case-sensitive match is a must.

@kyleam

This comment has been minimized.

Member

kyleam commented Aug 6, 2018

But I will keep the present one around, as I think being able to perform a case-sensitive match is a must.

Yes, I like the Emacs behavior, but I agree the user should be able to be able to perform a case-sensitive lowercase match (and Emacs of course lets you do that too by pressing M-c).

@mih

This comment has been minimized.

Member

mih commented Aug 6, 2018

FTR (random dataset), new emacs-like search mode:

(datalad3-dev) mih@meiner /tmp/public (git)-[master] % datalad -c datalad.search.index-egrep-documenttype=all search T1 |tail -n3
search(ok): /tmp/public/sub-21/anat/sub-21_acq-r30_mod-T1w_defacemask.nii.gz (file)
action summary:
  search (ok: 56)
(datalad3-dev) mih@meiner /tmp/public (git)-[master] % datalad -c datalad.search.index-egrep-documenttype=all search t1 |tail -n3
search(ok): /tmp/public/sub-21/func/sub-21_task-orientation_acq-r30_run-10_bold.nii.gz (file)
action summary:
  search (ok: 336)

The difference between 336 and 56 are false positives with regard to the intent of the query.

mih and others added some commits Aug 6, 2018

RF+ENH: New search mode egrepcs for case-sensitive search
Default egrep mode switches to case-sensitive when the query contains
uppercase chars, but is insensitive otherwise.
@mih

This comment has been minimized.

Member

mih commented Aug 6, 2018

Thx @kyleam

@mih

This comment has been minimized.

Member

mih commented Aug 8, 2018

Seems everyone is overjoyed with the update. Setting merge counter to 8h.

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Aug 8, 2018

Confirming being overjoyed and updating counter to 2h. ;-)

@mih

This comment has been minimized.

Member

mih commented Aug 8, 2018

Alrighty!

@mih mih merged commit fcaac41 into datalad:master Aug 8, 2018

6 of 7 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
datalad-pr-dl-osx-64 DEV build done.
Details
datalad-pr-docker-dl-nd14_04 DEV build done.
Details
datalad-pr-docker-dl-nd16_04 DEV build done.
Details
datalad-pr-docker-dl-nd80 DEV build done.
Details
datalad-pr-docker-dl-nd90 DEV build done.
Details

@mih mih deleted the mih:enh-search branch Aug 8, 2018

@yarikoptic yarikoptic added this to the Release 0.10.3 milestone Sep 12, 2018

yarikoptic added a commit that referenced this pull request Sep 13, 2018

Merge tag '0.10.3' into debian
0.10.3 (Sep 13, 2018) -- Almost-perfect

This is largely a bugfix release which addressed many (but not yet all)
issues of working with git-annex direct and version 6 modes, and operation
on Windows in general.  Among enhancements you will see the
support of public S3 buckets (even with periods in their names),
ability to configure new providers interactively, and improved `egrep`
search backend.

Although we do not require with this release, it is recommended to make
sure that you are using a recent `git-annex` since it also had a variety
of fixes and enhancements in the past months.

Fixes

- Parsing of combined short options has been broken since DataLad
  v0.10.0. ([#2710])
- The `datalad save` instructions shown by `datalad run` for a command
  with a non-zero exit were incorrectly formatted. ([#2692])
- Decompression of zip files (e.g., through `datalad
  add-archive-content`) failed on Python 3.  ([#2702])
- Windows:
  - colored log output was not being processed by colorama.  ([#2707])
  - more codepaths now try multiple times when removing a file to deal
    with latency and locking issues on Windows.  ([#2795])
- Internal git fetch calls have been updated to work around a
  GitPython `BadName` issue.  ([#2712]), ([#2794])
- The progess bar for annex file transferring was unable to handle an
  empty file.  ([#2717])
- `datalad add-readme` halted when no aggregated metadata was found
  rather than displaying a warning.  ([#2731])
- `datalad rerun` failed if `--onto` was specified and the history
  contained no run commits.  ([#2761])
- Processing of a command's results failed on a result record with a
  missing value (e.g., absent field or subfield in metadata).  Now the
  missing value is rendered as "N/A".  ([#2725]).
- A couple of documentation links in the "Delineation from related
  solutions" were misformatted.  ([#2773])
- With the latest git-annex, several known V6 failures are no longer
  an issue.  ([#2777])
- In direct mode, commit changes would often commit annexed content as
  regular Git files.  A new approach fixes this and resolves a good
  number of known failures.  ([#2770])
- The reporting of command results failed if the current working
  directory was removed (e.g., after an unsuccessful `install`). ([#2788])
- When installing into an existing empty directory, `datalad install`
  removed the directory after a failed clone.  ([#2788])
- `datalad run` incorrectly handled inputs and outputs for paths with
  spaces and other characters that require shell escaping.  ([#2798])
- Globbing inputs and outputs for `datalad run` didn't work correctly
  if a subdataset wasn't installed.  ([#2796])
- Minor (in)compatibility with git 2.19 - (no) trailing period
  in an error message now. ([#2815])

Enhancements and new features

- Anonymous access is now supported for S3 and other downloaders.  ([#2708])
- A new interface is available to ease setting up new providers.  ([#2708])
- Metadata: changes to egrep mode search  ([#2735])
  - Queries in egrep mode are now case-sensitive when the query
    contains any uppercase letters and are case-insensitive otherwise.
    The new mode egrepcs can be used to perform a case-sensitive query
    with all lower-case letters.
  - Search can now be limited to a specific key.
  - Multiple queries (list of expressions) are evaluated using AND to
    determine whether something is a hit.
  - A single multi-field query (e.g., `pa*:findme`) is a hit, when any
    matching field matches the query.
  - All matching key/value combinations across all (multi-field)
    queries are reported in the query_matched result field.
  - egrep mode now shows all hits rather than limiting the results to
    the top 20 hits.
- The documentation on how to format commands for `datalad run` has
  been improved.  ([#2703])
- The method for determining the current working directory on Windows
  has been improved.  ([#2707])
- `datalad --version` now simply shows the version without the
  license.  ([#2733])
- `datalad export-archive` learned to export under an existing
  directory via its `--filename` option.  ([#2723])
- `datalad export-to-figshare` now generates the zip archive in the
  root of the dataset unless `--filename` is specified.  ([#2723])
- After importing `datalad.api`, `help(datalad.api)` (or
  `datalad.api?` in IPython) now shows a summary of the available
  DataLad commands.  ([#2728])
- Support for using `datalad` from IPython has been improved.  ([#2722])
- `datalad wtf` now returns structured data and reports the version of
  each extension.  ([#2741])
- The internal handling of gitattributes information has been
  improved.  A user-visible consequence is that `datalad create
  --force` no longer duplicates existing attributes.  ([#2744])
- The "annex" metadata extractor can now be used even when no content
  is present.  ([#2724])
- The `add_url_to_file` method (called by commands like `datalad
  download-url` and `datalad add-archive-content`) learned how to
  display a progress bar.  ([#2738])

* tag '0.10.3': (214 commits)
  Changelog entry for 2.19 git compat fix
  DOC: slight tune ups to the ChangeLog
  ENH: link issue/pull #s in CHANGELOG, use tools/link_issues_CHANGELOG
  BF: remove trailing period while matching a mesage from git
  DOC: try_multiple_dec: Add description of `duration` parameter
  CLN: annexrepo: Fix grammar in a recently added comment
  TST: auto: Reformat comment from 900ee08
  DOC: _rmtree: Drop a word from summary line
  DOC: Info on IDs (fixes gh-2801)
  BF: diff -- when reraising - just raise, do not raise that instance of exception from new location
  BF(TST): precommit before removing submodule so there is no batched processes
  ENH+BF(TST): close established ssh sockets upon test finish
  BF(TST): One more "clost all log handlers in the test"
  CHANGELOG.md: Start adding entries for v0.10.3
  BF(TST): close cookies db in the teardown since atexit is later, so cannot assure no open files
  BF(TST): explicitly close created log handlers
  ENH(TST): @known_failure_windows to replace plain @skip_if_on_windows where there is hope
  BF(TST): do not swallow output while testing AutomagicIO to not cause some open files issue
  ENH(TST): Skip a test when cannot remove curent directory
  BF(TST): explicitly precommit a repo used under swallow_outputs
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment