New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved gitattribute handling #2744

Merged
merged 12 commits into from Aug 8, 2018

Conversation

Projects
None yet
3 participants
@mih
Member

mih commented Aug 5, 2018

  • Ability to query multiple paths
  • more robust parsing
  • GitRepo.set_gitattributes()
  • RF existing code to use helper
  • tests

This pull request will fix #2743

mih added some commits Aug 5, 2018

@mih mih force-pushed the mih:enh-attributes branch from 647f9cc to cf07dc2 Aug 6, 2018

mih and others added some commits Aug 7, 2018

BF: no_annex: Correct path to .gitattributes
The path should be within the dataset, not $PWD.
@kyleam

kyleam approved these changes Aug 7, 2018

Looks like a nice improvement. no_annex plugin test should pass with the latest push.

# on rerun with --force
if not attrs.get('.', {}).get('annex.largefiles', None) == '(not(mimetype=text/*))':
tbrepo.set_gitattributes([
# XXX shouldn't this be '**' to have the desired effect?

This comment has been minimized.

@kyleam

kyleam Aug 7, 2018

Member

No, a single '*' works recursively there.

This comment has been minimized.

@mih

mih Aug 8, 2018

Member

OK.

@kyleam

This comment has been minimized.

Member

kyleam commented Aug 7, 2018

I haven't clicked through all the failing tests, but the few I did were all due to the known grunt issue (#2742).

@kyleam

This comment has been minimized.

Member

kyleam commented Aug 7, 2018

The remaining test failures are legitimate. There are two places in the crawler extension that call the removed get_git_attributes method.

@codecov

This comment has been minimized.

codecov bot commented Aug 8, 2018

Codecov Report

❗️ No coverage uploaded for pull request base (master@9c8a9a1). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2744   +/-   ##
=========================================
  Coverage          ?   88.76%           
=========================================
  Files             ?      245           
  Lines             ?    31195           
  Branches          ?        0           
=========================================
  Hits              ?    27691           
  Misses            ?     3504           
  Partials          ?        0
Impacted Files Coverage Δ
datalad/support/annexrepo.py 89.05% <100%> (ø)
datalad/distribution/create.py 90.07% <100%> (ø)
datalad/plugin/no_annex.py 78.72% <100%> (ø)
datalad/support/tests/test_annexrepo.py 95.33% <100%> (ø)
datalad/support/gitrepo.py 87.94% <100%> (ø)
datalad/support/tests/test_gitrepo.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c8a9a1...106a49e. Read the comment docs.

@mih

This comment has been minimized.

Member

mih commented Aug 8, 2018

Thx @kyleam, and thx for locating the failing tests. I just pushed actual tests for the new code too. Should be done by tomorrow.

mih added some commits Aug 8, 2018

BF: Humble attempt to fix unicode issue
  File "/Users/datalad/buildslave/datalad-pr-dl-osx-64/build/datalad/support/tests/test_gitrepo.py", line 1270, in test_gitattributes
    get_most_obscure_supported_name())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-13: ordinal not in range(128)
@mih

This comment has been minimized.

Member

mih commented Aug 8, 2018

I pushed the changes necessary for the crawler to https://github.com/datalad/datalad-crawler/tree/rf-attributes Will open a PR once this is merged.

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented Aug 8, 2018

As travis rightfully points out this API breakage breaks datalad-crawler:

AttributeError: 'AnnexRepo' object has no attribute 'get_git_attributes'
@kyleam

This comment has been minimized.

Member

kyleam commented Aug 8, 2018

As travis rightfully points out this API breakage breaks datalad-crawler:

@yarikoptic above and above?

@mih mih merged commit 3bcc58f into datalad:master Aug 8, 2018

6 of 7 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
datalad-pr-dl-osx-64 DEV build done.
Details
datalad-pr-docker-dl-nd14_04 DEV build done.
Details
datalad-pr-docker-dl-nd16_04 DEV build done.
Details
datalad-pr-docker-dl-nd80 DEV build done.
Details
datalad-pr-docker-dl-nd90 DEV build done.
Details

@mih mih deleted the mih:enh-attributes branch Aug 8, 2018

@yarikoptic

Please make it backward compatible with existing code , breaks crawler as travis has stated

from datalad import get_encoding_info
from datalad.cmd import Runner
from datalad.tests.utils import ok_

This comment has been minimized.

@yarikoptic

yarikoptic Aug 8, 2018

Member

Just a minor comment: I find the convention of using "tupled" list of imports quite nice, so it would be nice if you could try/adhere to it as well, so here it would look like

from datalad.tests.utils import (
     ok_,
     ok_clean_git,
     ...,
)

which also allows to minimize diff for the next additions since the last entry could have a trailing ,. Also IMHO easier to read since less of repetition

This comment has been minimized.

@mih

mih Aug 9, 2018

Member

It makes grepping for imports on RF much harder.

This comment has been minimized.

@yarikoptic

yarikoptic Aug 9, 2018

Member

Hmm, haven't thought about that, but not sure why I would want to go for them with import - if rfing away any hit should get renamed or removed

Parameters
----------
attrs : list
Each item is a 2-tuple, where the first element is a path pattern,

This comment has been minimized.

@yarikoptic

yarikoptic Aug 8, 2018

Member

feels that it is a use-case to use a dict (if needed ordered) not a list of 2-tuples.

This comment has been minimized.

@mih

mih Aug 9, 2018

Member

That would work in the general case, because it is possible to set attributes on a single paths on multiple lines.

that contains the target .gitattributes file (see `attrfile`).
attrfile: path
Path relative to the repository root of the .gitattributes file the
attributes shall be set in.

This comment has been minimized.

@yarikoptic

yarikoptic Aug 8, 2018

Member

just curious if you know about a use-case where having it as an option would come handy already with git? i.e. if there is another file which might follow this syntax to be used for storing some attrs?

This comment has been minimized.

@kyleam

kyleam Aug 8, 2018

Member

a use-case where having it as an option would come handy already with git?

Any .gitattributes file in a subdirectory? We could assume the base name .gitattributes and change this argument to take a directory, though the current approach seems straightforward enough.

We don't use it at the moment, but there is also $GITDIR/info/attributes, though taking that relative to the repo root isn't convenient for that use case since $GITDIR isn't necessarily "root/.git".

This comment has been minimized.

@mih

mih Aug 9, 2018

Member

True, it isn't necessarily always .gitattributes, hence I went this way.

This comment has been minimized.

@yarikoptic
attr = []
return attributes
def set_gitattributes(self, attrs, attrfile='.gitattributes'):

This comment has been minimized.

@yarikoptic

yarikoptic Aug 8, 2018

Member

I wonder if this one should be named add_gitattributes (or even append_) since that is what it does -- it appends, and thus potentially augments existing or it doesn't really "reset" or modify already existing ones, so it cannot remove previously set (besides to reset to an empty value I guess).

This comment has been minimized.

@kyleam

kyleam Aug 8, 2018

Member

thus potentially augments existing or it doesn't really "reset" or modify already existing ones

Is it true that a later value can augment a previous one? IIRC a later value in a .gitattributes file overrides the previous one.

This comment has been minimized.

@mih

mih Aug 9, 2018

Member

@kyleam but it is done on a per attribute basis.

This comment has been minimized.

@mih

mih Aug 9, 2018

Member

@yarikoptic I think neither add nor append indicate the right semantics. It appends attribute spec lines, but whether that results in a set, add, or unset behavior is determined by a complex interplay of interpreting all gitattribute specs from multiple locations.

This comment has been minimized.

@kyleam

kyleam Aug 9, 2018

Member

@kyleam but it is done on a per attribute basis.

I'm confused. I thought "existing ones" in @yarikoptic's comment referred to attributes. Perhaps I should have read it to mean a spec line.

@yarikoptic yarikoptic added this to the Release 0.10.3 milestone Sep 12, 2018

yarikoptic added a commit that referenced this pull request Sep 13, 2018

Merge tag '0.10.3' into debian
0.10.3 (Sep 13, 2018) -- Almost-perfect

This is largely a bugfix release which addressed many (but not yet all)
issues of working with git-annex direct and version 6 modes, and operation
on Windows in general.  Among enhancements you will see the
support of public S3 buckets (even with periods in their names),
ability to configure new providers interactively, and improved `egrep`
search backend.

Although we do not require with this release, it is recommended to make
sure that you are using a recent `git-annex` since it also had a variety
of fixes and enhancements in the past months.

Fixes

- Parsing of combined short options has been broken since DataLad
  v0.10.0. ([#2710])
- The `datalad save` instructions shown by `datalad run` for a command
  with a non-zero exit were incorrectly formatted. ([#2692])
- Decompression of zip files (e.g., through `datalad
  add-archive-content`) failed on Python 3.  ([#2702])
- Windows:
  - colored log output was not being processed by colorama.  ([#2707])
  - more codepaths now try multiple times when removing a file to deal
    with latency and locking issues on Windows.  ([#2795])
- Internal git fetch calls have been updated to work around a
  GitPython `BadName` issue.  ([#2712]), ([#2794])
- The progess bar for annex file transferring was unable to handle an
  empty file.  ([#2717])
- `datalad add-readme` halted when no aggregated metadata was found
  rather than displaying a warning.  ([#2731])
- `datalad rerun` failed if `--onto` was specified and the history
  contained no run commits.  ([#2761])
- Processing of a command's results failed on a result record with a
  missing value (e.g., absent field or subfield in metadata).  Now the
  missing value is rendered as "N/A".  ([#2725]).
- A couple of documentation links in the "Delineation from related
  solutions" were misformatted.  ([#2773])
- With the latest git-annex, several known V6 failures are no longer
  an issue.  ([#2777])
- In direct mode, commit changes would often commit annexed content as
  regular Git files.  A new approach fixes this and resolves a good
  number of known failures.  ([#2770])
- The reporting of command results failed if the current working
  directory was removed (e.g., after an unsuccessful `install`). ([#2788])
- When installing into an existing empty directory, `datalad install`
  removed the directory after a failed clone.  ([#2788])
- `datalad run` incorrectly handled inputs and outputs for paths with
  spaces and other characters that require shell escaping.  ([#2798])
- Globbing inputs and outputs for `datalad run` didn't work correctly
  if a subdataset wasn't installed.  ([#2796])
- Minor (in)compatibility with git 2.19 - (no) trailing period
  in an error message now. ([#2815])

Enhancements and new features

- Anonymous access is now supported for S3 and other downloaders.  ([#2708])
- A new interface is available to ease setting up new providers.  ([#2708])
- Metadata: changes to egrep mode search  ([#2735])
  - Queries in egrep mode are now case-sensitive when the query
    contains any uppercase letters and are case-insensitive otherwise.
    The new mode egrepcs can be used to perform a case-sensitive query
    with all lower-case letters.
  - Search can now be limited to a specific key.
  - Multiple queries (list of expressions) are evaluated using AND to
    determine whether something is a hit.
  - A single multi-field query (e.g., `pa*:findme`) is a hit, when any
    matching field matches the query.
  - All matching key/value combinations across all (multi-field)
    queries are reported in the query_matched result field.
  - egrep mode now shows all hits rather than limiting the results to
    the top 20 hits.
- The documentation on how to format commands for `datalad run` has
  been improved.  ([#2703])
- The method for determining the current working directory on Windows
  has been improved.  ([#2707])
- `datalad --version` now simply shows the version without the
  license.  ([#2733])
- `datalad export-archive` learned to export under an existing
  directory via its `--filename` option.  ([#2723])
- `datalad export-to-figshare` now generates the zip archive in the
  root of the dataset unless `--filename` is specified.  ([#2723])
- After importing `datalad.api`, `help(datalad.api)` (or
  `datalad.api?` in IPython) now shows a summary of the available
  DataLad commands.  ([#2728])
- Support for using `datalad` from IPython has been improved.  ([#2722])
- `datalad wtf` now returns structured data and reports the version of
  each extension.  ([#2741])
- The internal handling of gitattributes information has been
  improved.  A user-visible consequence is that `datalad create
  --force` no longer duplicates existing attributes.  ([#2744])
- The "annex" metadata extractor can now be used even when no content
  is present.  ([#2724])
- The `add_url_to_file` method (called by commands like `datalad
  download-url` and `datalad add-archive-content`) learned how to
  display a progress bar.  ([#2738])

* tag '0.10.3': (214 commits)
  Changelog entry for 2.19 git compat fix
  DOC: slight tune ups to the ChangeLog
  ENH: link issue/pull #s in CHANGELOG, use tools/link_issues_CHANGELOG
  BF: remove trailing period while matching a mesage from git
  DOC: try_multiple_dec: Add description of `duration` parameter
  CLN: annexrepo: Fix grammar in a recently added comment
  TST: auto: Reformat comment from 900ee08
  DOC: _rmtree: Drop a word from summary line
  DOC: Info on IDs (fixes gh-2801)
  BF: diff -- when reraising - just raise, do not raise that instance of exception from new location
  BF(TST): precommit before removing submodule so there is no batched processes
  ENH+BF(TST): close established ssh sockets upon test finish
  BF(TST): One more "clost all log handlers in the test"
  CHANGELOG.md: Start adding entries for v0.10.3
  BF(TST): close cookies db in the teardown since atexit is later, so cannot assure no open files
  BF(TST): explicitly close created log handlers
  ENH(TST): @known_failure_windows to replace plain @skip_if_on_windows where there is hope
  BF(TST): do not swallow output while testing AutomagicIO to not cause some open files issue
  ENH(TST): Skip a test when cannot remove curent directory
  BF(TST): explicitly precommit a repo used under swallow_outputs
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment