Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI: empty files are annexed; inconsistent mimetype reporting #3663

Closed
adswa opened this issue Sep 12, 2019 · 5 comments
Closed

FYI: empty files are annexed; inconsistent mimetype reporting #3663

adswa opened this issue Sep 12, 2019 · 5 comments

Comments

@adswa
Copy link
Member

adswa commented Sep 12, 2019

I'm just noting a behavior that confused me:

Let's say I am creating an empty file In a dataset configured with text2git with touch somefile (no extension, no content).
The mimetype command reports this file as text/plain:

╰─➤ datalad create -c text2git blubb
[INFO   ] Creating a new annex repo at /tmp/blubb 
                                                                                [INFO   ] Running procedure cfg_text2git 
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
create(ok): /tmp/blubb (dataset)
╭─adina@muninn /tmp
╰─➤ cd blubb; touch somefile
╭─adina@muninn /tmp/blubb on master
╭─adina@muninn /tmp/blubb on master
╰─➤ mimetype somefile
somefile: text/plain

A subsequent datalad save annexes the file:

╭─adina@muninn /tmp/blubb on master
╰─➤ datalad save -m "blabla"
add(ok): somefile (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
╭─adina@muninn /tmp/blubb on master
╰─➤ ls -l
total 4
lrwxrwxrwx 1 adina adina 108 Sep 12 10:30 somefile -> .git/annex/objects/2W/kW/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e

The inconsistency/my confusion arise from the fact that .gitattributes is configured to not regard any text files as largefiles by text2git

* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=(not(mimetype=text/*))

The rules for largefiles in the docs of Git-annex help to clarify what I was missing:

mimetype=glob
Looks up the MIME type of a file, and checks if the glob matches it.
For example, "mimetype=text/*" will match many varieties of text files, including "text/plain", but also "text/x-shellscript", "text/x-makefile", etc.
The MIME types are the same that are displayed by running file --mime-type
This is only available to use when git-annex was built with the MagicMime build flag.

If I actually run file --mime-type somefile it is not reported as a text file anymore:

╰─➤ file --mime-type somefile
somefile: inode/symlink

That's just as an FYI. However, is it actually useful/intended to annex files with a size of 0?

@bpoldrack
Copy link
Member

FTR:

$ file --mime-type [empty file]
anotherfile: inode/x-empty

Apparently that's what git-annex uses. However, we need to figure what to do with such configuration procedures indeed, since the behavior doesn't make a lot of sense. Maybe just add inode/x-empty to things to not be annexed?

@kyleam
Copy link
Contributor

kyleam commented Sep 12, 2019

However, is it actually useful/intended to annex files with a size of 0?

I can't think of a reason why it would be. The file name is already tracked, so there can be no unintended information that's being exposed.

I read @bpoldrack's comment as suggesting that you should add inode/x-empty to your repo's .gitattributes. That would of course work fine, but in this case I'd also be ok with just bundling the condition with cfg_text2git:

diff --git a/datalad/resources/procedures/cfg_text2git.py b/datalad/resources/procedures/cfg_text2git.py
index 0218d2f9d..490666ea0 100644
--- a/datalad/resources/procedures/cfg_text2git.py
+++ b/datalad/resources/procedures/cfg_text2git.py
@@ -10,7 +10,7 @@
     check_installed=True,
     purpose='configuration')
 
-annex_largefiles = '(not(mimetype=text/*))'
+annex_largefiles = '(not((mimetype=text/*)or(mimetype=inode/x-empty)))'
 attrs = ds.repo.get_gitattributes('*')
 if not attrs.get('*', {}).get(
         'annex.largefiles', None) == annex_largefiles:

If a user has instructed datalad to send text files to annex, I think it's more likely that they'll be surprised an empty file goes to annex than that they'll be annoyed about our loose interpretation of an empty file as a text file.

I suppose it's worth considering if the default .gitattributes (i.e., no -c text2git) should include not(mimetype=inode/x-empty), but I think the consistency of "all files go to annex unless told otherwise" is worth keeping.

@bpoldrack
Copy link
Member

@kyleam:
tldr; completely agree.

My intention actually was to add it either to text2git (and possibly other config procedures) or making it the default .gitattributes instead. Just wanted to hear about possible objections or alternatives.

but I think the consistency of "all files go to annex unless told otherwise" is worth keeping.

Yes, I lean towards that idea as well.

@kyleam
Copy link
Contributor

kyleam commented Sep 12, 2019

@adswa Would you like to take a crack at wrapping the above patch into a proper commit? I think it should come down to providing the rationale in the commit message and adding a test.

I suppose one tricky part about testing cfg_text2git is that the behavior depends on git-annex being built with the MagicMime flag. But assuming MagicMime support should be fine because neurodebian's git-annex has it and it looks like conda's does now too. If it ends up being a problem down the road, we can skip the test based on the git annex version output.

@adswa
Copy link
Member Author

adswa commented Sep 13, 2019

Yes, I'd love to!

adswa added a commit to adswa/datalad that referenced this issue Sep 13, 2019
Empty files are of mimetype inode/x-empty, and hence would be annexed. This behavior is likely unexpected after applying the text2git configuration. See 'datalad#3663 (comment)' for details.
adswa added a commit to adswa/datalad that referenced this issue Sep 13, 2019
Empty files are of mimetype inode/x-empty, and hence would be annexed. This behavior is likely unexpected after applying the text2git configuration. See 'datalad#3663 (comment)' for details.
adswa added a commit to adswa/datalad that referenced this issue Sep 13, 2019
Empty files are of mimetype inode/x-empty, and hence would be annexed. This behavior is likely unexpected after applying the text2git configuration. See 'datalad#3663 (comment)' for details.
adswa added a commit to adswa/datalad that referenced this issue Sep 13, 2019
Empty files are of mimetype inode/x-empty, and hence would be annexed. This behavior is likely unexpected after applying the text2git configuration. See 'datalad#3663 (comment)' for details.
adswa added a commit to adswa/datalad that referenced this issue Sep 16, 2019
Empty files are of mimetype inode/x-empty, and hence would be annexed. This behavior is likely unexpected after applying the text2git configuration. See 'datalad#3663 (comment)' for details.

use file size rule instead of mime type
adswa added a commit to adswa/datalad that referenced this issue Sep 16, 2019
Empty files are of mimetype inode/x-empty, and hence would be annexed. This behavior is likely unexpected after applying the text2git configuration. See 'datalad#3663 (comment)' for details.

use file size rule instead of mime type
@kyleam kyleam closed this as completed in 0ec2b11 Sep 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants