Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add interface to core's File::Compare #255

Closed
grr opened this issue Sep 23, 2022 · 10 comments
Closed

add interface to core's File::Compare #255

grr opened this issue Sep 23, 2022 · 10 comments
Labels

Comments

@grr
Copy link

grr commented Sep 23, 2022

I often have to compare the contents pointed to by the paths. I had been using the digest method, which works fine for the purpose, but I just benchmarked against File::Compare and the latter is 15-30x faster, so I'm definitely switching to using compare.

According to corelist: File::Compare was first released with perl 5.004, so it would not add an additional dependency.

@xdg xdg added the wishlist label Sep 23, 2022
@xdg
Copy link
Contributor

xdg commented Sep 23, 2022

Thanks for the suggestion. This does fit the mission of Path::Tiny -- making it easy to do things the "right way".

path($one)->compare($two)

Path:Tiny can do a raw open on the two files and provide the handles to File::Compare. And it can default to a 64k buffer size, which appears recommended as a good default for I/O.

I'm not inclined to support the compare_text method because dealing with file encodings leads to even more API bloat and is already a specialized case. Users can open the two file handles directly and iterate them easily enough.

Note to self: compare isn't a great verb for this. Many (but not all) boolean methods start with is. Maybe compares_equal?

@xdg
Copy link
Contributor

xdg commented Sep 23, 2022

@ap -- you've got good perspective on APIs -- what do you think about naming?

@ap
Copy link
Contributor

ap commented Sep 23, 2022

My first thought was “what’s wrong with compare?”

I don’t feel that compares_equal improves on it.

And I couldn’t immediately think of anything else that does.

OTOH, taking a step back, I don’t actually disagree that compare is a bit meh.

Indeed, on second thought, it occurs to me that “compares” and “equal” both say the same thing, and that’s the problem, no? It’s just meh twice. You want to say what is equal. Which would be the contents, which means you are looking for something like contents_equal.

My only misgiving with that particular suggestion is that it completely loses any connection to File::Compare as the underlying module.

(FWIW I agree with not supporting compare_text.)

@grr
Copy link
Author

grr commented Sep 23, 2022

which means you are looking for something like contents_equal.

My only misgiving with that particular suggestion is that it completely loses any connection to File::Compare as the underlying module.

It sounds like we're already assuming the return value will be reversed (unless we name it contents_differ). Looking through Path::Tiny's docs, there are a number of tests that are named is_ (is_absolute, is_relative, is_rootdir), beside the file test operators, so why not something like $a->is_same_content($b)?

The docs should also probably mention that File::Compare does not check if the underlying files are the same (it will happily take the same file as both inputs), so the user might still want to compare with realpath(). And there doesn't appear to be a portable way to check if a file is a hard-link of another: https://www.perlmonks.org/?node_id=859658

@ap
Copy link
Contributor

ap commented Sep 23, 2022

All the existing is_foo methods are simple predicates about the invocant itself. This new method would be an odd one out among the set, so I don’t know that that’s a good place to locate it.

Just linguistically is_same_content is nails on a chalkboard though. It would have have to be called has_same_content anyway, or else something like is_identical_in_content.

This all gets rather cumbersome IMO.

It sounds like we're already assuming the return value will be reversed

That’s a good point. I don’t know that I’m happy with that.

@xdg
Copy link
Contributor

xdg commented Sep 23, 2022

Annoyingly, File::Compare has 0 for unequal, 1 for equal and -1 for error which means it doesn't actually work like cmp, which would make a name like cmp_content descriptive and intuitive. (Could reimplement to do those semantics without File::Compare...)

  • has_equal_content?

@grr
Copy link
Author

grr commented Sep 24, 2022

The docs should also probably mention that File::Compare does not check if the underlying files are the same (it will happily take the same file as both inputs), so the user might still want to compare with realpath(). And there doesn't appear to be a portable way to check if a file is a hard-link of another: https://www.perlmonks.org/?node_id=859658

The unix cmp command (which is part of the POSIX spec, so is available on most systems) does both of these checks: http://git.savannah.gnu.org/cgit/diffutils.git/tree/src/cmp.c#n284
And it's exit value is similar to to File::Compare::compare's return values: Exit status is 0 if inputs are the same, 1 if different, 2 if trouble.

@xdg
Copy link
Contributor

xdg commented Sep 28, 2022

In the spirit of "write the docs first", here's what I propose. I think it's hard to misunderstand or misuse.

=method has_same_bytes

    if ( path("foo.txt")->has_same_bytes("bar.txt") ) {
       # ...
    }

This method returns true if both the invocant and the argument can be opened as
file handles and the handles contain the same bytes.  It returns false if their
contents differ.  If either can't be opened as a file (e.g. a directory or
non-existent file), the method throws an exception.  If both can be opened and
both have the same C<realpath>, the method returns true without scanning any
data.

=cut

Note: there is a subtlety with realpath where a realpath to a non-existent leaf is possible, which is why both must be opened before the realpath check.

@ap
Copy link
Contributor

ap commented Sep 28, 2022

LGTM

@grr
Copy link
Author

grr commented Sep 30, 2022

Just linguistically is_same_content is nails on a chalkboard though. It would have have to be called has_same_content anyway, or else something like is_identical_in_content.

I Just realized why i was having the opposite feeling. has implies a membership test, whereas is implies an identity test. And both of those words are frequently used in Perl with those contexts: has with Moo/Moose (though it's used to install a member into a set, instead of testing for membership), and is with the isa op.

@xdg xdg closed this as completed in 1308e50 Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants