Discussion on future directions of "hardlinkpy" and variants #36

chadnetzer · 2018-07-21T05:26:56Z

@akaihola @wolfospealain @jamescassell @JohnVillalovos I'm opening this issue as a general place to discuss future directions for the hardlinkpy program, and the various forks.

Currently, there are two forks (Mine and the one from @wolfospealain) from this repo that have advanced a fair bit ahead of the akaihola/master branch, enough that it's questionable whether they'll be in any way mergable. AFAICT, it's a matter of coincidental timing that both @wolfospealain and I got the itch to work on significant modifications to this program recently. Also, from looking around the internet, I found a number of variants and rewrites of hardlinkpy that others have worked on over the years. Currently Ubuntu/Debian have a "hardlink" package written by Julian Klode, that was inspired by "hardlinkpy" but re-written from scratch in C with an MIT license. It, notably, is aware of "extended-attributes" as well as normal inode metadata, which is a potentially valuable feature. Redhat/Fedora, on the other hand, is still using the original "hardlink" written by Jakub Jelinek, presumably because it still serves it's intended purpose for them (hardlinking the identical kernel header files). So two of the main distributions currently have a "hardlink" program, each serving the same purpose but with different implementations (and both related inspirationally to hardlinkpy, but independent from it). And there are a few other "hardlinkpy" inspired variants around as well.

Both @wolfospealain and I appear to have each made our own significant variations on the existing code base in the last few weeks, taking different approaches, but inspired by some of the same ideas (supporting new Python versions, using classes instead of module globals, updating options, etc).

It's worth discussing overall what direction @akaihola sees for his repo going forward, since it is the first search engine hit for "hardlinkpy". I can't speak for @wolfospealain, but given the License changes (GPL 3 only), and other recent updates, I expect he intends to maintain his own independent fork of a "hardlink" program.

As for me, I also got to a point where it made sense to significantly diverge from the current repo master; my PR #33 shows a progression of changes that are (in principle) mergeable into the current codebase, but for the other things I wanted to achieve I decided on a major refactoring, and rebranding, to allow me to break from some of the backwards option compatibility issues. In particular, I found that having to remember to specify the "dry-run" option was a chore, as I generally was mostly using the program in dry-run mode to gather information on what files were hardlinkable (without wanting to actually hard-link them yet). In general, I preferred a "safer" default. So I decided on making a program called "hardlinkable", instead of "hardlink", which by default reports data but makes no changes (and can be used more easily as a service by other Python software). One of my motivations was making sure the reported statistics on space saved for the various options, is always accurate, which I have found to be unreliable the "hardlink" variants that I have tested. I have a "hardlinkable_devel" branch in my repo (chadnetzer/hardlinkpy) that shows where I am currently, for those interested.

So, with that throat clearing, I'm interested in what thoughts others have about possible future directions for this (or other) forks of the hardlinkpy repo. I'm certainly interested in contributing to this repo for things like bug fixes, etc., as it is currently the one most people are likely to discover when searching for an alternative to the "hardlink" program that comes with their Linux distro (which is how I discovered it years ago). However, since it's unlikely that Redhat or Debian will replace their "hardlink" packages with a similarly named one unless it offers significant improvements over the existing versions they are using, I think I'm committed to developing my variant independently as an alternative to the classic "hardlink" program. Since there seems to be a recent surge of development interest on this repo (the @akaihola repo), I'm interested in the thoughts of him, @JohnVillalovos, and others on what they see for the future of the "hardlinkpy" program.

akaihola · 2018-07-21T17:06:43Z

Thanks @chadnetzer for your excellent summary!

My repository is high on search engine results probably because it's the earliest of the active forks, and actually predates @JohnVillalovos's own fork on GitHub.

I'm trying to summarize the status of different forks below. Please correct me if I'm mischaracterizing your forks and your intentions. Would be nice to acknowledge other significant work on the original code base with details, too; any particular ones to note here @chadnetzer?

JohnVillalovos/hardlinkpy
- no changes from final state of his original Google Code repository, abandoned
akaihola/hardlinkpy
- first to mirror original project from Google Code -> high on search engines
- no strong personal needs or vision for the functionality
- cares for Python 3 compatibility and good test coverage
- open towards handing over maintenance
wolfospealain/hardlinkpy
- obsoleted in favor of wolfospealain/hardlink
wolfospealain/hardlink
- project rename, license change, UI and internal changes, OO rewrite
- opinionated fork for personal use?
chadnetzer/hardlinkpy
- UI and internal changes, OO rewrite, Python 2.3 & 3.x compatibility
- prefer to work on an "official" version?

Regarding the dry-run default, couldn't we just provide multiple entry points in setup.py and use a different default for the --dry-run option depending on the name of the calling binary? That way hardlink and hardlinkable would execute the same script but with a different value for --dry-run.

As noted above, I'm willing to follow if @chadnetzer wants to take the lead and become the "official" fork.
In that case I would merge his changes into my fork, and point users to his fork in the repository description. I also appreciate changes in @wolfospealain's fork, but I would prefer to see them merged separately for clarity. I'd also like a possible license change to happen after discussion and agreement.

Finally, as stated elsewhere, I think we should eventually either obtain control of the hardlinkpy project on PyPI, or rebrand and upload a now project on PyPI. I'm not sure if the size and significance of this project warrants a GitHub organization instead of having the main fork under a personal account.

chadnetzer · 2018-07-21T22:56:12Z

@akaihola That's a good descriptive expansion of the projects I alluded to. At least for my fork, I can mention that I've essentially rewritten (or heavily modified) the current algorithm to keep track of inode/pathname information during the tree walk, and then (optionally) perform the hardlinking after the walk has been completed. This ensures that the linking step has complete information, and can (for example) perform linking in a way that minimizes the total number of link() calls made (solving the "clustering issue" that @wolfospealain mentions), and produce accurate statistics on what would happen whether it's in "dry-run" mode or not. There are some memory implications to this change, since (in principle) more information is being collected, and @wolfospealain has mentioned memory usage as a concern of his. On the other hand, I handle the pathname objects in a cleverer way, and it won't surprise me if my changes generally use less memory than the original code; currently the Statistics class keeps a lot of information around anyway, a fair amount of which can be optimized.

Anyway, long-story short version is that besides what you mentioned, I'm really attempting to produce a version which has always accurate statistics, while being "safe" to run (For example, aborting early if it detects the file tree, or the files themselves, have been modified before linking them, etc.)

Regarding the dry-run default, couldn't we just provide multiple entry points in setup.py and use a different default for the --dry-run option depending on the name of the calling binary?

I'm open to this possibility; note that (currently) I chose the rebranding of name to allow freedom with modifying some of the option names, etc. If I were to also support being called as "hardlink", it'd be worth having a discussion on what set of options to support (and what compatibility, if any, with other "hardlink" implementations it's best to maintain).

As noted above, I'm willing to follow if @chadnetzer wants to take the lead and become the "official" fork.

I'm open to this, though I'd like to hear from @wolfospealain about it. A number of his changes were queued up before I started my own development recently (and I incorporated his pending commits as best I could), but our recent developments suggest we have different styles in our approach. :) I'd at least be interested to know if what I've been working on would suit his needs as well, and if he is okay with a totally different code structure than his for the future direction. One strong point in favor of my current code base (imo) is my expanded test suite.

I'd also like a possible license change to happen after discussion and agreement.

I'd like to hear more; can you open a separate issue to discuss (or maybe just a quick summary of your thoughts here first)? Are you wishing to move to GPLv3, for example?

Finally, as stated elsewhere, I think we should eventually either obtain control of the hardlinkpy project on PyPI, or rebrand and upload a now project on PyPI.

I'd be interested if John V. joins the conversation in August, after his vacation, and whether he is open to such a thing. However, given the current distro namespace clashes of the "hardlink" name, coupled potentially with the fairly major changes to the longstanding codebase, I'm inclined to rebrand (allowing us to change command line options with more freedom). I chose "hardlinkable" to indicate my default "dry-run" version of the tool, but I was also considering "hardlinkify" as an updated name that better indicated the intent of the software.

akaihola · 2018-07-23T20:05:22Z

Continuing the refactoring disussion from #3, it indeed seems that we have two good rewrites of the project in your (@chadnetzer's) and @wolfospealain's forks. I appreciate the fact that building a clean trail of commits starting from my fork and ending in either of those forks would be lots of work and the value of that work would be questionable.

If we were to collectively "bless" either of your forks, I'll be more than happy to follow with my fork. And clearly you @chadnetzer and @wolfospealain have much more specialized needs for the tool and more expertise for ironing out the details. Therefore I don't think it makes sense for me to participate too much in design decisions.

Regarding the license, I don't see a concrete need to change it, but I'm open towards well reasoned changes. I typically favor permissive licenses, especially in library and framework code, but in the case of a stand-alone application like hardlinkpy, GPL is also reasonable.

It will indeed be interesting to hear from @JohnVillalovos. I guess it makes sense to wait for that before big decisions.

chadnetzer · 2018-07-25T05:41:02Z

@akaihola I might be doing some sporadic traveling for a couple weeks, which may coincide w/ John V returning, so in the short term we can maybe wait and see if he weighs in.

chadnetzer mentioned this issue Jul 21, 2018

Added filesize options, clustering check, change mode and timestamp use-case, keep newest attributes. #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on future directions of "hardlinkpy" and variants #36

Discussion on future directions of "hardlinkpy" and variants #36

chadnetzer commented Jul 21, 2018

akaihola commented Jul 21, 2018 •

edited by chadnetzer

Loading

chadnetzer commented Jul 21, 2018

akaihola commented Jul 23, 2018

chadnetzer commented Jul 25, 2018

Discussion on future directions of "hardlinkpy" and variants #36

Discussion on future directions of "hardlinkpy" and variants #36

Comments

chadnetzer commented Jul 21, 2018

akaihola commented Jul 21, 2018 • edited by chadnetzer Loading

chadnetzer commented Jul 21, 2018

akaihola commented Jul 23, 2018

chadnetzer commented Jul 25, 2018

akaihola commented Jul 21, 2018 •

edited by chadnetzer

Loading