Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Hash dependent build #2044

Closed
wants to merge 18 commits into from
Closed

Conversation

drug007
Copy link
Contributor

@drug007 drug007 commented Nov 27, 2020

Preface

This is the full implementation that passes tests but the last commits should be polished to get ready to review. To do this I need the feedback in general.

Description

Now dub uses timestamps of all source files to determine if the project should be rebuild. It compares timestamp of every source file to timestamp of the target file of the project. If any of source files has more recent timestamp than the target file dub rightly considered that the target file is old and rebuild the project. Call this mode time dependent build. The PR adds other mode, so called hash dependent build that uses the hash sum of a file instead of its timestamp, i.e. in this mode dub checks if the hash sum of the file has been changed comparing to the previous value on the last successful build, if so then rebuild is triggered. If no previous value exists rebuild is triggered too, of course.

How to use the feature

The new mode can be set using command line option, environment var or settings file. Option has four values:

  • absence means there is no explicit value set so default one will be used (currently it is time dependent build, the current behavior). Unfortunately default is keyword in D so absence was used. absence lets you explicitly reset the option to undefined value and can be useful in automated workflow.
  • time means time dependent build
  • sha1 and sha256 mean hash dependent build, using sha1 and sha256 algorithm respectively.

To set the option using command line interface use hash argument:

dub build --hash=sha1

To set the option using the environment variable use DUB_HASH_KIND var:

$ DUB_HASH_KIND=sha256 dub build # linux

To set the option using settings file add this:

{
    ...

    "hashKind" : "sha256"

    ...
}

to your system wide, user wide or project specific setting file, for example dub.settings.json file in the project root. Now the project will be built using sha256 algorithm every time. But you can override it using command line or env var:

# hash dependent build using sha1, not sha256 due to options priority
DUB_HASH_KIND=sha256 dub build --hash=sha1

Option priority

Options set by different ways have different priority (strong to weak):

  • command line option
  • environment variable
  • settings file
  • default value (currently time-dependent but can be changed later)

I implemented both sha1 and sha256 because sha1 is less secure than sha256 but faster so user can select what (s)he wants, some users discussed that in the D forum. Also other algorithms can be added too, especially those implemented in Phobos can be added easily.
Tests are provided but I didn't know that Actions was preferable way to run tests and appveyour and travis were used for this. I'll remake that.
Hash sum of source files are not used in calculating the hash sum of the build, so if you revert your changes dub rebuilds the project. This feature was requested but it means creating artifacts on every changes user made in source files and can take much disk space, so it is not implemented (yet).

Questions

Currently I have the following questions about the feature and its implementation:

  • what is the current code style?
  • I'm not sure that hash or hash kind term are good, probably something like BuildCachePolicy is more appropriated?
  • HashKind.absence is unusual for me too
  • should build cache depend on hash sum of all source files? see the forum post
  • your question here

@dlang-bot
Copy link
Collaborator

Thanks for your pull request and interest in making D better, @drug007! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

@drug007 drug007 changed the title Hash dependent build [WIP] Hash dependent build Nov 27, 2020
@dlang-bot dlang-bot added the WIP label Nov 27, 2020
@PetarKirov
Copy link
Member

PetarKirov commented Nov 27, 2020

Thank you, this is awesome work. One quick comment:

Now only the following CIs are used:

  • GitHub Actions - this is our main CI, as it has support for Windows, Linux and Mac
  • BuildKite - tests each dub pull request against various projects on code.dlang.org
  • DAutoTest - Build the documentation

Previously, we used multiple CI services (as none was sufficient on its own back then). Later we stopped using them, but their respectful config files were left in the repo. I corrected this in this pull request: #2042

Can you edit the commits in your branch (with interactive git rebase) to remove appveyor.yml, etc.?

@drug007
Copy link
Contributor Author

drug007 commented Nov 27, 2020

My plan is:

  • get the feedback
  • refactor my changes according to the feedback
  • split the PR to a series of the PRs to simplify code review

Of course I'll get rid of appveyor and travis traces. I had seen your commit just before I created the PR.

@rikkimax
Copy link
Contributor

Use default_ over absence. Append an underscore to it. That is what the D style guidelines says.

I would not recommend sha1 as your "cheap" option. If for some reason you need to go cheaper on the hash (unlikely since time will be the cheapest option possible), you can't get much cheaper than a FNV. sha256 is probably a sensible default. But in saying that, we don't need a cryptographic hash, just a checksum (which is what FNV is good at).

private {
    enum FNV_Prime_64 = (2 ^^ 40) + (2 ^^ 8) + 0xb3;
    enum FNV_Offset_Basis_64 = 0xcbf29ce484222325;
}

/**
 * Performs a Fowler–Noll–Vo 1a hash
 *
 * Params:
 * 		data = The data to hash
 * 		start = Start hash (for incremental hashing)
 *
 * Returns:
 * 		The hash
 */
ulong fnv_64_1a(ubyte[] data, ulong start = FNV_Offset_Basis_64) @safe nothrow @nogc {
    ulong hash = start;

    foreach (b; data) {
        hash ^= b;
        hash *= FNV_Prime_64;
    }

    return hash;
}

@drug007
Copy link
Contributor Author

drug007 commented Nov 29, 2020

Use default_ over absence. Append an underscore to it.

Then user should use dub build --hash=default_ that is not very intuitive. I just try to avoid increasing code complexity even trivial.

@CyberShadow
Copy link
Member

I don't use Dub much, but I don't understand a few things about this PR:

  1. What is the motivation for this change?

I can understand hashing source code if you're implementing a caching system such as ccache. However, dub does not cache previous builds. Such functionality would probably be better implemented at a lower layer as well, so that it can be used by build systems or dependency management strategies other than Dub.

  1. Why even allow the possibility to configure the hashing algorithm?

When discussing a similar enhancement to rund, there was a similar concern that hashing would introduce a noticeable performance overhead. Benchmarking showed that to be false. There is also the opportunity to speed up hashing by doing it in parallel; but, as already mentioned, there is no reason to require a cryptographically secure hash in this situation.

@drug007
Copy link
Contributor Author

drug007 commented Nov 29, 2020

What is the motivation for this change?

It's a request

Why even allow the possibility to configure the hashing algorithm?

There was a discussion what hash algorithm should be used. As for me there is no difference, just after my changes implementing different algorithms is rather tirvial.

@drug007
Copy link
Contributor Author

drug007 commented Nov 29, 2020

@rikkimax Ok, now I use default_ in the enum and default as an argument. The other question is that I think term hash is not appropriate here because in case of time dependent build there is no any hashing. Isn't something like build-cache-policy more suitable here? This option will be used seldom so long name is normal here.

@drug007
Copy link
Contributor Author

drug007 commented Nov 29, 2020

@CyberShadow should we use one of existing caching systems like ccache/distcc or invent another one? What do you think?

@CyberShadow
Copy link
Member

It's a request

I read the description and looked at the sources linked in the description.

None of them explain or demonstrate a use case that would be applicable here.

If the goal is to try to move towards reproducible builds, this doesn't really help with that.

A functional build system makes sense, but Dub already has strong dependency management, so I don't see how it helps with that either.

I think Symmetry needs to clarify or re-evaluate the request.

@CyberShadow should we use one of existing caching systems like ccache/distcc or invent another one? What do you think?

The first order of business is to have a clearly defined goal, with some user stories and definitions of what would pass as a MVP.

But, to answer your question... again, it depends on the goals. Dub is already a highly D-specific build system, and the difficulty of integrating it with other build systems is a known weakness. If we were to consider that as part of the overarching problem that is attempted at being solved, then evaluating integration with existing caching systems would be worthwhile. If not, then a simple D-specific implementation that's easy to set up and has no external dependencies would probably be more welcome.

@rikkimax
Copy link
Contributor

The other question is that I think term hash is not appropriate here because in case of time dependent build there is no any hashing. Isn't something like build-cache-policy more suitable here? This option will be used seldom so long name is normal here.

Maybe fileIdentifier? But I agree with @CyberShadow. I can only think of one reason where this makes sense and that is multiple workers in a build chain. And I think we need a long talk about what is required for that.

@John-Colvin
Copy link
Contributor

John-Colvin commented Dec 17, 2020

It's a request

I read the description and looked at the sources linked in the description.

None of them explain or demonstrate a use case that would be applicable here.

If the goal is to try to move towards reproducible builds, this doesn't really help with that.

A functional build system makes sense, but Dub already has strong dependency management, so I don't see how it helps with that either.

I think Symmetry needs to clarify or re-evaluate the request.

The goal is to not do a rebuild when you have already built identical files before. It's a build performance aid.

It's not a full solution for all build-performance related woes, but it does provide a useful tool.

@CyberShadow
Copy link
Member

The goal is to not do a rebuild when you have already built identical files before.

As discussed above, this doesn't do that. Dub can only cache one build (per repository and build options).

@drug007
Copy link
Contributor Author

drug007 commented Dec 17, 2020

The goal is to not do a rebuild when you have already built identical files before.

As discussed above, this doesn't do that. Dub can only cache one build (per repository and build options).

Yes, this doesn't do that, but this can be implemented. And I asked feedback exactly to get know what should be implemented and what not.

@John-Colvin
Copy link
Contributor

John-Colvin commented Dec 17, 2020

Sorry, I wasn't clear. When I said

The goal is to not do a rebuild when you have already built identical files before.

I meant that the goal is that when none of the files in a dub package have changed contents, but they have changed timestamp, no rebuild is done.

I understand that dub doesn't cache at a smaller scale, that would be great to have, but it's not required for this to be useful.

@John-Colvin
Copy link
Contributor

Use case: to prevent rebuilds of dependencies (not in ~/.dub) when their files have been touched by changing git branch & back.

@John-Colvin
Copy link
Contributor

Note for others: I have talked to @drug007 separately and any misunderstanding seems resolved. The original understanding that was the basis for the work thus far was correct and it seems my attempts to explain it just caused confusion, sorry. I'll let @drug007 discuss the last bits needed.

Tdlr; everything is mostly fine

@drug007
Copy link
Contributor Author

drug007 commented Jan 12, 2021

Closed because of #2077

@drug007 drug007 closed this Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants