Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts for sparse checkout of some contribs #48183

Merged
merged 11 commits into from
Apr 11, 2023

Conversation

tavplubix
Copy link
Member

@tavplubix tavplubix commented Mar 29, 2023

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

See https://git-scm.com/docs/git-sparse-checkout

Before:

du -hs contrib
4.6G	contrib
find contrib -type f | wc -l
357344

After:

du -hs contrib
1.6G	contrib
find contrib -type f | wc -l
91082

It does not reduce network usage when cloning and does not reduce the sizes of .git/modules/contrib/* dirs. But it's still useful and can make some IDEs work faster. We should exclude more files from contrib.

@robot-clickhouse robot-clickhouse added pr-not-for-changelog This PR should not be mentioned in the changelog submodule changed At least one submodule changed in this PR. labels Mar 29, 2023
@antonio2368 antonio2368 self-assigned this Apr 3, 2023
@Felixoid
Copy link
Member

Felixoid commented Apr 4, 2023

I've tested it as the following:

$ git clone git@github.com:ClickHouse/ClickHouse.git --depth=1 --single-branch --branch=contrib_sparse_checkout ClickHouse-sparse
$ cd ClickHouse-sparse
# the new command
$ time ./contrib/update-submodules.sh
.....
./contrib/update-submodules.sh  61,69s user 12,43s system 21% cpu 5:50,28 total
$ du -hscx .git/modules contrib
1,1G	.git/modules
1,6G	contrib
2,7G	total
# remove the repo and apply the command from https://github.com/ClickHouse/checkout/
$ git clone git@github.com:ClickHouse/ClickHouse.git --depth=1 --single-branch --branch=contrib_sparse_checkout ClickHouse-sparse
$ time (
  git -C ClickHouse-sparse submodule sync && \
  git -C ClickHouse-sparse submodule init && \
  git config --file ClickHouse-sparse/.gitmodules --null --get-regexp path | \
    sed -z 's|.*\n||' | \
    xargs --max-procs=100 --null --no-run-if-empty --max-args=1 \
      git -C ClickHouse-sparse submodule update --depth=1 --single-branch
)
.....
( LC_ALL=en_US.UTF-8 git -C ClickHouse-sparse submodule sync && LC_ALL= git -)  71,25s user 16,78s system 92% cpu 1:35,53 total
$ du -hscx .git/modules contrib
1,1G	.git/modules
4,5G	contrib
5,6G	total

As far as I get, the only thing the submodule.<name>.update does is ignore the paths from checking out. But we still have them in the .git/modules AKA still download them. For the CI the changes are unnecessary since the ClickHouse/checkout is already optimized from the checkout time perspective.

From an end-user usability perspective, I couldn't really say. Yes, 3GB is a quite significant difference.

@@ -550,6 +550,13 @@ jobs:
with:
clear-repository: true
submodules: true
- name: Apply sparse checkout for contrib # in order to check that it doesn't break build
Copy link
Member

@Felixoid Felixoid Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, don't forget to roll it back. Unfortunately, it doesn't increase the checkout speed, it only decreases the allocated disk space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no intention of rolling it back, I'm going to keep it. See the comment below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary for at least one build. Because we must check that it works. Imagine that some contributor cloned the repo with "sparse checkout" following the documentation. And another contributor has full contribs. The second contributor adds some code that uses a new boost module, for example. It compiles on their machine. It compiles in the CI.

I see. In this case, the job is done very well. We have all submodules cloned in the checkout step before, and the following test is done faster.

Maybe, if we do it in BuilderDebDebug, it makes sense to adjust every workflow having it?

  • .github/workflows/backport_branches.yml
  • .github/workflows/master.yml
  • .github/workflows/release_branches.yml

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, if we do it in BuilderDebDebug, it makes sense to adjust every workflow having it?

I think that having it in pull_request.yml only is enough to avoid breaking the build with "sparse checkout". But I don't have a strong opinion on that, we can add it to all workflows for consistency.

Also it makes sense to add an option to use update-submodules.sh in https://github.com/ClickHouse/checkout (and avoid the ugly trick with rm -rf), but it will make the action less universal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Felixoid, I added the same change to other workflows

P.S. It would be great to refactor these yamls somehow and reduce the amount of copy-paste

docs/ru/development/developer-instruction.md Outdated Show resolved Hide resolved
docs/en/development/developer-instruction.md Outdated Show resolved Hide resolved
@tavplubix
Copy link
Member Author

tavplubix commented Apr 4, 2023

As far as I get, the only thing the submodule..update does is ignore the paths from checking out.

Yep.

But we still have them in the .git/modules AKA still download them.

Yep, that's written in the PR description:

It does not reduce network usage when cloning and does not reduce the sizes of .git/modules/contrib/* dirs. But it's still useful and can make some IDEs work faster. We should exclude more files from contrib.

But it's worth mentioning that git/modules/contrib/* contain compressed files. For example, .git/modules/contrib/llvm-project is only 265MB (with --depth=1) while contrib/llvm-project is 1.5GB.

For the CI the changes are unnecessary

It is necessary for at least one build. Because we must check that it works. Imagine that some contributor cloned the repo with "sparse checkout" following the documentation. And another contributor has full contribs. The second contributor adds some code that uses a new boost module, for example. It compiles on their machine. It compiles in the CI.

But then the PR from the second contributor is merged and the first contributor cannot build ClickHouse anymore because the scripts in contrib/sparse-checkout/ are stale.

Having "sparse checkout" in the CI avoids issues like this.

From an end-user usability perspective, I couldn't really say. Yes, 3GB is a quite significant difference.

My build directory is almost 50GB. I don't really care about 3GB of disk space, it's not a big difference. The problem is that an IDE may index all code in contrib/ and it may slow down the IDE significantly. Also, having less garbage in contrib/ is good for searching in code (it will search faster and will find fewer irrelevant results).

Also it's useful for understanding actual dependencies:

# FIXME why do we need csharp?
#echo '!/src/csharp/*' >> $FILES_TO_CHECKOUT

(probably we use it by mistake and it has to be fixed)

@tavplubix
Copy link
Member Author

Stateless tests (debug) [2/5] - #48574

@tavplubix tavplubix merged commit 327bfbc into master Apr 11, 2023
148 checks passed
@tavplubix tavplubix deleted the contrib_sparse_checkout branch April 11, 2023 13:18
@@ -349,6 +349,13 @@ jobs:
with:
clear-repository: true
submodules: true
- name: Apply sparse checkout for contrib # in order to check that it doesn't break build
Copy link
Member

@Felixoid Felixoid May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest reverting this part in the CI.

Besides the fact, that it only does the additional work after the modules are already checked out, this part has broken the debug build https://github.com/ClickHouse/ClickHouse/actions/runs/5033995222/jobs/9028616459#step:7:3663

cc @nickitat

upd: it's only a suggestion though. Fixing in the same PR works too. But it's better to backport it too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Felixoid, that's exactly why I added this to the CI. Changes in #50037 break sparse checkout, so it must be fixed in #50037 before merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense to check this script in CI, but maybe as a separate step with self-explanatory name. otherwise people will just waste time over and over again because obviously most of them have no clue about this logic. and they shouldn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Nikina. It's unclear now what is broken there, and why it differs from other steps

Copy link
Member

@Felixoid Felixoid May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the fact that it's good to have a good explanation of failure, we should use one of the fastest builds instead of a long one

image

What do you think of BuilderBinRelease as the job to test the sparse checkout? We can rename it explicitly to BuilderBinReleaseSparseSubmodules

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was unobvious that we use sparse checkout for debug build only, sorry for that. But I don't like introducing another build just to check that sparse checkout works, because it will actually duplicate an ordinary build for no good reason. I would prefer to add some note about this to the report.

we should use one of the fastest builds instead of a long one

It does not matter, sparse checkout does not make a build slower, it just takes a few seconds to re-checkout submodules, and after that build continues as usual.

What do you think of BuilderBinRelease as the job to test the sparse checkout?

Release build is slower than debug build.

Copy link
Member

@Felixoid Felixoid May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release deb yes, but I am talking regarding the release binary build. See the report, 0:16:20 for the BuilderBinRelease vs 0:20:05 for BuilderDebDebug

I am speaking not about adding a new check, but renaming the existing one to reflect what we actually check there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-not-for-changelog This PR should not be mentioned in the changelog submodule changed At least one submodule changed in this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants