[HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env#2058
[HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env#2058yanghua wants to merge 3 commits intoapache:masterfrom
Conversation
|
@bvaradar knows the best about this. can you please help review when the PR is ready |
… build docker image in local env
9fec56b to
af386a2
Compare
|
@yanghua : Sorry for the delay. Let me review this tonight. |
bvaradar
left a comment
There was a problem hiding this comment.
@yanghua : Trying to understand where such caching is reused. I can see this being useful when we are repeatedly building docker images. But, this is not how we have setup CI. Are you building docker images somewhere else ?
| echo "The binary file $HIVE_BINARY_FILE_NAME has been cached in the binary cache directory!" | ||
| else | ||
| echo "The binary file $HIVE_BINARY_FILE_NAME did not exist in the binary cache directory, try to download." | ||
| wget https://archive.apache.org/dist/hive/hive-$HIVE_VERSION/apache-hive-$HIVE_VERSION-bin.tar.gz -O ${HIVE_BINARY_CACHE_FILE_PATH} |
There was a problem hiding this comment.
@yanghua : This could sometimes result in corrupted hive binary to be uploaded. If wget fails in the middle (because of timeout, network disconnect), the partially downloaded files would be reused in subsequent docker builds (bypasses downloads).You would need to maintain some state files to track if the files are fully downloaded or not.
There was a problem hiding this comment.
Yes, will strengthen the check logic.
There was a problem hiding this comment.
My original thought is to download and cache the sha256 file in the project and verify the checksum later. However, I find that different frameworks have different formats. See here:
http://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz.sha512
https://archive.apache.org/dist/hive/hive-2.3.3/apache-hive-2.3.3-bin.tar.gz.sha256
So, I would just copy the hash-value into a customized file.
No, only want to speed up the local build progress. Actually, I do not know how we publish the official docker image. Do we reuse these changed |
|
@yanghua : Publishing images is done in an adhoc fashion only on demand. So, IMO, local caching of artifacts is not going to help. W.r.t adaptations, do you have anything else other than caching in mind ? |
Let me clarify the purpose of this PR. I just want to speed up the local build progress so that I can verify new changes frequently. I do not want to speed up the publishing progress. I throw this question so that you can check if the changing of |
|
@yanghua :
Can you kindly clarify what you mean by changes here ? Do you mean hudi code changes ? If you are referring to hudi, we don't have to rebuild docker images to pick up latest hudi code. The hudi codebase is mounted inside docker containers so that you can use the latest version. If you need to rebuild docker for other reasons, one of the most common case this can happen is when we are upgrading hive, spark, presto versions. In that case the caching wont help. Let me know if I am missing something. Regarding using SHAs to determine if cached distributions is useful or not, this will pose additional work for someone upgrading the dependencies. Let me know your thoughts. Thanks, |
Yes
You mean if I change the code it would reflect into the hudi on docker immediately? Where can I know the configuration of this mechanism in the project? Sorry, I am not familiar with Docker. |
|
@yanghua : You can look at the docker compose file and https://github.com/apache/hudi/blob/master/docker/setup_demo.sh We mount hudi workspace inside docker to achieve it. I guess, we can close this PR then ? |
Got it. IIUC, you mean:
Yes. Actually, My colleagues also do not know it. We rarely use docker. IMO, it would be better to describe it into the documentation. wdyt? |
|
Yes @yanghua : The docker containers are for mainly for internal (testing and demo) consumption but I agree we can document it for engineers to know. Would you mind helping document this if possible ? |
OK, my pleasure. |
What is the purpose of the pull request
Cache some framework binaries to speed up the progress of building docker image in local env
Brief change log
Verify this pull request
This pull request is a trivial rework / code cleanup without any test coverage.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.