Skip to content

[HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env#2058

Closed
yanghua wants to merge 3 commits intoapache:masterfrom
yanghua:HUDI-DOCKER-OPTIMIZATION
Closed

[HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env#2058
yanghua wants to merge 3 commits intoapache:masterfrom
yanghua:HUDI-DOCKER-OPTIMIZATION

Conversation

@yanghua
Copy link
Contributor

@yanghua yanghua commented Sep 1, 2020

What is the purpose of the pull request

Cache some framework binaries to speed up the progress of building docker image in local env

Brief change log

  • Prepare and cache binaries before building docker image in local env

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@vinothchandar
Copy link
Member

@bvaradar knows the best about this. can you please help review when the PR is ready

@vinothchandar vinothchandar removed their assignment Sep 1, 2020
@yanghua yanghua force-pushed the HUDI-DOCKER-OPTIMIZATION branch from 9fec56b to af386a2 Compare September 3, 2020 11:28
@yanghua
Copy link
Contributor Author

yanghua commented Sep 7, 2020

@bvaradar It seems you are busy?

@leesf Do you have time to verify this PR?

@bvaradar
Copy link
Contributor

bvaradar commented Sep 8, 2020

@yanghua : Sorry for the delay. Let me review this tonight.

@bvaradar bvaradar self-requested a review September 8, 2020 01:09
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanghua : Trying to understand where such caching is reused. I can see this being useful when we are repeatedly building docker images. But, this is not how we have setup CI. Are you building docker images somewhere else ?

echo "The binary file $HIVE_BINARY_FILE_NAME has been cached in the binary cache directory!"
else
echo "The binary file $HIVE_BINARY_FILE_NAME did not exist in the binary cache directory, try to download."
wget https://archive.apache.org/dist/hive/hive-$HIVE_VERSION/apache-hive-$HIVE_VERSION-bin.tar.gz -O ${HIVE_BINARY_CACHE_FILE_PATH}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanghua : This could sometimes result in corrupted hive binary to be uploaded. If wget fails in the middle (because of timeout, network disconnect), the partially downloaded files would be reused in subsequent docker builds (bypasses downloads).You would need to maintain some state files to track if the files are fully downloaded or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will strengthen the check logic.

Copy link
Contributor Author

@yanghua yanghua Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thought is to download and cache the sha256 file in the project and verify the checksum later. However, I find that different frameworks have different formats. See here:

http://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz.sha512
https://archive.apache.org/dist/hive/hive-2.3.3/apache-hive-2.3.3-bin.tar.gz.sha256

So, I would just copy the hash-value into a customized file.

@yanghua
Copy link
Contributor Author

yanghua commented Sep 8, 2020

Are you building docker images somewhere else ?

No, only want to speed up the local build progress. Actually, I do not know how we publish the official docker image. Do we reuse these changed Dockerfiles. If yes, we also need more adaption.

@bvaradar
Copy link
Contributor

bvaradar commented Sep 8, 2020

@yanghua : Publishing images is done in an adhoc fashion only on demand. So, IMO, local caching of artifacts is not going to help. W.r.t adaptations, do you have anything else other than caching in mind ?

@yanghua
Copy link
Contributor Author

yanghua commented Sep 8, 2020

@yanghua : Publishing images is done in an adhoc fashion only on demand. So, IMO, local caching of artifacts is not going to help. W.r.t adaptations, do you have anything else other than caching in mind ?

Let me clarify the purpose of this PR. I just want to speed up the local build progress so that I can verify new changes frequently. I do not want to speed up the publishing progress. I throw this question so that you can check if the changing of Dockerfile would break normal publishing.

@bvaradar
Copy link
Contributor

bvaradar commented Sep 9, 2020

@yanghua :

I just want to speed up the local build progress so that I can verify new changes frequently.

Can you kindly clarify what you mean by changes here ? Do you mean hudi code changes ? If you are referring to hudi, we don't have to rebuild docker images to pick up latest hudi code. The hudi codebase is mounted inside docker containers so that you can use the latest version.

If you need to rebuild docker for other reasons, one of the most common case this can happen is when we are upgrading hive, spark, presto versions. In that case the caching wont help. Let me know if I am missing something.

Regarding using SHAs to determine if cached distributions is useful or not, this will pose additional work for someone upgrading the dependencies. Let me know your thoughts.

Thanks,
Balaji.V

@yanghua
Copy link
Contributor Author

yanghua commented Sep 10, 2020

If you are referring to hudi, we don't have to rebuild docker images to pick up latest hudi code.

Yes

The hudi codebase is mounted inside docker containers so that you can use the latest version.

You mean if I change the code it would reflect into the hudi on docker immediately? Where can I know the configuration of this mechanism in the project? Sorry, I am not familiar with Docker.

@bvaradar
Copy link
Contributor

@yanghua : You can look at the docker compose file and https://github.com/apache/hudi/blob/master/docker/setup_demo.sh We mount hudi workspace inside docker to achieve it.

I guess, we can close this PR then ?

@yanghua
Copy link
Contributor Author

yanghua commented Sep 11, 2020

@yanghua : You can look at the docker compose file and https://github.com/apache/hudi/blob/master/docker/setup_demo.sh We mount hudi workspace inside docker to achieve it.

I guess, we can close this PR then ?

Got it. IIUC, you mean:

volumes:
    - ${HUDI_WS}:/var/hoodie/ws

I guess, we can close this PR then ?

Yes.

Actually, My colleagues also do not know it. We rarely use docker. IMO, it would be better to describe it into the documentation. wdyt?

@yanghua yanghua closed this Sep 11, 2020
@bvaradar
Copy link
Contributor

Yes @yanghua : The docker containers are for mainly for internal (testing and demo) consumption but I agree we can document it for engineers to know. Would you mind helping document this if possible ?

@yanghua
Copy link
Contributor Author

yanghua commented Sep 12, 2020

Would you mind helping document this if possible ?

OK, my pleasure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants