Skip to content

Conversation

@houqp
Copy link
Member

@houqp houqp commented Jul 6, 2021

Which issue does this PR close?

Closes apache/datafusion-ballista#24

Rationale for this change

Avoid ballista integration test regressions automatically.

What changes are included in this PR?

New CI job and updated integration test script to make it runnable within CI.

Merged ballista base docker build with the main docker build so we can leverage buildx cache for the app image build. On top of that, we don't get much value out of dedicated base image if we are not publishing that base image to a public registry.

reference: automata-network/automata#11

@houqp
Copy link
Member Author

houqp commented Jul 6, 2021

Integration test took 28m to run, compared to our main Rust test job which took 14m. I think we can cut down the build time by 10m if we avoid rebuilding the base image on every run. I am thinking we can host the pre-built base image with public github docker registry, thoughts?

@jorgecarleitao
Copy link
Member

The build log is a bit unreadable in its current form. Could we trim the output?

I went through the build log and I am not sure having a pre-built image helps:

  • 4m are spent in apt-get installing.
  • most of the time is spent building a --release binary. Do we need it to be a release build?
  • we are not caching anything from cargo, like we do for the other builds

I suggest that we trim the time by caching the cargo build, which has no dependency on docker nor docker registry.

@houqp
Copy link
Member Author

houqp commented Jul 6, 2021

You are right, base image build took around 5 mins, then it's 20 mins for ballista build and 4 mins for running the integration tests. I will look into optimization for the ballista build step.

@houqp houqp marked this pull request as draft July 6, 2021 04:30
@andygrove
Copy link
Member

Thanks @houqp it is great to see this being worked on 🚀

@houqp
Copy link
Member Author

houqp commented Jul 6, 2021

Turns out release build made a big difference for test run time, in debug mode, job runtime went from 28m to almost 1 hour.

@andygrove
Copy link
Member

The integration tests are crazy slow because the distributed query execution in Ballista is fundamentally broken (see #707) and fragments of the query are executed multiple times. I am hoping to have this all fixed in the next few weeks (I only have time at weekends to work on this).

@github-actions github-actions bot added the development-process Related to development process of DataFusion label Aug 8, 2021
@houqp houqp marked this pull request as ready for review August 8, 2021 06:01
@houqp
Copy link
Member Author

houqp commented Aug 8, 2021

Finally got cargo cache working within docker builds, this is way more complicated than I initially expected...

Ballista integration test run now completes roughly 1 minute faster than our full Rust workspace test suit when there is a cache hit. I think we should be good for now. As a bonus, subsequent ballista docker build should be a lot faster on local machine as well if buildx cache is enabled.

@andygrove
Copy link
Member

Thanks @houqp for persevering with this! I will try and review in the next day or two.

@houqp houqp marked this pull request as draft August 10, 2021 06:28
@houqp
Copy link
Member Author

houqp commented Aug 10, 2021

converting PR back to draft mode since I noticed buildx just released a native github action backend that we can leverage to keep layer cache size from growing unbounded. i will give https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md#cache-backend-api a try this weekend.

@alamb
Copy link
Contributor

alamb commented Oct 26, 2021

Marking PRs that haven't had activity in over a month as 'stale-pr' to help me filter the list. Please remove the label or let me know if "stale" is not the correct designation

@alamb
Copy link
Contributor

alamb commented Nov 2, 2021

Closing a seemingly stale PR -- please reopen if that was a mistake.

@alamb alamb closed this Nov 2, 2021
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
## Which issue does this PR close?

Part of apache#372 and apache#551

## Rationale for this change

To be ready for Spark 4.0

## What changes are included in this PR?

This PR fixes the test that requires to see SparkArithmeticException

## How are these changes tested?

Enabled `SPARK-40389: Don't eliminate a cast which can cause overflow`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development-process Related to development process of DataFusion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI should run Ballista integration tests

4 participants