New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3367: [INTEGRATION] Port Spark integration test to the docker-compose setup #3300
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3300 +/- ##
==========================================
+ Coverage 88.56% 88.58% +0.01%
==========================================
Files 546 546
Lines 73059 73059
==========================================
+ Hits 64708 64721 +13
+ Misses 8242 8233 -9
+ Partials 109 105 -4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one small comment
fails locally for me with
This Docker build pollutes the local git clone with files and directories owned by root. this has been an issue for me with other Docker builds. Is it possible to stop doing this? See https://issues.apache.org/jira/browse/ARROW-3078 |
@wesm So the Arrow source directory is only contaminated by maven ( |
Is it possible to build the Java library inside the container rather than using the local volume? |
@wesm Sure, it's possible, but the two approaches serve different purposes. I've optimized the current Dockerfiles to minimize So currently my approach is to build exclusively the dependencies in the container, than mount arrow's source to We can optimize the other way around, build the libraries within the container, but that approach is more suited for production containers. In this particular case, We can build spark as a dependency in the container (image actually :)), but then We either need to build arrow before it (in a previous layer) or recompile afterwards hoping that maven can reuse the previous build artifacts. In the first case an edit would trigger a full rebuild. Theoretically docker volumes don't have any performance penalty on linux hosts. BTW it builds successfully on the DGX machine. |
|
||
# Run pyarrow related Python tests only | ||
echo "Testing PySpark:" | ||
python/run-tests --modules pyspark-sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testnames option is available after the 2.4.0 release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is what I used to test pyarrow for all related python tests:
python/run-tests --testnames=pyspark.sql.tests.test_arrow,pyspark.sql.tests.test_pandas_udf,pyspark.sql.tests.test_pandas_udf_scalar,pyspark.sql.tests.test_pandas_udf_grouped_agg,pyspark.sql.tests.test_pandas_udf_grouped_map,pyspark.sql.tests.test_pandas_udf_window
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @BryanCutler! Could You please give me a spark commit hash I should use to test spark with arrow 0.12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kszucs I just merged in support for 0.12.0 here apache/spark@16990f9, so you could also use the master branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler !
|
I've never seen an error like that, the Have You tried to rebuild the parent images?
|
Executed from the root directory /home/wesm/code/arrow. You are sudo on this machine so if you want to see for yourself |
I get it now, docker can't gather the build context (arrow files). I have two workarounds for it,will fix tomorrow. |
02a465f
to
b8d67cf
Compare
- .:/arrow:ro # ensures that docker won't contaminate the host directory | ||
- alpine-cache:/build:delegated | ||
|
||
volumes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to named volumes.
@@ -21,13 +21,19 @@ version: '3.5' | |||
|
|||
x-ubuntu-volumes: | |||
&ubuntu-volumes | |||
- .:/arrow:delegated | |||
- ${ARROW_DOCKER_CACHE_DIR:-./docker_cache}/ubuntu:/build:delegated | |||
- .:/arrow:ro # ensures that docker won't contaminate the host directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This brought up new issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the python build is failing, because of the _generated_version.py
can't be written in the source directory (setting write_to to False
resolves that). So python setup.py build would work without contaminating the source directory.
The issue is that python setup.py install is still attempting the source directory (no matter which options I use).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, can you rsync the files into container space and build there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't like that solution, but probably have to stick with that.
ci/docker_build_java.sh
Outdated
mkdir -p /build/java | ||
rsync -a /arrow/header /build/java | ||
rsync -a /arrow/java /build/java | ||
rsync -a /arrow/format /build/java |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't change maven's mind to NOT contaminate the source directory, even with hacking the top-level pom.xml
e9ddc32
to
b08b33a
Compare
So finally I was able to retrieve the Arrow related issues of spark 1.4. A couple of test cases are failing (log is coming...) |
d413f86
to
1397392
Compare
@kszucs this needs a rebase. What is the status of this build? |
…egration test to work
This reverts commit 3524ce3.
1397392
to
e13d227
Compare
I'm struggling with maven to compile spark. Before the release it compiled, but now I need to update the scala maven plugin. |
Docker didn't provide enough memory to compile spark, the error was not descriptive... So reproduced the previouse Pyspark tests failures:
|
It'd be the best to merge this PR and decide/fix the issues in follow-up pull requests, mainly because this PR also helps to prevent contamination of source directory. |
@@ -105,8 +111,8 @@ services: | |||
context: . | |||
dockerfile: java/Dockerfile | |||
volumes: | |||
- .:/arrow:delegated | |||
- $HOME/.m2:/root/.m2:delegated | |||
- .:/arrow:ro # ensures that docker won't contaminate the host directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kszucs yes let's merge this and then look into the failure as follow up. @BryanCutler have you seent the above error? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, merging now. I'm sorry this was such a slog. Can you open a JIRA about the follow up fix?
|
||
# this is a nightmare, but prevents mutating the source directory | ||
# which is bind mounted as readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof sorry. We should report to setuptools about this issue since this is really crappy
Sorry I missed this PR.. I have seen an error like the above . Spark needs to be patched to work with Arrow 0.12.0. I'll try to do that soon and run this again. |
Spark updates for 0.12.0 patch here apache/spark#23657 |
This incorporates the fixes from #3254
Running it on a local "Docker for Mac" takes ages because of its volume virtualization and the excessive IO usage of maven. So currently I need to run it on travis...
Crossbow builds here