ARROW-3367: [INTEGRATION] Port Spark integration test to the docker-compose setup #3300

kszucs · 2019-01-03T11:16:24Z

This incorporates the fixes from #3254
Running it on a local "Docker for Mac" takes ages because of its volume virtualization and the excessive IO usage of maven. So currently I need to run it on travis...

Crossbow builds here

codecov-io · 2019-01-03T11:51:16Z

Codecov Report

Merging #3300 into master will increase coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3300      +/-   ##
==========================================
+ Coverage   88.56%   88.58%   +0.01%     
==========================================
  Files         546      546              
  Lines       73059    73059              
==========================================
+ Hits        64708    64721      +13     
+ Misses       8242     8233       -9     
+ Partials      109      105       -4

Impacted Files	Coverage Δ
go/arrow/math/uint64_sse4_amd64.go	`0% <0%> (-100%)`	⬇️
go/arrow/math/float64_sse4_amd64.go	`0% <0%> (-100%)`	⬇️
go/arrow/memory/memory_sse4_amd64.go	`0% <0%> (-100%)`	⬇️
go/arrow/math/int64_sse4_amd64.go	`0% <0%> (-100%)`	⬇️
go/arrow/math/float64_amd64.go	`33.33% <0%> (ø)`	⬆️
go/arrow/math/int64_amd64.go	`33.33% <0%> (ø)`	⬆️
go/arrow/math/uint64_amd64.go	`33.33% <0%> (ø)`	⬆️
cpp/src/arrow/csv/column-builder.cc	`97.4% <0%> (+1.94%)`	⬆️
go/arrow/math/math_amd64.go	`36.84% <0%> (+5.26%)`	⬆️
go/arrow/memory/memory_amd64.go	`42.85% <0%> (+14.28%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55848a3...b8d67cf. Read the comment docs.

xhochy

LGTM except one small comment

integration/spark/Dockerfile

wesm · 2019-01-05T23:27:49Z

fails locally for me with

/arrow/java /
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
/
Downloading Spark source archive to /build/spark...
/build/spark/spark-2.4.0 /
Building Spark with Arrow 
exec: curl --progress-bar -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
######################################################################################### 100.0%
exec: curl --progress-bar -L https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
######################################################################################### 100.0%
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
exec: curl --progress-bar -L https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz
######################################################################################### 100.0%
Using `mvn` from path: /build/spark/spark-2.4.0/build/apache-maven-3.5.4/bin/mvn
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Testing Spark:
org.apache.spark.sql.execution.arrow
org.apache.spark.sql.execution.vectorized.ColumnarBatchSuite
org.apache.spark.sql.execution.vectorized.ArrowColumnVectorSuite
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Using `mvn` from path: /build/spark/spark-2.4.0/build/apache-maven-3.5.4/bin/mvn
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Testing PySpark:
pyspark.sql.tests.test_arrow ArrowTests
pyspark.sql.tests.test_pandas_udf PandasUDFTests
pyspark.sql.tests.test_pandas_udf_scalar ScalarPandasUDFTests
pyspark.sql.tests.test_pandas_udf_grouped_map GroupedMapPandasUDFTests
pyspark.sql.tests.test_pandas_udf_grouped_agg GroupedAggPandasUDFTests
pyspark.sql.tests.test_pandas_udf_window WindowPandasUDFTests
Traceback (most recent call last):
  File "./python/run-tests.py", line 70, in <module>
    raise Exception("Cannot find assembly build directory, please build Spark first.")
Exception: Cannot find assembly build directory, please build Spark first.

This Docker build pollutes the local git clone with files and directories owned by root. this has been an issue for me with other Docker builds. Is it possible to stop doing this? See https://issues.apache.org/jira/browse/ARROW-3078

integration/spark/runtest.sh

kszucs · 2019-01-07T14:27:28Z

@wesm So the Arrow source directory is only contaminated by maven (**/target directories). I tried to introduce an option to put the build artifacts in a separate directory (similarly when building C++, Python, C-GLib), but I guess that's a bad practice and perhaps unnecessary because of the platform independent nature of java.

wesm · 2019-01-07T16:33:04Z

Is it possible to build the Java library inside the container rather than using the local volume?

kszucs · 2019-01-07T18:06:40Z

@wesm Sure, it's possible, but the two approaches serve different purposes.

I've optimized the current Dockerfiles to minimize source code edit <=> build & test cycles. If We want to build either Java or any other language implementation in the container, than the source must be ADDed in a previous layer, and each code edit invalidates that layer, including the subsequent ones.

So currently my approach is to build exclusively the dependencies in the container, than mount arrow's source to /arrow and another cache directory /build to host the build artifacts - while preventing the contamination of the source directory. This strategy speeds up the cpp and python builds a lot, and it doesn't require to both build and run the image.

We can optimize the other way around, build the libraries within the container, but that approach is more suited for production containers. In this particular case, We can build spark as a dependency in the container (image actually :)), but then We either need to build arrow before it (in a previous layer) or recompile afterwards hoping that maven can reuse the previous build artifacts. In the first case an edit would trigger a full rebuild.

Theoretically docker volumes don't have any performance penalty on linux hosts.

BTW it builds successfully on the DGX machine.

kszucs · 2019-01-07T18:11:06Z

integration/spark/runtest.sh

+
+  # Run pyarrow related Python tests only
+  echo "Testing PySpark:"
+  python/run-tests --modules pyspark-sql


Testnames option is available after the 2.4.0 release

here is what I used to test pyarrow for all related python tests:

python/run-tests --testnames=pyspark.sql.tests.test_arrow,pyspark.sql.tests.test_pandas_udf,pyspark.sql.tests.test_pandas_udf_scalar,pyspark.sql.tests.test_pandas_udf_grouped_agg,pyspark.sql.tests.test_pandas_udf_grouped_map,pyspark.sql.tests.test_pandas_udf_window

Hey @BryanCutler! Could You please give me a spark commit hash I should use to test spark with arrow 0.12?

@kszucs I just merged in support for 0.12.0 here apache/spark@16990f9, so you could also use the master branch

Thanks @BryanCutler !

wesm · 2019-01-07T18:58:32Z

$ docker-compose build spark-integration
Building spark-integration
Traceback (most recent call last):
  File "bin/docker-compose", line 6, in <module>
  File "compose/cli/main.py", line 71, in main
  File "compose/cli/main.py", line 127, in perform_command
  File "compose/cli/main.py", line 287, in build
  File "compose/project.py", line 384, in build
  File "compose/project.py", line 366, in build_service
  File "compose/service.py", line 1082, in build
  File "site-packages/docker/api/build.py", line 154, in build
  File "site-packages/docker/utils/build.py", line 30, in tar
  File "site-packages/docker/utils/build.py", line 49, in exclude_paths
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  [Previous line repeated 1 more time]
  File "site-packages/docker/utils/build.py", line 184, in rec_walk
PermissionError: [Errno 13] Permission denied: '/home/wesm/code/arrow/java/format/target/generated-sources'
[11337] Failed to execute script docker-compose

kszucs · 2019-01-07T19:41:39Z

I've never seen an error like that, the docker-compose build should not request access to arrow/java. Are You using docker-compose.yml in arrow's root?

Have You tried to rebuild the parent images?

$ docker-compose build cpp
$ docker-compose build python
$ docker-compose build spark-integration
$ docker-compose run spark-integration

wesm · 2019-01-07T20:28:37Z

$ docker-compose build cpp
Building cpp
Traceback (most recent call last):
  File "bin/docker-compose", line 6, in <module>
  File "compose/cli/main.py", line 71, in main
  File "compose/cli/main.py", line 127, in perform_command
  File "compose/cli/main.py", line 287, in build
  File "compose/project.py", line 384, in build
  File "compose/project.py", line 366, in build_service
  File "compose/service.py", line 1082, in build
  File "site-packages/docker/api/build.py", line 154, in build
  File "site-packages/docker/utils/build.py", line 30, in tar
  File "site-packages/docker/utils/build.py", line 49, in exclude_paths
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  [Previous line repeated 1 more time]
  File "site-packages/docker/utils/build.py", line 184, in rec_walk
PermissionError: [Errno 13] Permission denied: '/home/wesm/code/arrow/java/format/target/generated-sources'
[19391] Failed to execute script docker-compose

Executed from the root directory /home/wesm/code/arrow. You are sudo on this machine so if you want to see for yourself sudo su wesm and take a look

kszucs · 2019-01-07T20:51:11Z

I get it now, docker can't gather the build context (arrow files). I have two workarounds for it,will fix tomorrow.

kszucs · 2019-01-08T15:22:09Z

docker-compose.yml

+  - .:/arrow:ro  # ensures that docker won't contaminate the host directory
+  - alpine-cache:/build:delegated
+
+volumes:


Switched to named volumes.

kszucs · 2019-01-08T15:22:21Z

docker-compose.yml

@@ -21,13 +21,19 @@ version: '3.5'

 x-ubuntu-volumes:
  &ubuntu-volumes
-  - .:/arrow:delegated
-  - ${ARROW_DOCKER_CACHE_DIR:-./docker_cache}/ubuntu:/build:delegated
+  - .:/arrow:ro  # ensures that docker won't contaminate the host directory


This brought up new issues.

Now the python build is failing, because of the _generated_version.py can't be written in the source directory (setting write_to to False resolves that). So python setup.py build would work without contaminating the source directory.
The issue is that python setup.py install is still attempting the source directory (no matter which options I use).

Hm, can you rsync the files into container space and build there?

Yes, but I don't like that solution, but probably have to stick with that.

kszucs · 2019-01-08T15:23:04Z

ci/docker_build_java.sh

+mkdir -p /build/java
+rsync -a /arrow/header /build/java
+rsync -a /arrow/java /build/java
+rsync -a /arrow/format /build/java


I couldn't change maven's mind to NOT contaminate the source directory, even with hacking the top-level pom.xml

kszucs · 2019-01-10T11:41:25Z

So finally I was able to retrieve the Arrow related issues of spark 1.4. A couple of test cases are failing (log is coming...)

wesm · 2019-01-23T01:56:51Z

@kszucs this needs a rebase. What is the status of this build?

…egration test to work

This reverts commit 3524ce3.

…mvn builds

kszucs · 2019-01-24T12:30:24Z

I'm struggling with maven to compile spark. Before the release it compiled, but now I need to update the scala maven plugin.

kszucs · 2019-01-25T13:54:56Z

Docker didn't provide enough memory to compile spark, the error was not descriptive...

So reproduced the previouse Pyspark tests failures:

Traceback (most recent call last):
  File "/spark/spark-2.4.0/python/pyspark/sql/tests.py", line 5937, in test_column_order
    grouped_df.apply(invalid_positional_types).collect()
AssertionError: "No cast implemented" does not match "An error occurred while calling o4388.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 (TID 377, localhost, executor driver): org.apache.spark.api.python.PythonException:
 Traceback (most recent call last):
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
    process()
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/serializers.py", line 284, in dump_stream
    batch = _create_batch(series, self._timezone)
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/serializers.py", line 253, in _create_batch
    arrs = [create_array(s, t) for s, t in series]
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/serializers.py", line 253, in <listcomp>
    arrs = [create_array(s, t) for s, t in series]
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/serializers.py", line 251, in create_array
    return pa.Array.from_pandas(s, mask=mask, type=t)
  File "pyarrow/array.pxi", line 531, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 171, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: an integer is required (got type str)
/arrow/cpp/src/arrow/python/helpers.cc:196 code: CheckPyError()
/arrow/cpp/src/arrow/python/python_to_arrow.cc:137 code: internal::CIntFromPython(obj, &value)
/arrow/cpp/src/arrow/python/iterators.h:54 code: func(objects[i], i, &keep_going)
/arrow/cpp/src/arrow/python/python_to_arrow.cc:1010 code: converter->AppendMultipleMasked(seq, mask, size)

kszucs · 2019-01-25T16:10:29Z

This is the full log

It'd be the best to merge this PR and decide/fix the issues in follow-up pull requests, mainly because this PR also helps to prevent contamination of source directory.

Please review @wesm @xhochy

kszucs · 2019-01-25T16:13:12Z

docker-compose.yml

@@ -105,8 +111,8 @@ services:
      context: .
      dockerfile: java/Dockerfile
    volumes:
-      - .:/arrow:delegated
-      - $HOME/.m2:/root/.m2:delegated
+      - .:/arrow:ro  # ensures that docker won't contaminate the host directory


https://issues.apache.org/jira/browse/ARROW-3078

wesm · 2019-01-25T16:41:24Z

Thanks @kszucs yes let's merge this and then look into the failure as follow up. @BryanCutler have you seent the above error?

wesm

+1, merging now. I'm sorry this was such a slog. Can you open a JIRA about the follow up fix?

wesm · 2019-01-25T16:42:08Z

ci/docker_build_python.sh


+  # this is a nightmare, but prevents mutating the source directory
+  # which is bind mounted as readonly


oof sorry. We should report to setuptools about this issue since this is really crappy

BryanCutler · 2019-01-25T19:29:07Z

Sorry I missed this PR.. I have seen an error like the above . Spark needs to be patched to work with Arrow 0.12.0. I'll try to do that soon and run this again.

BryanCutler · 2019-01-26T00:08:21Z

Spark updates for 0.12.0 patch here apache/spark#23657

kszucs added the WIP PR is work in progress label Jan 3, 2019

xhochy reviewed Jan 3, 2019

View reviewed changes

integration/spark/Dockerfile Outdated Show resolved Hide resolved

sarutak reviewed Jan 6, 2019

View reviewed changes

integration/spark/runtest.sh Outdated Show resolved Hide resolved

kszucs removed the WIP PR is work in progress label Jan 7, 2019

kszucs commented Jan 7, 2019

View reviewed changes

kszucs force-pushed the spark_integration branch from 02a465f to b8d67cf Compare January 8, 2019 12:23

kszucs commented Jan 8, 2019

View reviewed changes

wesm mentioned this pull request Jan 8, 2019

ARROW-4108: [INTEGRATION] Spark integration tests does not work #3254

Closed

kszucs force-pushed the spark_integration branch from e9ddc32 to b08b33a Compare January 9, 2019 10:31

kszucs force-pushed the spark_integration branch from d413f86 to 1397392 Compare January 21, 2019 10:38

sarutak and others added 7 commits January 24, 2019 10:58

Fix spark_integration/{Dockerfile,spark_integration.sh} for spark-int…

41df92d

…egration test to work

spark integration test

cdb4f41

add nightly test entry

7703b50

silent docker build outputs on nightly builds

d654366

silent installation steps

341adb4

Revert "silent docker build outputs on nightly builds"

6a6e1d6

This reverts commit 3524ce3.

remove commented lines

b91eae0

kszucs added 16 commits January 24, 2019 10:59

less verbose maven

c0aa8eb

patch spark 2.4.0

c0b24e6

compile only sql/core and assemply spark packages; don't parallelize …

3c088ed

…mvn builds

cleanup spark dockerfile

20cf16c

silenting

f1b74e9

warn level

11b41ab

remove spark_integration entry from the old docker-compose.yml

05a934f

avoid bind mounts

fd54e8a

[skip ci] download spark into the image

bb30fe1

[skip ci] custom build dir for for java

31ce522

[skip ci] use rsync...

cb8a438

setup.py fuuuuu

2c8334d

[skip ci] java requires integration folder

70028d2

[skip ci] mvn fu

68756b6

[skip ci] fix another pyspark test issue

4de2c85

[skip ci] fix a pyspark test again

e13d227

kszucs force-pushed the spark_integration branch from 1397392 to e13d227 Compare January 24, 2019 09:59

fix wrong patch; spark version tag pointed to a different commit...

c594cd3

kszucs commented Jan 25, 2019

View reviewed changes

wesm approved these changes Jan 25, 2019

View reviewed changes

wesm closed this in 23475ee Jan 26, 2019

This was referenced Jan 29, 2019

[INTEGRATION] Port Spark integration test to the docker-compose setup #19698

Closed

[INTEGRATION] Make spark integration test pass and test against spark's master branch #20982

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-3367: [INTEGRATION] Port Spark integration test to the docker-compose setup #3300

ARROW-3367: [INTEGRATION] Port Spark integration test to the docker-compose setup #3300

kszucs commented Jan 3, 2019 •

edited

codecov-io commented Jan 3, 2019 •

edited

xhochy left a comment

wesm commented Jan 5, 2019

kszucs commented Jan 7, 2019

wesm commented Jan 7, 2019

kszucs commented Jan 7, 2019 •

edited

kszucs Jan 7, 2019 •

edited

BryanCutler Jan 29, 2019

kszucs Jan 29, 2019

BryanCutler Jan 29, 2019

kszucs Jan 29, 2019

wesm commented Jan 7, 2019

kszucs commented Jan 7, 2019 •

edited

wesm commented Jan 7, 2019

kszucs commented Jan 7, 2019 •

edited

kszucs Jan 8, 2019

kszucs Jan 8, 2019

kszucs Jan 8, 2019

wesm Jan 8, 2019

kszucs Jan 8, 2019

kszucs Jan 8, 2019 •

edited

kszucs commented Jan 10, 2019

wesm commented Jan 23, 2019

kszucs commented Jan 24, 2019

kszucs commented Jan 25, 2019 •

edited

kszucs commented Jan 25, 2019

kszucs Jan 25, 2019

wesm commented Jan 25, 2019

wesm left a comment

wesm Jan 25, 2019

BryanCutler commented Jan 25, 2019

BryanCutler commented Jan 26, 2019


		# this is a nightmare, but prevents mutating the source directory
		# which is bind mounted as readonly

ARROW-3367: [INTEGRATION] Port Spark integration test to the docker-compose setup #3300

ARROW-3367: [INTEGRATION] Port Spark integration test to the docker-compose setup #3300

Conversation

kszucs commented Jan 3, 2019 • edited

codecov-io commented Jan 3, 2019 • edited

Codecov Report

xhochy left a comment

Choose a reason for hiding this comment

wesm commented Jan 5, 2019

kszucs commented Jan 7, 2019

wesm commented Jan 7, 2019

kszucs commented Jan 7, 2019 • edited

kszucs Jan 7, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Jan 7, 2019

kszucs commented Jan 7, 2019 • edited

wesm commented Jan 7, 2019

kszucs commented Jan 7, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kszucs Jan 8, 2019 • edited

Choose a reason for hiding this comment

kszucs commented Jan 10, 2019

wesm commented Jan 23, 2019

kszucs commented Jan 24, 2019

kszucs commented Jan 25, 2019 • edited

kszucs commented Jan 25, 2019

Choose a reason for hiding this comment

wesm commented Jan 25, 2019

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Jan 25, 2019

BryanCutler commented Jan 26, 2019

kszucs commented Jan 3, 2019 •

edited

codecov-io commented Jan 3, 2019 •

edited

kszucs commented Jan 7, 2019 •

edited

kszucs Jan 7, 2019 •

edited

kszucs commented Jan 7, 2019 •

edited

kszucs commented Jan 7, 2019 •

edited

kszucs Jan 8, 2019 •

edited

kszucs commented Jan 25, 2019 •

edited