Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3367: [INTEGRATION] Port Spark integration test to the docker-compose setup #3300

Closed
wants to merge 26 commits into from

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Jan 3, 2019

This incorporates the fixes from #3254
Running it on a local "Docker for Mac" takes ages because of its volume virtualization and the excessive IO usage of maven. So currently I need to run it on travis...

Crossbow builds here

@kszucs kszucs added the WIP PR is work in progress label Jan 3, 2019
@codecov-io
Copy link

codecov-io commented Jan 3, 2019

Codecov Report

Merging #3300 into master will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3300      +/-   ##
==========================================
+ Coverage   88.56%   88.58%   +0.01%     
==========================================
  Files         546      546              
  Lines       73059    73059              
==========================================
+ Hits        64708    64721      +13     
+ Misses       8242     8233       -9     
+ Partials      109      105       -4
Impacted Files Coverage Δ
go/arrow/math/uint64_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/float64_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/int64_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/float64_amd64.go 33.33% <0%> (ø) ⬆️
go/arrow/math/int64_amd64.go 33.33% <0%> (ø) ⬆️
go/arrow/math/uint64_amd64.go 33.33% <0%> (ø) ⬆️
cpp/src/arrow/csv/column-builder.cc 97.4% <0%> (+1.94%) ⬆️
go/arrow/math/math_amd64.go 36.84% <0%> (+5.26%) ⬆️
go/arrow/memory/memory_amd64.go 42.85% <0%> (+14.28%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55848a3...b8d67cf. Read the comment docs.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one small comment

integration/spark/Dockerfile Outdated Show resolved Hide resolved
@wesm
Copy link
Member

wesm commented Jan 5, 2019

fails locally for me with

/arrow/java /
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
/
Downloading Spark source archive to /build/spark...
/build/spark/spark-2.4.0 /
Building Spark with Arrow 
exec: curl --progress-bar -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
######################################################################################### 100.0%
exec: curl --progress-bar -L https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
######################################################################################### 100.0%
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
exec: curl --progress-bar -L https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz
######################################################################################### 100.0%
Using `mvn` from path: /build/spark/spark-2.4.0/build/apache-maven-3.5.4/bin/mvn
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Testing Spark:
org.apache.spark.sql.execution.arrow
org.apache.spark.sql.execution.vectorized.ColumnarBatchSuite
org.apache.spark.sql.execution.vectorized.ArrowColumnVectorSuite
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Using `mvn` from path: /build/spark/spark-2.4.0/build/apache-maven-3.5.4/bin/mvn
Unrecognized option: -q
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Testing PySpark:
pyspark.sql.tests.test_arrow ArrowTests
pyspark.sql.tests.test_pandas_udf PandasUDFTests
pyspark.sql.tests.test_pandas_udf_scalar ScalarPandasUDFTests
pyspark.sql.tests.test_pandas_udf_grouped_map GroupedMapPandasUDFTests
pyspark.sql.tests.test_pandas_udf_grouped_agg GroupedAggPandasUDFTests
pyspark.sql.tests.test_pandas_udf_window WindowPandasUDFTests
Traceback (most recent call last):
  File "./python/run-tests.py", line 70, in <module>
    raise Exception("Cannot find assembly build directory, please build Spark first.")
Exception: Cannot find assembly build directory, please build Spark first.

This Docker build pollutes the local git clone with files and directories owned by root. this has been an issue for me with other Docker builds. Is it possible to stop doing this? See https://issues.apache.org/jira/browse/ARROW-3078

@kszucs kszucs removed the WIP PR is work in progress label Jan 7, 2019
@kszucs
Copy link
Member Author

kszucs commented Jan 7, 2019

@wesm So the Arrow source directory is only contaminated by maven (**/target directories). I tried to introduce an option to put the build artifacts in a separate directory (similarly when building C++, Python, C-GLib), but I guess that's a bad practice and perhaps unnecessary because of the platform independent nature of java.

@wesm
Copy link
Member

wesm commented Jan 7, 2019

Is it possible to build the Java library inside the container rather than using the local volume?

@kszucs
Copy link
Member Author

kszucs commented Jan 7, 2019

@wesm Sure, it's possible, but the two approaches serve different purposes.

I've optimized the current Dockerfiles to minimize source code edit <=> build & test cycles. If We want to build either Java or any other language implementation in the container, than the source must be ADDed in a previous layer, and each code edit invalidates that layer, including the subsequent ones.

So currently my approach is to build exclusively the dependencies in the container, than mount arrow's source to /arrow and another cache directory /build to host the build artifacts - while preventing the contamination of the source directory. This strategy speeds up the cpp and python builds a lot, and it doesn't require to both build and run the image.

We can optimize the other way around, build the libraries within the container, but that approach is more suited for production containers. In this particular case, We can build spark as a dependency in the container (image actually :)), but then We either need to build arrow before it (in a previous layer) or recompile afterwards hoping that maven can reuse the previous build artifacts. In the first case an edit would trigger a full rebuild.

Theoretically docker volumes don't have any performance penalty on linux hosts.

BTW it builds successfully on the DGX machine.


# Run pyarrow related Python tests only
echo "Testing PySpark:"
python/run-tests --modules pyspark-sql
Copy link
Member Author

@kszucs kszucs Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testnames option is available after the 2.4.0 release

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is what I used to test pyarrow for all related python tests:

python/run-tests --testnames=pyspark.sql.tests.test_arrow,pyspark.sql.tests.test_pandas_udf,pyspark.sql.tests.test_pandas_udf_scalar,pyspark.sql.tests.test_pandas_udf_grouped_agg,pyspark.sql.tests.test_pandas_udf_grouped_map,pyspark.sql.tests.test_pandas_udf_window

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @BryanCutler! Could You please give me a spark commit hash I should use to test spark with arrow 0.12?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kszucs I just merged in support for 0.12.0 here apache/spark@16990f9, so you could also use the master branch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler !

@wesm
Copy link
Member

wesm commented Jan 7, 2019

$ docker-compose build spark-integration
Building spark-integration
Traceback (most recent call last):
  File "bin/docker-compose", line 6, in <module>
  File "compose/cli/main.py", line 71, in main
  File "compose/cli/main.py", line 127, in perform_command
  File "compose/cli/main.py", line 287, in build
  File "compose/project.py", line 384, in build
  File "compose/project.py", line 366, in build_service
  File "compose/service.py", line 1082, in build
  File "site-packages/docker/api/build.py", line 154, in build
  File "site-packages/docker/utils/build.py", line 30, in tar
  File "site-packages/docker/utils/build.py", line 49, in exclude_paths
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  [Previous line repeated 1 more time]
  File "site-packages/docker/utils/build.py", line 184, in rec_walk
PermissionError: [Errno 13] Permission denied: '/home/wesm/code/arrow/java/format/target/generated-sources'
[11337] Failed to execute script docker-compose

@kszucs
Copy link
Member Author

kszucs commented Jan 7, 2019

I've never seen an error like that, the docker-compose build should not request access to arrow/java. Are You using docker-compose.yml in arrow's root?

Have You tried to rebuild the parent images?

$ docker-compose build cpp
$ docker-compose build python
$ docker-compose build spark-integration
$ docker-compose run spark-integration

@wesm
Copy link
Member

wesm commented Jan 7, 2019

$ docker-compose build cpp
Building cpp
Traceback (most recent call last):
  File "bin/docker-compose", line 6, in <module>
  File "compose/cli/main.py", line 71, in main
  File "compose/cli/main.py", line 127, in perform_command
  File "compose/cli/main.py", line 287, in build
  File "compose/project.py", line 384, in build
  File "compose/project.py", line 366, in build_service
  File "compose/service.py", line 1082, in build
  File "site-packages/docker/api/build.py", line 154, in build
  File "site-packages/docker/utils/build.py", line 30, in tar
  File "site-packages/docker/utils/build.py", line 49, in exclude_paths
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  File "site-packages/docker/utils/build.py", line 214, in rec_walk
  [Previous line repeated 1 more time]
  File "site-packages/docker/utils/build.py", line 184, in rec_walk
PermissionError: [Errno 13] Permission denied: '/home/wesm/code/arrow/java/format/target/generated-sources'
[19391] Failed to execute script docker-compose

Executed from the root directory /home/wesm/code/arrow. You are sudo on this machine so if you want to see for yourself sudo su wesm and take a look

@kszucs
Copy link
Member Author

kszucs commented Jan 7, 2019

I get it now, docker can't gather the build context (arrow files). I have two workarounds for it,will fix tomorrow.

- .:/arrow:ro # ensures that docker won't contaminate the host directory
- alpine-cache:/build:delegated

volumes:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to named volumes.

@@ -21,13 +21,19 @@ version: '3.5'

x-ubuntu-volumes:
&ubuntu-volumes
- .:/arrow:delegated
- ${ARROW_DOCKER_CACHE_DIR:-./docker_cache}/ubuntu:/build:delegated
- .:/arrow:ro # ensures that docker won't contaminate the host directory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brought up new issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the python build is failing, because of the _generated_version.py can't be written in the source directory (setting write_to to False resolves that). So python setup.py build would work without contaminating the source directory.
The issue is that python setup.py install is still attempting the source directory (no matter which options I use).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, can you rsync the files into container space and build there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't like that solution, but probably have to stick with that.

mkdir -p /build/java
rsync -a /arrow/header /build/java
rsync -a /arrow/java /build/java
rsync -a /arrow/format /build/java
Copy link
Member Author

@kszucs kszucs Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't change maven's mind to NOT contaminate the source directory, even with hacking the top-level pom.xml

@kszucs
Copy link
Member Author

kszucs commented Jan 10, 2019

So finally I was able to retrieve the Arrow related issues of spark 1.4. A couple of test cases are failing (log is coming...)

@wesm
Copy link
Member

wesm commented Jan 23, 2019

@kszucs this needs a rebase. What is the status of this build?

@kszucs
Copy link
Member Author

kszucs commented Jan 24, 2019

I'm struggling with maven to compile spark. Before the release it compiled, but now I need to update the scala maven plugin.

@kszucs
Copy link
Member Author

kszucs commented Jan 25, 2019

Docker didn't provide enough memory to compile spark, the error was not descriptive...

So reproduced the previouse Pyspark tests failures:

Traceback (most recent call last):
  File "/spark/spark-2.4.0/python/pyspark/sql/tests.py", line 5937, in test_column_order
    grouped_df.apply(invalid_positional_types).collect()
AssertionError: "No cast implemented" does not match "An error occurred while calling o4388.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 (TID 377, localhost, executor driver): org.apache.spark.api.python.PythonException:
 Traceback (most recent call last):
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
    process()
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/serializers.py", line 284, in dump_stream
    batch = _create_batch(series, self._timezone)
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/serializers.py", line 253, in _create_batch
    arrs = [create_array(s, t) for s, t in series]
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/serializers.py", line 253, in <listcomp>
    arrs = [create_array(s, t) for s, t in series]
  File "/spark/spark-2.4.0/python/lib/pyspark.zip/pyspark/serializers.py", line 251, in create_array
    return pa.Array.from_pandas(s, mask=mask, type=t)
  File "pyarrow/array.pxi", line 531, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 171, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: an integer is required (got type str)
/arrow/cpp/src/arrow/python/helpers.cc:196 code: CheckPyError()
/arrow/cpp/src/arrow/python/python_to_arrow.cc:137 code: internal::CIntFromPython(obj, &value)
/arrow/cpp/src/arrow/python/iterators.h:54 code: func(objects[i], i, &keep_going)
/arrow/cpp/src/arrow/python/python_to_arrow.cc:1010 code: converter->AppendMultipleMasked(seq, mask, size)

@kszucs
Copy link
Member Author

kszucs commented Jan 25, 2019

This is the full log

It'd be the best to merge this PR and decide/fix the issues in follow-up pull requests, mainly because this PR also helps to prevent contamination of source directory.

Please review @wesm @xhochy

@@ -105,8 +111,8 @@ services:
context: .
dockerfile: java/Dockerfile
volumes:
- .:/arrow:delegated
- $HOME/.m2:/root/.m2:delegated
- .:/arrow:ro # ensures that docker won't contaminate the host directory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm
Copy link
Member

wesm commented Jan 25, 2019

Thanks @kszucs yes let's merge this and then look into the failure as follow up. @BryanCutler have you seent the above error?

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, merging now. I'm sorry this was such a slog. Can you open a JIRA about the follow up fix?


# this is a nightmare, but prevents mutating the source directory
# which is bind mounted as readonly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof sorry. We should report to setuptools about this issue since this is really crappy

@BryanCutler
Copy link
Member

Sorry I missed this PR.. I have seen an error like the above . Spark needs to be patched to work with Arrow 0.12.0. I'll try to do that soon and run this again.

@BryanCutler
Copy link
Member

Spark updates for 0.12.0 patch here apache/spark#23657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants