[SPARK-13534][PySpark] Using Apache Arrow to increase performance of DataFrame.toPandas #15821

Closed
wants to merge 59 commits into
from

Conversation

@BryanCutler
Contributor

BryanCutler commented Nov 9, 2016

What changes were proposed in this pull request?

Integrate Apache Arrow with Spark to increase performance of DataFrame.toPandas. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an UnsupportedOperation exception is thrown.

Additions to Spark include a Scala package private method Dataset.toArrowPayloadBytes that will convert data partitions in the executor JVM to ArrowPayloads as byte arrays so they can be easily served. A package private class/object ArrowConverters that provide data type mappings and conversion routines. In Python, a public method DataFrame.collectAsArrow is added to collect Arrow payloads and an optional flag in toPandas(useArrow=False) to enable using Arrow (uses the old conversion by default).

How was this patch tested?

Added a new test suite ArrowConvertersSuite that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the toPandas method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Nov 9, 2016

Test build #68381 has finished for PR 15821 at commit 4227ec6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Nov 9, 2016

Test build #68381 has finished for PR 15821 at commit 4227ec6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Nov 9, 2016

Test build #68425 has finished for PR 15821 at commit 3f855ec.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Nov 9, 2016

Test build #68425 has finished for PR 15821 at commit 3f855ec.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Nov 9, 2016

Test build #68427 has finished for PR 15821 at commit b06e11f.

  • This patch fails Python style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

SparkQA commented Nov 9, 2016

Test build #68427 has finished for PR 15821 at commit b06e11f.

  • This patch fails Python style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Nov 17, 2016

Test build #68806 has finished for PR 15821 at commit 053e3a6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Nov 17, 2016

Test build #68806 has finished for PR 15821 at commit 053e3a6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Nov 18, 2016

Test build #68812 has finished for PR 15821 at commit 9191b96.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Nov 18, 2016

Test build #68812 has finished for PR 15821 at commit 9191b96.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Nov 21, 2016

Test build #68954 has finished for PR 15821 at commit 9191b96.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Nov 21, 2016

Test build #68954 has finished for PR 15821 at commit 9191b96.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
python/pyspark/sql/dataframe.py
@@ -1508,7 +1518,7 @@ def toDF(self, *cols):
return DataFrame(jdf, self.sql_ctx)
@since(1.3)
- def toPandas(self):
+ def toPandas(self, useArrow=True):

This comment has been minimized.

@holdenk

holdenk Nov 22, 2016

Contributor

Would it maybe make more sense to default this to false or have more thorough checking that the dataframe being written with arrow is supported? At least initially the set of supported dataframes might be rather small.

@holdenk

holdenk Nov 22, 2016

Contributor

Would it maybe make more sense to default this to false or have more thorough checking that the dataframe being written with arrow is supported? At least initially the set of supported dataframes might be rather small.

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Nov 23, 2016

Contributor

Hey @holdenk, I just had this in to do my own testing and hadn't thought
about keeping the option, but if we do keep it then yeah you're right, it
would be better to default to the original way.

On Nov 22, 2016 12:02 PM, "Holden Karau" notifications@github.com wrote:

@holdenk commented on this pull request.

In python/pyspark/sql/dataframe.py
#15821 (review):

@@ -1508,7 +1518,7 @@ def toDF(self, *cols):
return DataFrame(jdf, self.sql_ctx)

 @since(1.3)
  • def toPandas(self):
  • def toPandas(self, useArrow=True):

Would it maybe make more sense to default this to false or have more
thorough checking that the dataframe being written with arrow is supported?
At least initially the set of supported dataframes might be rather small.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#15821 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUwdd_Y8jogGipNikWJG3JAPy8DoLV8ks5rA0pygaJpZM4KtGBc
.

Contributor

BryanCutler commented Nov 23, 2016

Hey @holdenk, I just had this in to do my own testing and hadn't thought
about keeping the option, but if we do keep it then yeah you're right, it
would be better to default to the original way.

On Nov 22, 2016 12:02 PM, "Holden Karau" notifications@github.com wrote:

@holdenk commented on this pull request.

In python/pyspark/sql/dataframe.py
#15821 (review):

@@ -1508,7 +1518,7 @@ def toDF(self, *cols):
return DataFrame(jdf, self.sql_ctx)

 @since(1.3)
  • def toPandas(self):
  • def toPandas(self, useArrow=True):

Would it maybe make more sense to default this to false or have more
thorough checking that the dataframe being written with arrow is supported?
At least initially the set of supported dataframes might be rather small.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#15821 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUwdd_Y8jogGipNikWJG3JAPy8DoLV8ks5rA0pygaJpZM4KtGBc
.

@mariusvniekerk

This comment has been minimized.

Show comment
Hide comment
@mariusvniekerk

mariusvniekerk Nov 30, 2016

Contributor

So this is very cool stuff.

Would it be reasonable to add some api pieces so that on the python side things like DataFrame.mapPartitions makes use of Apache Arrow to lower the serialization costs? Or is that more a follow-on piece of work

Contributor

mariusvniekerk commented Nov 30, 2016

So this is very cool stuff.

Would it be reasonable to add some api pieces so that on the python side things like DataFrame.mapPartitions makes use of Apache Arrow to lower the serialization costs? Or is that more a follow-on piece of work

@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Nov 30, 2016

Contributor

@mariusvniekerk I think just getting this working for local connection is going to be hard so breaking up using arrow on the driver side into a separate follow up piece of work would make sense.

Contributor

holdenk commented Nov 30, 2016

@mariusvniekerk I think just getting this working for local connection is going to be hard so breaking up using arrow on the driver side into a separate follow up piece of work would make sense.

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Nov 30, 2016

Contributor

Thanks @mariusvniekerk, as @holdenk said we are going to try to get something basic working first and after we show some performance improvement, we can follow up with more things

Contributor

BryanCutler commented Nov 30, 2016

Thanks @mariusvniekerk, as @holdenk said we are going to try to get something basic working first and after we show some performance improvement, we can follow up with more things

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Nov 30, 2016

Member

Luckily we are on the home stretch for making the Java and C++ libraries binary compatible -- e.g. I'm working on automated testing today: apache/arrow#219

Member

wesm commented Nov 30, 2016

Luckily we are on the home stretch for making the Java and C++ libraries binary compatible -- e.g. I'm working on automated testing today: apache/arrow#219

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Dec 1, 2016

Member

@BryanCutler I'm working with @icexelloss on my end to get involved in this, we were going to start working on unit tests to validate converting each of the Spark SQL data types to Arrow format while the Arrow Java-C++ compatibility work progresses, but we don't want to duplicate any efforts if you're started on this. Perhaps we can create an integration branch someplace to make pull requests into since it will probably take a while until this patch will get accepted into Spark?

Member

wesm commented Dec 1, 2016

@BryanCutler I'm working with @icexelloss on my end to get involved in this, we were going to start working on unit tests to validate converting each of the Spark SQL data types to Arrow format while the Arrow Java-C++ compatibility work progresses, but we don't want to duplicate any efforts if you're started on this. Perhaps we can create an integration branch someplace to make pull requests into since it will probably take a while until this patch will get accepted into Spark?

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Dec 1, 2016

Member

Related to this we'll also want to be able to precisely instrument and benchmark the Dataset <-> Arrow conversion -- @icexelloss suggested might be able to push down the conversion into the executors instead of doing all the work in the driver, but I'm not sure how feasible that is

Member

wesm commented Dec 1, 2016

Related to this we'll also want to be able to precisely instrument and benchmark the Dataset <-> Arrow conversion -- @icexelloss suggested might be able to push down the conversion into the executors instead of doing all the work in the driver, but I'm not sure how feasible that is

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Dec 1, 2016

Contributor

Hi @wesm and @icexelloss , that sounds good on our end. @yinxusen has been working on validating some basic conversion so far, but everything is still very preliminary so it would be great to work with you guys. I'll setup a new integration branch and ping you all when ready.

Related to this we'll also want to be able to precisely instrument and benchmark the Dataset <-> Arrow conversion -- @icexelloss suggested might be able to push down the conversion into the executors instead of doing all the work in the driver, but I'm not sure how feasible that is

We were thinking about that too, as it would be more ideal. For simplicity we decided to first do the conversion on the driver side, which should hopefully still show a performance increase, then follow up with some work to better optimize it.

Contributor

BryanCutler commented Dec 1, 2016

Hi @wesm and @icexelloss , that sounds good on our end. @yinxusen has been working on validating some basic conversion so far, but everything is still very preliminary so it would be great to work with you guys. I'll setup a new integration branch and ping you all when ready.

Related to this we'll also want to be able to precisely instrument and benchmark the Dataset <-> Arrow conversion -- @icexelloss suggested might be able to push down the conversion into the executors instead of doing all the work in the driver, but I'm not sure how feasible that is

We were thinking about that too, as it would be more ideal. For simplicity we decided to first do the conversion on the driver side, which should hopefully still show a performance increase, then follow up with some work to better optimize it.

@icexelloss

This comment has been minimized.

Show comment
Hide comment
@icexelloss

icexelloss Dec 2, 2016

Contributor

@BryanCutler , I have been working based on your branch here:
https://github.com/BryanCutler/spark/tree/wip-toPandas_with_arrow-SPARK-13534

Is this the right one?

Contributor

icexelloss commented Dec 2, 2016

@BryanCutler , I have been working based on your branch here:
https://github.com/BryanCutler/spark/tree/wip-toPandas_with_arrow-SPARK-13534

Is this the right one?

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Dec 2, 2016

Contributor

@icexelloss, @wesm I branched off here for us to integrate our changes https://github.com/BryanCutler/spark/tree/arrow-integration
cc @yinxusen

Contributor

BryanCutler commented Dec 2, 2016

@icexelloss, @wesm I branched off here for us to integrate our changes https://github.com/BryanCutler/spark/tree/arrow-integration
cc @yinxusen

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Dec 2, 2016

Member

OK, let's open pull requests into that branch to help with not stepping on each other's toes. thank you

Member

wesm commented Dec 2, 2016

OK, let's open pull requests into that branch to help with not stepping on each other's toes. thank you

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Jan 18, 2017

Member

Shall we update this PR to the latest and solicit from involvement from Spark committers?

Member

wesm commented Jan 18, 2017

Shall we update this PR to the latest and solicit from involvement from Spark committers?

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jan 19, 2017

Contributor

Shall we update this PR to the latest and solicit from involvement from Spark committers?

Yeah, I think it's about ready for that. After we integrate the latest changes, I'll go over once more for some minor cleanup and update this. Probably in the next day or so.

Contributor

BryanCutler commented Jan 19, 2017

Shall we update this PR to the latest and solicit from involvement from Spark committers?

Yeah, I think it's about ready for that. After we integrate the latest changes, I'll go over once more for some minor cleanup and update this. Probably in the next day or so.

@icexelloss

This comment has been minimized.

Show comment
Hide comment
@icexelloss

icexelloss Jan 23, 2017

Contributor
Contributor

icexelloss commented Jan 23, 2017

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jan 23, 2017

Contributor
Contributor

BryanCutler commented Jan 23, 2017

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jan 24, 2017

Test build #71950 has finished for PR 15821 at commit 9bb75de.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Jan 24, 2017

Test build #71950 has finished for PR 15821 at commit 9bb75de.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jan 24, 2017

Contributor

This has been updated after integrating changes made with @icexelloss and @wesm. There has been good progress made and it would be great if others could take a look and review/test this out.

The current state of toPandas() with Arrow has support for Datasets with primitive, string, and timestamp data types. Complex types such as Structs, Array, and Mapped are not yet supported but are a wip. There is a suite of tests in Scala to test Dataset -> ArrowRecordBatch conversion and a collection on JSON files that serve to validate the converted data is correct. Also, added PySpark tests to verify Pandas frame is correct. It is compiled with the current arrow master 0.1.1-SNAPSHOT at commit apache/arrow@7d3e2a3

The performance so far shows a significant increase and I will follow up with a script to run and details of the results seen. Please ping me with any questions on setting up the build of Arrow or running the benchmarks. It would be great if this could be considered for Spark 2.2 as Arrow 0.2 will be released soon and be able to support the functionality used here.

@holdenk @davies @rxin, I would love to hear your thoughts on this so far. Thanks!
Also cc'ing some on the watch list, @mariusvniekerk @zjffdu @nchammas @zero323 @rdblue

Contributor

BryanCutler commented Jan 24, 2017

This has been updated after integrating changes made with @icexelloss and @wesm. There has been good progress made and it would be great if others could take a look and review/test this out.

The current state of toPandas() with Arrow has support for Datasets with primitive, string, and timestamp data types. Complex types such as Structs, Array, and Mapped are not yet supported but are a wip. There is a suite of tests in Scala to test Dataset -> ArrowRecordBatch conversion and a collection on JSON files that serve to validate the converted data is correct. Also, added PySpark tests to verify Pandas frame is correct. It is compiled with the current arrow master 0.1.1-SNAPSHOT at commit apache/arrow@7d3e2a3

The performance so far shows a significant increase and I will follow up with a script to run and details of the results seen. Please ping me with any questions on setting up the build of Arrow or running the benchmarks. It would be great if this could be considered for Spark 2.2 as Arrow 0.2 will be released soon and be able to support the functionality used here.

@holdenk @davies @rxin, I would love to hear your thoughts on this so far. Thanks!
Also cc'ing some on the watch list, @mariusvniekerk @zjffdu @nchammas @zero323 @rdblue

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jan 25, 2017

Contributor

Old Benchmarks with Conversion on Driver

Here are some rough benchmarks done locally on machine with 16GB mem and 8 cores, using Spark config defaults and taken from 50 trials of calling toPandas() measuring wall time in seconds with and without Arrow enabled:

1mm Longs

13.52x speedup on average

_ With Arrow Without Arrow
count 50.000000 50.000000
mean 0.190573 2.576587
std 0.078450 0.114455
min 0.139911 2.259916
25% 0.148212 2.516289
50% 0.163769 2.555433
75% 0.184402 2.631316
max 0.518090 2.946415

1mm Doubles

8.07x speedup on average

_ With Arrow Without Arrow
count 50.000000 50.000000
mean 0.259145 2.090295
std 0.069620 0.123091
min 0.196666 1.998588
25% 0.209051 2.015083
50% 0.230751 2.032701
75% 0.270519 2.122219
max 0.439556 2.485232

Script to generate these can be found here
Happy to run more if there is interest.

Contributor

BryanCutler commented Jan 25, 2017

Old Benchmarks with Conversion on Driver

Here are some rough benchmarks done locally on machine with 16GB mem and 8 cores, using Spark config defaults and taken from 50 trials of calling toPandas() measuring wall time in seconds with and without Arrow enabled:

1mm Longs

13.52x speedup on average

_ With Arrow Without Arrow
count 50.000000 50.000000
mean 0.190573 2.576587
std 0.078450 0.114455
min 0.139911 2.259916
25% 0.148212 2.516289
50% 0.163769 2.555433
75% 0.184402 2.631316
max 0.518090 2.946415

1mm Doubles

8.07x speedup on average

_ With Arrow Without Arrow
count 50.000000 50.000000
mean 0.259145 2.090295
std 0.069620 0.123091
min 0.196666 1.998588
25% 0.209051 2.015083
50% 0.230751 2.032701
75% 0.270519 2.122219
max 0.439556 2.485232

Script to generate these can be found here
Happy to run more if there is interest.

@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Jan 25, 2017

Contributor

On a personal note, those benchmarks certainly look very exciting (<3 max of with arrow less than min of without arrow) :)

It certainly seems it would probably be worth the review bandwidth to start looking this over but since this is pretty big and adds a new dependency this could take awhile to move forwards.

It would be great to hear what the other Python focused committers (maybe @davies ?) think of this approach :)

Contributor

holdenk commented Jan 25, 2017

On a personal note, those benchmarks certainly look very exciting (<3 max of with arrow less than min of without arrow) :)

It certainly seems it would probably be worth the review bandwidth to start looking this over but since this is pretty big and adds a new dependency this could take awhile to move forwards.

It would be great to hear what the other Python focused committers (maybe @davies ?) think of this approach :)

@leifwalsh

This comment has been minimized.

Show comment
Hide comment
@leifwalsh

leifwalsh Jan 25, 2017

The next iteration of this for perf would likely involve generating the arrow batches on executors and having the driver use the new streaming arrow format to just forward this to python. In our experiments, assembling arrays of internal rows dominates time, then transposing them and forming an arrow record batch is pretty quick. If we can do that work in parallel on the executors, we're likely to get another big win.

The next iteration of this for perf would likely involve generating the arrow batches on executors and having the driver use the new streaming arrow format to just forward this to python. In our experiments, assembling arrays of internal rows dominates time, then transposing them and forming an arrow record batch is pretty quick. If we can do that work in parallel on the executors, we're likely to get another big win.

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Jan 25, 2017

Member

Very nice to see the improved wall clock times. I have been busy engineering the pipeline between the byte stream from Spark and the resulting DataFrame -- the only major thing still left on the table that might help is converting strings in C++ to pandas.Categorical rather than returning a dense array of strings.

I'll review this patch in more detail when I can

I'll do a bit of performance analysis (esp. on the Python side) and flesh out some of the architectural next-steps (e.g. what @leifwalsh has described) in advance of Spark Summit in a couple weeks. Parallelizing the record batch conversion and streaming it to Python would be another significant perf win. Having these tools should also be helpful for speeding up UDF evaluation

Member

wesm commented Jan 25, 2017

Very nice to see the improved wall clock times. I have been busy engineering the pipeline between the byte stream from Spark and the resulting DataFrame -- the only major thing still left on the table that might help is converting strings in C++ to pandas.Categorical rather than returning a dense array of strings.

I'll review this patch in more detail when I can

I'll do a bit of performance analysis (esp. on the Python side) and flesh out some of the architectural next-steps (e.g. what @leifwalsh has described) in advance of Spark Summit in a couple weeks. Parallelizing the record batch conversion and streaming it to Python would be another significant perf win. Having these tools should also be helpful for speeding up UDF evaluation

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jan 25, 2017

Contributor

Parallelizing the record batch conversion and streaming it to Python would be another significant perf win.

Right, I should have also mentioned that this PR takes a simplistic approach and collects rows to the driver, where all the conversion is done. Offloading this to the executors should boost the performance more.

Contributor

BryanCutler commented Jan 25, 2017

Parallelizing the record batch conversion and streaming it to Python would be another significant perf win.

Right, I should have also mentioned that this PR takes a simplistic approach and collects rows to the driver, where all the conversion is done. Offloading this to the executors should boost the performance more.

@wesm

Some minor comments / design questions. The serialization code itself is clean and should be easy to extend when we want to support more data types: especially decimals and nested types.

I'd like to see ARROW-477 merged before the Arrow 0.2 release, but it's not strictly necessary. I'm going to nudge along that discussion on the ML so we can be on target to have up-to-date Maven artifacts for 0.2 so this will build nicely out of the box.

Are there any packaging / library conflicts to be aware of that we should address in Arrow before the release?

+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+
+object Arrow {

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

Perhaps a more descriptive name, like ArrowConverters?

@wesm

wesm Jan 26, 2017

Member

Perhaps a more descriptive name, like ArrowConverters?

+private[sql] class BinaryColumnWriter(allocator: BaseAllocator)
+ extends PrimitiveColumnWriter(allocator) {
+ override protected val valueVector: NullableVarBinaryVector
+ = new NullableVarBinaryVector("UTF8StringValue", allocator)

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

"BinaryValue"

@wesm

wesm Jan 26, 2017

Member

"BinaryValue"

+
+ override protected def setNull(): Unit = valueMutator.setNull(count)
+ override protected def setValue(row: InternalRow, ordinal: Int): Unit = {
+ valueMutator.setSafe(count, row.getInt(ordinal).toLong * 24 * 3600 * 1000)

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

Can you add a comment explaining the difference between the value representations (date ordinal vs. Joda-compatible milliseconds timestamp)?

@wesm

wesm Jan 26, 2017

Member

Can you add a comment explaining the difference between the value representations (date ordinal vs. Joda-compatible milliseconds timestamp)?

+ override protected def setNull(): Unit = valueMutator.setNull(count)
+
+ override protected def setValue(row: InternalRow, ordinal: Int): Unit = {
+ valueMutator.setSafe(count, row.getLong(ordinal) / 1000)

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

Can you add a TODO that we should migrate to microsecond timestamps once ARROW-477 is in?

@wesm

wesm Jan 26, 2017

Member

Can you add a TODO that we should migrate to microsecond timestamps once ARROW-477 is in?

import scala.collection.JavaConverters._
import scala.language.implicitConversions
import scala.reflect.runtime.universe.TypeTag
import scala.util.control.NonFatal
+import org.apache.arrow.memory.RootAllocator
+import org.apache.arrow.vector.file.ArrowWriter
+import org.apache.arrow.vector.schema.ArrowRecordBatch

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

It may be more palatable to encapsulate these details in a class that is internal to Spark

@wesm

wesm Jan 26, 2017

Member

It may be more palatable to encapsulate these details in a class that is internal to Spark

+ */
+ @DeveloperApi
+ def collectAsArrow(
+ allocator: RootAllocator = new RootAllocator(Long.MaxValue)): ArrowRecordBatch = {

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

The encapsulated return value (particularly when/if we are able to perform the Array[InternalRow] conversion on the task executors, should ideally be a collection of row batches. But the internal structure of the return value should preferably be hidden from interface here. A sort of ArrowPayload data structure that we can put on the wire later

@wesm

wesm Jan 26, 2017

Member

The encapsulated return value (particularly when/if we are able to perform the Array[InternalRow] conversion on the task executors, should ideally be a collection of row batches. But the internal structure of the return value should preferably be hidden from interface here. A sort of ArrowPayload data structure that we can put on the wire later

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

When we are able to adopt a streaming model, it might be useful for the Arrow payload data structure to be iterator based, so we can begin serving record batches to Python as soon as they begin arriving from the task executors

@wesm

wesm Jan 26, 2017

Member

When we are able to adopt a streaming model, it might be useful for the Arrow payload data structure to be iterator based, so we can begin serving record batches to Python as soon as they begin arriving from the task executors

+
+ withNewExecutionId {
+ PythonRDD.serveIterator(Iterator(out.toByteArray), "serve-Arrow")
+ }

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

The mechanics of taking an ArrowPayload (as described above) and serving it to Python could be pushed down into either Arrow.scala (whatever that ends up being called) or in PythonRDD. e.g.

val payload = collectAsArrow()
payload.serveToPython()
@wesm

wesm Jan 26, 2017

Member

The mechanics of taking an ArrowPayload (as described above) and serving it to Python could be pushed down into either Arrow.scala (whatever that ends up being called) or in PythonRDD. e.g.

val payload = collectAsArrow()
payload.serveToPython()
+ ]
+ }
+ ]
+}

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

In the future, you may wish to generate these files rather than storing them in source control. The particulars of the Arrow JSON format for integration testing could change

@wesm

wesm Jan 26, 2017

Member

In the future, you may wish to generate these files rather than storing them in source control. The particulars of the Arrow JSON format for integration testing could change

+private[sql] case class DoubleData(i: Int, a_d: Double, b_d: Option[Double])
+
+
+class ArrowSuite extends SharedSQLContext {

This comment has been minimized.

@wesm

wesm Jan 26, 2017

Member

Rename to match the new name (if any) for Arrow.scala

@wesm

wesm Jan 26, 2017

Member

Rename to match the new name (if any) for Arrow.scala

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jan 27, 2017

Contributor

Thanks for the review @wesm! Those are good ideas, I'll work on an update to this. As for the packaging conflicts, I had to add exclusions for com.fasterxml.jackson.core:jackson-annotations, jackson-databind and org.slf4j:log4j-over-slf4j. These don't seem too critical to me and it was a while ago so I'll double check if they are still a problem.

  • edit - these exclusions are still needed at this time
Contributor

BryanCutler commented Jan 27, 2017

Thanks for the review @wesm! Those are good ideas, I'll work on an update to this. As for the packaging conflicts, I had to add exclusions for com.fasterxml.jackson.core:jackson-annotations, jackson-databind and org.slf4j:log4j-over-slf4j. These don't seem too critical to me and it was a while ago so I'll double check if they are still a problem.

  • edit - these exclusions are still needed at this time
@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Jan 30, 2017

Member

The conda-forge pyarrow package is now up to date with latest Arrow git master

Member

wesm commented Jan 30, 2017

The conda-forge pyarrow package is now up to date with latest Arrow git master

@SparkQA

This comment has been minimized.

Show comment
Hide comment

SparkQA commented Jan 31, 2017

Test build #72180 has started for PR 15821 at commit b35192c.

@zero323

This comment has been minimized.

Show comment
Hide comment
@zero323

zero323 Feb 5, 2017

Contributor

This looks amazing and I can't help but wonder - if the next step is generating the arrow batches on executors is it possible to reuse this to pass data between JVM and Python UDFs? Right now, with ongoing shift towards SQL API, this is the prominent bottleneck for Python users. An improvement there could be a huge win.

Contributor

zero323 commented Feb 5, 2017

This looks amazing and I can't help but wonder - if the next step is generating the arrow batches on executors is it possible to reuse this to pass data between JVM and Python UDFs? Right now, with ongoing shift towards SQL API, this is the prominent bottleneck for Python users. An improvement there could be a huge win.

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Feb 5, 2017

Member

@zero323 that is exactly the plan. It's a bit complicated though because the Python UDF code path handles arbitrary iterators, not just Array[InternalRow]

Member

wesm commented Feb 5, 2017

@zero323 that is exactly the plan. It's a bit complicated though because the Python UDF code path handles arbitrary iterators, not just Array[InternalRow]

@mariusvniekerk

This comment has been minimized.

Show comment
Hide comment
@mariusvniekerk

mariusvniekerk Feb 5, 2017

Contributor

Probably a good thing to look at is the R pieces since that is effectively constrained to InternalRow

Contributor

mariusvniekerk commented Feb 5, 2017

Probably a good thing to look at is the R pieces since that is effectively constrained to InternalRow

+ = new NullableBitVector("BooleanValue", allocator)
+ override protected val valueMutator: NullableBitVector#Mutator = valueVector.getMutator
+
+ override def setNull(): Unit = valueMutator.setNull(count)

This comment has been minimized.

@viirya

viirya Feb 15, 2017

Contributor

setNull and setValue should be protected.

@viirya

viirya Feb 15, 2017

Contributor

setNull and setValue should be protected.

+ count += 1
+ }
+
+ override def finish(): (Seq[ArrowFieldNode], Seq[ArrowBuf]) = {

This comment has been minimized.

@viirya

viirya Feb 15, 2017

Contributor

Do we need to prevent a ColumnWriter has been written after finish is called?

@viirya

viirya Feb 15, 2017

Contributor

Do we need to prevent a ColumnWriter has been written after finish is called?

+ val payloadBytes = ArrowConverters.payloadToByteArray(payload, this.schema)
+
+ withNewExecutionId {
+ PythonRDD.serveIterator(Iterator(payloadBytes), "serve-Arrow")

This comment has been minimized.

@viirya

viirya Feb 15, 2017

Contributor

Can we convert dataset to payload without collecting them back to driver?

@viirya

viirya Feb 15, 2017

Contributor

Can we convert dataset to payload without collecting them back to driver?

This comment has been minimized.

@BryanCutler

BryanCutler Feb 15, 2017

Contributor

Thanks @viirya for the review, I'm working on another update that will do conversion on the executors

@BryanCutler

BryanCutler Feb 15, 2017

Contributor

Thanks @viirya for the review, I'm working on another update that will do conversion on the executors

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Feb 22, 2017

Member

The 0.2 Maven artifacts have been posted. I'll try to update the conda-forge packages this week -- if anyone can help with conda-forge maintenance that would be a big help.

Thanks!

Member

wesm commented Feb 22, 2017

The 0.2 Maven artifacts have been posted. I'll try to update the conda-forge packages this week -- if anyone can help with conda-forge maintenance that would be a big help.

Thanks!

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Jun 26, 2017

Member

CC @cloud-fan @BryanCutler is there an easy fix or do we need to revert this temporarily? it's failing the builds

Member

srowen commented Jun 26, 2017

CC @cloud-fan @BryanCutler is there an easy fix or do we need to revert this temporarily? it's failing the builds

@cloud-fan

This comment has been minimized.

Show comment
Hide comment
@cloud-fan

cloud-fan Jun 26, 2017

Contributor

@JoshRosen do we have a jenkins setup script like https://github.com/apache/arrow/blob/master/ci/travis_install_conda.sh#L37 ?

I think we need to make conda up to date and increase the timeout.

Contributor

cloud-fan commented Jun 26, 2017

@JoshRosen do we have a jenkins setup script like https://github.com/apache/arrow/blob/master/ci/travis_install_conda.sh#L37 ?

I think we need to make conda up to date and increase the timeout.

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Jun 26, 2017

Member

@srowen @cloud-fan adding the steps from https://github.com/apache/arrow/blob/master/ci/travis_install_conda.sh that update conda to the latest version and increasing the SSL timeout should fix the problem. If it does not I can take a closer look

Member

wesm commented Jun 26, 2017

@srowen @cloud-fan adding the steps from https://github.com/apache/arrow/blob/master/ci/travis_install_conda.sh that update conda to the latest version and increasing the SSL timeout should fix the problem. If it does not I can take a closer look

@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Jun 26, 2017

Contributor

If it's still failing builds we should revert, fix the issue and reemerge once it's fixed.

Contributor

holdenk commented Jun 26, 2017

If it's still failing builds we should revert, fix the issue and reemerge once it's fixed.

@shivaram

This comment has been minimized.

Show comment
Hide comment
Contributor

shivaram commented Jun 26, 2017

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 26, 2017

Contributor

hmm. currently thinking about this. thanks for the ping, shiv.

Contributor

shaneknapp commented Jun 26, 2017

hmm. currently thinking about this. thanks for the ping, shiv.

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jun 26, 2017

Contributor
Contributor

BryanCutler commented Jun 26, 2017

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 26, 2017

Contributor

@BryanCutler -- i'm ok w/holding out to discuss this in more detail tomorrow. in the meantime, i'll look over this PR and build failures and get myself up to speed w/what's going on.

Contributor

shaneknapp commented Jun 26, 2017

@BryanCutler -- i'm ok w/holding out to discuss this in more detail tomorrow. in the meantime, i'll look over this PR and build failures and get myself up to speed w/what's going on.

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 26, 2017

Contributor

ok, @JoshRosen and i will bang our respective heads against this in about an hour. we should be able to figure something out pretty quick.

Contributor

shaneknapp commented Jun 26, 2017

ok, @JoshRosen and i will bang our respective heads against this in about an hour. we should be able to figure something out pretty quick.

@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Jun 26, 2017

Contributor

@shaneknapp let me know if you want some help poking at Jenkins.

Contributor

holdenk commented Jun 26, 2017

@shaneknapp let me know if you want some help poking at Jenkins.

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 26, 2017

Contributor

@holdenk yeah, another set of eyes would be great! i haven't actually touched the test infra code in a long time and i'm currently wrapping my brain around the order of operations that run-pip-tests goes through in conjunction w/everything else.

i have a feeling that the chain of scripts (run-tests-jenkins -> run-tests-jenkins.py -> run-tests -> run-pip-tests) besides being confusing for humans (ie: me), is also fragile WRT conda envs (aka munging PATH) in our environment.

would installing pyarrow 0.4.0 in the py3k conda env fix things? if so, i can bang that out in moments.

Contributor

shaneknapp commented Jun 26, 2017

@holdenk yeah, another set of eyes would be great! i haven't actually touched the test infra code in a long time and i'm currently wrapping my brain around the order of operations that run-pip-tests goes through in conjunction w/everything else.

i have a feeling that the chain of scripts (run-tests-jenkins -> run-tests-jenkins.py -> run-tests -> run-pip-tests) besides being confusing for humans (ie: me), is also fragile WRT conda envs (aka munging PATH) in our environment.

would installing pyarrow 0.4.0 in the py3k conda env fix things? if so, i can bang that out in moments.

@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Jun 26, 2017

Contributor

@shaneknapp it might, assuming the Conda cache is shared it should avoid needing to fetch the package. I'm not super sure but I think we might have better luck updating conda on the jenkins machines (if people are ok with that) since it seems like this is probably from an out of date conda.

Contributor

holdenk commented Jun 26, 2017

@shaneknapp it might, assuming the Conda cache is shared it should avoid needing to fetch the package. I'm not super sure but I think we might have better luck updating conda on the jenkins machines (if people are ok with that) since it seems like this is probably from an out of date conda.

@MaheshIBM

This comment has been minimized.

Show comment
Hide comment
@MaheshIBM

MaheshIBM Jun 27, 2017

This does not seem like a timeout issue, the certificate CN and the what is used as the hostname are not matching. So clearly the client downloads the certificate but is not able to verify (no timeout). If anything it may be possible to configure the code/command to ignore ssl cert errors.

At this point I checked there is no host with hostname as below. so clearly some cert on the anaconda side is not set up properly, if I get the host name from where the package download is attempted I can investigate more.

ping conda.binstart.org
ping: unknown host conda.binstart.org

For troubleshooting from the problematic host, you can try using openssl to verify the certs, below is a sample of a successful negotiation.

openssl s_client -showcerts -connect anaconda.org:443

Lot of output here.
Server certificate
subject=/C=US/postalCode=78701/ST=TX/L=Austin/street=221 W 6th St, Ste 1550/O=Continuum Analytics Inc/OU=Information Technology/OU=PremiumSSL Wildcard/CN=*.anaconda.org
issuer=/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO RSA Organization Validation Secure Server CA

No client certificate CA names sent
Server Temp Key: ECDH, prime256v1, 256 bits

SSL handshake has read 5488 bytes and written 373 bytes

New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES128-GCM-SHA256
Server public key is 4096 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
Protocol : TLSv1.2
Cipher : ECDHE-RSA-AES128-GCM-SHA256
Session-ID: 32C19785E3801B08BB2C5997BC437A54C13C6D3F4D678F5927B028B5AAE7E2C1
Session-ID-ctx:
Master-Key: 5B2AD811A5D131CD9565311AD0A4749DC0D03657E91B32B22B77813905B9CD1865FF7DB0E67395EB1DE194848DD0037A
Key-Arg : None
Krb5 Principal: None
PSK identity: None
PSK identity hint: None
Start Time: 1498537784
Timeout : 300 (sec)
Verify return code: 0 (ok) `

MaheshIBM commented Jun 27, 2017

This does not seem like a timeout issue, the certificate CN and the what is used as the hostname are not matching. So clearly the client downloads the certificate but is not able to verify (no timeout). If anything it may be possible to configure the code/command to ignore ssl cert errors.

At this point I checked there is no host with hostname as below. so clearly some cert on the anaconda side is not set up properly, if I get the host name from where the package download is attempted I can investigate more.

ping conda.binstart.org
ping: unknown host conda.binstart.org

For troubleshooting from the problematic host, you can try using openssl to verify the certs, below is a sample of a successful negotiation.

openssl s_client -showcerts -connect anaconda.org:443

Lot of output here.
Server certificate
subject=/C=US/postalCode=78701/ST=TX/L=Austin/street=221 W 6th St, Ste 1550/O=Continuum Analytics Inc/OU=Information Technology/OU=PremiumSSL Wildcard/CN=*.anaconda.org
issuer=/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO RSA Organization Validation Secure Server CA

No client certificate CA names sent
Server Temp Key: ECDH, prime256v1, 256 bits

SSL handshake has read 5488 bytes and written 373 bytes

New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES128-GCM-SHA256
Server public key is 4096 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
Protocol : TLSv1.2
Cipher : ECDHE-RSA-AES128-GCM-SHA256
Session-ID: 32C19785E3801B08BB2C5997BC437A54C13C6D3F4D678F5927B028B5AAE7E2C1
Session-ID-ctx:
Master-Key: 5B2AD811A5D131CD9565311AD0A4749DC0D03657E91B32B22B77813905B9CD1865FF7DB0E67395EB1DE194848DD0037A
Key-Arg : None
Krb5 Principal: None
PSK identity: None
PSK identity hint: None
Start Time: 1498537784
Timeout : 300 (sec)
Verify return code: 0 (ok) `

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jun 27, 2017

Contributor
Contributor

BryanCutler commented Jun 27, 2017

@MaheshIBM

This comment has been minimized.

Show comment
Hide comment
@MaheshIBM

MaheshIBM Jun 27, 2017

That lends me to believe that the download request could be resolving to different hosts every time, can it happen if there is a CDN working in the background? Not all hosts are configured to use the bad certificate. While one (or more possibly) are using a certificate with DN of conda.binstar.org and responding to the domain name in the hostname of the url from where the package download is attempted.

If there is a way for configuring pip to ignore ssl errors (only for purpose of troubleshooting and find root cause of the problem here), then that is one possible direction to take. I am looking for ways to ignore ssl errors when using pip, will update the comment if i find something.

--Update
There is a --trusted-host param that can be passed to pip

--Update 2
To double check I downloaded the certificate from binstar.org and saw the values in field below, which exactly matches what pip is complaining about.

 X509v3 Subject Alternative Name: 
                DNS:anaconda.com, DNS:anacondacloud.com, DNS:anacondacloud.org, DNS:binstar.org, DNS:wakari.io

MaheshIBM commented Jun 27, 2017

That lends me to believe that the download request could be resolving to different hosts every time, can it happen if there is a CDN working in the background? Not all hosts are configured to use the bad certificate. While one (or more possibly) are using a certificate with DN of conda.binstar.org and responding to the domain name in the hostname of the url from where the package download is attempted.

If there is a way for configuring pip to ignore ssl errors (only for purpose of troubleshooting and find root cause of the problem here), then that is one possible direction to take. I am looking for ways to ignore ssl errors when using pip, will update the comment if i find something.

--Update
There is a --trusted-host param that can be passed to pip

--Update 2
To double check I downloaded the certificate from binstar.org and saw the values in field below, which exactly matches what pip is complaining about.

 X509v3 Subject Alternative Name: 
                DNS:anaconda.com, DNS:anacondacloud.com, DNS:anacondacloud.org, DNS:binstar.org, DNS:wakari.io
@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 27, 2017

Contributor

i agree w/@MaheshIBM that we're looking at a bad CA cert. i think we're looking at a problem on continuum.io's side, not our side.

however, i do no like the thought of ignoring certs (on principle). :)

and finally, if i'm reading the run-pip-tests code correctly (and please correct me if i'm wrong @holdenk ), we're just creating a temp python environment in /tmp, installing some packages, running the tests, and then moving on.

some thoughts/suggestions:

  • our conda environment is pretty stagnant and hasn't been explicitly upgraded since we deployed anaconda python over a year ago.
  • the py3k environment that exists in the workers' conda installation is solely used by spark builds, so updating said environment w/the packages in the run-pip-tests will remove the need to download them, but at the same time, make the tests a NOOP.
  • we can hope that continuum fixes their cert issue asap. :\
Contributor

shaneknapp commented Jun 27, 2017

i agree w/@MaheshIBM that we're looking at a bad CA cert. i think we're looking at a problem on continuum.io's side, not our side.

however, i do no like the thought of ignoring certs (on principle). :)

and finally, if i'm reading the run-pip-tests code correctly (and please correct me if i'm wrong @holdenk ), we're just creating a temp python environment in /tmp, installing some packages, running the tests, and then moving on.

some thoughts/suggestions:

  • our conda environment is pretty stagnant and hasn't been explicitly upgraded since we deployed anaconda python over a year ago.
  • the py3k environment that exists in the workers' conda installation is solely used by spark builds, so updating said environment w/the packages in the run-pip-tests will remove the need to download them, but at the same time, make the tests a NOOP.
  • we can hope that continuum fixes their cert issue asap. :\
@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Jun 27, 2017

Contributor

@shaneknapp your understanding about what run-pip-tests code is pretty correct. It's important to note that part of the test is installing the pyspark package its self to makesure we didn't break the packaging, and pyarrow is only installed because we want to be able to run some pyarrow tests with it -- we don't need that to be part of the packaging tests infact it would be simpler to have it be part of the normal tests.

So one possible approach to fix this I think would be updating conda on the machines because its old, installing pyarrow into the py3k worker env, and then taking the pyarrow tests out of the packaging test and instead have them run in the normal flow.

I'm not super sure this is a cert issue per-se, it seems that newer versions of conda are working fine (it's possible the SSL lib is slightly out of date and not understanding wildcards or something else in the cert)?

Contributor

holdenk commented Jun 27, 2017

@shaneknapp your understanding about what run-pip-tests code is pretty correct. It's important to note that part of the test is installing the pyspark package its self to makesure we didn't break the packaging, and pyarrow is only installed because we want to be able to run some pyarrow tests with it -- we don't need that to be part of the packaging tests infact it would be simpler to have it be part of the normal tests.

So one possible approach to fix this I think would be updating conda on the machines because its old, installing pyarrow into the py3k worker env, and then taking the pyarrow tests out of the packaging test and instead have them run in the normal flow.

I'm not super sure this is a cert issue per-se, it seems that newer versions of conda are working fine (it's possible the SSL lib is slightly out of date and not understanding wildcards or something else in the cert)?

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 27, 2017

Contributor

okie dokie. how about i install pyarrow in the py3k conda environment right now... and once that's done, we can remove the pyarrow test from run-pip-tests and add it to the regular tests.

so, who wants to take care of the test updating? :)

Contributor

shaneknapp commented Jun 27, 2017

okie dokie. how about i install pyarrow in the py3k conda environment right now... and once that's done, we can remove the pyarrow test from run-pip-tests and add it to the regular tests.

so, who wants to take care of the test updating? :)

@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Jun 27, 2017

Contributor

I can do the test updating assuming that @BryanCutler is traveling. I've got a webinar this afternoon but I can do it after I'm done with that.

Also I don't think its the wild card issue now that I think about it some more, its that new conda deprecates binstar and our old conda is going to binstar which is just pointing to the conda host but the conda host now has an SSL cert just for conda not conda and binstar. I don't think contium is going to fix that, rather I suspect the answer is going to be just to upgrade to a newer version of conda.

Contributor

holdenk commented Jun 27, 2017

I can do the test updating assuming that @BryanCutler is traveling. I've got a webinar this afternoon but I can do it after I'm done with that.

Also I don't think its the wild card issue now that I think about it some more, its that new conda deprecates binstar and our old conda is going to binstar which is just pointing to the conda host but the conda host now has an SSL cert just for conda not conda and binstar. I don't think contium is going to fix that, rather I suspect the answer is going to be just to upgrade to a newer version of conda.

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 27, 2017

Contributor

yeah, i think you're right. however, upgrading to a new version of conda on a live environment does indeed scare me a little bit. :)

w/the new jenkins, i'll have a staging server dedicated to testing crap like this. ah, the future: so shiny and bright!

Contributor

shaneknapp commented Jun 27, 2017

yeah, i think you're right. however, upgrading to a new version of conda on a live environment does indeed scare me a little bit. :)

w/the new jenkins, i'll have a staging server dedicated to testing crap like this. ah, the future: so shiny and bright!

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 27, 2017

Contributor

anyways: installing pyarrow right now.

Contributor

shaneknapp commented Jun 27, 2017

anyways: installing pyarrow right now.

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 27, 2017

Contributor

done

Contributor

shaneknapp commented Jun 27, 2017

done

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 27, 2017

Contributor
(py3k)-bash-4.1$ pip install pyarrow
Requirement already satisfied: pyarrow in /home/anaconda/envs/py3k/lib/python3.4/site-packages
Requirement already satisfied: six>=1.0.0 in /home/anaconda/envs/py3k/lib/python3.4/site-packages (from pyarrow)
Requirement already satisfied: numpy>=1.9 in /home/anaconda/envs/py3k/lib/python3.4/site-packages (from pyarrow)
Contributor

shaneknapp commented Jun 27, 2017

(py3k)-bash-4.1$ pip install pyarrow
Requirement already satisfied: pyarrow in /home/anaconda/envs/py3k/lib/python3.4/site-packages
Requirement already satisfied: six>=1.0.0 in /home/anaconda/envs/py3k/lib/python3.4/site-packages (from pyarrow)
Requirement already satisfied: numpy>=1.9 in /home/anaconda/envs/py3k/lib/python3.4/site-packages (from pyarrow)
@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jun 27, 2017

Contributor
Contributor

BryanCutler commented Jun 27, 2017

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 27, 2017

Contributor

btw, do we want pyarrow-0.4.0 or -0.4.1? i'm assuming the latter based on #15821 (comment)

Contributor

shaneknapp commented Jun 27, 2017

btw, do we want pyarrow-0.4.0 or -0.4.1? i'm assuming the latter based on #15821 (comment)

@wesm

This comment has been minimized.

Show comment
Hide comment
@wesm

wesm Jun 27, 2017

Member

I recommend using the latest. The data format is forward/backward compatible so the JAR doesn't necessarily need to be 0.4.1 if you're using pyarrow 0.4.1 (0.4.1 fixed a Decimal regression in the Java library, but that isn't relevant here quite yet)

Member

wesm commented Jun 27, 2017

I recommend using the latest. The data format is forward/backward compatible so the JAR doesn't necessarily need to be 0.4.1 if you're using pyarrow 0.4.1 (0.4.1 fixed a Decimal regression in the Java library, but that isn't relevant here quite yet)

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Jun 27, 2017

Contributor

roger copy... "latest" is 0.4.1, which is what's currently on the jenkins workers.

Contributor

shaneknapp commented Jun 27, 2017

roger copy... "latest" is 0.4.1, which is what's currently on the jenkins workers.

@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Jun 27, 2017

Contributor

Great, thanks @shaneknapp . @BryanCutler I've got a webinar and if you don't have a chance to change the tests around until after I'm done teaching I'll do it, but if your flight lands first then go for it :)

Contributor

holdenk commented Jun 27, 2017

Great, thanks @shaneknapp . @BryanCutler I've got a webinar and if you don't have a chance to change the tests around until after I'm done teaching I'll do it, but if your flight lands first then go for it :)

@cloud-fan

This comment has been minimized.

Show comment
Hide comment
@cloud-fan

cloud-fan Jun 28, 2017

Contributor

the last build still failed, shall we update dev/run-pip-tests to use pip?

Contributor

cloud-fan commented Jun 28, 2017

the last build still failed, shall we update dev/run-pip-tests to use pip?

@cloud-fan

This comment has been minimized.

Show comment
Hide comment
@cloud-fan

cloud-fan Jun 28, 2017

Contributor

Some PRs are blocked because of this failures, for days. I'm reverting it, @BryanCutler please reopen this PR after fixing the pip stuff, thanks!

Contributor

cloud-fan commented Jun 28, 2017

Some PRs are blocked because of this failures, for days. I'm reverting it, @BryanCutler please reopen this PR after fixing the pip stuff, thanks!

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Jun 28, 2017

Member

Wait @cloud-fan! just want to ask a quesiton.

Member

HyukjinKwon commented Jun 28, 2017

Wait @cloud-fan! just want to ask a quesiton.

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Jun 28, 2017

Member

Should we maybe wait for #18443? Actually, I think there is an alternative for this - #18439 rather than reverting whole PR.

Reverting is also an option. I hope these were considered (or I assume already considered).

Member

HyukjinKwon commented Jun 28, 2017

Should we maybe wait for #18443? Actually, I think there is an alternative for this - #18439 rather than reverting whole PR.

Reverting is also an option. I hope these were considered (or I assume already considered).

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Jun 28, 2017

Member

FWIW, I am not against reverting. Just wanted to provide some contexts just in case missed.

Member

HyukjinKwon commented Jun 28, 2017

FWIW, I am not against reverting. Just wanted to provide some contexts just in case missed.

@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Jun 28, 2017

Contributor

I'm against #18439 , I'd rather revert this and fix it later than installing packages without SSL.

Contributor

holdenk commented Jun 28, 2017

I'm against #18439 , I'd rather revert this and fix it later than installing packages without SSL.

@BryanCutler

This comment has been minimized.

Show comment
Hide comment
@BryanCutler

BryanCutler Jun 28, 2017

Contributor
Contributor

BryanCutler commented Jun 28, 2017

@cloud-fan

This comment has been minimized.

Show comment
Hide comment
@cloud-fan

cloud-fan Jun 28, 2017

Contributor

Sorry I didn't know there is a PR fixing the issue, and I already reverted it. Please cherry-pick this commit in the new PR and apply the pip fixing. Sorry for the trouble.

Contributor

cloud-fan commented Jun 28, 2017

Sorry I didn't know there is a PR fixing the issue, and I already reverted it. Please cherry-pick this commit in the new PR and apply the pip fixing. Sorry for the trouble.

@rxin

This comment has been minimized.

Show comment
Hide comment
@rxin

rxin Jun 28, 2017

Contributor

In the future we should revert PRs that fail builds IMMEDIATELY. There is no way we should've let the build be broken for days.

Contributor

rxin commented Jun 28, 2017

In the future we should revert PRs that fail builds IMMEDIATELY. There is no way we should've let the build be broken for days.

robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017

[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of …
…DataFrame.toPandas

## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment