Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21190][PYSPARK] Python Vectorized UDFs #18659

Closed

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Jul 17, 2017

What changes were proposed in this pull request?

This PR adds vectorized UDFs to the Python API

Proposed API
Introduce a flag to turn on vectorization for a defined UDF, for example:

@pandas_udf(DoubleType())
def plus(a, b)
    return a + b

or

plus = pandas_udf(lambda a, b: a + b, DoubleType())

Usage is the same as normal UDFs

0-parameter UDFs
pandas_udf functions can declare an optional **kwargs and when evaluated, will contain a key "size" that will give the required length of the output. For example:

@pandas_udf(LongType())
def f0(**kwargs):
    return pd.Series(1).repeat(kwargs["size"])

df.select(f0())

How was this patch tested?

Added new unit tests in pyspark.sql that are enabled if pyarrow and Pandas are available.

TODO

  • Fix support for promoted types with null values
  • Discuss 0-param UDF API (use of kwargs)
  • Add tests for chained UDFs
  • Discuss behavior when pyarrow not installed / enabled
  • Cleanup pydoc and add user docs

@BryanCutler
Copy link
Member Author

BryanCutler commented Jul 17, 2017

The following was used to test performance locally

spark = SparkSession.builder.appName("vectorized_udfs").getOrCreate()

vectorize = True

if vectorize:
    from numpy import log, exp
else:
    from math import log, exp

def my_func(p1, p2):
    w = 0.5
    return exp(log(p1) + log(p2) - log(w))

df = spark.range(1 << 24, numPartitions=16).toDF("id") \
    .withColumn("p1", rand()).withColumn("p2", rand())

my_udf = udf(my_func, DoubleType(), vectorized=vectorize)

df.withColumn("p", my_udf(col("p1"), col("p2")))

** Updated with using ColumnarBatches **

Non-Vectorized ~ 6.127449s

Vectorized ~ 2.867868s
Speedup of 2.14x

Vectorized ~1.877384

Speedup of 3.26x

@BryanCutler
Copy link
Member Author

BryanCutler commented Jul 17, 2017

Some comments on the performance above

  1. I used the ArrowFileWriter that is currently in the pyspark ArrowSerializer - this carries significant overhead in copying data to temporary buffers before transferring. Using the ArrowStreamWriter I was seeing much better performance, however it makes significant changes to PythonRunner in PythonRDD. If we move forward with this I can present those changes as well.

  2. This naively transfers data back into a GenericInternalRow - I'm sure there is a better way to do this that is probably more efficient, maybe someone more familiar with SQL internals can comment This is resolved with using ArrowColumnVectors now

@SparkQA
Copy link

SparkQA commented Jul 17, 2017

Test build #79680 has finished for PR 18659 at commit 11a7a87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2017

Test build #79682 has finished for PR 18659 at commit 063dcd9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val genericRowData = fields.map { field =>
field.getAccessor.getObject(_index)
}.toArray[Any]
Copy link
Member

@kiszk kiszk Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using SpecificInternalRow to improve performance? I think that it could eliminate some boxing/unboxing. The following is a snippet for this usage.

        val fieldTypes = fields.map { field =>
          field match {
            case NullableIntVector => IntegerType
            case NullableFloat8Vector => DoubleType
            ...
          }
        }
        val row = new SpecificInternalRow(fieldTypes)
        fields.zipWithIndex.map { case (field, i) =>
          field match {
            case NullableIntVector =>
              row.setInt(i, field.asInstanceOf[NullableIntVector].getAccessor.get(_index)) 
            case NullableFloat8Vector => LongType
              row.setDouble(i, field.asInstanceOf[NullableFloat8Vector].getAccessor.get(_index))
            ...  
          }
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As additional performance optimization, can we reuse InternalRow object that next() method returns?
For example, I think that this next() reuses UnsafeRow that is allocated in Java code, which is generated at here , as an instance variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kiszk , I'll give that a shot and see if it helps!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler,

I have implemented arrow -> unsafe row conversions in:

icexelloss@8f38c15#diff-52cca47e7a940849b28d476ddf99d65eR575

This reuses the row object and doesn't do boxing. Hopefully it's useful to you as well?

Copy link
Member

@kiszk kiszk Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler
As @cloud-fan suggested here, it is good to create ColumnarBatch with ArrowColumnVector and get an iterator. It looks simpler implementation.
cc: @ueshin

The following is code piece.

    new Iterator[InternalRow] {
      private val _allocator = new RootAllocator(Long.MaxValue)
      private var _reader: ArrowFileReader = _
      private var _root: VectorSchemaRoot = _
      private var _index = 0
      private var _iterator = null

      loadNextBatch()

      override def hasNext: Boolean = _root != null && _index < _root.getRowCount && _iterator.hasNext

      override def next(): InternalRow = {
        _index += 1
        if (_index >= _root.getRowCount) {
          _index = 0
          loadNextBatch()
          if (!hasNext) {
            close()
          }
        }
        _iterator.next()
      }
      ...
      private def loadNextBatch(): Unit = {
        closeReader()
        if (iter.hasNext) {
          val in = new ByteArrayReadableSeekableByteChannel(iter.next().asPythonSerializable)
          _reader = new ArrowFileReader(in, _allocator)
          _root = _reader.getVectorSchemaRoot // throws IOException
          _reader.loadNextBatch() // throws IOException
          val arrowSchema = ArrowUtils.fromArrowSchema(_root.getSchema)
          val fields = _root.getFieldVectors
          val rows = _root.getRowCount
          val columnarBatch = ColumnarBatch.allocateArrow(
            _root.getFieldVectors.asInstanceOf[java.util.List[ValueVector]],
            ArrowUtils.fromArrowSchema(_root.getSchema), _root.getRowCount)
          _iterator = columnarBatch.rowIterator
        }
      }

public final class ColumnarBatch {
...
  public static ColumnarBatch allocateArrow(List<ValueVector> vectors, StructType schema, int maxRows) {
    // need to implement the following constructor for arrowColumnVector
    return new ColumnarBatch(vectors, schema, maxRows);
  }
...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kiszk , I'm giving it a try!

return columns;
}

public static ColumnarBatch createReadOnly(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin I made some changes here to allow for use with ArrowColumnVectors. I was thinking of putting these in a separate JIRA because it can be used regardless of what is done with vectorized UDFs. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler I agree with you, let's separate it from this pr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do. I created https://issues.apache.org/jira/browse/SPARK-21583 for this

@SparkQA
Copy link

SparkQA commented Jul 29, 2017

Test build #80030 has finished for PR 18659 at commit 46e4112.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80264 has finished for PR 18659 at commit 912143e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2017

Test build #80265 has finished for PR 18659 at commit a01a2d3.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler force-pushed the arrow-vectorized-udfs-SPARK-21404 branch from a01a2d3 to 38474d8 Compare August 25, 2017 18:29
@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81138 has finished for PR 18659 at commit 38474d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler force-pushed the arrow-vectorized-udfs-SPARK-21404 branch from 38474d8 to cc7ed5a Compare September 1, 2017 18:05
@SparkQA
Copy link

SparkQA commented Sep 1, 2017

Test build #81321 has finished for PR 18659 at commit cc7ed5a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler changed the title [SPARK-21404][PYSPARK][WIP] Simple Python Vectorized UDFs [SPARK-21190][PYSPARK][WIP] Simple Python Vectorized UDFs Sep 6, 2017
@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81478 has finished for PR 18659 at commit 3efa7f2.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler force-pushed the arrow-vectorized-udfs-SPARK-21404 branch 2 times, most recently from 1503fa0 to fdea603 Compare September 6, 2017 21:56
@BryanCutler BryanCutler force-pushed the arrow-vectorized-udfs-SPARK-21404 branch from fdea603 to 4f6c950 Compare September 6, 2017 21:57
@@ -2112,7 +2113,7 @@ def wrapper(*args):


@since(1.3)
def udf(f=None, returnType=StringType()):
def udf(f=None, returnType=StringType(), vectorized=False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung does this fit your idea for a more generic decorator? Not exclusively labeled as pandas_udf, just enable vectorization with a flag, e.g. @udf(DoubleType(), vectorized=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @pandas_udf(DoubleType()) is better than @udf(DoubleType(), vectorized=True), which is more concise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed in the email, we should also accept data type of string format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also **kwargs to bring the size information

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the consensus is for pandas_udf and I'm fine with that too. I'll make that change and the others brought up here.

@felixcheung
Copy link
Member

felixcheung commented Sep 6, 2017 via email

val outputRowIterator = ArrowConverters.fromPayloadIterator(
outputIterator.map(new ArrowPayload(_)), context)

assert(schemaOut.equals(outputRowIterator.schema))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung , I think you had also brought up checking the return type matches what was defined in the UDF. This is done here.

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81479 has finished for PR 18659 at commit fdea603.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

series = [series]
series = [(s, None) if not isinstance(s, (list, tuple)) else s for s in series]
arrs = [pa.Array.from_pandas(s[0], type=s[1], mask=s[0].isnull()) for s in series]
batch = pa.RecordBatch.from_arrays(arrs, ["_%d" % i for i in range(len(arrs))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use xrange.

if not isinstance(series, (list, tuple)) or \
(len(series) == 2 and isinstance(series[1], pa.DataType)):
series = [series]
series = [(s, None) if not isinstance(s, (list, tuple)) else s for s in series]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use generator comprehension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work, but does it help much since series will already be a list or tuple?

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it actually affects the performance because we can avoid an extra loop:

def im_map(x):
    print("I am map %s" % x)
    return x

def im_gen(x):
    print("I am gen %s" % x)
    return x

def im_list(x):
    print("I am list %s" % x)
    return x

items = list(range(3))
map(im_map, [im_list(item) for item in items])
map(im_map, (im_gen(item) for item in items))

And .. this actually affects the performance up to my knowledge:

import time

items = list(xrange(int(1e8)))

for _ in xrange(10):
    s = time.time()
    _ = map(lambda x: x, [item for item in items])
    print "I am list comprehension with a list: %s" % (time.time() - s)
    s = time.time()
    _ = map(lambda x: x, (item for item in items))
    print "I am generator expression with a list: %s" % (time.time() - s)

This gives me ~13% improvement in Python 2

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be a big deal but .. I usually use generator if it iterates once and is discarded. This should consume less memory too as list comprehension should be evaluated once first up to my knowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HyukjinKwon , I suppose if there are more than a few series then it might make some difference. In that case, every little bit helps so sounds good to me!

reader = pa.RecordBatchFileReader(pa.BufferReader(obj))
batches = [reader.get_batch(i) for i in range(reader.num_record_batches)]
# NOTE: a 0-parameter pandas_udf will produce an empty batch that can have num_rows set
num_rows = sum([batch.num_rows for batch in batches])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use generator comprehension here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this makes sense because its a summation, no sense in making a list then adding it all up

"""
import pyarrow as pa
reader = pa.RecordBatchFileReader(pa.BufferReader(obj))
batches = [reader.get_batch(i) for i in range(reader.num_record_batches)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And .. xrange here too

@cloud-fan
Copy link
Contributor

what if users installed an older version of pyarrow? Shall we throw exception and ask them to upgrade, or work around type casting issue?

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81945 has finished for PR 18659 at commit 69112a5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • // enable memo iff we serialize the row with schema (schema and class should be memorized)
  • abstract class EvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)

@BryanCutler
Copy link
Member Author

Thanks for the reviews @ueshin @viirya and @HyukjinKwon ! I updated with your comments

@BryanCutler
Copy link
Member Author

BryanCutler commented Sep 19, 2017

Regarding the upgrade of Arrow, the concerns of #18974 are still valid - namely it has some risk and upgrading the Python side is a good amount of work that only a couple of people have the access to do. Would it be better to discuss the upgrade strategy in another JIRA?
cc @holdenk

@BryanCutler
Copy link
Member Author

what if users installed an older version of pyarrow? Shall we throw exception and ask them to upgrade, or work around type casting issue?

@cloud-fan , in regards to handling of problems that might come up if using different versions of Arrow, I think we should first decide on a minimum supported version, then maybe we could put that version of pyarrow as a requirement for PySpark. If we decide to use 0.4.1 which we currently use, then we should probably work around the type casting issue and make sure this PR works with that version.

@SparkQA
Copy link

SparkQA commented Sep 20, 2017

Test build #81955 has finished for PR 18659 at commit f451d65.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

ok let's work around the type casting issue and discuss arrow upgrading later.

* \ /
* \ socket (input of UDF)
* \ /
* upstream (from child)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Upstream better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think upstream is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I put myself uncomfortable to see Downstream upper, forgive me..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine but either looks fine and not a big deal.

@BryanCutler
Copy link
Member Author

@ueshin I haven't had much luck with the casting workaround:

pa.Array.from_pandas(s.astype(t.to_pandas_dtype(), copy=False), mask=s.isnull(), type=t)

It appears that it forces a copy for floating point -> integer and then checks if any NaNs, so I get the error ValueError: Cannot convert non-finite values (NA or inf) to integer. I'm using Pandas 0.20.1, but also tried 0.19.4 with the same result, any ideas?

@ueshin
Copy link
Member

ueshin commented Sep 21, 2017

@BryanCutler Hmm, I'm not exactly sure the reason why it doesn't work (or mine works) but I guess we can use fillna(0) before casting like:

pa.Array.from_pandas(s.fillna(0).astype(t.to_pandas_dtype(), copy=False), mask=s.isnull(), type=t)

@BryanCutler
Copy link
Member Author

Thanks @ueshin , that works to allow the tests to pass. I do worry that it might cause some other issues and I would much prefer we upgrade Arrow to handle this, but I'll push this and we can discuss.

@SparkQA
Copy link

SparkQA commented Sep 21, 2017

Test build #82042 has finished for PR 18659 at commit 53926cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 22, 2017

Test build #82053 has finished for PR 18659 at commit b8ffa50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"""

def __init__(self):
super(ArrowPandasSerializer, self).__init__()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that was leftovers.. I'll remove it in a followup.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

We can address remaining minor comments in follow-up, and have new PRs to remove the 0-parameter UDF and use arrow streaming protocol.

@asfgit asfgit closed this in 27fc536 Sep 22, 2017
@BryanCutler BryanCutler changed the title [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs [SPARK-21190][PYSPARK] Python Vectorized UDFs Sep 22, 2017
@BryanCutler
Copy link
Member Author

Thanks @cloud-fan @ueshin and others who reviewed! I'll make followups to disable 0-param and complete the docs for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants