[SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow #22807

viirya · 2018-10-23T15:09:05Z

What changes were proposed in this pull request?

Since 0.11.0, PyArrow supports to raise an error for unsafe cast (PR). We should use it to raise a proper error for pandas udf users when such cast is detected.

Added a SQL config spark.sql.execution.pandas.arrowSafeTypeConversion to disable Arrow safe type check.

How was this patch tested?

Added test and manually test.

viirya · 2018-10-23T15:09:57Z

cc @HyukjinKwon @BryanCutler

SparkQA · 2018-10-23T15:54:05Z

Test build #97925 has finished for PR 22807 at commit 7c3fa6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks @viirya , looks good so far! Were you thinking of putting in a spark config to toggle the safe flag here or in a follow up? I think we really need that because prior to v0.11.0 it allowed unsafe casts, but now raises an error by default.

python/pyspark/serializers.py

BryanCutler · 2018-10-23T21:31:58Z

python/pyspark/sql/tests.py

+        udf_boolean = df.select(['A']).withColumn('udf', udf('A'))
+
+        # Since 0.11.0, PyArrow supports the feature to raise an error for unsafe cast.
+        if LooseVersion(pa.__version__) >= LooseVersion("0.11.0"):


I checked how 0.8.0 is working and it does raise an error for something like overflows, but not truncation like this test. Can you also add a check for overflow not dependent on the version?

BTW, let's bump up the minimal required PyArrow and Pandas version up if possible at 3.0 :-)

@BryanCutler Do you mean this same udf raises an error like overflows? I tried with 0.8.0 but it doesn't? Am I missing something?

I think I was talking about something like an integer overflow. In pyarrow 0.8.0 it will raise an error:

In [10]: pa.Array.from_pandas(pd.Series([128]), type=pa.int8()) --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) <ipython-input-10-49026b548de3> in <module>() ----> 1 pa.Array.from_pandas(pd.Series([128]), type=pa.int8()) ... ArrowInvalid: Integer value out of bounds

but in 0.11.1 with safe=False, it will allow this

In [11]: pa.Array.from_pandas(pd.Series([128]), type=pa.int8(), safe=False) Out[11]: <pyarrow.lib.Int8Array object at 0x7f49eebb0f48> [ -128 ]

So I think I was saying you could add a test that makes sure the default behavior is to raise an error on overflow.

Seems like the cast from float to integral types are always working without an error in pyarrow < 0.11.

>>> pa.Array.from_pandas(pd.Series([128.0]), type=pa.int8()) <pyarrow.lib.Int8Array object at 0x11c42d940> [ -128 ]

Yeah but in pyarrow 0.11.0+ you'd see an error:

>>> import pandas as pd >>> import pyarrow as pa >>> pa.__version__ '0.11.1' >>> pa.Array.from_pandas(pd.Series([128.0]), type=pa.int8()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/array.pxi", line 474, in pyarrow.lib.Array.from_pandas File "pyarrow/array.pxi", line 169, in pyarrow.lib.array File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Floating point value truncated >>> pa.Array.from_pandas(pd.Series([128.0]), type=pa.int8(), safe=False) <pyarrow.lib.Int8Array object at 0x7f3ee1a4b868> [ -128 ]

Yeah. I was thinking how we should handle this behavior change.
We will have the behavior change anyway regardless of the config, right?

safe=True: we can't use nullable integral types

safe=False: we can't detect an integer overflow

I think so. safe=False does the type conversion anyway even an overflow.

If using safe=False by default, it is possible to break user's code without notifications. If using safe=True by default, it still breaks user's code, but there is error message so it should let user know what's happen.

viirya · 2018-10-24T00:42:37Z

@BryanCutler Thanks for looking at this!

Yea, this is a WIP work for early review and I will add a config to toggle the safe flag later.

viirya · 2019-01-08T05:05:00Z

python/pyspark/serializers.py

-        return pa.Array.from_pandas(s, mask=mask, type=t, safe=False)
+        try:
+            array = pa.Array.from_pandas(s, mask=mask, type=t, safe=True)
+        except pa.ArrowException as e:


@BryanCutler Now it catches ArrowException.

viirya · 2019-01-08T05:07:02Z

Sorry for not pushing this for a while. Hope we can make it soon.

SparkQA · 2019-01-08T05:07:23Z

Test build #100917 has finished for PR 22807 at commit a645a90.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-08T05:10:11Z

Were you thinking of putting in a spark config to toggle the safe flag here or in a follow up? I think we really need that because prior to v0.11.0 it allowed unsafe casts, but now raises an error by default.

Do we have proper way to get spark config value at executor side like serializers here? I found that in read_udfs some configs for pandas_udf evaluation are read from stream, is it the only way we have?

SparkQA · 2019-01-08T05:47:34Z

Test build #100918 has finished for PR 22807 at commit 409569b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2019-01-08T06:51:54Z

So since pyarrow 0.11, we won't allow nullable integral types by default?
ScalarPandasUDFTests.test_vectorized_udf_null_(byte|short|int|long) will fail as per #23305, IIUC.

viirya · 2019-01-08T07:25:05Z

@ueshin Ideally we should provide a config to turn off this check. But I'm wondering if there is a proper way to obtain config value at the serializers (#22807 (comment)). Do you have some suggestions?

ueshin · 2019-01-08T09:17:11Z

@viirya We can use runner_conf as we use to provide the session timezone to ArrowStreamPandasSerializer. See worker.py and ArrowUtils.getPythonRunnerConfMap().
Also, although I've never tried yet, TaskContext might have the configs in its local properties.

viirya · 2019-01-08T10:10:12Z

@ueshin Thanks. I've noticed that way, but wondering if it is only way to do that. I will use it to set safe check.

SparkQA · 2019-01-08T15:29:52Z

Test build #100926 has finished for PR 22807 at commit 5fc35a3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-08T17:40:35Z

Test build #100931 has finished for PR 22807 at commit c31d519.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-08T20:11:50Z

Test build #100935 has finished for PR 22807 at commit 68b0a3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks for getting back to this @viirya! Looks pretty good, I just think maybe parse the runner_conf earlier and use a safe flag in the ArrowPandasSerializer`. Also, I'm a little concerned about the default value that might break a lot of people's code... but also if it is false by default, then it will allow overflows when it didn't before, so I'm not sure what's best.

BryanCutler · 2019-01-09T00:34:17Z

python/pyspark/worker.py

@@ -253,7 +253,7 @@ def read_udfs(pickleSer, infile, eval_type):

        # NOTE: if timezone is set here, that implies respectSessionTimeZone is True
        timezone = runner_conf.get("spark.sql.session.timeZone", None)
-        ser = ArrowStreamPandasSerializer(timezone)
+        ser = ArrowStreamPandasSerializer(timezone, runner_conf)


I think it's slightly better to parse the runner_conf for the needed options here and pass in the required flags like we do with timezone, instead of passing around the runner_conf

Ok. I passed in the required flag now.

BryanCutler · 2019-01-09T00:35:08Z

python/pyspark/sql/session.py

        batches = [_create_batch([(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)],
-                                 timezone)
+                                 timezone, runner_conf)


here, it doesn't really make sense to use a runner_conf, so it would be better to just pass in the flag

BryanCutler · 2019-01-09T00:37:12Z

python/pyspark/serializers.py

-        return pa.Array.from_pandas(s, mask=mask, type=t, safe=False)
+
+        enabledArrowSafeTypeCheck = \
+            runner_conf.get("spark.sql.execution.pandas.arrowSafeTypeConversion", "true") == 'true'


might want to tack on .lower() to ensure you are checking lower case

BryanCutler · 2019-01-09T00:41:23Z

python/pyspark/sql/tests/test_pandas_udf_scalar.py

-        res = df.select(int_f(col('int')))
-        self.assertEquals(df.collect(), res.collect())
+        with self.sql_conf({
+                "spark.sql.execution.pandas.arrowSafeTypeConversion": False}):


This and other tests fail if arrowSafeTypeConversion=True?

Yes, please see @ueshin's comment #22807 (comment).

I see, it's because of the NULL values

BryanCutler · 2019-01-09T00:43:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "when detecting unsafe type conversion. When false, disabling Arrow's type " +
+        "check and do type conversions anyway.")
+      .booleanConf
+      .createWithDefault(true)


If enabling by default causes a lot of the current tests to fail, it might also do the same for users - maybe we should disable by default?

As you said, if it is false by default, it allows overflow. cc @ueshin @HyukjinKwon What do you think? Which one is better default value?

I'd favor true by default. Do we even need this flag? As in many such cases, I'm just not clear a) how a user would find this option and b) when they would disable it. Disabling allows something to continue that is going to also fail or give an incorrect answer, right?

I think the big issue with this is when NULL values are introduced in an integer column. Pandas will automatically convert these to floating-points to represent the NULLs, then when Arrow casts it back to integer, it will raise an error due to truncation - I don't think Arrow checks the actual values, but maybe it should? For example, with safe=True:

>>> pa.Array.from_pandas(pd.Series([1, None]), type=pa.int32(), safe=True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/array.pxi", line 474, in pyarrow.lib.Array.from_pandas File "pyarrow/array.pxi", line 169, in pyarrow.lib.array File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Floating point value truncated

Does Spark check for these types of errors with standard udfs, when not using Arrow?

Thanks @kszucs

I would have left some comments about that JIRA together tho rather than making everybody click and read the whole issue. So, NULL -> Integer issue will be fixed in Arrow 0.12.0?

Do you mean leaving comments in this SQL config doc?

No .. I was referring #22807 (comment) ..

So, NULL -> Integer issue will be fixed in Arrow 0.12.0?

Yes, I verified this and ran the unit tests on the release

BryanCutler · 2019-01-09T00:44:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val PANDAS_ARROW_SAFE_TYPE_CONVERSION =
+    buildConf("spark.sql.execution.pandas.arrowSafeTypeConversion")
+      .internal()
+      .doc("When true, enabling Arrow do safe type conversion check when converting" +


maybe reword to "When true, Arrow will perform safe type conversion when converting " +

BryanCutler · 2019-01-09T00:45:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.execution.pandas.arrowSafeTypeConversion")
+      .internal()
+      .doc("When true, enabling Arrow do safe type conversion check when converting" +
+        "Pandas.Series to Arrow Array during serialization. Arrow will raise errors " +


nit: don't capitalize Array

BryanCutler · 2019-01-09T00:47:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("When true, enabling Arrow do safe type conversion check when converting" +
+        "Pandas.Series to Arrow Array during serialization. Arrow will raise errors " +
+        "when detecting unsafe type conversion. When false, disabling Arrow's type " +


it might be good to elaborate what is an unsafe conversion, e.g. overflow

viirya · 2019-01-09T10:06:18Z

Thanks @BryanCutler for continuing to review this. I addressed above comments. We need to decide the default config value.

SparkQA · 2019-01-09T14:11:09Z

Test build #100958 has finished for PR 22807 at commit 250e0b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-01-14T15:07:30Z

So, in these situations, we agree some error should occur. The current one is an overflow error -- right? or does it silently overfloat? Non-arrow UDFs seem to do the latter.

Based on that, I'm not clear the Arrow behavior should be different, especially if it already matches that behavior, and is what Arrow users expect.

No, this isn't a case for a flag. That is a failure to decide, punted to the user.

BryanCutler · 2019-01-14T18:51:01Z

@srowen we discussed adding this flag a while ago when pyarrow 0.11.1 made this option which seemed to change default behavior in pyspark. Our intention was to preserve default behavior in pyspark and provide the user a config to change it if needed. It was only recently we learned there was a behavior change either way the config is set.

It does seem like there is a bug in Arrow that is affecting this too, https://issues.apache.org/jira/browse/ARROW-4258, although we still should make a patch for v0.11.1

viirya · 2019-01-16T01:44:03Z

@BryanCutler @srowen Based on the summary, to set a false for the config value, I think we should also leave some words in migration guide for the behavior change. What do you think?

srowen · 2019-01-16T13:57:26Z

To be clear, if the new setting is false, then that enables the 'silent' behavior? that seems like the right default. Yes, can't hurt to note this in release notes ('release-notes' label in JIRA and 'Docs text')

BryanCutler · 2019-01-16T17:33:19Z

I think a default setting of false and a note in the migration guide sound like the best option.

viirya · 2019-01-21T11:01:42Z

Changed default config value to false and added a note to migration guide. Please take a look when you have time. Thanks.

srowen

Looks good pending tests

SparkQA · 2019-01-21T15:04:22Z

Test build #101473 has finished for PR 22807 at commit 36a9cfb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-01-21T22:53:28Z

python/pyspark/sql/tests/test_pandas_udf.py

@@ -197,6 +197,66 @@ def foofoo(x, y):
            ).collect
        )

+    def test_pandas_udf_detect_unsafe_type_conversion(self):
+        from distutils.version import LooseVersion
+        from pyspark.sql.functions import pandas_udf


nit: after the unit test reorg, the pandas_udf import is at the top so it's not needed here or in the other test

Removed it. Thanks!

BryanCutler

LGTM except for just one minor nit

BryanCutler · 2019-01-21T22:55:26Z

docs/sql-migration-guide-upgrade.md

+          </th>
+        </tr>
+  </table>
+


Looks good, thanks for adding this!

HyukjinKwon · 2019-01-22T02:28:12Z

docs/sql-migration-guide-upgrade.md

+        </tr>
+        <tr>
+          <th>
+            <b>version > 0.11.0, arrowSafeTypeConversion=true</b>


Just quick question (don't block by this). So, do we target to use true in the near future? Looks it was false by default to prevent behaviour changes, and in particular due to of NULL -> integer issue (which is fixed in Arrow 0.12.0).

Sounds like we can make it true and remove this configuration when we set the minimal Arrow version to 0.12.0. Am I correct?

I think this config can be kept even we set the minimal Arrow version to 0.12.0. If anything goes wrong, users still can disable the type check.

For now, there isn't consistent behavior across integer overflow and float point truncation. Either true or false causes behavior change. It is false by default to make it consistent to non-arrow UDFs.

If we are going to change it to true in the future, isn't a behavior change again?

Hmhmm .. yea .. Good point about consistency with the regular UDFs. I was thinking of 1.
removing such configurations out eventually (personally I don't like the bunch of configurations we have currently ..), 2. making all UDFs to expect the exact types (<- not sure yet it needs discussion). Yes, we can talk later ..

Yeah, I think the config should stay too since it is only able to be set through pyarrow

SparkQA · 2019-01-22T02:37:45Z

Test build #101501 has finished for PR 22807 at commit 44402da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-22T02:40:06Z

retest this please.

HyukjinKwon · 2019-01-22T02:43:01Z

python/pyspark/sql/tests/test_pandas_udf.py

+        # Disabling Arrow safe type check.
+        with self.sql_conf({
+                "spark.sql.execution.pandas.arrowSafeTypeConversion": False}):
+            df.select(['A']).withColumn('udf', udf('A')).collect()


nit: Can we asset the result too?

HyukjinKwon · 2019-01-22T02:44:08Z

python/pyspark/sql/tests/test_pandas_udf.py

+
+        values = [1.0] * 3
+        pdf = pd.DataFrame({'A': values})
+        df = self.spark.createDataFrame(pdf).repartition(1)


sorry if I missed something. Why should we repartition?

I thought I was writing this to make sure all values are in single partition so it matches the length of returned Pandas Series. This can be simplified. We can do this if we touch the code here next time.

HyukjinKwon · 2019-01-22T02:46:03Z

python/pyspark/sql/tests/test_pandas_udf.py

+        import pandas as pd
+        import pyarrow as pa
+
+        df = self.spark.range(0, 1)


nit .. : spark.range(0).

HyukjinKwon · 2019-01-22T02:49:17Z

python/pyspark/sql/tests/test_pandas_udf.py

+            # Overflow cast causes an error.
+            with self.sql_conf({"spark.sql.execution.pandas.arrowSafeTypeConversion": False}):
+                with self.assertRaisesRegexp(Exception,
+                                             "Integer value out of bounds"):


looks it can be inlined

HyukjinKwon

LGTM too

SparkQA · 2019-01-22T06:53:21Z

Test build #101506 has finished for PR 22807 at commit 44402da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-22T06:53:42Z

Merged to master.

viirya · 2019-01-22T08:22:58Z

Thanks guys!

@HyukjinKwon I will make a followup to address above minor comments. Thanks.

HyukjinKwon · 2019-01-22T08:37:54Z

Eh, it's okie. minor is minor. let's fix it later when we touch that code.

viirya · 2019-01-22T08:44:08Z

@HyukjinKwon ok. no problem.

BryanCutler · 2019-01-22T18:56:14Z

Thanks @viirya !

…ted by PyArrow ## What changes were proposed in this pull request? Since 0.11.0, PyArrow supports to raise an error for unsafe cast ([PR](apache/arrow#2504)). We should use it to raise a proper error for pandas udf users when such cast is detected. Added a SQL config `spark.sql.execution.pandas.arrowSafeTypeConversion` to disable Arrow safe type check. ## How was this patch tested? Added test and manually test. Closes apache#22807 from viirya/SPARK-25811. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Raise a proper error when unsafe cast detected by PyArrow.

7c3fa6d

BryanCutler reviewed Oct 23, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into SPARK-25811

a645a90

viirya commented Jan 8, 2019

View reviewed changes

Fix python style.

409569b

Add SQL config for Arrow safe type check.

5fc35a3

viirya force-pushed the SPARK-25811 branch from e0b3b82 to 5fc35a3 Compare January 8, 2019 12:01

Disable Arrow safe type check for some tests.

c31d519

viirya changed the title ~~[WIP][SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow~~ [SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow Jan 8, 2019

Fix test.

68b0a3a

BryanCutler reviewed Jan 9, 2019

View reviewed changes

Address comments.

250e0b8

viirya added 2 commits January 21, 2019 18:57

Change default config value to false. Add doc.

ad1b3b0

Merge remote-tracking branch 'upstream/master' into SPARK-25811

36a9cfb

srowen approved these changes Jan 21, 2019

View reviewed changes

BryanCutler reviewed Jan 21, 2019

View reviewed changes

BryanCutler approved these changes Jan 21, 2019

View reviewed changes

BryanCutler reviewed Jan 21, 2019

View reviewed changes

Remove local pandas_udf import.

44402da

HyukjinKwon reviewed Jan 22, 2019

View reviewed changes

HyukjinKwon approved these changes Jan 22, 2019

View reviewed changes

asfgit closed this in f92d276 Jan 22, 2019

ueshin mentioned this pull request Feb 15, 2019

[26] Fixing package bounds databricks/koalas#27

Merged

viirya deleted the SPARK-25811 branch December 27, 2023 18:36

[SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow #22807

[SPARK-25811][PySpark] Raise a proper error when unsafe cast is detected by PyArrow #22807

Conversation

viirya commented Oct 23, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

viirya commented Oct 23, 2018

SparkQA commented Oct 23, 2018

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Oct 24, 2018

Choose a reason for hiding this comment

viirya commented Jan 8, 2019

SparkQA commented Jan 8, 2019

viirya commented Jan 8, 2019

SparkQA commented Jan 8, 2019

ueshin commented Jan 8, 2019

viirya commented Jan 8, 2019

ueshin commented Jan 8, 2019

viirya commented Jan 8, 2019

SparkQA commented Jan 8, 2019

SparkQA commented Jan 8, 2019

SparkQA commented Jan 8, 2019

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jan 9, 2019

SparkQA commented Jan 9, 2019

srowen commented Jan 14, 2019

BryanCutler commented Jan 14, 2019

viirya commented Jan 16, 2019

srowen commented Jan 16, 2019

BryanCutler commented Jan 16, 2019

viirya commented Jan 21, 2019

srowen left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 22, 2019

viirya commented Jan 22, 2019

Choose a reason for hiding this comment

viirya commented Oct 23, 2018 •

edited

Loading