[SPARK-19223][SQL][PySpark] Fix InputFileBlockHolder for datasources which are based on HadoopRDD or NewHadoopRDD #16585

viirya · 2017-01-14T06:44:43Z

What changes were proposed in this pull request?

For some datasources which are based on HadoopRDD or NewHadoopRDD, such as spark-xml, InputFileBlockHolder doesn't work with Python UDF.

The method to reproduce it is, running the following codes with bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1:

from pyspark.sql.functions import udf,input_file_name
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession

def filename(path):
    return path

session = SparkSession.builder.appName('APP').getOrCreate()

session.udf.register('sameText', filename)
sameText = udf(filename, StringType())

df = session.read.format('xml').load('a.xml', rowTag='root').select('*', input_file_name().alias('file'))
df.select('file').show() # works
df.select(sameText(df['file'])).show()   # returns empty content

The issue is because in HadoopRDD and NewHadoopRDD we set the file block's info in InputFileBlockHolder before the returned iterator begins consuming. InputFileBlockHolder will record this info into thread local variable. When running Python UDF in batch, we set up another thread to consume the iterator from child plan's output rdd, so we can't read the info back in another thread.

To fix this, we have to set the info in InputFileBlockHolder after the iterator begins consuming. So the info can be read in correct thread.

How was this patch tested?

Manual test with above example codes for spark-xml package on pyspark: bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1.

Added pyspark test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2017-01-14T06:48:40Z

Test build #71366 has started for PR 16585 at commit 5fd215f.

viirya · 2017-01-14T08:11:16Z

retest this please.

SparkQA · 2017-01-14T10:07:18Z

Test build #71368 has finished for PR 16585 at commit 5fd215f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-01-14T11:44:48Z

retest this please.

SparkQA · 2017-01-14T14:13:37Z

Test build #71370 has finished for PR 16585 at commit 5fd215f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-01-15T04:20:46Z

ping @rxin @cloud-fan

rxin · 2017-01-16T23:09:47Z

should the proper fix be the python thread transfers the proper information over?

viirya · 2017-01-17T03:32:27Z

@rxin Thanks for looking at this. I think the simplest way to transfer the info is using InheritableThreadLocal to replace ThreadLocal in InputFileBlockHolder. As I tested, it works. What do you think? It is ok for you?

cloud-fan · 2017-01-17T04:55:58Z

SGTM

viirya · 2017-01-17T04:57:32Z

@cloud-fan SGTM is for current approach or InheritableThreadLocal?

cloud-fan · 2017-01-17T05:10:38Z

InheritableThreadLocal

rxin · 2017-01-17T06:03:50Z

BTW please add a test case for this. Thanks.

SparkQA · 2017-01-17T08:02:40Z

Test build #71474 has finished for PR 16585 at commit e2d872c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-17T08:03:04Z

Test build #71475 has finished for PR 16585 at commit 1563e03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-17T14:54:27Z

core/src/test/scala/org/apache/spark/rdd/InputFileBlockHolderSuite.scala

+import org.apache.spark.{SparkContext, SparkFunSuite}
+import org.apache.spark.util.Utils
+
+class InputFileBlockHolderSuite extends SparkFunSuite {


shall we just write a pyspark test?

I can't find a proper built-in datasource for testing it. I think all file-based datasources are based on FileFormat. They don't use HadoopRDD/NewHadoopRDD. The pyspark test shown in the description is using spark-xml package to test manually.

I've written a pyspark test which directly reads file and produces HadoopRDD/NewHadoopRDD based dataframe. So this scala test is removed.

SparkQA · 2017-01-17T16:12:25Z

Test build #71511 has finished for PR 16585 at commit d7c05d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class InputFileBlockHolderSuite extends SparkFunSuite

SparkQA · 2017-01-17T16:39:56Z

Test build #71509 has finished for PR 16585 at commit b1bfc50.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class InputFileBlockHolderSuite extends SparkFunSuite with LocalSparkContext

SparkQA · 2017-01-17T16:42:21Z

Test build #71510 has finished for PR 16585 at commit 157f70f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class InputFileBlockHolderSuite extends SparkFunSuite with LocalSparkContext

SparkQA · 2017-01-17T23:49:40Z

Test build #71540 has finished for PR 16585 at commit 8380617.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class InputFileBlockHolderSuite extends SparkFunSuite

SparkQA · 2017-01-18T02:40:07Z

Test build #71559 has finished for PR 16585 at commit 3444499.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-18T05:52:33Z

Test build #71561 has finished for PR 16585 at commit 2ce65cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-18T06:11:25Z

python/pyspark/sql/tests.py

+        def filename(path):
+            return path
+
+        self.spark.udf.register('sameText', filename)


where do we call this registered function?

oh. wrongly copied.

cloud-fan · 2017-01-18T06:11:59Z

python/pyspark/sql/tests.py

+            'org.apache.hadoop.io.Text')
+
+        df2 = self.spark.read.json(rdd2).select(input_file_name().alias('file'))
+        row = df2.select(sameText(df2['file'])).first()


nit: row2?

SparkQA · 2017-01-18T06:38:41Z

Test build #71575 has started for PR 16585 at commit 2b61d47.

cloud-fan · 2017-01-18T07:14:46Z

LGTM, pending jenkins

viirya · 2017-01-18T08:06:45Z

retest this please.

SparkQA · 2017-01-18T10:05:57Z

Test build #71581 has finished for PR 16585 at commit 2b61d47.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-01-18T10:21:06Z

retest this please.

SparkQA · 2017-01-18T12:59:14Z

Test build #71596 has finished for PR 16585 at commit 2b61d47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-18T15:07:36Z

thanks, merging to master!

viirya · 2017-01-18T15:30:20Z

@rxin @cloud-fan Thanks!

…which are based on HadoopRDD or NewHadoopRDD ## What changes were proposed in this pull request? For some datasources which are based on HadoopRDD or NewHadoopRDD, such as spark-xml, InputFileBlockHolder doesn't work with Python UDF. The method to reproduce it is, running the following codes with `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`: from pyspark.sql.functions import udf,input_file_name from pyspark.sql.types import StringType from pyspark.sql import SparkSession def filename(path): return path session = SparkSession.builder.appName('APP').getOrCreate() session.udf.register('sameText', filename) sameText = udf(filename, StringType()) df = session.read.format('xml').load('a.xml', rowTag='root').select('*', input_file_name().alias('file')) df.select('file').show() # works df.select(sameText(df['file'])).show() # returns empty content The issue is because in `HadoopRDD` and `NewHadoopRDD` we set the file block's info in `InputFileBlockHolder` before the returned iterator begins consuming. `InputFileBlockHolder` will record this info into thread local variable. When running Python UDF in batch, we set up another thread to consume the iterator from child plan's output rdd, so we can't read the info back in another thread. To fix this, we have to set the info in `InputFileBlockHolder` after the iterator begins consuming. So the info can be read in correct thread. ## How was this patch tested? Manual test with above example codes for spark-xml package on pyspark: `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`. Added pyspark test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#16585 from viirya/fix-inputfileblock-hadooprdd.

Fix InputFileBlock for HadoopRDD.

5fd215f

Address comment.

1563e03

viirya force-pushed the fix-inputfileblock-hadooprdd branch from e2d872c to 1563e03 Compare January 17, 2017 05:33

viirya force-pushed the fix-inputfileblock-hadooprdd branch 2 times, most recently from 157f70f to d7c05d2 Compare January 17, 2017 14:15

cloud-fan reviewed Jan 17, 2017

View reviewed changes

Add test case.

8380617

viirya force-pushed the fix-inputfileblock-hadooprdd branch from d7c05d2 to 8380617 Compare January 17, 2017 23:20

Replace test case with a pyspark one.

2ce65cb

viirya force-pushed the fix-inputfileblock-hadooprdd branch from 3444499 to 2ce65cb Compare January 18, 2017 03:14

cloud-fan reviewed Jan 18, 2017

View reviewed changes

For comment.

2b61d47

asfgit closed this in d06172b Jan 18, 2017

viirya deleted the fix-inputfileblock-hadooprdd branch December 27, 2023 18:20

[SPARK-19223][SQL][PySpark] Fix InputFileBlockHolder for datasources which are based on HadoopRDD or NewHadoopRDD #16585

[SPARK-19223][SQL][PySpark] Fix InputFileBlockHolder for datasources which are based on HadoopRDD or NewHadoopRDD #16585

Conversation

viirya commented Jan 14, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 14, 2017

viirya commented Jan 14, 2017

SparkQA commented Jan 14, 2017

viirya commented Jan 14, 2017

SparkQA commented Jan 14, 2017

viirya commented Jan 15, 2017

rxin commented Jan 16, 2017

viirya commented Jan 17, 2017

cloud-fan commented Jan 17, 2017

viirya commented Jan 17, 2017

cloud-fan commented Jan 17, 2017

rxin commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

cloud-fan Jan 17, 2017

Choose a reason for hiding this comment

viirya Jan 17, 2017 • edited

Choose a reason for hiding this comment

viirya Jan 18, 2017 • edited

Choose a reason for hiding this comment

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 18, 2017

SparkQA commented Jan 18, 2017

cloud-fan Jan 18, 2017

Choose a reason for hiding this comment

viirya Jan 18, 2017

Choose a reason for hiding this comment

cloud-fan Jan 18, 2017

Choose a reason for hiding this comment

viirya Jan 18, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 18, 2017

cloud-fan commented Jan 18, 2017

viirya commented Jan 18, 2017

SparkQA commented Jan 18, 2017

viirya commented Jan 18, 2017

SparkQA commented Jan 18, 2017

cloud-fan commented Jan 18, 2017

viirya commented Jan 18, 2017

viirya commented Jan 14, 2017 •

edited

viirya Jan 17, 2017 •

edited

viirya Jan 18, 2017 •

edited