[SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks#24605

Closed

jose-torres wants to merge 6 commits intoapache:masterfrom

jose-torres:fix254

Contributor

jose-torres commented May 14, 2019

What changes were proposed in this pull request?

Unset InputFileBlockHolder at the end of tasks to stop the file name from leaking over to other tasks in the same thread. This happens in particular in Pyspark because of its complex threading model.

How was this patch tested?

new pyspark test

fix

fdbae70

Member

zsxwing commented May 14, 2019

LGTM

Member

zsxwing commented May 14, 2019

ok to test

SparkQA commented May 14, 2019

Test build #105388 has finished for PR 24605 at commit fdbae70.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.


          rm blankline

7e1eafe

SparkQA commented May 14, 2019

Test build #105389 has finished for PR 24605 at commit 7e1eafe.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.


          fix indent

bc2a874

SparkQA commented May 14, 2019

Test build #105390 has finished for PR 24605 at commit bc2a874.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.


          move to right place

e30c6ab

SparkQA commented May 14, 2019

Test build #105391 has finished for PR 24605 at commit e30c6ab.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.


          import struct classes

7ea89c0

SparkQA commented May 14, 2019

Test build #105392 has finished for PR 24605 at commit 7ea89c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon reviewed

View reviewed changes

python/pyspark/sql/tests/test_functions.py Outdated

               import sys
               from pyspark.sql import Row
+              from pyspark.sql.types import *

Member

HyukjinKwon May 15, 2019

Can we avoid wlidcard import

HyukjinKwon reviewed

View reviewed changes

python/pyspark/sql/tests/test_functions.py Outdated

+                      # [SC-12160]: if everything was properly reset after the last job, this should return
+                      # empty string rather than the file read in the last job.
+                      for result in results:
+                          assert(result[0] == '')

Member

HyukjinKwon May 15, 2019

we can stick to self.assertEqual for a better message.

HyukjinKwon reviewed

View reviewed changes

python/pyspark/sql/tests/test_functions.py Outdated

+                      results = non_file_df.collect()
+                      self.assertTrue(len(results) == 100)
+                      # [SC-12160]: if everything was properly reset after the last job, this should return

Member

HyukjinKwon May 15, 2019

What's SC-12160?

Member

dongjoon-hyun May 17, 2019

+1 for @HyukjinKwon 's comment. Is this the internal issue tracker ID?
Could you update the PR, @jose-torres ?

HyukjinKwon reviewed

View reviewed changes

python/pyspark/sql/tests/test_functions.py Outdated

                           [Row(name=u'Tom'), Row(name=u'Alice'), Row(name=None)])
+                  def test_input_file_name_reset_for_rdd(self):
+                      from pyspark.sql.functions import udf, input_file_name

Member

HyukjinKwon May 15, 2019

I think we can just import on the top

HyukjinKwon reviewed

View reviewed changes

python/pyspark/sql/tests/test_functions.py Outdated

+                  def test_input_file_name_reset_for_rdd(self):
+                      from pyspark.sql.functions import udf, input_file_name
+                      rdd = self.sc.textFile('python/test_support/hello/hello.txt').map(lambda x: {'data': x})
+                      df = self.spark.createDataFrame(rdd, StructType([StructField('data', StringType(), True)]))

Member

HyukjinKwon May 15, 2019

actually, you don't have to import types:

spark.createDataFrame(rdd, "data STRING")

HyukjinKwon reviewed

View reviewed changes

python/pyspark/sql/tests/test_functions.py Outdated

+                      df = self.spark.createDataFrame(rdd, StructType([StructField('data', StringType(), True)]))
+                      df.select(input_file_name().alias('file')).collect()
+                      non_file_df = self.spark.range(0, 100, 1, 100).select(input_file_name().alias('file'))

Member

HyukjinKwon May 15, 2019

seems like we don;t need an alias file.

and why don't we just use spark.range(100)?

HyukjinKwon approved these changes

View reviewed changes

Member

HyukjinKwon left a comment

Some nits. looks good to me too


          address comments

c8c8d7b

Contributor Author

jose-torres commented May 22, 2019

Pushed comments. Sorry it took so long, I was on a trip.

SparkQA commented May 22, 2019

Test build #105702 has finished for PR 24605 at commit c8c8d7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 approved these changes

View reviewed changes

Contributor

jiangxb1987 commented May 23, 2019

Thanks! Merged to master, please manually backport to 2.4!

jiangxb1987 closed this in

5fae8f7

jose-torres added a commit to jose-torres/spark that referenced this pull request


          [SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks

48de77b

Unset InputFileBlockHolder at the end of tasks to stop the file name from leaking over to other tasks in the same thread. This happens in particular in Pyspark because of its complex threading model.

new pyspark test

Closes apache#24605 from jose-torres/fix254.

Authored-by: Jose Torres <torres.joseph.f+github@gmail.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>

jose-torres mentioned this pull request

[SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks #24690

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet