Skip to content

[SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks#24605

Closed
jose-torres wants to merge 6 commits intoapache:masterfrom
jose-torres:fix254
Closed

[SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks#24605
jose-torres wants to merge 6 commits intoapache:masterfrom
jose-torres:fix254

Conversation

@jose-torres
Copy link
Contributor

What changes were proposed in this pull request?

Unset InputFileBlockHolder at the end of tasks to stop the file name from leaking over to other tasks in the same thread. This happens in particular in Pyspark because of its complex threading model.

How was this patch tested?

new pyspark test

@zsxwing
Copy link
Member

zsxwing commented May 14, 2019

LGTM

@zsxwing
Copy link
Member

zsxwing commented May 14, 2019

ok to test

@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105388 has finished for PR 24605 at commit fdbae70.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105389 has finished for PR 24605 at commit 7e1eafe.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105390 has finished for PR 24605 at commit bc2a874.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105391 has finished for PR 24605 at commit e30c6ab.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105392 has finished for PR 24605 at commit 7ea89c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import sys

from pyspark.sql import Row
from pyspark.sql.types import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid wlidcard import

# [SC-12160]: if everything was properly reset after the last job, this should return
# empty string rather than the file read in the last job.
for result in results:
assert(result[0] == '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can stick to self.assertEqual for a better message.

results = non_file_df.collect()
self.assertTrue(len(results) == 100)

# [SC-12160]: if everything was properly reset after the last job, this should return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's SC-12160?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @HyukjinKwon 's comment. Is this the internal issue tracker ID?
Could you update the PR, @jose-torres ?

[Row(name=u'Tom'), Row(name=u'Alice'), Row(name=None)])

def test_input_file_name_reset_for_rdd(self):
from pyspark.sql.functions import udf, input_file_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just import on the top

def test_input_file_name_reset_for_rdd(self):
from pyspark.sql.functions import udf, input_file_name
rdd = self.sc.textFile('python/test_support/hello/hello.txt').map(lambda x: {'data': x})
df = self.spark.createDataFrame(rdd, StructType([StructField('data', StringType(), True)]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, you don't have to import types:

spark.createDataFrame(rdd, "data STRING")

df = self.spark.createDataFrame(rdd, StructType([StructField('data', StringType(), True)]))
df.select(input_file_name().alias('file')).collect()

non_file_df = self.spark.range(0, 100, 1, 100).select(input_file_name().alias('file'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we don;t need an alias file.

and why don't we just use spark.range(100)?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits. looks good to me too

@jose-torres
Copy link
Contributor Author

Pushed comments. Sorry it took so long, I was on a trip.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105702 has finished for PR 24605 at commit c8c8d7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

Thanks! Merged to master, please manually backport to 2.4!

jose-torres added a commit to jose-torres/spark that referenced this pull request May 23, 2019
Unset InputFileBlockHolder at the end of tasks to stop the file name from leaking over to other tasks in the same thread. This happens in particular in Pyspark because of its complex threading model.

new pyspark test

Closes apache#24605 from jose-torres/fix254.

Authored-by: Jose Torres <torres.joseph.f+github@gmail.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants