-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22651][PYTHON][ML] Prevent initiating multiple Hive clients for ImageSchema.readImages #19845
Conversation
I could fold this change into |
cc @jiangxb1987, @viirya who I am seeing touched and reviewed similar codes, and @imatiach-msft who's the primary author of this codes. |
Test build #84297 has finished for PR 19845 at commit
|
Let me look into this tomorrow. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a valid fix, LGTM. Also cc @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ctx = SparkContext._active_spark_context | ||
spark = SparkSession(ctx) | ||
image_schema = ctx._jvm.org.apache.spark.ml.image.ImageSchema | ||
spark = SparkSession.builder.getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this should be best practice to initialize SparkSession
.
Thanks for reviewing this @jiangxb1987, @dongjoon-hyun and @viirya. |
Shall we also add a simple test to |
LGTM as long as all tests pass - but what about the other methods that use "ctx = SparkContext._active_spark_context" -- should those be modified as well? Can you run those one after the other in sequence too, or do they fail with the same error? I recall I tried to get the current session just like this in the first iteration but the python tests started to fail when run in parallel, that was the original reason to move to use _active_spark_context. |
@imatiach-msft, ah, I think it's not about SparkContext but SparkSession, (SparkSession(...) directly) to be more clear, which seems causing multiple Hive clients when Hive support is enabled. I think the most of PySpark's unit tests do not enable Hive's support by default. Also, I double checked and I ran this multiple times and it seems fine after this fix as well. Let me keep my eyes on Jenkins tests about this. Basically the tests are ran in parallel on Jenkins. |
@viirya, actually, I think it's not that simple to just add multiple |
Test build #84342 has finished for PR 19845 at commit
|
cls.tearDownClass() | ||
raise unittest.SkipTest("Hive is not available") | ||
cls.spark = HiveContext._createForTesting(cls.sc) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add classmethod tearDownClass
to stop the cls.spark
? I didn't see HiveContextSQLTests
closes it anyway, maybe we can fix it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that should be safer but let me fix it this one here only for now. Let's fix that up too when we touch some codes around there.
@HyukjinKwon Thanks. I forgot the Hive support is needed to test it. The added test looks good. |
Test build #84369 has finished for PR 19845 at commit
|
Merged to master. Thanks for reviewing this @viirya, @jiangxb1987, @dongjoon-hyun and @imatiach-msft. |
What changes were proposed in this pull request?
Calling
ImageSchema.readImages
multiple times as below in PySpark shell:throws an error as below:
Seems we better stick to
SparkSession.builder.getOrCreate()
like:spark/python/pyspark/sql/streaming.py
Line 329 in 51620e2
spark/python/pyspark/sql/column.py
Line 541 in dc5d34d
spark/python/pyspark/sql/readwriter.py
Line 105 in 33d43bf
How was this patch tested?
This was tested as below in PySpark shell: