[SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers#45269
[SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers#45269xinrong-meng wants to merge 12 commits intoapache:masterfrom
Conversation
HyukjinKwon
left a comment
There was a problem hiding this comment.
Shall we add the API into python/docs/source/reference/pyspark.sql/spark_session.rst as well?
|
I was looking for the API doc.. thank you @HyukjinKwon ! |
| SparkSession.createDataFrame | ||
| SparkSession.getActiveSession | ||
| SparkSession.newSession | ||
| SparkSession.profile |
There was a problem hiding this comment.
I think we should also have a dedicated section for profile.show, profile.dump.
There was a problem hiding this comment.
There was a problem hiding this comment.
I hit
[autosummary] failed to import pyspark.sql.SparkSession.profile.dump.
Possible hints:
* AttributeError: 'property' object has no attribute 'dump'
* ImportError:
* ModuleNotFoundError: No module named 'pyspark.sql.SparkSession'
The profile property returns a Profile class instance, Sphinx might have difficulty accessing it. Do you happen to know the best way to resolve that?
There was a problem hiding this comment.
There was a problem hiding this comment.
Hmm I was thinking the same but it kept failing with the error message..
There was a problem hiding this comment.
I think SparkSession.builder works because it is a classproperty whereas profile is a property of SparkSession.
There was a problem hiding this comment.
I have a workaround 76e7387 by using autoclass, but it doesn't look consistent with the rest of the page, as shown below.
I'm wondering if we should have a follow-up designated for that part.
| Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``. | ||
| Python/Pandas UDFs. | ||
|
|
||
| SparkContext-based |
There was a problem hiding this comment.
I think you can just remove this, and just add one additional section called runtime profiler
There was a problem hiding this comment.
How about put the new doc to the first place?
- Identifying Hot Loops (Python Profilers)
- Driver Side
... - Executor Side
- Python/Pandas UDF
Show the new profiler usage - Legacy (for RDD or non-Spark Connect)
Put the current doc here
- Python/Pandas UDF
- Driver Side
There was a problem hiding this comment.
I believe there are many existing users of SparkContext-based profilers. Shall we keep it in the debugging guide until SparkSession-based profilers gain more adoption and positive feedbacks? I'll adjust the order to show SparkSession-based profilers first as @ueshin suggested. What do you think @HyukjinKwon?
There was a problem hiding this comment.
We will remove "legacy" profilers for readability and clarity and start preparing migration guide.
|
Marked WIP to wait for #45378 merged first and then adjusted. |
|
Merged to master. |

What changes were proposed in this pull request?
Documentation for SparkSession-based Profilers.
Why are the changes needed?
For easier user onboarding and better usability.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual test. Screenshots of built htmls are as shown below.
Was this patch authored or co-authored using generative AI tooling?
No.