-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35624][CORE]Support reading inputbytes and inputrecords of the… #33874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35624][CORE]Support reading inputbytes and inputrecords of the… #33874
Conversation
|
@cloud-fan hi, please take a look. |
|
+CC @cloud-fan, @attilapiros who took a look at this last. |
|
ok to test |
|
Test build #142907 has finished for PR 33874 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
@allendang001 have you checked the python codes? Especially the spark/python/pyspark/status.py Lines 29 to 31 in 387a251
|
I am not familiar with python codes, can you help me to complete this part? |
|
cc @HyukjinKwon |
|
Just adding these fields into diff --git a/python/pyspark/status.py b/python/pyspark/status.py
index a6fa7dd3144..f342ee38a2d 100644
--- a/python/pyspark/status.py
+++ b/python/pyspark/status.py
@@ -28,7 +28,7 @@ class SparkJobInfo(namedtuple("SparkJobInfo", "jobId stageIds status")):
class SparkStageInfo(namedtuple("SparkStageInfo",
"stageId currentAttemptId name numTasks numActiveTasks "
- "numCompletedTasks numFailedTasks")):
+ "numCompletedTasks numFailedTasks inputBytes inputRecords")):
"""
Exposes information about Spark Stages.
"""
diff --git a/python/pyspark/status.pyi b/python/pyspark/status.pyi
index 0558e245f49..8ea885693bb 100644
--- a/python/pyspark/status.pyi
+++ b/python/pyspark/status.pyi
@@ -32,6 +32,8 @@ class SparkStageInfo(NamedTuple):
numActiveTasks: int
numCompletedTasks: int
numFailedTasks: int
+ inputBytes: int
+ inputRecords: int
class StatusTracker:
def __init__(self, jtracker: JavaObject) -> None: ...
diff --git a/python/pyspark/tests/test_context.py b/python/pyspark/tests/test_context.py
index 4611d038f96..2c28fbabcc8 100644
--- a/python/pyspark/tests/test_context.py
+++ b/python/pyspark/tests/test_context.py
@@ -239,6 +239,8 @@ class ContextTests(unittest.TestCase):
self.assertEqual(1, len(job.stageIds))
stage = tracker.getStageInfo(job.stageIds[0])
self.assertEqual(rdd.getNumPartitions(), stage.numTasks)
+ self.assertGreater(0, stage.inputBytes)
+ self.assertEqual(10, stage.inputRecords)
sc.cancelAllJobs()
t.join()BTW, please keep the Github PR template as is (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE), and describe, at "Does this PR introduce any user-facing change?", which interface it adds with an example preferably. Also, please add a test at |
|
Oh, and we should fix Python example here: spark/examples/src/main/python/status_api_demo.py Lines 59 to 60 in 0494dc9
Lastly, Apache Spark leverages the resources of GitHub Actions in your forked repository to test your PR. Please enable it, see also https://github.com/apache/spark/pull/33874/checks?check_run_id=3471668214. |
@HyukjinKwon thanks a lot |
|
Test build #143111 has finished for PR 33874 at commit
|
ebe9447 to
c67a9dd
Compare
|
Test build #143112 has finished for PR 33874 at commit
|
|
i have submit this codes, @HyukjinKwon PTAL |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
… stage int spark context interface
e49e05e to
16e6323
Compare
|
Test build #143143 has finished for PR 33874 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
retest this please |
|
@allendang001 mind enabling GitHub Actions in your forked repository? See also https://github.com/apache/spark/pull/33874/checks?check_run_id=3565581644 |
|
Test build #143176 has finished for PR 33874 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
add inputBytes and inputRecords interface in SparkStageInfo
Why are the changes needed?
One of our projects needs to count the amount of data scanned and the number of scanned data rows during the execution of sparksql statements, but the current version of spark does not provide an interface to view these data, so I want to obtain this type of data through the spark context interface
Does this PR introduce any user-facing change?
expose new interface in spark context
How was this patch tested?
Manual test