Skip to content

Conversation

@allendang001
Copy link

@allendang001 allendang001 commented Aug 31, 2021

What changes were proposed in this pull request?
add inputBytes and inputRecords interface in SparkStageInfo

Why are the changes needed?
One of our projects needs to count the amount of data scanned and the number of scanned data rows during the execution of sparksql statements, but the current version of spark does not provide an interface to view these data, so I want to obtain this type of data through the spark context interface

Does this PR introduce any user-facing change?
expose new interface in spark context

How was this patch tested?
Manual test

@allendang001
Copy link
Author

@cloud-fan hi, please take a look.

@mridulm
Copy link
Contributor

mridulm commented Sep 1, 2021

+CC @cloud-fan, @attilapiros who took a look at this last.
I am fine with the change in general.

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Sep 1, 2021

Test build #142907 has finished for PR 33874 at commit fb66ab3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47410/

@SparkQA
Copy link

SparkQA commented Sep 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47410/

@attilapiros
Copy link
Contributor

@allendang001 have you checked the python codes?

Especially the status.py:

class SparkStageInfo(namedtuple("SparkStageInfo",
"stageId currentAttemptId name numTasks numActiveTasks "
"numCompletedTasks numFailedTasks")):

@allendang001
Copy link
Author

allendang001 commented Sep 6, 2021

@allendang001 have you checked the python codes?

Especially the status.py:

class SparkStageInfo(namedtuple("SparkStageInfo",
"stageId currentAttemptId name numTasks numActiveTasks "
"numCompletedTasks numFailedTasks")):

I am not familiar with python codes, can you help me to complete this part?

@cloud-fan
Copy link
Contributor

cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

Just adding these fields into status.py should be enough. e.g.)

diff --git a/python/pyspark/status.py b/python/pyspark/status.py
index a6fa7dd3144..f342ee38a2d 100644
--- a/python/pyspark/status.py
+++ b/python/pyspark/status.py
@@ -28,7 +28,7 @@ class SparkJobInfo(namedtuple("SparkJobInfo", "jobId stageIds status")):

 class SparkStageInfo(namedtuple("SparkStageInfo",
                                 "stageId currentAttemptId name numTasks numActiveTasks "
-                                "numCompletedTasks numFailedTasks")):
+                                "numCompletedTasks numFailedTasks inputBytes inputRecords")):
     """
     Exposes information about Spark Stages.
     """
diff --git a/python/pyspark/status.pyi b/python/pyspark/status.pyi
index 0558e245f49..8ea885693bb 100644
--- a/python/pyspark/status.pyi
+++ b/python/pyspark/status.pyi
@@ -32,6 +32,8 @@ class SparkStageInfo(NamedTuple):
     numActiveTasks: int
     numCompletedTasks: int
     numFailedTasks: int
+    inputBytes: int
+    inputRecords: int

 class StatusTracker:
     def __init__(self, jtracker: JavaObject) -> None: ...
diff --git a/python/pyspark/tests/test_context.py b/python/pyspark/tests/test_context.py
index 4611d038f96..2c28fbabcc8 100644
--- a/python/pyspark/tests/test_context.py
+++ b/python/pyspark/tests/test_context.py
@@ -239,6 +239,8 @@ class ContextTests(unittest.TestCase):
             self.assertEqual(1, len(job.stageIds))
             stage = tracker.getStageInfo(job.stageIds[0])
             self.assertEqual(rdd.getNumPartitions(), stage.numTasks)
+            self.assertGreater(0, stage.inputBytes)
+            self.assertEqual(10, stage.inputRecords)

             sc.cancelAllJobs()
             t.join()

BTW, please keep the Github PR template as is (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE), and describe, at "Does this PR introduce any user-facing change?", which interface it adds with an example preferably.

Also, please add a test at StatusTrackerSuite.

@HyukjinKwon
Copy link
Member

Oh, and we should fix Python example here:

print("Stage %d: %d tasks total (%d active, %d complete)" %
(sid, info.numTasks, info.numActiveTasks, info.numCompletedTasks))
just like you did in Java.

Lastly, Apache Spark leverages the resources of GitHub Actions in your forked repository to test your PR. Please enable it, see also https://github.com/apache/spark/pull/33874/checks?check_run_id=3471668214.

@allendang001
Copy link
Author

allendang001 commented Sep 9, 2021

StatusTrackerSuite

@HyukjinKwon thanks a lot

@github-actions github-actions bot added the PYTHON label Sep 9, 2021
@SparkQA
Copy link

SparkQA commented Sep 9, 2021

Test build #143111 has finished for PR 33874 at commit ebe9447.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@allendang001 allendang001 force-pushed the add-input-bytes-and-records branch from ebe9447 to c67a9dd Compare September 9, 2021 06:52
@SparkQA
Copy link

SparkQA commented Sep 9, 2021

Test build #143112 has finished for PR 33874 at commit c67a9dd.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@allendang001
Copy link
Author

i have submit this codes, @HyukjinKwon PTAL

@SparkQA
Copy link

SparkQA commented Sep 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47615/

@SparkQA
Copy link

SparkQA commented Sep 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47615/

@SparkQA
Copy link

SparkQA commented Sep 9, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47617/

@allendang001 allendang001 force-pushed the add-input-bytes-and-records branch from e49e05e to 16e6323 Compare September 10, 2021 09:11
@SparkQA
Copy link

SparkQA commented Sep 10, 2021

Test build #143143 has finished for PR 33874 at commit 16e6323.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47647/

@SparkQA
Copy link

SparkQA commented Sep 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47647/

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

@allendang001 mind enabling GitHub Actions in your forked repository? See also https://github.com/apache/spark/pull/33874/checks?check_run_id=3565581644

@SparkQA
Copy link

SparkQA commented Sep 12, 2021

Test build #143176 has finished for PR 33874 at commit 16e6323.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47679/

@SparkQA
Copy link

SparkQA commented Sep 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47679/

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 22, 2021
@github-actions github-actions bot closed this Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants