Skip to content

[SPARK-39922][COER][TESTS] Introduce IteratorSizeBenchmark for Utils#getIteratorSize method#37339

Closed
LuciferYang wants to merge 2 commits intoapache:masterfrom
LuciferYang:SPARK-39922
Closed

[SPARK-39922][COER][TESTS] Introduce IteratorSizeBenchmark for Utils#getIteratorSize method#37339
LuciferYang wants to merge 2 commits intoapache:masterfrom
LuciferYang:SPARK-39922

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Jul 29, 2022

What changes were proposed in this pull request?

This pr add a micro-benchmark for o.a.spark.util.Utils#getIteratorSize method.

Why are the changes needed?

/**
* Counts the number of elements of an iterator using a while loop rather than calling
* [[scala.collection.Iterator#size]] because it uses a for loop, which is slightly slower
* in the current version of Scala.
*/
def getIteratorSize(iterator: Iterator[_]): Long = {

From the method comments, Utils#getIteratorSize method was added due to scala.collection.Iterator#size uses a for loop and it is slightly slower in the current version of Scala.

When adding this method, Spark uses Scala 2.10. Currently, Spark use Scala 2.12. this pr add introduce IteratorSizeBenchmark to ensure that the conclusion is still correct when upgrading the major version of Scala, otherwise we should use Iterator#size directly.

Does this PR introduce any user-facing change?

No, just for test

How was this patch tested?

Pass GitHub Actions.

@LuciferYang LuciferYang changed the title [SPARK-39922][COER][TESTS] Introduce IteratorSizeBenchmark for Utils#getIteratorSize method [WIP][SPARK-39922][COER][TESTS] Introduce IteratorSizeBenchmark for Utils#getIteratorSize method Jul 29, 2022
@github-actions github-actions bot added the CORE label Jul 29, 2022
@LuciferYang
Copy link
Contributor Author

LuciferYang commented Jul 29, 2022

From the method comments, Utils#getIteratorSize method was added due to scala.collection.Iterator#size uses a for loop and it is slightly slower in the current version of Scala.

In fact, this conclusion may no longer be correct after using Scala 2.13 (the test results will update)

@LuciferYang
Copy link
Contributor Author

The result of IteratorSizeBenchmark using Scala 2.13 as follows, Utils.getIteratorSize method slightly slower than Iterator.size

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Range iterator size 10:              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     0              1           0        259.9           3.8       1.0X
Use Utils.getIteratorSize                             1              2           0         72.8          13.7       0.3X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Range iterator size 100:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     2              2           0         55.4          18.0       1.0X
Use Utils.getIteratorSize                             9              9           0         11.6          86.5       0.2X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Range iterator size 1000:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     2              2           0         54.4          18.4       1.0X
Use Utils.getIteratorSize                            83             83           0          1.2         827.0       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Range iterator size 10000:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     2              2           0         54.0          18.5       1.0X
Use Utils.getIteratorSize                           818            820           2          0.1        8181.3       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Range iterator size 30000:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     2              2           0         57.6          17.4       1.0X
Use Utils.getIteratorSize                          2437           2448          16          0.0       24367.8       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Seq iterator size 10:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     3              4           0         30.0          33.4       1.0X
Use Utils.getIteratorSize                             9             10           1         10.8          92.2       0.4X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Seq iterator size 100:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     3              4           0         28.8          34.7       1.0X
Use Utils.getIteratorSize                            80             82           1          1.3         798.8       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Seq iterator size 1000:              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     4              4           1         25.1          39.8       1.0X
Use Utils.getIteratorSize                           751            753           2          0.1        7513.1       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Seq iterator size 10000:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     4              4           0         25.3          39.5       1.0X
Use Utils.getIteratorSize                          7659           7676          24          0.0       76589.3       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Seq iterator size 30000:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     4              4           0         24.8          40.4       1.0X
Use Utils.getIteratorSize                         23008          23025          24          0.0      230080.0       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Array iterator size 10:              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     1              1           0        166.5           6.0       1.0X
Use Utils.getIteratorSize                             6              7           0         15.5          64.7       0.1X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Array iterator size 100:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     0              0           0        315.6           3.2       1.0X
Use Utils.getIteratorSize                            56             56           0          1.8         559.3       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Array iterator size 1000:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     0              0           0        309.1           3.2       1.0X
Use Utils.getIteratorSize                           613            617           2          0.2        6134.1       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Array iterator size 10000:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     0              0           0        306.4           3.3       1.0X
Use Utils.getIteratorSize                          6220           6237          25          0.0       62196.7       0.0X

OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test Array iterator size 30000:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Use Iterator.size                                     0              0           0        316.8           3.2       1.0X
Use Utils.getIteratorSize                         18712          18739          39          0.0      187118.4       0.0X

@@ -0,0 +1,105 @@
OpenJDK 64-Bit Server VM 1.8.0_342-b07 on Linux 5.15.0-1014-azure
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Counts the number of elements of an iterator using a while loop rather than calling
* [[scala.collection.Iterator#size]] because it uses a for loop, which is slightly slower
* in the current version of Scala.
*/
def getIteratorSize(iterator: Iterator[_]): Long = {

From the benchmark results, the above conclusion is correct only when Scala 2.12 + Java 8 is used.

@LuciferYang LuciferYang changed the title [WIP][SPARK-39922][COER][TESTS] Introduce IteratorSizeBenchmark for Utils#getIteratorSize method [SPARK-39922][COER][TESTS] Introduce IteratorSizeBenchmark for Utils#getIteratorSize method Jul 29, 2022
@LuciferYang
Copy link
Contributor Author

LuciferYang commented Jul 29, 2022

If do the following change in Scala 2.13, the performance of Utils#getIteratorSize method will be significantly improved, but this is for Scala 2.13 only, should we do this change? It seems the change need distinguish Scala versions due to Scala 2.12 has no knownSize method. At the same time, I haven't found an optimization way for Scala 2.12.

def getIteratorSize(iterator: Iterator[_]): Long = {
    if (iterator.knownSize > 0) {
      iterator.knownSize.toLong
    } else {
      var count = 0L
      while (iterator.hasNext) {
        count += 1L
        iterator.next()
      }
      count
    }
  }

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is too narrow to benchmark. The observed perf diff was minor too. Also given that it heavily depends on Scala version, I guess nothing we can do even when it's slower.

@LuciferYang
Copy link
Contributor Author

I feel like this is too narrow to benchmark. The observed perf diff was minor too. Also given that it heavily depends on Scala version, I guess nothing we can do even when it's slower.

Ok, I will close this and I will optimize the implementation of this method when Scala 2.13 is the default version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants