[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 #24157

LantaoJin · 2019-03-20T13:11:41Z

What changes were proposed in this pull request?

HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer.

How was this patch tested?

Adding UT

  test("kryo serialization with RoaringBitmap") {
    val bitmap = new RoaringBitmap
    bitmap.add(1787)

    val safeSer = new KryoSerializer(conf).newInstance()
    val bitmap2 : RoaringBitmap = safeSer.deserialize(safeSer.serialize(bitmap))
    assert(bitmap2.equals(bitmap))

    conf.set("spark.kryo.unsafe", "true")
    val unsafeSer = new KryoSerializer(conf).newInstance()
    val bitmap3 : RoaringBitmap = unsafeSer.deserialize(unsafeSer.serialize(bitmap))
    assert(bitmap3.equals(bitmap)) // this will fail
  }

LantaoJin · 2019-03-20T13:15:03Z

This UT only works after #24156 fixed. Now it's easy to reproduce by replacing conf.set("spark.kryo.unsafe", "true") to conf.set("spark.kyro.unsafe", "true").

LantaoJin · 2019-03-20T13:25:42Z

Since current RoaringBitmap couldn't be ser/deser correctly in unsafe KryoSerializer, first thing I could think out is replacing this data structure totally or when use unsafe Kryo. How do you think about it?
cc @dongjoon-hyun @vanzin @squito @gatorsmile

srowen · 2019-03-20T14:17:37Z

Does this need to be serialized? I wouldn't think so if it doesn't work!

attilapiros · 2019-03-20T14:35:03Z

core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala

+    val bitmap2 : RoaringBitmap = safeSer.deserialize(safeSer.serialize(bitmap))
+    assert(bitmap2.equals(bitmap))
+
+    conf.set("spark.kryo.unsafe", "true")


You are changing the conf which also used by other tests within the suite and now the execution order of these tests are important. If the test execution starts with this test and others are executed latter they might fail.

It can be move to a totally new Suite. I will update it.

AmplabJenkins · 2019-03-20T16:06:54Z

Can one of the admins verify this patch?

squito · 2019-03-20T16:36:21Z

~~@srowen I think this is only necessary with "spark.kryo.unsafe=true" -- it probably never worked with that configuration before, but did work with the default "spark.kryo.unsafe=false"~~

Err, scratch that, I was looking at entirely the wrong thing. I'm also confused here -- so far, this change is just the failing UT, right? you will add the actual fix to behavior as part of this pr?

LantaoJin · 2019-03-21T04:07:09Z

@srowen @squito I've added another UT which is the minimized dataset from our product issue.
In this UT, I roughly comment one line in ShuffleBlockFetcherIterator to avoid job fail.

if (buf.size == 0) {
// throwFetchFailedException(blockId, address, new IOException(msg))
}

After that, the testing fail zero-size blocks in ShuffleBlockFetcherIteratorSuite will fail. This was introduced by #21219. So in Spark2.3.x, this UT doesn't need this hard code commenting.

squito · 2019-03-22T21:04:50Z

@LantaoJin I'm still confused by the status of this -- it seems its just test changes, not behavior changes, but it sounds like you are saying some behavior is just broken. Its labeled as a WIP, but you've also pinged people for review. Are you looking for help in determining the right fix? If so, it would help us if you could give a more complete description of what goes wrong. I don't see anything obviously wrong with unsafe kryo and roaring bitmap -- you could try serializing a tiny bitmap and see if the bits make sense

Or do you believe this by itself is actually the complete change?

LantaoJin · 2019-04-01T13:55:29Z

How can I reopen this closed ticket? @srowen @srowen Now I have detected the root cause. It's a bug of RoaringBitmap and fixed in latest version. (Above unit tests could illustrate it observably. I renamed this PR and am going to push new commit.

LantaoJin · 2019-04-01T14:00:31Z

Other users also meet this problem. https://issues.apache.org/jira/browse/SPARK-27216?focusedCommentId=16804447&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16804447

LantaoJin · 2019-04-01T14:04:57Z

Since SQLQueryWithKryoSuite is overkill for this PR but useful to illustrate this problem. I will delete it from code and keep it in comment here. After upgraded to latest version, below UT could pass.

package org.apache.spark.sql

import org.apache.spark.internal.config
import org.apache.spark.internal.config.Kryo._
import org.apache.spark.internal.config.SERIALIZER
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.test.SharedSQLContext

class SQLQueryWithKryoSuite extends QueryTest with SharedSQLContext {

  override protected def sparkConf = super.sparkConf
    .set(SERIALIZER, "org.apache.spark.serializer.KryoSerializer")
    .set(KRYO_USE_UNSAFE, true)

  test("kryo unsafe data quality issue") {
    // This issue can be reproduced when
    // 1. Enable KryoSerializer
    // 2. Set spark.kryo.unsafe to true
    // 3. Use HighlyCompressedMapStatus since it uses RoaringBitmap
    // 4. Set spark.sql.shuffle.partitions to 6000, 6000 can trigger issue based the supplied data
    // 5. Comment the zero-size blocks fetch fail exception in ShuffleBlockFetcherIterator
    //    or this job will failed with FetchFailedException.
    withSQLConf(
      SQLConf.SHUFFLE_PARTITIONS.key -> "6000",
      config.SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS.key -> "-1") {
      withTempView("t") {
        val df = spark.read.parquet(testFile("test-data/dates.parquet")).toDF("date")
        df.createOrReplaceTempView("t")
        checkAnswer(
          sql("SELECT COUNT(*) FROM t"),
          sql(
            """
              |SELECT SUM(a) FROM
              |(
              |SELECT COUNT(*) a, date
              |FROM t
              |GROUP BY date
              |)
            """.stripMargin))
      }
    }
  }
}

srowen · 2019-04-01T14:04:59Z

@LantaoJin you should be able to reopen this, or it will reopen if you push a new commit.

LantaoJin · 2019-04-01T14:17:34Z

Sorry I can not reopen it since a force pushing. I open a new #24264 as a updating.

srowen · 2019-04-01T14:22:39Z

That's fine, I can reopen them too, but you already have a new PR

attilapiros reviewed Mar 20, 2019

View reviewed changes

LantaoJin added 2 commits March 21, 2019 11:15

[SPARK-27216][CORE] Kryo serialization with RoaringBitmap

fdbc781

ut move to UnsafeKryoSerializerSuite

c8f15bd

LantaoJin force-pushed the SPARK-27216 branch from 28ecac3 to c8f15bd Compare March 21, 2019 03:39

add a sql related ut to illustrate

b3d7d40

trivial

9535479

srowen closed this Mar 29, 2019

LantaoJin changed the title ~~[WIP][SPARK-27216][CORE] Kryo serialization with RoaringBitmap~~ [SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 Apr 1, 2019

LantaoJin mentioned this pull request Apr 1, 2019

[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue #24264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 #24157

[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 #24157

LantaoJin commented Mar 20, 2019

LantaoJin commented Mar 20, 2019

LantaoJin commented Mar 20, 2019

srowen commented Mar 20, 2019

attilapiros Mar 20, 2019 •

edited

Loading

LantaoJin Mar 21, 2019

AmplabJenkins commented Mar 20, 2019

squito commented Mar 20, 2019 •

edited

Loading

LantaoJin commented Mar 21, 2019 •

edited

Loading

squito commented Mar 22, 2019

LantaoJin commented Apr 1, 2019 •

edited

Loading

LantaoJin commented Apr 1, 2019

LantaoJin commented Apr 1, 2019 •

edited

Loading

srowen commented Apr 1, 2019

LantaoJin commented Apr 1, 2019

srowen commented Apr 1, 2019

[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 #24157

[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 #24157

Conversation

LantaoJin commented Mar 20, 2019

What changes were proposed in this pull request?

How was this patch tested?

LantaoJin commented Mar 20, 2019

LantaoJin commented Mar 20, 2019

srowen commented Mar 20, 2019

attilapiros Mar 20, 2019 • edited Loading

Choose a reason for hiding this comment

LantaoJin Mar 21, 2019

Choose a reason for hiding this comment

AmplabJenkins commented Mar 20, 2019

squito commented Mar 20, 2019 • edited Loading

LantaoJin commented Mar 21, 2019 • edited Loading

squito commented Mar 22, 2019

LantaoJin commented Apr 1, 2019 • edited Loading

LantaoJin commented Apr 1, 2019

LantaoJin commented Apr 1, 2019 • edited Loading

srowen commented Apr 1, 2019

LantaoJin commented Apr 1, 2019

srowen commented Apr 1, 2019

attilapiros Mar 20, 2019 •

edited

Loading

squito commented Mar 20, 2019 •

edited

Loading

LantaoJin commented Mar 21, 2019 •

edited

Loading

LantaoJin commented Apr 1, 2019 •

edited

Loading

LantaoJin commented Apr 1, 2019 •

edited

Loading