Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR] Test case for hoodie.merge.allow.duplicate.on.inserts #6949

Merged
merged 5 commits into from Oct 19, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -901,4 +901,47 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
}
}
}

test("Test enable hoodie.merge.allow.duplicate.on.inserts when write") {
spark.sql("set hoodie.datasource.write.operation = insert")
Seq("mor", "cow").foreach { tableType =>
withTempDir { tmp =>
val tableName = generateTableName
spark.sql(
s"""
| create table $tableName (
| id int,
| name string,
| price double,
| ts long,
| dt string
| ) using hudi
| partitioned by (dt)
| location '${tmp.getCanonicalPath}/$tableName'
| tblproperties (
| primaryKey = 'id',
| preCombineField = 'ts',
| type = '$tableType'
| )
""".stripMargin)
spark.sql(s"insert into $tableName partition(dt='2021-12-25') values (1, 'a1', 10, 1000)")
checkAnswer(s"select id, name, price, ts, dt from $tableName")(
Seq(1, "a1", 10, 1000, "2021-12-25")
)
spark.sql("set hoodie.merge.allow.duplicate.on.inserts = false")
spark.sql(s"insert into $tableName partition(dt='2021-12-25') values (1, 'a2', 20, 1001)")
checkAnswer(s"select id, name, price, ts, dt from $tableName")(
Seq(1, "a2", 20, 1001, "2021-12-25")
)
spark.sql("set hoodie.merge.allow.duplicate.on.inserts = true")
spark.sql(s"insert into $tableName partition(dt='2021-12-25') values (1, 'a3', 30, 1002)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we extract primary-key to a common var to make sharing apparent

checkAnswer(s"select id, name, price, ts, dt from $tableName")(
Seq(1, "a2", 20, 1001, "2021-12-25"),
Seq(1, "a3", 30, 1002, "2021-12-25")
)
}
}
spark.sql("set hoodie.merge.allow.duplicate.on.inserts = false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to set these after the test is done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it defaults to false, we need to restore it after the test is complete

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reset it at the beginning of the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reset it at the beginning of the test

This is done to ensure that other test cases are not affected by this case, just like line 945, 535 and so on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw many cases use this way to set a config at the beginning and restore a config at the ending.
IMO, better to use the method like org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf.
@Zouxxyy are you interesting to create a ticket to follow up this optimization of spark sql UTs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql("set hoodie.datasource.write.operation = upsert")
}
}
Expand Up @@ -609,7 +609,6 @@ class TestMergeIntoTable2 extends HoodieSparkSqlTestBase {
| preCombineField = 'ts'
| )
""".stripMargin)
spark.sql("set hoodie.merge.allow.duplicate.on.inserts = true")
// Insert data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why drop this line ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why drop this line ?

Because this configuration does not work here and can be misleading

spark.sql(s"insert into $tableName select 1, 1, 'a1', 1, 10, '2022-08-18'")
checkAnswer(s"select id1, id2, name, price, ts, dt from $tableName")(
Expand Down