Skip to content

Flink: Refactor SimpleDataUtil#assertTableRecords method#1765

Merged
openinx merged 2 commits intoapache:masterfrom
zhangjun0x01:refactorAssertTableRows
Nov 19, 2020
Merged

Flink: Refactor SimpleDataUtil#assertTableRecords method#1765
openinx merged 2 commits intoapache:masterfrom
zhangjun0x01:refactorAssertTableRows

Conversation

@zhangjun0x01
Copy link
Contributor

if the parameter List<Record> contains repeated data, the Set will remove the duplicates, which is different from our expected value,we should compare the original list

@github-actions github-actions bot added the flink label Nov 13, 2020
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using hashCode to sort this list is correct if we only want to check wether the two list is equal or not. But it may not friend to debug the root cause when two list is not equal because those elements are disordered. I'd prefer to compare the fields in lexicographical order if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be some unknown data type in the Record that does not implement the compareTo interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if this is just for comparing two lists, then it would be better to use a set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertTableRecords method provides a List parameter. For a new user to use this method, he may not know to use set when comparing data.

Whether there is such a situation: the original data is (1,'a'),(1,'a'), but the actual running result is (1,'a'). This test case passed, but it is wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should remove the duplicate records from tests so that we can use a set?

Copy link
Contributor Author

@zhangjun0x01 zhangjun0x01 Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can generate non-repetitive test data when writing test cases, but there must be duplicate data in the production environment. Will there be test cases that pass but the production environment is incorrect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this SimpleDataUtil, all Record should have the same schema (id and data column), so I think we could just compare those two fields directly. Don't have to compare the hashCode.

About using set to compare lists, I'd prefer to compare sorted lists. Because in future equality-delete test cases, we may delete the same row several times to make sure that we've gurantteed the correct semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to test the multi-level partition filter, I constructed a 2-level partition in this method : TestRewriteDataFilesAction#testRewriteDataFilesWithFilter , so each record has three fields. I don’t know if there will be similar test cases in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, how about using the com.google.common.collect.Multiset to assert those lists with duplicated records ? Don't have to sort based on hashCode, and it also allow to use more fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,it is great ,I update the pr.

@zhangjun0x01 zhangjun0x01 force-pushed the refactorAssertTableRows branch from e52aaa8 to b32919f Compare November 19, 2020 08:08
@openinx openinx merged commit c6c9d38 into apache:master Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments