[HUDI-392] Introduce DIstributedTestDataSource to generate test data by yanghua · Pull Request #1115 · apache/hudi

yanghua · 2019-12-19T08:03:45Z

What is the purpose of the pull request

Introduce DIstributedTestDataSource to generate test data

Brief change log

Introduce DIstributedTestDataSource to generate test data

Verify this pull request

(Please pick either of the following options)

This change added tests and can be verified as follows:

TestHoodieTestSuiteJob#testDistributeSourceInsert

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

yanghua · 2019-12-19T12:30:35Z

Will add more test cases to cover complex dag.

n3nash · 2019-12-19T20:57:01Z

hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java

Is the reason for this change to use the timestamp to generate partition for which a double value is not conducive ?

Yes, I tried to change the type of timestamp field to double. However, it will cause the test case: TestHoodieTestSuiteJob#testComplexDag failed.

Yes. I have tried to change the type of timestamp field of source.avsc file to doubule. However, it will cause TestHoodieTestSuiteJob#testComplexDag to be failed.

Okay, we can fix that, shouldn't be difficult

n3nash · 2019-12-19T21:02:29Z

hudi-test-suite/src/main/java/org/apache/hudi/testsuite/generator/DeltaGenerator.java

Not sure I follow this method properly :

We shouldn't use any Test* names in the source code.

How do we plan to support more properties in the future, do we need to make code changes every time ?

What is the need for rocksdb use here, again "Test" in the name ?

fetchNext is always fetching 1000000 records, why is that ?

The name of the method says "upsert" but I see only inserts getting generated..

Not sure I follow this method properly :

We shouldn't use any Test* names in the source code.

How do we plan to support more properties in the future, do we need to make code changes every time ?

What is the need for rocksdb use here, again "Test" in the name ?

fetchNext is always fetching 1000000 records, why is that ?

The name of the method says "upsert" but I see only inserts getting generated..

Hi @n3nash When implementing this function, I also have some questions (e.g..yours 1,2,3). Actually, using the ability of DistributedTestDataSource is @vinothchandar 's suggestion.

I know DistributedTestDataSource has some limitations. IMO, we can refactor it to make it more scalability and to be a general data generator. WDYT?

About, question 4:

fetchNext is always fetching 1000000 records, why is that ?

It is because of this statement:

InputBatch<JavaRDD<GenericRecord>> batch = distributedTestDataSource.fetchNext(Option.empty(), 10000000);

Will try to fix it.

About, question 5:

The name of the method says "upsert" but I see only inserts getting generated..

In AbstractBaseTestSource#fetchNextBatch, it connects insertStream and updateStream. IMO, it can provide upsert function. I will try to figure it out.

@yanghua Yes, we should refactor those parts.

For (5), what I mean is that when we perform distributedTestDataSource.fetchNext(Option.empty(), 10000000) does it return a bunch of updates + inserts (or just inserts) ?

I have debugged the call chain of the relevant methods.

The key chain lists below:

DistributedTestDataSource#fetchNext DistributedTestDataSource#fetchNewData DistributedTestDataSource#fetchNextBatch

In DistributedTestDataSource#fetchNextBatch, it will calculate the number of insert and update records. Core logic:

int numExistingKeys = dataGenerator.getNumExistingKeys(); int numUpdates = Math.min(numExistingKeys, sourceLimit / 2); int numInserts = sourceLimit - numUpdates;

The sourceLimit variable is specified by the outside (here is 10000000). However, about numExistingKeys variable, it is always 0. It can only be changed after calling some methods in HoodieTestDataGenerator to generate insert records. In our scene, these methods have never been invoked. So here:

numUpdates = 0; numInserts = sourceLimit;

okay, so I'm still unclear, can we pass the exact number of inserts/upserts to create using the above logic ? If not, this might not be that useful.

yanghua · 2019-12-23T11:57:48Z

Hi @vinothchandar , WDYT about DIstributedTestDataSource ? It seems this class has not been used anywhere. It's only be tested in TestHoodieDeltaStreamer. Can we move it into hudi-test-suite module?

vinothchandar · 2019-12-23T19:13:00Z

@yanghua IIUC, @bvaradar uses it actually to run a test job that generates random data on the cluster.. So, may be leave it in hoodie-utilities so that the bundle also has it.. Its in general, I nice way to start running deltastreamer with some fake data.

yanghua · 2019-12-24T05:56:34Z

@yanghua IIUC, @bvaradar uses it actually to run a test job that generates random data on the cluster..

I did not see any place where use DistributedTestDataSource in the master branch.

So, may be leave it in hoodie-utilities so that the bundle also has it.. Its in general, I nice way to start running deltastreamer with some fake data.

We can leave it in hoodie-utilities module. However, it exists in the test package. As @n3nash mentioned, we would better avoid using test code in another module.

…fix some checkstyle issues (apache#1102)

…createNewCommitTime

…d testMultipleValueKeyGenerator NPE

yanghua · 2019-12-24T08:07:08Z

The Travis is green now.

n3nash · 2019-12-25T07:48:10Z

hudi-test-suite/src/main/java/org/apache/hudi/testsuite/dag/nodes/DistributedUpsertNode.java

+import java.io.Serializable;
+
+/**
+ * A insert node which used {@link org.apache.hudi.utilities.sources.DistributedTestDataSource}


Is this supposed to generate inserts or upserts ? The name of the class says differently.
Also, the name of the node is slightly confusing, the existing upsertNode is also generating data in a distributed manner - since it also uses RDD based logic. May be name the new class as UpsertNodeUsingDistributedGenerator or something along these lines ?

n3nash · 2019-12-25T07:49:34Z

hudi-test-suite/src/main/java/org/apache/hudi/testsuite/generator/DeltaGenerator.java

+    props.setProperty(TestSourceConfig.MAX_UNIQUE_RECORDS_PROP, String.valueOf(operation.getNumRecordsInsert()));
+    props.setProperty(TestSourceConfig.NUM_SOURCE_PARTITIONS_PROP, String.valueOf(operation.getNumInsertPartitions()));
+    props.setProperty(TestSourceConfig.USE_ROCKSDB_FOR_TEST_DATAGEN_KEYS, "true");
+    DistributedTestDataSource distributedTestDataSource = new DistributedTestDataSource(


Still see "test" names in the core logic - either rename it and add it to a utils folder so can be used in the src code.

yanghua · 2019-12-25T08:35:53Z

@n3nash I think the whole workload generation is a bit confusing now. Rethinking how to refactor them. Do you think the generator implemented by you can replace DistributedTestDataSource ?

n3nash · 2020-01-03T23:24:52Z

@yanghua I was on a holiday break, apologies for the late response. Have you tried to run the test-suite ? If the current data generation methodology meets our needs, we might not require the DistributedTestDataSource. If not, we can tweek the current implementation or bring in the DistributedSource, wdyt ?

yanghua · 2020-01-04T01:15:29Z

@yanghua I was on a holiday break, apologies for the late response. Have you tried to run the test-suite ? If the current data generation methodology meets our needs, we might not require the DistributedTestDataSource. If not, we can tweek the current implementation or bring in the DistributedSource, wdyt ?

Hi @n3nash No need to say apology, happy holiday. Yes, I have run the test suite several times. It works fine.

IMO, the DistributedTestDataSource will not block the test suite. Actually, I think the test payload generation is a little confused currently. I was thinking about how to refactor it. However, the work was broken by other things about integrating with Azure pipeline and designing how to integrate Hudi with Flink.

The more details about integrating with Azure can be found here:

It has not be done.

cc @vinothchandar

n3nash · 2020-01-07T01:19:55Z

@yanghua Okay, it's good to hear that you were able to try out the test suite. May be we need to prepare some more elaborate test suite DAGs which cover all use-cases and code paths/api's ?

I'm open to any refactoring ideas that you might have for the data generation, let me know when you have those thoughts more concrete and shareable.

Integrating azure pipelines and the test suite would be a good to close the loop on a first version of the test suite. Let's continue to focus on that (and Hudi with Flink of course :) ).
Can you do another pass at the PR and see if there are any glaring open items (apart from the data generation refactor which I will let you do) that need work ? I can then take that up this week so hopefully in the next few days we have a PR ready to go through a final review process ?

yanghua · 2020-01-07T10:54:32Z

@n3nash OK, will try to review the whole test suite again to see if I can find some issues.

n3nash · 2020-08-05T00:48:33Z

@yanghua Is it okay to close this now ?

yanghua · 2020-08-05T03:23:23Z

@yanghua Is it okay to close this now ?

Yes, closing...

yanghua force-pushed the HUDI-392 branch 3 times, most recently from f97e62d to cb91a79 Compare December 19, 2019 12:26

yanghua assigned n3nash Dec 19, 2019

n3nash reviewed Dec 19, 2019

View reviewed changes

yanghua force-pushed the HUDI-392 branch from cb91a79 to ab42bff Compare December 20, 2019 07:56

yanghua force-pushed the HUDI-392 branch from ab42bff to 5f68cb9 Compare December 23, 2019 12:02

n3nash and others added 2 commits December 24, 2019 14:00

[HUDI-394] Provide a basic implementation of test suite

9b55d37

[HUDI-391] Rename module name from hudi-bench to hudi-test-suite and …

1d2ecbc

…fix some checkstyle issues (apache#1102)

yanghua force-pushed the hudi_test_suite_refactor branch from dcfbab1 to 1d2ecbc Compare December 24, 2019 06:02

[MINOR] Fix compile error about the deletion of HoodieActiveTimeline#…

66463ff

…createNewCommitTime

yanghua force-pushed the HUDI-392 branch from 5f68cb9 to 37df2ef Compare December 24, 2019 07:21

yanghua added 2 commits December 24, 2019 15:38

[HUDI-442] Fix TestComplexKeyGenerator#testSingleValueKeyGenerator an…

09c34a0

…d testMultipleValueKeyGenerator NPE

[HUDI-392] Introduce DIstributedTestDataSource to generate test data

e2e2b0c

yanghua force-pushed the HUDI-392 branch from 37df2ef to e2e2b0c Compare December 24, 2019 07:40

n3nash reviewed Dec 25, 2019

View reviewed changes

yanghua force-pushed the hudi_test_suite_refactor branch 2 times, most recently from 3dc85eb to 0456214 Compare January 14, 2020 08:20

n3nash force-pushed the hudi_test_suite_refactor branch 2 times, most recently from de6ec05 to ff13b2a Compare July 7, 2020 06:40

n3nash force-pushed the hudi_test_suite_refactor branch from ff13b2a to 839c1a4 Compare July 9, 2020 15:45

n3nash force-pushed the hudi_test_suite_refactor branch 14 times, most recently from 247d923 to ea2c616 Compare July 21, 2020 18:05

n3nash force-pushed the hudi_test_suite_refactor branch 10 times, most recently from aadba78 to bf59232 Compare July 30, 2020 22:43

yanghua closed this Aug 5, 2020

Conversation

yanghua commented Dec 19, 2019

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

yanghua commented Dec 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n3nash Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua Dec 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua commented Dec 23, 2019

Uh oh!

vinothchandar commented Dec 23, 2019

Uh oh!

yanghua commented Dec 24, 2019

Uh oh!

yanghua commented Dec 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua commented Dec 25, 2019

Uh oh!

n3nash commented Jan 3, 2020

Uh oh!

yanghua commented Jan 4, 2020

Uh oh!

n3nash commented Jan 7, 2020

Uh oh!

yanghua commented Jan 7, 2020

Uh oh!

n3nash commented Aug 5, 2020

Uh oh!

yanghua commented Aug 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

n3nash Dec 19, 2019 •

edited

Loading

yanghua Dec 20, 2019 •

edited

Loading