[HUDI-392] Introduce DIstributedTestDataSource to generate test data#1115
[HUDI-392] Introduce DIstributedTestDataSource to generate test data#1115yanghua wants to merge 5 commits intoapache:hudi_test_suite_refactorfrom
Conversation
f97e62d to
cb91a79
Compare
|
Will add more test cases to cover complex dag. |
There was a problem hiding this comment.
Is the reason for this change to use the timestamp to generate partition for which a double value is not conducive ?
There was a problem hiding this comment.
Yes, I tried to change the type of timestamp field to double. However, it will cause the test case: TestHoodieTestSuiteJob#testComplexDag failed.
There was a problem hiding this comment.
Yes. I have tried to change the type of timestamp field of source.avsc file to doubule. However, it will cause TestHoodieTestSuiteJob#testComplexDag to be failed.
There was a problem hiding this comment.
Okay, we can fix that, shouldn't be difficult
There was a problem hiding this comment.
Not sure I follow this method properly :
- We shouldn't use any Test* names in the source code.
- How do we plan to support more properties in the future, do we need to make code changes every time ?
- What is the need for rocksdb use here, again "Test" in the name ?
- fetchNext is always fetching 1000000 records, why is that ?
- The name of the method says "upsert" but I see only inserts getting generated..
There was a problem hiding this comment.
Not sure I follow this method properly :
- We shouldn't use any Test* names in the source code.
- How do we plan to support more properties in the future, do we need to make code changes every time ?
- What is the need for rocksdb use here, again "Test" in the name ?
- fetchNext is always fetching 1000000 records, why is that ?
- The name of the method says "upsert" but I see only inserts getting generated..
Hi @n3nash When implementing this function, I also have some questions (e.g..yours 1,2,3). Actually, using the ability of DistributedTestDataSource is @vinothchandar 's suggestion.
I know DistributedTestDataSource has some limitations. IMO, we can refactor it to make it more scalability and to be a general data generator. WDYT?
About, question 4:
- fetchNext is always fetching 1000000 records, why is that ?
It is because of this statement:
InputBatch<JavaRDD<GenericRecord>> batch = distributedTestDataSource.fetchNext(Option.empty(), 10000000);
Will try to fix it.
About, question 5:
The name of the method says "upsert" but I see only inserts getting generated..
In AbstractBaseTestSource#fetchNextBatch, it connects insertStream and updateStream. IMO, it can provide upsert function. I will try to figure it out.
There was a problem hiding this comment.
@yanghua Yes, we should refactor those parts.
For (5), what I mean is that when we perform distributedTestDataSource.fetchNext(Option.empty(), 10000000) does it return a bunch of updates + inserts (or just inserts) ?
There was a problem hiding this comment.
I have debugged the call chain of the relevant methods.
The key chain lists below:
DistributedTestDataSource#fetchNext
DistributedTestDataSource#fetchNewData
DistributedTestDataSource#fetchNextBatch
In DistributedTestDataSource#fetchNextBatch, it will calculate the number of insert and update records. Core logic:
int numExistingKeys = dataGenerator.getNumExistingKeys();
int numUpdates = Math.min(numExistingKeys, sourceLimit / 2);
int numInserts = sourceLimit - numUpdates;
The sourceLimit variable is specified by the outside (here is 10000000). However, about numExistingKeys variable, it is always 0. It can only be changed after calling some methods in HoodieTestDataGenerator to generate insert records. In our scene, these methods have never been invoked. So here:
numUpdates = 0;
numInserts = sourceLimit;
There was a problem hiding this comment.
okay, so I'm still unclear, can we pass the exact number of inserts/upserts to create using the above logic ? If not, this might not be that useful.
|
Hi @vinothchandar , WDYT about |
I did not see any place where use
We can leave it in |
dcfbab1 to
1d2ecbc
Compare
…createNewCommitTime
…d testMultipleValueKeyGenerator NPE
|
The Travis is green now. |
| import java.io.Serializable; | ||
|
|
||
| /** | ||
| * A insert node which used {@link org.apache.hudi.utilities.sources.DistributedTestDataSource} |
There was a problem hiding this comment.
Is this supposed to generate inserts or upserts ? The name of the class says differently.
Also, the name of the node is slightly confusing, the existing upsertNode is also generating data in a distributed manner - since it also uses RDD based logic. May be name the new class as UpsertNodeUsingDistributedGenerator or something along these lines ?
| props.setProperty(TestSourceConfig.MAX_UNIQUE_RECORDS_PROP, String.valueOf(operation.getNumRecordsInsert())); | ||
| props.setProperty(TestSourceConfig.NUM_SOURCE_PARTITIONS_PROP, String.valueOf(operation.getNumInsertPartitions())); | ||
| props.setProperty(TestSourceConfig.USE_ROCKSDB_FOR_TEST_DATAGEN_KEYS, "true"); | ||
| DistributedTestDataSource distributedTestDataSource = new DistributedTestDataSource( |
There was a problem hiding this comment.
Still see "test" names in the core logic - either rename it and add it to a utils folder so can be used in the src code.
|
@n3nash I think the whole workload generation is a bit confusing now. Rethinking how to refactor them. Do you think the generator implemented by you can replace |
|
@yanghua I was on a holiday break, apologies for the late response. Have you tried to run the test-suite ? If the current data generation methodology meets our needs, we might not require the DistributedTestDataSource. If not, we can tweek the current implementation or bring in the DistributedSource, wdyt ? |
Hi @n3nash No need to say apology, happy holiday. Yes, I have run the test suite several times. It works fine. IMO, the The more details about integrating with Azure can be found here:
It has not be done. |
|
@yanghua Okay, it's good to hear that you were able to try out the test suite. May be we need to prepare some more elaborate test suite DAGs which cover all use-cases and code paths/api's ? I'm open to any refactoring ideas that you might have for the data generation, let me know when you have those thoughts more concrete and shareable. Integrating azure pipelines and the test suite would be a good to close the loop on a first version of the test suite. Let's continue to focus on that (and Hudi with Flink of course :) ). |
|
@n3nash OK, will try to review the whole test suite again to see if I can find some issues. |
3dc85eb to
0456214
Compare
de6ec05 to
ff13b2a
Compare
ff13b2a to
839c1a4
Compare
247d923 to
ea2c616
Compare
aadba78 to
bf59232
Compare
|
@yanghua Is it okay to close this now ? |
Yes, closing... |
What is the purpose of the pull request
Introduce DIstributedTestDataSource to generate test data
Brief change log
Verify this pull request
(Please pick either of the following options)
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.