Skip to content

[HUDI-392] Introduce DIstributedTestDataSource to generate test data#1115

Closed
yanghua wants to merge 5 commits intoapache:hudi_test_suite_refactorfrom
yanghua:HUDI-392
Closed

[HUDI-392] Introduce DIstributedTestDataSource to generate test data#1115
yanghua wants to merge 5 commits intoapache:hudi_test_suite_refactorfrom
yanghua:HUDI-392

Conversation

@yanghua
Copy link
Contributor

@yanghua yanghua commented Dec 19, 2019

What is the purpose of the pull request

Introduce DIstributedTestDataSource to generate test data

Brief change log

  • Introduce DIstributedTestDataSource to generate test data

Verify this pull request

(Please pick either of the following options)

This change added tests and can be verified as follows:

  • TestHoodieTestSuiteJob#testDistributeSourceInsert

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@yanghua yanghua force-pushed the HUDI-392 branch 3 times, most recently from f97e62d to cb91a79 Compare December 19, 2019 12:26
@yanghua
Copy link
Contributor Author

yanghua commented Dec 19, 2019

Will add more test cases to cover complex dag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason for this change to use the timestamp to generate partition for which a double value is not conducive ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried to change the type of timestamp field to double. However, it will cause the test case: TestHoodieTestSuiteJob#testComplexDag failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I have tried to change the type of timestamp field of source.avsc file to doubule. However, it will cause TestHoodieTestSuiteJob#testComplexDag to be failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we can fix that, shouldn't be difficult

Copy link
Contributor

@n3nash n3nash Dec 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow this method properly :

  1. We shouldn't use any Test* names in the source code.
  2. How do we plan to support more properties in the future, do we need to make code changes every time ?
  3. What is the need for rocksdb use here, again "Test" in the name ?
  4. fetchNext is always fetching 1000000 records, why is that ?
  5. The name of the method says "upsert" but I see only inserts getting generated..

Copy link
Contributor Author

@yanghua yanghua Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow this method properly :

  1. We shouldn't use any Test* names in the source code.
  2. How do we plan to support more properties in the future, do we need to make code changes every time ?
  3. What is the need for rocksdb use here, again "Test" in the name ?
  4. fetchNext is always fetching 1000000 records, why is that ?
  5. The name of the method says "upsert" but I see only inserts getting generated..

Hi @n3nash When implementing this function, I also have some questions (e.g..yours 1,2,3). Actually, using the ability of DistributedTestDataSource is @vinothchandar 's suggestion.

I know DistributedTestDataSource has some limitations. IMO, we can refactor it to make it more scalability and to be a general data generator. WDYT?

About, question 4:

  1. fetchNext is always fetching 1000000 records, why is that ?

It is because of this statement:

InputBatch<JavaRDD<GenericRecord>> batch = distributedTestDataSource.fetchNext(Option.empty(), 10000000);

Will try to fix it.

About, question 5:

The name of the method says "upsert" but I see only inserts getting generated..

In AbstractBaseTestSource#fetchNextBatch, it connects insertStream and updateStream. IMO, it can provide upsert function. I will try to figure it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanghua Yes, we should refactor those parts.

For (5), what I mean is that when we perform distributedTestDataSource.fetchNext(Option.empty(), 10000000) does it return a bunch of updates + inserts (or just inserts) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have debugged the call chain of the relevant methods.

The key chain lists below:

DistributedTestDataSource#fetchNext
  DistributedTestDataSource#fetchNewData
    DistributedTestDataSource#fetchNextBatch

In DistributedTestDataSource#fetchNextBatch, it will calculate the number of insert and update records. Core logic:

int numExistingKeys = dataGenerator.getNumExistingKeys();

int numUpdates = Math.min(numExistingKeys, sourceLimit / 2);
int numInserts = sourceLimit - numUpdates;

The sourceLimit variable is specified by the outside (here is 10000000). However, about numExistingKeys variable, it is always 0. It can only be changed after calling some methods in HoodieTestDataGenerator to generate insert records. In our scene, these methods have never been invoked. So here:

numUpdates = 0;
numInserts = sourceLimit;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so I'm still unclear, can we pass the exact number of inserts/upserts to create using the above logic ? If not, this might not be that useful.

@yanghua
Copy link
Contributor Author

yanghua commented Dec 23, 2019

Hi @vinothchandar , WDYT about DIstributedTestDataSource ? It seems this class has not been used anywhere. It's only be tested in TestHoodieDeltaStreamer. Can we move it into hudi-test-suite module?

@vinothchandar
Copy link
Member

@yanghua IIUC, @bvaradar uses it actually to run a test job that generates random data on the cluster.. So, may be leave it in hoodie-utilities so that the bundle also has it.. Its in general, I nice way to start running deltastreamer with some fake data.

@yanghua
Copy link
Contributor Author

yanghua commented Dec 24, 2019

@yanghua IIUC, @bvaradar uses it actually to run a test job that generates random data on the cluster..

I did not see any place where use DistributedTestDataSource in the master branch.

So, may be leave it in hoodie-utilities so that the bundle also has it.. Its in general, I nice way to start running deltastreamer with some fake data.

We can leave it in hoodie-utilities module. However, it exists in the test package. As @n3nash mentioned, we would better avoid using test code in another module.

@yanghua yanghua force-pushed the hudi_test_suite_refactor branch from dcfbab1 to 1d2ecbc Compare December 24, 2019 06:02
@yanghua
Copy link
Contributor Author

yanghua commented Dec 24, 2019

The Travis is green now.

import java.io.Serializable;

/**
* A insert node which used {@link org.apache.hudi.utilities.sources.DistributedTestDataSource}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to generate inserts or upserts ? The name of the class says differently.
Also, the name of the node is slightly confusing, the existing upsertNode is also generating data in a distributed manner - since it also uses RDD based logic. May be name the new class as UpsertNodeUsingDistributedGenerator or something along these lines ?

props.setProperty(TestSourceConfig.MAX_UNIQUE_RECORDS_PROP, String.valueOf(operation.getNumRecordsInsert()));
props.setProperty(TestSourceConfig.NUM_SOURCE_PARTITIONS_PROP, String.valueOf(operation.getNumInsertPartitions()));
props.setProperty(TestSourceConfig.USE_ROCKSDB_FOR_TEST_DATAGEN_KEYS, "true");
DistributedTestDataSource distributedTestDataSource = new DistributedTestDataSource(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still see "test" names in the core logic - either rename it and add it to a utils folder so can be used in the src code.

@yanghua
Copy link
Contributor Author

yanghua commented Dec 25, 2019

@n3nash I think the whole workload generation is a bit confusing now. Rethinking how to refactor them. Do you think the generator implemented by you can replace DistributedTestDataSource ?

@n3nash
Copy link
Contributor

n3nash commented Jan 3, 2020

@yanghua I was on a holiday break, apologies for the late response. Have you tried to run the test-suite ? If the current data generation methodology meets our needs, we might not require the DistributedTestDataSource. If not, we can tweek the current implementation or bring in the DistributedSource, wdyt ?

@yanghua
Copy link
Contributor Author

yanghua commented Jan 4, 2020

@yanghua I was on a holiday break, apologies for the late response. Have you tried to run the test-suite ? If the current data generation methodology meets our needs, we might not require the DistributedTestDataSource. If not, we can tweek the current implementation or bring in the DistributedSource, wdyt ?

Hi @n3nash No need to say apology, happy holiday. Yes, I have run the test suite several times. It works fine.

IMO, the DistributedTestDataSource will not block the test suite. Actually, I think the test payload generation is a little confused currently. I was thinking about how to refactor it. However, the work was broken by other things about integrating with Azure pipeline and designing how to integrate Hudi with Flink.

The more details about integrating with Azure can be found here:

It has not be done.

cc @vinothchandar

@n3nash
Copy link
Contributor

n3nash commented Jan 7, 2020

@yanghua Okay, it's good to hear that you were able to try out the test suite. May be we need to prepare some more elaborate test suite DAGs which cover all use-cases and code paths/api's ?

I'm open to any refactoring ideas that you might have for the data generation, let me know when you have those thoughts more concrete and shareable.

Integrating azure pipelines and the test suite would be a good to close the loop on a first version of the test suite. Let's continue to focus on that (and Hudi with Flink of course :) ).
Can you do another pass at the PR and see if there are any glaring open items (apart from the data generation refactor which I will let you do) that need work ? I can then take that up this week so hopefully in the next few days we have a PR ready to go through a final review process ?

@yanghua
Copy link
Contributor Author

yanghua commented Jan 7, 2020

@n3nash OK, will try to review the whole test suite again to see if I can find some issues.

@yanghua yanghua force-pushed the hudi_test_suite_refactor branch 2 times, most recently from 3dc85eb to 0456214 Compare January 14, 2020 08:20
@n3nash n3nash force-pushed the hudi_test_suite_refactor branch 2 times, most recently from de6ec05 to ff13b2a Compare July 7, 2020 06:40
@n3nash n3nash force-pushed the hudi_test_suite_refactor branch from ff13b2a to 839c1a4 Compare July 9, 2020 15:45
@n3nash n3nash force-pushed the hudi_test_suite_refactor branch 14 times, most recently from 247d923 to ea2c616 Compare July 21, 2020 18:05
@n3nash n3nash force-pushed the hudi_test_suite_refactor branch 10 times, most recently from aadba78 to bf59232 Compare July 30, 2020 22:43
@n3nash
Copy link
Contributor

n3nash commented Aug 5, 2020

@yanghua Is it okay to close this now ?

@yanghua
Copy link
Contributor Author

yanghua commented Aug 5, 2020

@yanghua Is it okay to close this now ?

Yes, closing...

@yanghua yanghua closed this Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants