Hudi Test Suite (Refactor) by n3nash · Pull Request #991 · apache/hudi

n3nash · 2019-11-01T19:41:03Z

- Flexible schema payload generation
- Different types of workload generation such as inserts, upserts etc
- Post process actions to perform validations
- Interoperability of test suite to use HoodieWriteClient and HoodieDeltaStreamer so both code paths can be tested
- Custom workload dag generation
- Ability to perform parallel operations, such as upsert and compaction
- Ability to run the test suite either in docker (local env) or on your cluster setup

n3nash · 2019-11-03T22:03:02Z

@vinothchandar @bvaradar I've tried to address most of your comments you had in this PR : #623. I've also added 2 end-to-end test cases that can be run either inside the docker container or in your own cluster (where you can scale test the same DAG).
@yanghua and I are figuring out a way to work together on this PR to land each of our parts.

n3nash · 2019-11-03T23:50:08Z

Test pass locally but failing with javax.servlet.FilterRegistration"'s signer information on travis, looking into this.

yanghua · 2019-11-04T03:30:15Z

Since every subclass of DagNode has defined a execute and we did not use its return value in DagScheduler.
Shall we define an abstract method which signature is:

public abstract void execute(ExecutionContext context) throws Exception;

To provide different runtime context information, we can define a context POJO, e.g. ExecutionContext.

What do you think?

The idea was to have the execute method return it's output so we can actually perform validations at any level in the DAG if needed.
If you take a look at ValidateNode, that's what it does, it just uses the output of the last execution. Let me think about this a little more. I like the idea about removing the if-else and have an abstract method, refactored.

yanghua · 2019-11-04T03:39:18Z

With refactoring DagNode (provide an abstract execute method), here we could replace these if/else with a single line like:

node.execute(xxx);

yanghua · 2019-11-04T05:52:32Z

Hi @n3nash Thanks for refactoring the original PR. I left some comments some of them may need further discussion.

General:
- Some class files missed License header, some of them have old(Uber -> Apache?) License information.
- It would be better to rename the module name to reflect the real purpose, e.g. hudi-end-to-end-test or hudi-test-suite or something else?
Detailed:

Please see inline comments.

n3nash · 2019-11-04T06:59:57Z

@yanghua Thanks for pointing out the Uber -> Apache licensing, I've fixed that in all files now.
I'm open to naming it hudi-test-suite (that's what we called it before).

yanghua

Some comments from my side.

yanghua · 2019-11-04T05:53:18Z

{@org.apache.hadoop.hdfs.LocalFileSystem} -> {@link org.apache.hadoop.hdfs.LocalFileSystem}

yanghua

Some comments from my side.

yanghua · 2019-11-04T09:06:46Z

Since there are many places use the avro and parquet file extension. Shall we define a general constant?

I'd like to keep it contained since it's only used in 2 places, made it public and consolidated usages.

yanghua

Some new comments.

yanghua · 2019-11-05T06:23:15Z

It would be better to unify the log face framework(slf4j) in the whole project.

We have been using apache log4j, let's file a ticket and discuss this there.

It seems there is a plan to replace log4j with slf4j. So let's do this later.

there is a ticket for this already.

yanghua · 2019-11-05T06:24:01Z

Shall we replace System.out.println with logging?

This is a remnant of debugging, remove this.

yanghua · 2019-11-05T06:24:30Z

Shall we replace System.out.println with logging?

yanghua · 2019-11-05T06:30:15Z

IMO, It would be better to use two try/catch to wrap each two statements to avoid the first one throw exception, while the second one can not be invoked.

Well, if one of them fails, ideally we want to throw the exception which would terminate the jvm and hence the local hive service as well.

yanghua · 2019-11-05T06:33:07Z

It would be better to unify the prefix of the class name, based on the naming of the project, HoodieTestSuiteJob looks better.

yanghua

Still some comments

yanghua · 2019-11-05T06:56:13Z

This helper class read both Avro and Parquet files. So we need to update this doc.

yanghua · 2019-11-05T06:57:42Z

AVRO -> PARQUET

yanghua · 2019-11-05T09:00:43Z

Here is a suggestion about the naming of the nodes to make the DAG more clear. For example, we can refactor them with: root-> firstLevel, child1 -> secondLevel, child2 -> thirdLevel1, newNode2 -> thirdLevel2. Just a example, there may be another way to make the structure more clear.

Renamed and made it more clear

yanghua · 2019-11-05T11:39:06Z

Here is also a naming suggestion. It seems we have too many classes with a name pattern like xxxWrapper. What about HiveTestServiceProvider?

There's only 2 classes with the name wrapper, not many :) But I like the name you suggested, made the change.

yanghua · 2019-11-05T11:40:45Z

It seems we did not use this class anywhere? Is it useful?

Yes, need to use this to efficiently consume batches. Will make adjustments to use this class soon.

yanghua · 2019-11-05T12:57:32Z

@n3nash Now, there is a conflict file.

yanghua

Two new comments.

yanghua · 2019-11-06T09:08:15Z

What about renaming to DeltaOutputType. Generally, there are two common pairs: input <-> output, source <-> sink.

done, I agree

yanghua · 2019-11-06T09:09:43Z

I am confused. It seems we have removed this class from the master branch, @vinothchandar right?

yanghua · 2019-11-06T09:51:26Z

IMO, here we can also replace these if/else mode with other technology(e.g. Reflection, unify constructor...). WDYT?

I think that's possible, will address this at the end of after all other comments are addressed.

yanghua · 2019-11-15T03:35:27Z

@n3nash If your lastest force-pushed commit try to fix CI issues? I would not suggest using force-push for the big PR which needs to be iterated many times. Single commit can be review easily. WDYT?

As @vinothchandar said, this is a big PR. Since it's a single and independency module. Can we split it into several subtasks to make it more controllable? We can add some subtask under HUDI-289? WDYT?

vinothchandar · 2019-11-15T14:04:46Z

+1 on this PR. deferring squash finally at merging time would be ideal.
+1 on making the build pass first, before we spend more time on it. Ideally we should also add a few "receipes" to the integ test and have them passing too.

@yanghua while I agree with you that ideally, we phased this as smaller checkins. But this one is a special case, since it was started even before incubating... So we are where we are..

We can treat Nishith's branch as the current feature branch or even merge this to a feature branch in apache/incubator-hudi (like we did for packaging fixes) and keep iterating there? That would also help @yanghua contribute changes on top of this.. and ultimately merge this into master when ready. My 2c

yanghua · 2019-11-16T08:56:57Z

@vinothchandar I agree with you. Especially, moving this PR into a feature branch so that I can interact with @n3nash . When can we move to the feature branch? Now, I am not convenient to update this PR because it may cause conflicts.

n3nash · 2019-11-16T13:29:50Z

@yanghua I'd like to first understand our plan for iterating on this feature before we can decide to move it to a feature branch.

Can you list out the changes you wish to make on top of the existing PR ?
Let's not try to push to this PR
Once I understand our plan to merge this PR and the incremental changes we want to make on top of this, I'm open to separating this into a feature branch.

On a side note of force-push, I think it's fair to squash later since this is a large PR.

vinothchandar · 2019-11-18T06:15:07Z

I think there are open comments here already. But getting this PR to pass travis and getting to a feature branch would be a good starting point.

Ultimately, @yanghua can see if his JIRA for long running tests can be implemented on top of this. I expect he might have to make changes on top of this. Hope that provides some context

yanghua · 2019-11-18T08:43:06Z

@yanghua I'd like to first understand our plan for iterating on this feature before we can decide to move it to a feature branch.

Can you list out the changes you wish to make on top of the existing PR ?

Let's not try to push to this PR
Once I understand our plan to merge this PR and the incremental changes we want to make on top of this, I'm open to separating this into a feature branch.

On a side note of force-push, I think it's fair to squash later since this is a large PR.

Hi @n3nash there are some things should be done in the future. For example:

Try to utilize DistributedTestDataSource to generate workload;
How to integrate with some online pipeline service which provides the basic infrastructure ability;
...

Actually, the reason that I want to move to the feature branch is I want to quickly interact with you not just comment(I can fix some issues which I found.). While if we work on the same PR. The frequency of conflicts is very high. WDYT?

vinothchandar · 2019-11-18T15:13:25Z

@n3nash Have not read thru all the responses you had yet. Will do in sometime .

But just wanted to highlight 1 issue that I feel strongly about.. IMO For any test framework, we need to write a few hard test cases, which identifies issues and runs in a CI environment continuously for few weeks, before we can deem it complete? In that sense, if we can have a feature branch, then you both can quickly iterate and get us to this stage..

We have some llarge ambitions in the coming months and having this in top shape, will ease all of our minds.. As someone supporting one of the largest data lakes out there :) , I am sure you 'd agree that catching issues upfront is way way way better than chasing them ..

yanghua · 2019-11-20T09:36:09Z

@n3nash The Travis is still red. Can we move this PR into a new branch now so that I can start to do something?

- Flexible schema payload generation - Different types of workload generation such as inserts, upserts etc - Post process actions to perform validations - Interoperability of test suite to use HoodieWriteClient and HoodieDeltaStreamer so both code paths can be tested - Custom workload sequence generator - Ability to perform parallel operations, such as upsert and compaction

n3nash · 2019-11-21T10:50:28Z

@yanghua I've pushed the PR to a feature branch : https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor. I will shortly close this PR.

The build right now is failing due to exceeded log limits, I'm looking at ways to fix them. Once you have started working on the first task you want to add to the branch, let's chat, I'd like to understand what you have in mind and how it aligns with the current PR so that we can provide a first version of test suite with sufficient features for large feature testing.

yanghua · 2019-11-21T11:32:05Z

@n3nash Thanks. Will check out the new feature branch soon.

vinothchandar · 2019-11-27T05:12:43Z

@yanghua @n3nash whats the status on this as of now? Would be great to see some plan written down somewhere

yanghua · 2019-11-27T05:57:33Z

@n3nash there are still some conflicts.

to @vinothchandar, When waiting for this PR to be pushed into a single feature branch, I spent some time to do #1049 and HUDI-184. Will return back to the test-suite soon.

yanghua · 2019-11-28T06:40:14Z

There are so many conflicts. I have tried to rebase with the master branch, however, it's hard to do this for multiple commits. So I would like to firstly squash the commits and rebase to fix conflicts. FYI, cc @n3nash @vinothchandar

vinothchandar · 2019-11-30T15:51:55Z

I think @n3nash has pushed a new feature branch? is this PR relevant anymore? I ll let @n3nash chime in. but if this PR is what we need, then resolving them and getting it into a good shape is top priority IMO. Do you agree @n3nash ?

yanghua · 2019-12-01T04:35:32Z

Yes, we should rebase it based on the master branch. The work reflects on branch #1057 . However, I can not merge it with this PR or the feature branch. So I created a copy.

yanghua · 2019-12-02T07:39:16Z

Some pending job from my side:

introduce DistributedTestDataSource to refactor the existed DeltaGenerator;
rename hudi-bench module name to other name;
replace all the code style: log.info(String.format(xxx));;
research Azure Pipelines;

vinothchandar · 2019-12-02T17:16:50Z

@yanghua If you have a buildable PR, that you feel can be used as starting point, please go ahead..
are we at that point or are you waiting for something else?

We can close this one, as @n3nash mentioned before? (please close if you agree as well)

will shortly close this PR.

yanghua · 2019-12-03T05:54:58Z

OK, Let's close this PR and move to #1057

n3nash added the status:in-progress Work in progress label Nov 1, 2019

n3nash changed the title ~~[WIP] Hudi Test Suite (Refactor)~~ Hudi Test Suite (Refactor) Nov 1, 2019

n3nash force-pushed the hudi_test_suite_refactor branch 5 times, most recently from b3eb59b to bfd17fe Compare November 2, 2019 17:44

vinothchandar mentioned this pull request Nov 2, 2019

Hudi Test Suite #623

Closed

n3nash force-pushed the hudi_test_suite_refactor branch 2 times, most recently from 8375f14 to bffdf83 Compare November 3, 2019 06:57

n3nash removed the status:in-progress Work in progress label Nov 3, 2019

n3nash force-pushed the hudi_test_suite_refactor branch 2 times, most recently from 0f64991 to d0974e8 Compare November 3, 2019 21:43

n3nash force-pushed the hudi_test_suite_refactor branch from d0974e8 to 3931ef6 Compare November 3, 2019 23:52

yanghua reviewed Nov 4, 2019

View reviewed changes

n3nash force-pushed the hudi_test_suite_refactor branch from 3931ef6 to 40c82ea Compare November 4, 2019 05:01

n3nash force-pushed the hudi_test_suite_refactor branch from 40c82ea to 1209fd3 Compare November 4, 2019 06:58

yanghua reviewed Nov 4, 2019

View reviewed changes

yanghua reviewed Nov 5, 2019

View reviewed changes

yanghua mentioned this pull request Nov 5, 2019

[HUDI-324] TimestampKeyGenerator should support milliseconds #993

Merged

yanghua reviewed Nov 6, 2019

View reviewed changes

n3nash force-pushed the hudi_test_suite_refactor branch from 1033f82 to dac689e Compare November 20, 2019 10:03

n3nash added 2 commits November 20, 2019 20:59

Adressing CR comments part 1

07b4c12

n3nash force-pushed the hudi_test_suite_refactor branch 4 times, most recently from 8173923 to b060994 Compare November 21, 2019 07:49

fixing build issues due to javax servlet

0c2ed53

n3nash force-pushed the hudi_test_suite_refactor branch from b060994 to 0c2ed53 Compare November 21, 2019 11:11

Fixing some unit tests

4e95f60

yanghua mentioned this pull request Nov 28, 2019

Hudi Test Suite #1057

Closed

5 tasks

yanghua closed this Dec 3, 2019

Conversation

n3nash commented Nov 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

n3nash commented Nov 3, 2019

Uh oh!

n3nash commented Nov 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n3nash Nov 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua commented Nov 4, 2019

Uh oh!

n3nash commented Nov 4, 2019

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n3nash commented Nov 1, 2019 •

edited

Loading

n3nash Nov 4, 2019 •

edited

Loading