Skip to content

Hudi Test Suite (Refactor) #991

Closed
n3nash wants to merge 4 commits intoapache:masterfrom
n3nash:hudi_test_suite_refactor
Closed

Hudi Test Suite (Refactor) #991
n3nash wants to merge 4 commits intoapache:masterfrom
n3nash:hudi_test_suite_refactor

Conversation

@n3nash
Copy link
Copy Markdown
Contributor

@n3nash n3nash commented Nov 1, 2019

- Flexible schema payload generation
- Different types of workload generation such as inserts, upserts etc
- Post process actions to perform validations
- Interoperability of test suite to use HoodieWriteClient and HoodieDeltaStreamer so both code paths can be tested
- Custom workload dag generation
- Ability to perform parallel operations, such as upsert and compaction
- Ability to run the test suite either in docker (local env) or on your cluster setup 

@n3nash n3nash added the status:in-progress Work in progress label Nov 1, 2019
@n3nash n3nash changed the title [WIP] Hudi Test Suite (Refactor) Hudi Test Suite (Refactor) Nov 1, 2019
@n3nash n3nash force-pushed the hudi_test_suite_refactor branch 5 times, most recently from b3eb59b to bfd17fe Compare November 2, 2019 17:44
@vinothchandar vinothchandar mentioned this pull request Nov 2, 2019
@n3nash n3nash force-pushed the hudi_test_suite_refactor branch 2 times, most recently from 8375f14 to bffdf83 Compare November 3, 2019 06:57
@n3nash n3nash removed the status:in-progress Work in progress label Nov 3, 2019
@n3nash n3nash force-pushed the hudi_test_suite_refactor branch 2 times, most recently from 0f64991 to d0974e8 Compare November 3, 2019 21:43
@n3nash
Copy link
Copy Markdown
Contributor Author

n3nash commented Nov 3, 2019

@vinothchandar @bvaradar I've tried to address most of your comments you had in this PR : #623. I've also added 2 end-to-end test cases that can be run either inside the docker container or in your own cluster (where you can scale test the same DAG).
@yanghua and I are figuring out a way to work together on this PR to land each of our parts.

@n3nash
Copy link
Copy Markdown
Contributor Author

n3nash commented Nov 3, 2019

Test pass locally but failing with javax.servlet.FilterRegistration"'s signer information on travis, looking into this.

@n3nash n3nash force-pushed the hudi_test_suite_refactor branch from d0974e8 to 3931ef6 Compare November 3, 2019 23:52
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since every subclass of DagNode has defined a execute and we did not use its return value in DagScheduler.
Shall we define an abstract method which signature is:

  public abstract void execute(ExecutionContext context) throws Exception;

To provide different runtime context information, we can define a context POJO, e.g. ExecutionContext.

What do you think?

Copy link
Copy Markdown
Contributor Author

@n3nash n3nash Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to have the execute method return it's output so we can actually perform validations at any level in the DAG if needed.
If you take a look at ValidateNode, that's what it does, it just uses the output of the last execution. Let me think about this a little more. I like the idea about removing the if-else and have an abstract method, refactored.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With refactoring DagNode (provide an abstract execute method), here we could replace these if/else with a single line like:

node.execute(xxx);

@n3nash n3nash force-pushed the hudi_test_suite_refactor branch from 3931ef6 to 40c82ea Compare November 4, 2019 05:01
@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Nov 4, 2019

Hi @n3nash Thanks for refactoring the original PR. I left some comments some of them may need further discussion.

  • General:

    • Some class files missed License header, some of them have old(Uber -> Apache?) License information.
    • It would be better to rename the module name to reflect the real purpose, e.g. hudi-end-to-end-test or hudi-test-suite or something else?
  • Detailed:

Please see inline comments.

@n3nash n3nash force-pushed the hudi_test_suite_refactor branch from 40c82ea to 1209fd3 Compare November 4, 2019 06:58
@n3nash
Copy link
Copy Markdown
Contributor Author

n3nash commented Nov 4, 2019

@yanghua Thanks for pointing out the Uber -> Apache licensing, I've fixed that in all files now.
I'm open to naming it hudi-test-suite (that's what we called it before).

Copy link
Copy Markdown
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from my side.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{@org.apache.hadoop.hdfs.LocalFileSystem} -> {@link org.apache.hadoop.hdfs.LocalFileSystem}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from my side.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are many places use the avro and parquet file extension. Shall we define a general constant?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep it contained since it's only used in 2 places, made it public and consolidated usages.

Copy link
Copy Markdown
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some new comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to unify the log face framework(slf4j) in the whole project.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been using apache log4j, let's file a ticket and discuss this there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is a plan to replace log4j with slf4j. So let's do this later.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a ticket for this already.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we replace System.out.println with logging?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a remnant of debugging, remove this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we replace System.out.println with logging?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, It would be better to use two try/catch to wrap each two statements to avoid the first one throw exception, while the second one can not be invoked.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if one of them fails, ideally we want to throw the exception which would terminate the jvm and hence the local hive service as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to unify the prefix of the class name, based on the naming of the project, HoodieTestSuiteJob looks better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper class read both Avro and Parquet files. So we need to update this doc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVRO -> PARQUET

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a suggestion about the naming of the nodes to make the DAG more clear. For example, we can refactor them with: root-> firstLevel, child1 -> secondLevel, child2 -> thirdLevel1, newNode2 -> thirdLevel2. Just a example, there may be another way to make the structure more clear.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed and made it more clear

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is also a naming suggestion. It seems we have too many classes with a name pattern like xxxWrapper. What about HiveTestServiceProvider?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only 2 classes with the name wrapper, not many :) But I like the name you suggested, made the change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we did not use this class anywhere? Is it useful?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, need to use this to efficiently consume batches. Will make adjustments to use this class soon.

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Nov 5, 2019

@n3nash Now, there is a conflict file.

Copy link
Copy Markdown
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two new comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming to DeltaOutputType. Generally, there are two common pairs: input <-> output, source <-> sink.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I agree

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. It seems we have removed this class from the master branch, @vinothchandar right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, here we can also replace these if/else mode with other technology(e.g. Reflection, unify constructor...). WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's possible, will address this at the end of after all other comments are addressed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Nov 15, 2019

@n3nash If your lastest force-pushed commit try to fix CI issues? I would not suggest using force-push for the big PR which needs to be iterated many times. Single commit can be review easily. WDYT?

As @vinothchandar said, this is a big PR. Since it's a single and independency module. Can we split it into several subtasks to make it more controllable? We can add some subtask under HUDI-289? WDYT?

@vinothchandar
Copy link
Copy Markdown
Member

+1 on this PR. deferring squash finally at merging time would be ideal.
+1 on making the build pass first, before we spend more time on it. Ideally we should also add a few "receipes" to the integ test and have them passing too.

@yanghua while I agree with you that ideally, we phased this as smaller checkins. But this one is a special case, since it was started even before incubating... So we are where we are..

We can treat Nishith's branch as the current feature branch or even merge this to a feature branch in apache/incubator-hudi (like we did for packaging fixes) and keep iterating there? That would also help @yanghua contribute changes on top of this.. and ultimately merge this into master when ready. My 2c

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Nov 16, 2019

@vinothchandar I agree with you. Especially, moving this PR into a feature branch so that I can interact with @n3nash . When can we move to the feature branch? Now, I am not convenient to update this PR because it may cause conflicts.

@n3nash
Copy link
Copy Markdown
Contributor Author

n3nash commented Nov 16, 2019

@yanghua I'd like to first understand our plan for iterating on this feature before we can decide to move it to a feature branch.

  1. Can you list out the changes you wish to make on top of the existing PR ?
  2. Let's not try to push to this PR
    Once I understand our plan to merge this PR and the incremental changes we want to make on top of this, I'm open to separating this into a feature branch.

On a side note of force-push, I think it's fair to squash later since this is a large PR.

@vinothchandar
Copy link
Copy Markdown
Member

I think there are open comments here already. But getting this PR to pass travis and getting to a feature branch would be a good starting point.

Ultimately, @yanghua can see if his JIRA for long running tests can be implemented on top of this. I expect he might have to make changes on top of this. Hope that provides some context

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Nov 18, 2019

@yanghua I'd like to first understand our plan for iterating on this feature before we can decide to move it to a feature branch.

  1. Can you list out the changes you wish to make on top of the existing PR ?
  2. Let's not try to push to this PR
    Once I understand our plan to merge this PR and the incremental changes we want to make on top of this, I'm open to separating this into a feature branch.

On a side note of force-push, I think it's fair to squash later since this is a large PR.

Hi @n3nash there are some things should be done in the future. For example:

  • Try to utilize DistributedTestDataSource to generate workload;
  • How to integrate with some online pipeline service which provides the basic infrastructure ability;
  • ...

Actually, the reason that I want to move to the feature branch is I want to quickly interact with you not just comment(I can fix some issues which I found.). While if we work on the same PR. The frequency of conflicts is very high. WDYT?

@vinothchandar
Copy link
Copy Markdown
Member

@n3nash Have not read thru all the responses you had yet. Will do in sometime .

But just wanted to highlight 1 issue that I feel strongly about.. IMO For any test framework, we need to write a few hard test cases, which identifies issues and runs in a CI environment continuously for few weeks, before we can deem it complete? In that sense, if we can have a feature branch, then you both can quickly iterate and get us to this stage..

We have some llarge ambitions in the coming months and having this in top shape, will ease all of our minds.. As someone supporting one of the largest data lakes out there :) , I am sure you 'd agree that catching issues upfront is way way way better than chasing them ..

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Nov 20, 2019

@n3nash The Travis is still red. Can we move this PR into a new branch now so that I can start to do something?

@n3nash n3nash force-pushed the hudi_test_suite_refactor branch from 1033f82 to dac689e Compare November 20, 2019 10:03
    - Flexible schema payload generation
    - Different types of workload generation such as inserts, upserts etc
    - Post process actions to perform validations
    - Interoperability of test suite to use HoodieWriteClient and HoodieDeltaStreamer so both code paths can be tested
    - Custom workload sequence generator
    - Ability to perform parallel operations, such as upsert and compaction
@n3nash n3nash force-pushed the hudi_test_suite_refactor branch 4 times, most recently from 8173923 to b060994 Compare November 21, 2019 07:49
@n3nash
Copy link
Copy Markdown
Contributor Author

n3nash commented Nov 21, 2019

@yanghua I've pushed the PR to a feature branch : https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor. I will shortly close this PR.

The build right now is failing due to exceeded log limits, I'm looking at ways to fix them. Once you have started working on the first task you want to add to the branch, let's chat, I'd like to understand what you have in mind and how it aligns with the current PR so that we can provide a first version of test suite with sufficient features for large feature testing.

@n3nash n3nash force-pushed the hudi_test_suite_refactor branch from b060994 to 0c2ed53 Compare November 21, 2019 11:11
@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Nov 21, 2019

@n3nash Thanks. Will check out the new feature branch soon.

@vinothchandar
Copy link
Copy Markdown
Member

@yanghua @n3nash whats the status on this as of now? Would be great to see some plan written down somewhere

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Nov 27, 2019

@n3nash there are still some conflicts.

to @vinothchandar, When waiting for this PR to be pushed into a single feature branch, I spent some time to do #1049 and HUDI-184. Will return back to the test-suite soon.

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Nov 28, 2019

There are so many conflicts. I have tried to rebase with the master branch, however, it's hard to do this for multiple commits. So I would like to firstly squash the commits and rebase to fix conflicts. FYI, cc @n3nash @vinothchandar

@yanghua yanghua mentioned this pull request Nov 28, 2019
5 tasks
@vinothchandar
Copy link
Copy Markdown
Member

I think @n3nash has pushed a new feature branch? is this PR relevant anymore? I ll let @n3nash chime in. but if this PR is what we need, then resolving them and getting it into a good shape is top priority IMO. Do you agree @n3nash ?

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Dec 1, 2019

Yes, we should rebase it based on the master branch. The work reflects on branch #1057 . However, I can not merge it with this PR or the feature branch. So I created a copy.

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Dec 2, 2019

Some pending job from my side:

  • introduce DistributedTestDataSource to refactor the existed DeltaGenerator;
  • rename hudi-bench module name to other name;
  • replace all the code style: log.info(String.format(xxx));;
  • research Azure Pipelines;

@vinothchandar
Copy link
Copy Markdown
Member

@yanghua If you have a buildable PR, that you feel can be used as starting point, please go ahead..
are we at that point or are you waiting for something else?

We can close this one, as @n3nash mentioned before? (please close if you agree as well)

will shortly close this PR.

@yanghua
Copy link
Copy Markdown
Contributor

yanghua commented Dec 3, 2019

OK, Let's close this PR and move to #1057

@yanghua yanghua closed this Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants