Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: add subtask register #4896

Merged
merged 3 commits into from
Jun 8, 2023

Conversation

chenggui53
Copy link
Contributor

@chenggui53 chenggui53 commented Apr 11, 2023

⚠️ Pre Checklist

Please complete ALL items in this checklist, and remove before submitting

  • I have read through the Contributing Documentation.
  • I have added relevant tests.
  • I have added relevant documentation.
  • I will add labels to the PR, such as pr-type/bug-fix, pr-type/feature-development, etc.

Summary

This PR addresses the issue of subtask registration in our GitLab plugin. Previously, subtasks had to be manually specified in a particular order, which was error-prone and time-consuming. To solve this problem, I added a register and dependency sorter that automatically register and orders subtasks based on their dependencies.

Does this close any open issues?

Closes #4413

Screenshots

Include any relevant screenshots here.

Other Information

Any other information that is important to this PR.

@klesh
Copy link
Contributor

klesh commented Apr 12, 2023

@chenggui53 Thanks for your contribution.
I don't think this can be merged since we need those subtasks to be executed IN ORDER, however, init() func breaks this principle.

@chenggui53
Copy link
Contributor Author

@chenggui53 Thanks for your contribution. I don't think this can be merged since we need those subtasks to be executed IN ORDER, however, init() func breaks this principle.

@klesh Yes, so when I registered subtask with init, I added pre-dependencies declaration and queued the results when fetching the list in backend/plugins/gitlab/impl/impl.go, which was tested in the order defined by the previous code. At the same time, this part will only be called when the plugin is initialized, the ordering will not affect the performance, and can reduce the problems caused by the wrong order of the plugins or the plugins are not registered.

@chenggui53
Copy link
Contributor Author

@chenggui53 Thanks for your contribution. I don't think this can be merged since we need those subtasks to be executed IN ORDER, however, init() func breaks this principle.

@klesh In the short time I've been using the gitlab plugin, I've found some problems caused by manually filling in subtasks (once a specific subtask not being added to the list and not taking effect after being declared in the advanced configuration, and once a subtask being added repeatedly), I think it's worth taking some time to solve it systematically. If Gitlab's plugin registration can be well optimized, the relevant code can also be used in plugins with more subtasks such as GitHub, reducing the possibility of problems.

@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch from 6b12b65 to 4067987 Compare April 13, 2023 10:57
@klesh
Copy link
Contributor

klesh commented Apr 13, 2023

@chenggui53 Aah, you are right, sorry for missing the point.
@keon94 @CamilleTeruel @hezyin @likyh @mindlesscloud what do you think?

@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch from 46af817 to bcb74d3 Compare April 13, 2023 14:11
func init() {
register(SubTaskMetaRegister{
Plugin: &CollectTagMeta,
Dependencies: []string{ExtractApiMergeRequestDetailsMeta.Name},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like Dependencies is always just going to be a single subtask? Is there a reason for this to be a slice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, we only need to specify a dependent subtask, but if it has dependencies on the results of several subtasks when creating a new subtask, then slice will be useful

@klesh
Copy link
Contributor

klesh commented Apr 14, 2023

not sure if we need this kind of mechanism, if so, is it appropriate to sit it on the task level (organize subtasks in a plugin only), or should we do it on the framework level (organize tasks and their subtasks across all plugins). I would like to hear your thoughts on this.

@chenggui53
Copy link
Contributor Author

not sure if we need this kind of mechanism, if so, is it appropriate to sit it on the task level (organize subtasks in a plugin only), or should we do it on the framework level (organize tasks and their subtasks across all plugins). I would like to hear your thoughts on this.

@klesh I personally think that this registration mechanism should be used at the framework level, but since I am not sure whether this change will be accepted, I only modified the gitlab plugin that I use a lot first. If needed, I can change this to the framework level and adapt it to other plugins.

@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch 2 times, most recently from 3d2acb9 to 4857973 Compare April 14, 2023 12:24
@CamilleTeruel
Copy link
Contributor

CamilleTeruel commented Apr 14, 2023

not sure if we need this kind of mechanism, if so, is it appropriate to sit it on the task level (organize subtasks in a plugin only), or should we do it on the framework level (organize tasks and their subtasks across all plugins). I would like to hear your thoughts on this.

It is currently the responsibility of the plugin to order its subtasks while taking into account their dependencies.
I think it is valuable to let the plugin declare those dependencies and let the framework order subtasks accordingly. I find it less error-prone because it declares something that was tacit.

Declaring those dependencies might also be a first step to enable other improvements.
Currently we have a hierarchy of different task types: plans, stages, tasks and subtasks.
This has the merit of imposing a structure on task definition but it also comes with costs:

  • Dependencies are implicit:
    • a subtask might depend on a subtask listed before in the same task or on a subtask from an earlier stage. Only a fraction of those implicit dependencies are genuine. With declared dependencies we may be able to parallelize subtasks more efficiently.
    • The plugin developer has to know and remember the execution rules: stages and subtasks are executed sequentially and tasks in parallel. Those execution details could be hidden from the plugins.
  • This make the code to generate pipelines plan complex (~100 lines of procedural code for each plugin)
  • This introduces different protocols for data source plugins and metric plugins. Everything could be declared with dependencies instead.
  • Non-uniformity: there is a possible logic duplication between different types of tasks for concerns like: execution order, resource management, monitoring, caching, error handling, retries, ...

Adding dependency declaration at the subtask level does not solves all those point but is a step in the good direction IMO.

Now regarding implementation, I find SubtaskMeta to be a better place to declare those dependencies than to call a register function in init functions. I understand that the purpose of the subtask center is to handle subtask registration, and that calling a function to register each subtask is needed for that. But I think that subtask registration should be handled separately with auto-discovery by the framework.

For how to express dependencies, I think we should not force to declare that a subtask depends on another specific subtask but instead declare that the task require that some type of entity has been collected. I agree that it poses no problem for dependencies between subtasks defined by the same plugin: an extractor will always depend on the same convertor for example. But this doesn't work for dependencies between subtasks from different different plugins. For example, the DORA plugin requires some data to be present, like CICDPipelineCommit or PullRequest, but doesn't and mustn't care which plugin collected that data. So the declaration of dependencies should say "I need pull requests" not "I depend on gitlab plugin". I think that declaring dependencies that way would allow us to simplify plugin code and the plugin model of the framework.

To sum up, here is what I think we sould do:

  • Add Requirements and Productions slice fields to SubtaskMeta
    • The type of each item is not clear as it should ideally be a supertype of RawData, and all tool and domain models. So some refactorings are needed to subsume all those types. Otherwise have to fallback to []interface{} or []string, which is not great but can work as a first step.
    • domain types may be deduced from productions, so we can later remove the DomainTypes field
  • Let the framework order the subtasks of a plugin according to their requirements and provisions
  • Add auto-discovery of a plugin subtasks

@chenggui53
Copy link
Contributor Author

@CamilleTeruel Thank you very much for your guidance, as I am not particularly familiar with many aspects of devlake, and after reading your instructions, I feel very helpful and I will refer to your instructions to optimize these a bit, which may take some time.

@klesh
Copy link
Contributor

klesh commented Apr 16, 2023

Great discussion, thanks to @CamilleTeruel @chenggui53
I would like to add my 2 cents.

  1. auto-discovery may not be easy in Golang, especially since all our plugins get compiled into a .so file, there is no official support for scanning Types/Vars in a dynlib. I think init func is acceptable if most of us agreed that putting registration in another file is bad.
  2. dependency over entities is indeed a good idea, however, it is kind of complicated, I would accept tasks-base dependency at the moment. I think they can coexist in the future, developers can choose one or another to achieve the goal. We can start on the plugin level by providing a helper function to resolve the dependency and see how it goes.
  3. I don't see the loop detection in the implementation, the dependency should be a DAG, I think we should do it.
  4. Is the sorting algo stable? I would prefer a stable outcome for debugging's sake.

@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch from 4857973 to f85c4f7 Compare April 17, 2023 01:27
@chenggui53 chenggui53 changed the title refactor(gitlabPlugin): add subtask register refactor: add subtask register Apr 17, 2023
@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch 3 times, most recently from 3dfba8a to b08cad9 Compare April 19, 2023 10:46
@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch from aee16ee to aaa5521 Compare May 5, 2023 11:49
@chenggui53 chenggui53 closed this May 5, 2023
@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch from aaa5521 to 48d6413 Compare May 5, 2023 12:15
@chenggui53 chenggui53 reopened this May 31, 2023
@klesh
Copy link
Contributor

klesh commented Jun 2, 2023

Looking good! Good job. Is it ready yet?

@chenggui53
Copy link
Contributor Author

Looking good! Good job. Is it ready yet?

hi @klesh, sorry for taking so long, can you help to review this pr when you have time? It mainly includes the following content (to avoid a large number of modifications in each plugin, first only modify the gitlab plugin, and then apply it to other plugins after the code is ok)

in backend/core

  1. Add the subTaskMetas dependencies field in the subTaskMeta structure

in backend/plugin

  1. Stable topological sorting algorithm and loop detection
  2. The sort interface can be extended to entity-level sorting in the future. Currently, only subtask-level sorting is implemented

In backend/plugins/gitlab, take gitlab as an example

  1. In the impl.go file, loop detection will be performed during package init
  2. Add dependencies definition to each subTaskMeta
  3. Each subtask adds an init function for registration
  4. Added the register.go file to register subtask, and added tests to ensure compliance with the tone

@klesh
Copy link
Contributor

klesh commented Jun 6, 2023

@chenggui53 Nice job, will do

chenggui53 pushed a commit to chenggui53/incubator-devlake that referenced this pull request Jun 6, 2023
chenggui53 pushed a commit to chenggui53/incubator-devlake that referenced this pull request Jun 6, 2023
chenggui53 pushed a commit to chenggui53/incubator-devlake that referenced this pull request Jun 6, 2023
chenggui53 pushed a commit to chenggui53/incubator-devlake that referenced this pull request Jun 6, 2023
chenggui53 pushed a commit to chenggui53/incubator-devlake that referenced this pull request Jun 6, 2023
@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch 2 times, most recently from 7f28107 to 97eb8ed Compare June 6, 2023 11:13
@chenggui53
Copy link
Contributor Author

hi @klesh I‘ve fix some workflow errors, can you help trigger workflow again ?

@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch 3 times, most recently from c8a6b6d to 92b28b4 Compare June 7, 2023 02:25
Copy link
Contributor

@klesh klesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenggui53 Very impressive work 👍👍👍. Would you mind resolving some minor issues?🙏

backend/plugins/gitlab/tasks/register_test.go Outdated Show resolved Hide resolved
@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch 2 times, most recently from 48b0849 to 93fdc93 Compare June 7, 2023 11:47
@chenggui53 chenggui53 requested a review from klesh June 7, 2023 11:48
@chenggui53 chenggui53 force-pushed the feat-add-gitlab-subtask-register branch from c988df0 to 1548893 Compare June 8, 2023 07:44
Copy link
Contributor

@klesh klesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@klesh
Copy link
Contributor

klesh commented Jun 8, 2023

@chenggui53 Thanks for your contribution, it has been a pleasure working with you.

@chenggui53
Copy link
Contributor Author

chenggui53 commented Jun 8, 2023

@chenggui53 Thanks for your contribution, it has been a pleasure working with you.

@klesh Me too, and I really appreciate your guidance :). Do you think it's time to apply this register to other plugins, or I should create a new pr to do that?

@klesh klesh merged commit b63489c into apache:main Jun 8, 2023
7 of 9 checks passed
@chenggui53 chenggui53 deleted the feat-add-gitlab-subtask-register branch August 4, 2023 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants