Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Task instance] Automatic cleaning of database task instance table #16145

Closed
3 tasks done
pinkfloyds opened this issue Jun 13, 2024 · 14 comments
Closed
3 tasks done
Assignees
Labels
feature new feature Stale

Comments

@pinkfloyds
Copy link
Contributor

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

After long-term use of DolphinScheduler, the task instance table accumulates a large amount of data, which can lead to slow queries. It would be beneficial to have a feature that automatically cleans up logs, with a precise whitelist/blacklist mechanism at the project/ProcessDefinition level. After simple configuration, it could automatically clean up periodically (every few months).

I would like to contribute this feature. I have a question: would it be more appropriate for this feature to be integrated as a system-level function of DS or to be configured as ProcessDefinition?

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@pinkfloyds pinkfloyds added feature new feature Waiting for reply Waiting for reply labels Jun 13, 2024
@EricGao888
Copy link
Member

EricGao888 commented Jun 13, 2024

I think you need to give a more detailed design. Configuring on system level / workflow definition level both have pros and cons. The question is too general for reviewers to answer without a detailed design.

@EricGao888 EricGao888 removed the Waiting for reply Waiting for reply label Jun 13, 2024
@EricGao888
Copy link
Member

EricGao888 commented Jun 13, 2024

BTW, you need to submit a DISP for any potential modifications on DS core.

#14102

@zhuxt2015
Copy link
Contributor

I recommend using mysql Event Scheduler to implement automatically delete task instances

@davidzollo
Copy link
Contributor

good idea, please design a proposal first

@pinkfloyds
Copy link
Contributor Author

Plan One: Configure Workflow:

# black list
delete from t_ds_task_instance
       where (project_code in ($project_code)
          or process_instance_id in (select id
                                     from t_ds_process_instance
                                     where process_definition_code in ($process_definition_code)))
                 and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
       where (project_code not in ($project_code) or
          process_instance_id not in (select id from t_ds_process_instance
                                            where process_definition_code in ($process_definition_code)))
      and start_time < NOW() - INTERVAL 3 MONTH;

Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas.
Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use.
User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed).

Plan Two: Integration at the System Level:
Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.

@pinkfloyds
Copy link
Contributor Author

How about this plan?
Optimization Plan One:

After the workflow development is completed, it will be added to the initialization script of DS's database, along with detailed documentation, so that all users can easily use it with simple configuration.

@eelnomel
Copy link

I used a python script to connect to the metadatabase to delete the data in the t_ds_task_instance table, which will cause the scheduler of the workflow to be invalid and the workflow cannot be offline. Have you done a test ?my version:3.0.1,

@SbloodyS
Copy link
Member

Plan One: Configure Workflow:

# black list
delete from t_ds_task_instance
       where (project_code in ($project_code)
          or process_instance_id in (select id
                                     from t_ds_process_instance
                                     where process_definition_code in ($process_definition_code)))
                 and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
       where (project_code not in ($project_code) or
          process_instance_id not in (select id from t_ds_process_instance
                                            where process_definition_code in ($process_definition_code)))
      and start_time < NOW() - INTERVAL 3 MONTH;

Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed).

Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.

Simply deleting these tables is not enough. There are already workflow definitions and workflow instances api for deleting. We can create a sample workflow that contains the data that needs to be deleted according to certain filtering criteria and call the deletion API to delete the data. I think this is a good practice for both user intrusion and iterative updates.

@pinkfloyds
Copy link
Contributor Author

Plan One: Configure Workflow:

# black list
delete from t_ds_task_instance
       where (project_code in ($project_code)
          or process_instance_id in (select id
                                     from t_ds_process_instance
                                     where process_definition_code in ($process_definition_code)))
                 and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
       where (project_code not in ($project_code) or
          process_instance_id not in (select id from t_ds_process_instance
                                            where process_definition_code in ($process_definition_code)))
      and start_time < NOW() - INTERVAL 3 MONTH;

Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed).
Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.

Simply deleting these tables is not enough. There are already workflow definitions and workflow instances api for deleting. We can create a sample workflow that contains the data that needs to be deleted according to certain filtering criteria and call the deletion API to delete the data. I think this is a good practice for both user intrusion and iterative updates.

I understand, but there are some data operations involved, such as the need to query all workflow_definitions_id for the whitelist, and DS does not provide a relevant interface. Should we add a specific interface or directly query using SQL? Both options seem a bit troublesome. Which method is better?

I feel that implementing the above functionality with a Python script would be better than using a shell script.

Copy link

github-actions bot commented Aug 1, 2024

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Aug 1, 2024
@SbloodyS SbloodyS removed the Stale label Aug 1, 2024
@SbloodyS
Copy link
Member

SbloodyS commented Aug 1, 2024

I understand, but there are some data operations involved, such as the need to query all workflow_definitions_id for the
whitelist, and DS does not provide a relevant interface. Should we add a specific interface or directly query using SQL?
Both options seem a bit troublesome. Which method is better?
>
I feel that implementing the above functionality with a Python script would be better than using a shell script.

Why do you think python script is better? DolphinScheduler is a java project. Using a unified language to implement functionality is a basic requirement.

@q4q5q6qw
Copy link

q4q5q6qw commented Aug 14, 2024

I have the same problem. I wonder if I can start a scheduled task in the code to perform the cleanup.
In addition, two parameters are added to the configuration item. The first parameter indicates whether to enable the automatic deletion function, and the second parameter indicates the time (the data generated one month ago is expired data). The pseudo-code is as follows:

public void cleanupExpiredInstances() {
    boolean isExpired = isInstanceExpired(instance);
    if (isExpired) {
        instanceMapper.batchDelete(instance);
    }
}

private boolean isInstanceExpired(Instance instance) {
    long expireThreshold = 7 * 24 * 3600 * 1000; 
    long currentTime = System.currentTimeMillis();
    long createTime = instance.getStartTime().getTime();
    return currentTime - createTime > expireThreshold;
}

Copy link

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Sep 14, 2024
Copy link

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature new feature Stale
Projects
None yet
Development

No branches or pull requests

7 participants