-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add query results directory and prevent the auto cleaner from cleaning it #14446
Add query results directory and prevent the auto cleaner from cleaning it #14446
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR LGTM along with a few comments.
Can you please test it out as well and confirm if it's working locally? You can set up a LocalFileStorageConnector
and then manually create a couple of files under the paths and run the durable storage cleaner and check if its execution is as expected.
Also, please update the docs with this behavior and add a release note for the same.
@@ -150,4 +154,23 @@ public static String getControllerTaskIdWithPrefixFromPath(String path) | |||
return null; | |||
} | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the doc with an example of the query results path as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with examples
@@ -138,7 +142,7 @@ public static String getOutputsFileNameForPath( | |||
* </ul> | |||
*/ | |||
@Nullable | |||
public static String getControllerTaskIdWithPrefixFromPath(String path) | |||
public static String getNextDirNameWithPrefixFromPath(String path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the Javadoc for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
if (nextDirName != null && !nextDirName.isEmpty()) { | ||
if (runningTaskIds.contains(nextDirName)) { | ||
// do nothing | ||
} else if (DurableStorageUtils.QUERY_RESULTS_DIR.equals(nextDirName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please confirm if this logic works? Let's say __query_results has a known id and an unknown id, then we would execute files.remove(currentFile)
when we iterate over the unknown id.
However the currentFile
would be the whole results directory or am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not understand what you meant here, current file would be something inside "unknown id", so it would add that file to the list to delete. (We only delete individual files in the cleaner, the storage connector does not allow us to delete directories with the function called)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point! Just revisited the Javadocs for listDir
. This LGTM, thanks for clarfying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! LGTM 🚀
…g it (apache#14446) Adds support for automatic cleaning of a "query-results" directory in durable storage. This directory will be cleaned up only if the task id is not known to the overlord. This will allow the storage of query results after the task has finished running.
This PR makes some changes to the MSQ auto cleaner to add support for a "query results" directory in durable storage. This change will allow us to store query results as well on durable storage in the future.
The new structure of durable storage now contains a new special directory
query-results/<controller_id>/<file>
: This will contain query results in the future. Files in this directory would be cleaned by the auto cleaner once the taskcontroller_id
has been dropped from the list of known tasks.This should not break backward compatibility as the structure is the same as before, with an additional directory. An earlier version of overlord should simply delete the new query results directory as well.
Tested out the changes are working with an S3StorageConnector. Query directories for task Ids which were returned by knownTasks were not deleted. Other files were deleted as before.
Release notes:
This PR has: