Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove tempStorageDirectory and rely on task dir instead #16416

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

adarshsanjeev
Copy link
Contributor

@adarshsanjeev adarshsanjeev commented May 8, 2024

Removes the temporary directory used by durable storage and export, from a user configured value. This PR instead reuses the temporary storage configured for the task.

Pending:

  • This PR adds the bindings to peons to inject the temporary directory location. However, these changes are still pending for indexers.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

}

@Override
public StorageConnector get()
{
return new AzureStorageConnector(this, azureStorage);
final File tempDir = injector.getInstance(Key.get(File.class, Names.named(DataSourceTaskIdHolder.TMP_DIR_BINDING)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this change @adarshsanjeev 👍
Although, it might be worth checking if you can directly JacksonInject the File object - one way to try it could be to do :

@JacksonInject
@Named(DataSourceTaskIdHolder.TMP_DIR_BINDING)
File tempDir;

Also, could consider adding the JacksonInject in the constructor itself to keep all the relevant things there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lastly, could be worth having the temp dir instance setup for other processes as well, and also for indexers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the above initially, but injecting the tempDir as a variable fails (maybe since it is a jackson inject, and that binding has not been added?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it fail for processes which are not peons? if that's the case you can try also marking this Nullable
also, if you have the error available, please feel free to paste it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above code fails for peons with the error
Guice configuration errors: 1) No implementation for java.io.File annotated with @com.google.inject.name.Named(value="druidTempDirectory") was bound. while locating java.io.File annotated with @com.google.inject.name.Named(value="druidTempDirectory") 1 error

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels May 21, 2024
@adarshsanjeev adarshsanjeev marked this pull request as ready for review May 21, 2024 05:20
Copy link
Contributor

@findingrish findingrish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, a note asserting that the changes are backward compatible would help!

List<String> exportedFiles = (List<String>) queryKernel.getResultObjectForStage(finalStageId);
log.info("Query [%s] exported %d files.", queryDef.getQueryId(), exportedFiles.size());
exportMetadataManager.writeMetadata(exportedFiles);
} else if (MSQControllerTask.isExport(querySpec)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why were there two else condition with the same condition originally MSQControllerTask.isExport(querySpec)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to a merge from master. Git might have created a new copy of the if condition during the merge. It was added here: https://github.com/apache/druid/pull/16168/files#diff-b9c30db60c6f71d4ccabd6e9ebc8dc1fbeee4db9fca63d5fbaa0ef5833e8e4acL1796.

@@ -64,14 +57,12 @@ public class AzureOutputConfig
public AzureOutputConfig(
@JsonProperty(value = "container", required = true) String container,
@JsonProperty(value = "prefix", required = true) String prefix,
@JsonProperty(value = "tempDir", required = true) File tempDir,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this exchanged between the Controller and Workers? Can this change cause any failure during upgrade?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is created from the runtime properties, and should not be passed between workers. TmpDirectories should not have been passed anywhere, since they are "per peon" properties, and passing the value doesn't make sense.

The only changed class that should have been passed between peons is the exportStorageProvider classes, and these do not contain the tmpDirectory values.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. I am wondering if there's a way to inject the tempDir directly into the connector using Guice while constructing the instance, instead of relying on the callers to pass it.

Can you please add a release note for the PR stating that the druid.export.storage.s3.tempLocalDir and related properties are now defunct and storage connector will use the tasks's temp directory to store the temporary data.


@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type")
public interface StorageConnectorProvider extends Provider<StorageConnector>
public interface StorageConnectorProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why this change is needed.

Comment on lines -87 to -90
binder.bind(Key.get(StorageConnector.class, MultiStageQuery.class))
.toProvider(Key.get(StorageConnectorProvider.class, MultiStageQuery.class))
.in(LazySingleton.class);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change required? Do we still preserve the singleton nature of the storage connector? Also, we should still preserve the binding annotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Dependencies Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants