Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature-13324 contol loading lookups in peons #16266

Closed
wants to merge 1 commit into from

Conversation

IgorBerman
Copy link
Contributor

@IgorBerman IgorBerman commented Apr 11, 2024

Fixes #13324

Permits to disable lookups loading for some types of peons

for example compaction or kill tasks
this should improve memory footprint
e.g. for compaction we can use this memory for loading more rows into memory and thus improving throughput of compaction

I've followed same path as loadBroadcastSegments parameter works and reused LookupSerdeModule which already used in multiple services that doesn't need to load lookups

The fix adds doesntNeedLookups method to interface of task with default of false

Release note


Key changed/added classes in this PR
  • Task - added doesntNeedLookups with default implementation that returns false, i.e. most of peons will continue to load lookups
  • CompactionTask - overrides it with true
  • CliPeon - refers to the process parameter and loads relevant lookup module(which either loads lookups or not)

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IgorBerman , thanks for your solution! It makes sense to optimize lookup loading, but we need to leave some room in this design for future improvements. Please read comments for details.

@@ -331,4 +331,9 @@ static TaskInfo<TaskIdentifier, TaskStatus> toTaskIdentifierInfo(TaskInfo<Task,
taskInfo.getTask().getMetadata()
);
}

default boolean doesntNeedLookups()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method should rather be called needsLookups() or loadLookups(). There is no point naming this method in the negative.

Also, since we want to extend this feature in the future to be able to load lookups on demand, I think we should return an enum here with possible values NONE, ALL, ON_DEMAND.

For the time-being, the behaviour for ALL and ON_DEMAND would be the same. But later, ON_DEMAND can be optimized to load only required lookups.

So, overall:

default LookupLoadingMode needsLookups()
{
   return LookupLoadingMode.ALL;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @kfaraz for the comments.
Makes sense to me.
I'll ping you when your suggestions will be addressed.
Also here is something that I've discovered lately and if you have some suggestions/ideas please share them: I've tried to start nano cluster with my changes and discovered that it's not that simple. TaskToolbox injects LookupNodeService, i.e. currently it's not possible to remove it without additional refactoring, so I'm trying to convert it to "Lazy" one by injecting Providing instead so that LookupSerdeModule could create this Provider as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed interface @kfaraz

@@ -379,6 +379,11 @@ public TaskStatus call()
command.add("true");
}

if (task.doesntNeedLookups()) {
command.add("--doesntNeedLookups");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter should be called --loadLookups instead, and should take values all, none, onDemand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@IgorBerman
Copy link
Contributor Author

run nano cluster locally and verified that all services started, ingestion from kafka works and compaction also able to work.
compaction task gets argument of
2024-04-18T08:28:44,500 INFO [main] org.apache.druid.cli.CliPeon - * sun.java.command: org.apache.druid.cli.Main internal peon var/druid/task/slot1/compact_my_ds_2024-04-18T08:28:42.514Z 1 --loadLookups NONE

@IgorBerman IgorBerman marked this pull request as ready for review April 18, 2024 13:02
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @IgorBerman ! Overall, this looks like a neat approach to me. I have left some comments.

@@ -185,6 +187,13 @@ public class CliPeon extends GuiceRunnable
@Option(name = "--loadBroadcastSegments", title = "loadBroadcastSegments", description = "Enable loading of broadcast segments")
public String loadBroadcastSegments = "false";


/**
* If set to true then lookups won't be loaded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This javadoc needs to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -363,7 +372,7 @@ public String getTaskIDFromTask(final Task task)
new IndexingServiceTuningConfigModule(),
new InputSourceModule(),
new ChatHandlerServerModule(properties),
new LookupModule()
LookupLoadingMode.NONE == loadLookups ? new LookupSerdeModule() : new LookupModule()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this logic in a private method getLookupModule() and add a javadoc there explaining what we are doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A log line which indicates the chosen lookup loading mode and the expected effects would also be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added log and private method. also added tests for it

@@ -537,6 +539,115 @@ ProcessHolder runTaskProcess(List<String> command, File logFile, TaskLocation ta
Assert.assertTrue(forkingTaskRunner.restore().isEmpty());
}

@Test
public void testTaskRun_doesnt_pass_loadLookups_Argument() throws Exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other tests in this class don't use the underscore style of naming tests. Please adhere to the existing style in this class if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@IgorBerman IgorBerman force-pushed the feature-13324 branch 2 times, most recently from dc80221 to fe1b2cb Compare April 19, 2024 14:17
@IgorBerman
Copy link
Contributor Author

@kfaraz Hi, I think I addressed all of your comments. Also hopefully fixed coverage.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for promptly addressing the feedback! I have left a few more suggestions.

Assert.assertTrue(peon.resolveLookupModule() instanceof LookupSerdeModule);
}

@Test(expected = IllegalArgumentException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Assert.assertThrows() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with assertThrows

@@ -58,6 +59,9 @@ public List<? extends Module> getJacksonModules()
public void configure(Binder binder)
{
JsonConfigProvider.bind(binder, LookupModule.PROPERTY_BASE, LookupConfig.class);
//Even though LookupSerdeModule will be used for processes that are not loading lookups, we need to bind LookupNodeService,
//so that TaskToolboxFactory and TaskToolbox will get their dependency. Binding it won't load lookups.
Copy link
Contributor

@kfaraz kfaraz Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I suppose we could have just made this dependency @Nullable in TaskToolboxFactory.

The only usage is to build a DruidDiscoveryNode where the LookupNodeService object is used to populate the services map.

From the point of service discovery, I think it would be more appropriate to just not add the lookup node service to the map rather than declaring a dummy node service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried this and got following error for compaction task

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, see the following errors:

1) Could not find a suitable constructor in org.apache.druid.discovery.LookupNodeService. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private.
  at org.apache.druid.discovery.LookupNodeService.class(LookupNodeService.java:38)
  while locating org.apache.druid.discovery.LookupNodeService
    for the 26th parameter of org.apache.druid.indexing.common.TaskToolboxFactory.<init>(TaskToolboxFactory.java:165)
  at org.apache.druid.cli.CliPeon.bindTaskConfigAndClients(CliPeon.java:506) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliPeon$1)

1 error
	at org.apache.druid.cli.CliPeon.run(CliPeon.java:421)
	at org.apache.druid.cli.Main.main(Main.java:112)
Caused by: java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, see the following errors:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with Guice(using spring boot at work) so might be there is other option to have Optional dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually @nullable is the way to mark it optional...so I'm not sure why it doesn't work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in additional to @nullable need to add empty c'tor to LookupNodeService ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, we can try to address this in a follow up.

services/src/main/java/org/apache/druid/cli/CliPeon.java Outdated Show resolved Hide resolved
services/src/main/java/org/apache/druid/cli/CliPeon.java Outdated Show resolved Hide resolved
services/src/main/java/org/apache/druid/cli/CliPeon.java Outdated Show resolved Hide resolved
@IgorBerman
Copy link
Contributor Author

@kfaraz thanks for the review! I think I've addressed all your comments besides binding of Lookup service. Tried your idea but seems compaction peon fails with this approach.

@IgorBerman IgorBerman force-pushed the feature-13324 branch 3 times, most recently from 1de34f8 to 1355717 Compare April 20, 2024 15:30
@IgorBerman
Copy link
Contributor Author

I see that 1 test failing: org.apache.druid.indexing.overlord.ForkingTaskRunnerTest#testInvalidTaskContextJavaOptsArray
I've looked at it and for me it seems like some kind of race(meanwhile can't figure out how it's connected to what I've did)
I believe there is some inter-tests dependency
getWorkerSuccessfulTaskCount & getWorkerFailedTaskCount are using static variables
private static final AtomicLong FAILED_TASK_COUNT = new AtomicLong();
private static final AtomicLong SUCCESSFUL_TASK_COUNT = new AtomicLong();

both of them updated at

if (status.isSuccess()) {
                        SUCCESSFUL_TASK_COUNT.incrementAndGet();
                      } else {
                        FAILED_TASK_COUNT.incrementAndGet();
                      }

which doesn't happen when parsing of arguments failing here

List<String> taskJavaOptsArray = jsonMapper.convertValue(
                              task.getContextValue(ForkingTaskRunnerConfig.JAVA_OPTS_ARRAY_PROPERTY),
                              new TypeReference<List<String>>() {}
                          );

which throws exception and which is goes up to outer

catch (Throwable t) {
                      throw closer.rethrow(t);
                    }

i.e. those static variables are not incremented as part of failure to parse arguments(maybe should be fixed)

@IgorBerman
Copy link
Contributor Author

@zachjsh Hi, maybe you can advice regarding org.apache.druid.indexing.overlord.ForkingTaskRunnerTest#testInvalidTaskContextJavaOptsArray and updating FAILED_TASK_COUNT & SUCCESSFUL_TASK_COUNT?

@IgorBerman
Copy link
Contributor Author

run same test on master branch and it's failing as well, which is strange

@kfaraz
Copy link
Contributor

kfaraz commented Apr 22, 2024

run same test on master branch and it's failing as well, which is strange

Yeah, oddly enough, for me, the test fails only when I run it individually. When I run all the tests in this class from the IDE together, it passes. It is most likely because the failed/success task counts are static.

A test that runs before this one must be setting the failed count to 1, which is being verified in this test too.
When run by itself, this test is not able to set the failed count to 1.

@IgorBerman
Copy link
Contributor Author

@kfaraz can you tell me what I should do in addition for this PR to be merged?

@kfaraz
Copy link
Contributor

kfaraz commented Apr 23, 2024

@IgorBerman , I have created #16323 to fix up the test for now so that the builds pass. Later, we can revisit the code and decide if we want to increment the failed task count in the case where the javaOptsArray is invalid.

@IgorBerman
Copy link
Contributor Author

thanks @kfaraz , I left 1 comment there

@kfaraz
Copy link
Contributor

kfaraz commented Apr 24, 2024

@IgorBerman , I have merged #16323 . You can try to merge the latest master into your branch to fix the errors.

@IgorBerman
Copy link
Contributor Author

thanks @kfaraz , updated my branch with upstream

@kfaraz
Copy link
Contributor

kfaraz commented Apr 24, 2024

@IgorBerman , FYI, there is another PR #16328 which tackles the same problem as this one. Let me know what you think about it.

@IgorBerman
Copy link
Contributor Author

@kfaraz thanks :) I looked at it and it seems nice approach.

I think the other PR is more generic since it permits to control which lookups will be loaded. I haven't understood if it supports configuring it as some context arguments for task.

In our case it's always all or nothing, but I can imagine more involved environments.
For our usecase Compaction and Kill tasks are relevant and seems like it will be easy to support Compaction task as well.

Additional diff that I can think of is that if we are not going to load lookups current PR uses different guice module which might create lighter runtime(i.e. less beans e.g., but I'm not sure this is the case since I'm not too familiar with differences of those 2 modules really).
This is basically the other side of coin - to support higher flexibility when making decision a bit later(when everything is initialized already and only lookups are initialized) vs to make some decision during startup of task(CliPeon)
Personally I'm in favor for other PR seems so.

@kfaraz
Copy link
Contributor

kfaraz commented Apr 25, 2024

I haven't understood if it supports configuring it as some context arguments for task.

The author intends to create that as a follow up PR.

Personally I'm in favor for other PR seems so.

Thanks, @IgorBerman .
Closing this PR in favour of #16328 .

@kfaraz kfaraz closed this Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not Load lookups on tasks that will not use them
2 participants