feature-13324 contol loading lookups in peons #16266

IgorBerman · 2024-04-11T14:53:33Z

Permits to disable lookups loading for some types of peons

for example compaction or kill tasks
this should improve memory footprint
e.g. for compaction we can use this memory for loading more rows into memory and thus improving throughput of compaction

I've followed same path as loadBroadcastSegments parameter works and reused LookupSerdeModule which already used in multiple services that doesn't need to load lookups

The fix adds doesntNeedLookups method to interface of task with default of false

Release note

Key changed/added classes in this PR

Task - added doesntNeedLookups with default implementation that returns false, i.e. most of peons will continue to load lookups
CompactionTask - overrides it with true
CliPeon - refers to the process parameter and loads relevant lookup module(which either loads lookups or not)

This PR has:

kfaraz

@IgorBerman , thanks for your solution! It makes sense to optimize lookup loading, but we need to leave some room in this design for future improvements. Please read comments for details.

kfaraz · 2024-04-14T17:47:31Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/Task.java

@@ -331,4 +331,9 @@ static TaskInfo<TaskIdentifier, TaskStatus> toTaskIdentifierInfo(TaskInfo<Task,
        taskInfo.getTask().getMetadata()
    );
  }
+
+  default boolean doesntNeedLookups()


The method should rather be called needsLookups() or loadLookups(). There is no point naming this method in the negative.

Also, since we want to extend this feature in the future to be able to load lookups on demand, I think we should return an enum here with possible values NONE, ALL, ON_DEMAND.

For the time-being, the behaviour for ALL and ON_DEMAND would be the same. But later, ON_DEMAND can be optimized to load only required lookups.

So, overall:

default LookupLoadingMode needsLookups() { return LookupLoadingMode.ALL; }

thanks @kfaraz for the comments.
Makes sense to me.
I'll ping you when your suggestions will be addressed.
Also here is something that I've discovered lately and if you have some suggestions/ideas please share them: I've tried to start nano cluster with my changes and discovered that it's not that simple. TaskToolbox injects LookupNodeService, i.e. currently it's not possible to remove it without additional refactoring, so I'm trying to convert it to "Lazy" one by injecting Providing instead so that LookupSerdeModule could create this Provider as well

renamed interface @kfaraz

kfaraz · 2024-04-14T17:48:16Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/ForkingTaskRunner.java

@@ -379,6 +379,11 @@ public TaskStatus call()
                          command.add("true");
                        }

+                        if (task.doesntNeedLookups()) {
+                          command.add("--doesntNeedLookups");


The parameter should be called --loadLookups instead, and should take values all, none, onDemand.

IgorBerman · 2024-04-18T08:32:58Z

run nano cluster locally and verified that all services started, ingestion from kafka works and compaction also able to work.
compaction task gets argument of
2024-04-18T08:28:44,500 INFO [main] org.apache.druid.cli.CliPeon - * sun.java.command: org.apache.druid.cli.Main internal peon var/druid/task/slot1/compact_my_ds_2024-04-18T08:28:42.514Z 1 --loadLookups NONE

kfaraz

Thanks, @IgorBerman ! Overall, this looks like a neat approach to me. I have left some comments.

server/src/main/java/org/apache/druid/query/lookup/LookupLoadingMode.java

kfaraz · 2024-04-19T02:40:53Z

services/src/main/java/org/apache/druid/cli/CliPeon.java

@@ -185,6 +187,13 @@ public class CliPeon extends GuiceRunnable
  @Option(name = "--loadBroadcastSegments", title = "loadBroadcastSegments", description = "Enable loading of broadcast segments")
  public String loadBroadcastSegments = "false";

+
+  /**
+   * If set to true then lookups won't be loaded


This javadoc needs to be updated.

kfaraz · 2024-04-19T02:44:48Z

services/src/main/java/org/apache/druid/cli/CliPeon.java

@@ -363,7 +372,7 @@ public String getTaskIDFromTask(final Task task)
        new IndexingServiceTuningConfigModule(),
        new InputSourceModule(),
        new ChatHandlerServerModule(properties),
-        new LookupModule()
+        LookupLoadingMode.NONE == loadLookups ? new LookupSerdeModule() : new LookupModule()


Maybe put this logic in a private method getLookupModule() and add a javadoc there explaining what we are doing.

A log line which indicates the chosen lookup loading mode and the expected effects would also be helpful.

added log and private method. also added tests for it

kfaraz · 2024-04-19T02:53:09Z

indexing-service/src/test/java/org/apache/druid/indexing/overlord/ForkingTaskRunnerTest.java

@@ -537,6 +539,115 @@ ProcessHolder runTaskProcess(List<String> command, File logFile, TaskLocation ta
    Assert.assertTrue(forkingTaskRunner.restore().isEmpty());
  }

+  @Test
+  public void testTaskRun_doesnt_pass_loadLookups_Argument() throws Exception


The other tests in this class don't use the underscore style of naming tests. Please adhere to the existing style in this class if possible.

server/src/main/java/org/apache/druid/query/lookup/LookupSerdeModule.java

IgorBerman · 2024-04-19T14:18:26Z

@kfaraz Hi, I think I addressed all of your comments. Also hopefully fixed coverage.

kfaraz

Thanks for promptly addressing the feedback! I have left a few more suggestions.

indexing-service/src/main/java/org/apache/druid/indexing/common/task/Task.java

kfaraz · 2024-04-19T18:04:52Z

services/src/test/java/org/apache/druid/cli/CliPeonTest.java

+    Assert.assertTrue(peon.resolveLookupModule() instanceof LookupSerdeModule);
+  }
+
+  @Test(expected = IllegalArgumentException.class)


Please use Assert.assertThrows() instead.

replaced with assertThrows

kfaraz · 2024-04-19T18:07:30Z

server/src/main/java/org/apache/druid/query/lookup/LookupSerdeModule.java

@@ -58,6 +59,9 @@ public List<? extends Module> getJacksonModules()
  public void configure(Binder binder)
  {
    JsonConfigProvider.bind(binder, LookupModule.PROPERTY_BASE, LookupConfig.class);
+    //Even though LookupSerdeModule will be used for processes that are not loading lookups, we need to bind LookupNodeService,
+    //so that TaskToolboxFactory and TaskToolbox will get their dependency. Binding it won't load lookups.


Hmm, I suppose we could have just made this dependency @Nullable in TaskToolboxFactory.

The only usage is to build a DruidDiscoveryNode where the LookupNodeService object is used to populate the services map.

From the point of service discovery, I think it would be more appropriate to just not add the lookup node service to the map rather than declaring a dummy node service.

tried this and got following error for compaction task

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, see the following errors: 1) Could not find a suitable constructor in org.apache.druid.discovery.LookupNodeService. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private. at org.apache.druid.discovery.LookupNodeService.class(LookupNodeService.java:38) while locating org.apache.druid.discovery.LookupNodeService for the 26th parameter of org.apache.druid.indexing.common.TaskToolboxFactory.<init>(TaskToolboxFactory.java:165) at org.apache.druid.cli.CliPeon.bindTaskConfigAndClients(CliPeon.java:506) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliPeon$1) 1 error at org.apache.druid.cli.CliPeon.run(CliPeon.java:421) at org.apache.druid.cli.Main.main(Main.java:112) Caused by: java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, see the following errors:

I'm not too familiar with Guice(using spring boot at work) so might be there is other option to have Optional dependency?

actually @nullable is the way to mark it optional...so I'm not sure why it doesn't work

maybe in additional to @nullable need to add empty c'tor to LookupNodeService ?

No problem, we can try to address this in a follow up.

services/src/main/java/org/apache/druid/cli/CliPeon.java

services/src/test/java/org/apache/druid/cli/CliPeonTest.java

services/src/main/java/org/apache/druid/cli/CliPeon.java

server/src/main/java/org/apache/druid/query/lookup/LookupLoadingMode.java

server/src/main/java/org/apache/druid/query/lookup/LookupSerdeModule.java

...d-extensions/src/test/java/org/apache/druid/k8s/overlord/taskadapter/K8sTaskAdapterTest.java

IgorBerman · 2024-04-20T15:17:37Z

@kfaraz thanks for the review! I think I've addressed all your comments besides binding of Lookup service. Tried your idea but seems compaction peon fails with this approach.

indexing-service/src/test/java/org/apache/druid/indexing/overlord/ForkingTaskRunnerTest.java

server/src/main/java/org/apache/druid/query/lookup/LookupLoadingMode.java

IgorBerman · 2024-04-22T06:55:12Z

I see that 1 test failing: org.apache.druid.indexing.overlord.ForkingTaskRunnerTest#testInvalidTaskContextJavaOptsArray
I've looked at it and for me it seems like some kind of race(meanwhile can't figure out how it's connected to what I've did)
I believe there is some inter-tests dependency
getWorkerSuccessfulTaskCount & getWorkerFailedTaskCount are using static variables
private static final AtomicLong FAILED_TASK_COUNT = new AtomicLong();
private static final AtomicLong SUCCESSFUL_TASK_COUNT = new AtomicLong();

both of them updated at

if (status.isSuccess()) {
                        SUCCESSFUL_TASK_COUNT.incrementAndGet();
                      } else {
                        FAILED_TASK_COUNT.incrementAndGet();
                      }

which doesn't happen when parsing of arguments failing here

List<String> taskJavaOptsArray = jsonMapper.convertValue(
                              task.getContextValue(ForkingTaskRunnerConfig.JAVA_OPTS_ARRAY_PROPERTY),
                              new TypeReference<List<String>>() {}
                          );

which throws exception and which is goes up to outer

catch (Throwable t) {
                      throw closer.rethrow(t);
                    }

i.e. those static variables are not incremented as part of failure to parse arguments(maybe should be fixed)

IgorBerman · 2024-04-22T06:57:47Z

@zachjsh Hi, maybe you can advice regarding org.apache.druid.indexing.overlord.ForkingTaskRunnerTest#testInvalidTaskContextJavaOptsArray and updating FAILED_TASK_COUNT & SUCCESSFUL_TASK_COUNT?

IgorBerman · 2024-04-22T07:00:43Z

run same test on master branch and it's failing as well, which is strange

kfaraz · 2024-04-22T08:34:41Z

run same test on master branch and it's failing as well, which is strange

Yeah, oddly enough, for me, the test fails only when I run it individually. When I run all the tests in this class from the IDE together, it passes. It is most likely because the failed/success task counts are static.

A test that runs before this one must be setting the failed count to 1, which is being verified in this test too.
When run by itself, this test is not able to set the failed count to 1.

IgorBerman · 2024-04-22T11:15:16Z

@kfaraz can you tell me what I should do in addition for this PR to be merged?

kfaraz · 2024-04-23T11:09:13Z

@IgorBerman , I have created #16323 to fix up the test for now so that the builds pass. Later, we can revisit the code and decide if we want to increment the failed task count in the case where the javaOptsArray is invalid.

IgorBerman · 2024-04-23T12:11:09Z

thanks @kfaraz , I left 1 comment there

kfaraz · 2024-04-24T08:36:07Z

@IgorBerman , I have merged #16323 . You can try to merge the latest master into your branch to fix the errors.

IgorBerman · 2024-04-24T09:08:35Z

thanks @kfaraz , updated my branch with upstream

kfaraz · 2024-04-24T14:14:44Z

@IgorBerman , FYI, there is another PR #16328 which tackles the same problem as this one. Let me know what you think about it.

IgorBerman · 2024-04-24T14:59:54Z

@kfaraz thanks :) I looked at it and it seems nice approach.

I think the other PR is more generic since it permits to control which lookups will be loaded. I haven't understood if it supports configuring it as some context arguments for task.

In our case it's always all or nothing, but I can imagine more involved environments.
For our usecase Compaction and Kill tasks are relevant and seems like it will be easy to support Compaction task as well.

Additional diff that I can think of is that if we are not going to load lookups current PR uses different guice module which might create lighter runtime(i.e. less beans e.g., but I'm not sure this is the case since I'm not too familiar with differences of those 2 modules really).
This is basically the other side of coin - to support higher flexibility when making decision a bit later(when everything is initialized already and only lookups are initialized) vs to make some decision during startup of task(CliPeon)
Personally I'm in favor for other PR seems so.

kfaraz · 2024-04-25T09:14:05Z

I haven't understood if it supports configuring it as some context arguments for task.

The author intends to create that as a follow up PR.

Personally I'm in favor for other PR seems so.

Thanks, @IgorBerman .
Closing this PR in favour of #16328 .

github-actions bot added Kubernetes Area - Ingestion labels Apr 11, 2024

IgorBerman force-pushed the feature-13324 branch from 703dd32 to 7e83988 Compare April 12, 2024 07:01

IgorBerman marked this pull request as draft April 13, 2024 08:39

kfaraz requested changes Apr 14, 2024

View reviewed changes

IgorBerman force-pushed the feature-13324 branch 2 times, most recently from 4443e93 to 42fc8e3 Compare April 17, 2024 21:31

IgorBerman requested a review from kfaraz April 17, 2024 21:32

IgorBerman force-pushed the feature-13324 branch from 42fc8e3 to 199e726 Compare April 18, 2024 05:43

IgorBerman marked this pull request as ready for review April 18, 2024 13:02

kfaraz reviewed Apr 19, 2024

View reviewed changes

IgorBerman force-pushed the feature-13324 branch 2 times, most recently from dc80221 to fe1b2cb Compare April 19, 2024 14:17

kfaraz reviewed Apr 19, 2024

View reviewed changes

IgorBerman force-pushed the feature-13324 branch from b559004 to 92144e8 Compare April 20, 2024 15:15

IgorBerman force-pushed the feature-13324 branch 3 times, most recently from 1de34f8 to 1355717 Compare April 20, 2024 15:30

kfaraz approved these changes Apr 20, 2024

View reviewed changes

indexing-service/src/test/java/org/apache/druid/indexing/overlord/ForkingTaskRunnerTest.java Outdated Show resolved Hide resolved

server/src/main/java/org/apache/druid/query/lookup/LookupLoadingMode.java Outdated Show resolved Hide resolved

IgorBerman force-pushed the feature-13324 branch from dd23158 to 57c765f Compare April 20, 2024 17:19

feature-13324 contol loading lookups in peons

a120dfe

IgorBerman force-pushed the feature-13324 branch from 57c765f to a120dfe Compare April 24, 2024 09:08

kfaraz mentioned this pull request Apr 24, 2024

Add support for selective loading of lookups in the task layer #16328

Merged

9 tasks

kfaraz closed this Apr 25, 2024

feature-13324 contol loading lookups in peons #16266

feature-13324 contol loading lookups in peons #16266

Conversation

IgorBerman commented Apr 11, 2024 • edited

Permits to disable lookups loading for some types of peons

The fix adds doesntNeedLookups method to interface of task with default of false

Release note

Key changed/added classes in this PR

kfaraz left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IgorBerman commented Apr 18, 2024

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IgorBerman commented Apr 19, 2024

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz Apr 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IgorBerman commented Apr 20, 2024

IgorBerman commented Apr 22, 2024

IgorBerman commented Apr 22, 2024

IgorBerman commented Apr 22, 2024

kfaraz commented Apr 22, 2024 • edited

IgorBerman commented Apr 22, 2024

kfaraz commented Apr 23, 2024

IgorBerman commented Apr 23, 2024

kfaraz commented Apr 24, 2024

IgorBerman commented Apr 24, 2024

kfaraz commented Apr 24, 2024

IgorBerman commented Apr 24, 2024

kfaraz commented Apr 25, 2024

IgorBerman commented Apr 11, 2024 •

edited

kfaraz left a comment •

edited

kfaraz Apr 19, 2024 •

edited

kfaraz commented Apr 22, 2024 •

edited