Hadoop index task stopped working with 0.10.1-rc1 #4536

erikdubbelboer · 2017-07-12T10:58:15Z

With the previous version of Druid everything worked fine. But after the upgrade to 0.10.1-rc1 we are getting:

2017-07-12T10:14:01,315 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1499852511486_0003_m_000000_0, Status : FAILED
Error: com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) null returned by binding at io.druid.guice.JsonConfigProvider.bind(JsonConfigProvider.java:132) (via modules: com.google.inject.util.Modules$OverrideModule -> io.druid.storage.google.GoogleStorageDruidModule)
 but the 2nd parameter of io.druid.storage.google.GoogleDataSegmentPusher.<init>(GoogleDataSegmentPusher.java:56) is not @Nullable
  while locating io.druid.storage.google.GoogleAccountConfig
    for the 2nd parameter of io.druid.storage.google.GoogleDataSegmentPusher.<init>(GoogleDataSegmentPusher.java:56)
  while locating io.druid.storage.google.GoogleDataSegmentPusher
  at io.druid.storage.google.GoogleStorageDruidModule.configure(GoogleStorageDruidModule.java:94) (via modules: com.google.inject.util.Modules$OverrideModule -> io.druid.storage.google.GoogleStorageDruidModule)
  while locating io.druid.segment.loading.DataSegmentPusher annotated with @com.google.inject.multibindings.Element(setName=,uniqueId=47, type=MAPBINDER, keyType=java.lang.String)
  at io.druid.guice.PolyBind.createChoice(PolyBind.java:67) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.guice.LocalDataStorageDruidModule)
  while locating io.druid.segment.loading.DataSegmentPusher

That seems to be that somehow the druid.google.* aren't propagated to the hadoop task like they were before?

The text was updated successfully, but these errors were encountered:

erikdubbelboer · 2017-07-12T12:12:32Z

Anyone have any idea what might cause this? I can try all kinds of things to fix it but I'm kinda lost right now.

b-slim · 2017-07-12T15:19:20Z

@erikdubbelboer after #4116 user can now propagate selected properties to hadoop workers.
Not sure how this was working for you before ?

erikdubbelboer · 2017-07-13T08:18:23Z

@b-slim It seems like this is broken. When I settuningConfig.allowedHadoopPrefix to [ "druid.storage.", "druid.javascript.", "druid.google." ] I still get:

...
2017-07-13T08:07:10,806 INFO [main] io.druid.indexing.worker.executor.ExecutorLifecycle - Running with task: {
...
      "allowedHadoopPrefix" : [ "druid.storage.", "druid.javascript.", "druid.google." ]
...
2017-07-13T08:07:16,387 INFO [task-runner-0-priority-0] io.druid.indexer.HadoopDruidIndexerConfig - Running with config:
...
      "allowedHadoopPrefix" : [ "druid.storage.", "druid.javascript." ]
...

So it doesn't seem to propagate to the actual job.

erikdubbelboer · 2017-07-13T08:24:40Z

@b-slim Is it because these methods pass null instead of the actual value?: https://github.com/druid-io/druid/blob/3399d1a488896d675e54c1de90a596daea1c045f/indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java#L272-L322

erikdubbelboer · 2017-07-13T11:49:35Z

This has fixed the issue for me:

diff --git a/indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java b/indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java
index e64c0e7..b131a5e 100644
--- a/indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java
+++ b/indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java
@@ -267,7 +267,7 @@ public class HadoopTuningConfig implements TuningConfig
         numBackgroundPersistThreads,
         forceExtendableShardSpecs,
         useExplicitVersion,
-        null
+        allowedHadoopPrefix
     );
   }

@@ -292,7 +292,7 @@ public class HadoopTuningConfig implements TuningConfig
         numBackgroundPersistThreads,
         forceExtendableShardSpecs,
         useExplicitVersion,
-        null
+        allowedHadoopPrefix
     );
   }

@@ -317,7 +317,7 @@ public class HadoopTuningConfig implements TuningConfig
         numBackgroundPersistThreads,
         forceExtendableShardSpecs,
         useExplicitVersion,
-        null
+        allowedHadoopPrefix
     );
   }

erikdubbelboer · 2017-07-13T11:49:47Z

You want me to make a pull request?

b-slim · 2017-07-13T15:59:54Z

yeah sure will review that, @erikdubbelboer thanks.

gianm · 2017-07-13T16:07:49Z

How were the properties passed down before? I thought that pre 0.10.1, properties were generally not passed down to the hadoop cluster. So it seems weird that now they'd need to be.

Or is the difference that now in 0.10.1, the druid.storage.type property is passed down, which means now druid.google. needs to be as well if you're using the google extension? If that's the case, I think we should either add druid.google. to the default list, or give the deep storage extensions some way to say what properties they need passed down; otherwise upgrading will be too much of a pain for people using the extension.

b-slim · 2017-07-13T16:17:01Z

@gianm i think you are right, now that we push down the storage properties, guice tries to bind things up, hence the issues. Am not sure what is the best fix but seems like we need to push all the properties of the selected storage module at least. Any way i will send a first PR to fix the null issue (guess @erikdubbelboer is in a different timezone) and then will focus on how to solve this. @gianm should this be part of 0.10.1 ?

gianm · 2017-07-13T21:17:33Z

@b-slim I think we should at least have something in the release notes explaining the situation. Pushing down the storage properties would be even better. Perhaps we could do this by adding a getPropertyPrefixForHadoop to DataSegmentPusher and then automatically adding that prefix to allowedHadoopPrefix for whatever the deep storage is.

jon-wei · 2017-07-14T00:02:52Z

@b-slim
Are you planning on working on a fix beyond #4541 for 0.10.1?

If so, let's keep this is 0.10.1 and we can bring the fix into rc2.

If not, let's move this issue to 0.11.0, and I'll mention the storage extension property push in the 0.10.1 release notes.

b-slim · 2017-07-14T15:04:20Z

i don't want to block the RC for this issue thought, i will get to work on it maybe next week.

erikdubbelboer · 2017-07-17T16:43:04Z

I'm trying to post at the druid-development list but my messages keep on being automatically deleted (even without any links).

I don't think this only affects the Google Storage extension. I think it affects all storage extensions as all of them use different prefixes for their properties and all of them inject their config into their SegmentPushers?

gianm · 2017-07-17T17:37:34Z

I don't think this only affects the Google Storage extension. I think it affects all storage extensions as all of them use different prefixes for their properties and all of them inject their config into their SegmentPushers?

Most of them put their properties under druid.storage, so they aren't affected since the default list does include those. It looks like Google Storage and Azure are the two that don't.

I'm trying to post at the druid-development list but my messages keep on being automatically deleted (even without any links).

They got stuck in the list's spam filter for some reason. I marked them as not-spam and whitelisted your email address.

gianm · 2017-07-17T17:38:46Z

Looks like the old messages are deleted though, so you might have to re-post them.

erikdubbelboer · 2017-07-17T22:06:52Z

I guess if you want to store your segments in S3 you also need druid.s3: https://github.com/druid-io/druid/blob/60cdf946778d0624db6ddce0366e5fdb131437b2/extensions-core/s3-extensions/src/main/java/io/druid/storage/s3/S3StorageDruidModule.java#L77

gianm · 2017-07-17T22:11:43Z

I think S3 should be ok, since Hadoop's S3 FileSystem is used for connection from hadoop -> s3, not Druid's S3 client. So even though Druid's S3 client won't be properly instantiated, it wouldn't matter.

Has anyone tried this with 0.10.1-rc1 to confirm? I think we have but I'm not totally sure.

jon-wei · 2017-07-17T22:46:06Z

@gianm We did run some hadoop tests against 0.10.1 before rc1 was released for this PR: #4116, but the deep storage was HDFS in those tests.

jon-wei · 2017-07-18T00:33:42Z

I just ran a batch ingestion task with 0.10.1-rc1 with EMR, using S3 as deep storage, the task worked

gianm · 2017-07-18T02:34:56Z

@erikdubbelboer could you try #4562 to see if that works for you, without having to specify allowedHadoopPrefix in the task?

erikdubbelboer · 2017-07-19T15:09:35Z

@gianm I looks like #4562 can't be applied to the 0.10.1 branch we are running. I tried upgrading only our middlemanager node to master but this just resulted in lots of java.io.IOException: Failed to locate service uri exceptions when starting a task. Do you have any suggestion on how to test this without having to upgrade our whole cluster to master?

gianm · 2017-07-19T16:28:38Z

@erikdubbelboer please try #4567 which is a backport of that to 0.10.1.

Although, exceptions when middleManagers are upgraded and the rest of the cluster is not, are concerning since we do want to support upgrading them first as part of a rolling update. Could you please paste the full error message & stack traces you got when you saw those errors? Might be something we need to fix.

erikdubbelboer · 2017-07-20T20:27:00Z

@gianm with #4567 everything works fine. There is only one exception that repeats a couple of times at the start but it doesn't seem to affect anything:

2017-07-20 09:23:56,956 task-runner-0-priority-0 WARN JNDI lookup class is not available because this JRE does not support JNDI. JNDI string lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JndiLookup to org.apache.logging.log4j.core.lookup.StrLookup
	at java.lang.Class.cast(Class.java:3369)
	at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)
	at org.apache.logging.log4j.core.util.Loader.newCheckedInstanceOf(Loader.java:301)
	at org.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:94)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.<init>(AbstractConfiguration.java:111)
	at org.apache.logging.log4j.core.config.DefaultConfiguration.<init>(DefaultConfiguration.java:48)
	at org.apache.logging.log4j.core.LoggerContext.<init>(LoggerContext.java:75)
	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.createContext(ClassLoaderContextSelector.java:171)
	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.locateContext(ClassLoaderContextSelector.java:145)
	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:70)
	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:57)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:140)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
	at org.apache.logging.log4j.LogManager.getContext(LogManager.java:182)
	at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:103)
	at org.apache.logging.log4j.jul.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:34)
	at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:42)
	at org.apache.logging.log4j.jul.LogManager.getLogger(LogManager.java:89)
	at java.util.logging.LogManager.demandLogger(LogManager.java:551)
	at java.util.logging.Logger.demandLogger(Logger.java:455)
	at java.util.logging.Logger.getLogger(Logger.java:502)
	at com.google.inject.internal.util.Stopwatch.<clinit>(Stopwatch.java:27)
	at com.google.inject.internal.InternalInjectorCreator.<init>(InternalInjectorCreator.java:61)
	at com.google.inject.Guice.createInjector(Guice.java:96)
	at com.google.inject.Guice.createInjector(Guice.java:73)
	at io.druid.guice.GuiceInjectors.makeStartupInjector(GuiceInjectors.java:60)
	at io.druid.indexer.HadoopDruidIndexerConfig.<clinit>(HadoopDruidIndexerConfig.java:105)
	at io.druid.indexing.common.task.HadoopIndexTask$HadoopDetermineConfigInnerProcessing.runTask(HadoopIndexTask.java:293)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:215)
	at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:177)
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436)
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

The exception I'm seeing over and over again when upgrading only the middlemanager to master is:

2017-07-20T03:44:51,532 WARN [main] io.druid.indexing.common.actions.RemoteTaskActionClient - Exception submitting action for task[kill_ssps_2000-01-01T00:00:00.000Z_3000-01-01T00:00:00.000Z_2017-07-20T02:50:01.978Z]
java.io.IOException: Failed to locate service uri
	at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:94) [druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.indexing.common.task.AbstractFixedIntervalTask.isReady(AbstractFixedIntervalTask.java:92) [druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.indexing.worker.executor.ExecutorLifecycle.start(ExecutorLifecycle.java:169) [druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
	at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:364) [java-util-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:263) [java-util-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:156) [druid-api-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:101) [druid-services-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.cli.CliPeon.run(CliPeon.java:283) [druid-services-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.cli.Main.main(Main.java:108) [druid-services-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
Caused by: java.lang.NullPointerException
	at io.druid.curator.discovery.ServerDiscoverySelector$1.apply(ServerDiscoverySelector.java:60) ~[druid-server-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.curator.discovery.ServerDiscoverySelector$1.apply(ServerDiscoverySelector.java:52) ~[druid-server-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.curator.discovery.ServerDiscoverySelector.pick(ServerDiscoverySelector.java:108) ~[druid-server-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.indexing.common.actions.RemoteTaskActionClient.getServiceInstance(RemoteTaskActionClient.java:164) ~[druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:89) ~[druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	... 12 more
2017-07-20T03:44:51,539 INFO [main] io.druid.indexing.common.actions.RemoteTaskActionClient - Will try again in [PT4.686S].

jon-wei · 2017-07-20T20:45:49Z

@erikdubbelboer

The JNDI exception at startup can be ignored.

For the second exception, based on the line numbers in ServerDiscoverySelector, it looks like you've pulled in some post 0.10.1 changes (i.e., in master but not in 0.10.1-rc1 or rc2), specifically this PR that adds TLS support: #4270

I'm guessing what's happening is that the service announcement changes that set the TLS-enabled port property are missing because only the middleManager got the upgrade, but this newer code running on the middleManager expects that property to be non-null and gets an NPE here.

Can you try basing your custom branch on 0.10.1-rc2?

gianm · 2017-07-20T22:54:12Z

I'm guessing what's happening is that the service announcement changes that set the TLS-enabled port property are missing because only the middleManager got the upgrade, but this newer code running on the middleManager expects that property to be non-null and gets an NPE here.

Is that what happens when #4270 is rolled out to MMs but not overlords? If so, that makes it impossible to do a proper rolling update, and we should address it before 0.11.0.

leventov added this to the 0.10.1 milestone Jul 12, 2017

b-slim mentioned this issue Jul 13, 2017

Fix passing down of allowedHadoopPrefix #4541

Merged

gianm mentioned this issue Jul 18, 2017

DataSegmentPusher: Add allowed hadoop property prefixes. #4562

Merged

b-slim closed this as completed in #4562 Jul 18, 2017

jon-wei mentioned this issue Jul 21, 2017

Middle manager rolling upgrade issue with TLS support #4583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadoop index task stopped working with 0.10.1-rc1 #4536

Hadoop index task stopped working with 0.10.1-rc1 #4536

erikdubbelboer commented Jul 12, 2017

erikdubbelboer commented Jul 12, 2017

b-slim commented Jul 12, 2017

erikdubbelboer commented Jul 13, 2017

erikdubbelboer commented Jul 13, 2017

erikdubbelboer commented Jul 13, 2017

erikdubbelboer commented Jul 13, 2017

b-slim commented Jul 13, 2017

gianm commented Jul 13, 2017

b-slim commented Jul 13, 2017

gianm commented Jul 13, 2017

jon-wei commented Jul 14, 2017

b-slim commented Jul 14, 2017

erikdubbelboer commented Jul 17, 2017

gianm commented Jul 17, 2017

gianm commented Jul 17, 2017

erikdubbelboer commented Jul 17, 2017

gianm commented Jul 17, 2017

jon-wei commented Jul 17, 2017

jon-wei commented Jul 18, 2017

gianm commented Jul 18, 2017

erikdubbelboer commented Jul 19, 2017

gianm commented Jul 19, 2017

erikdubbelboer commented Jul 20, 2017

jon-wei commented Jul 20, 2017 •

edited

gianm commented Jul 20, 2017

Hadoop index task stopped working with 0.10.1-rc1 #4536

Hadoop index task stopped working with 0.10.1-rc1 #4536

Comments

erikdubbelboer commented Jul 12, 2017

erikdubbelboer commented Jul 12, 2017

b-slim commented Jul 12, 2017

erikdubbelboer commented Jul 13, 2017

erikdubbelboer commented Jul 13, 2017

erikdubbelboer commented Jul 13, 2017

erikdubbelboer commented Jul 13, 2017

b-slim commented Jul 13, 2017

gianm commented Jul 13, 2017

b-slim commented Jul 13, 2017

gianm commented Jul 13, 2017

jon-wei commented Jul 14, 2017

b-slim commented Jul 14, 2017

erikdubbelboer commented Jul 17, 2017

gianm commented Jul 17, 2017

gianm commented Jul 17, 2017

erikdubbelboer commented Jul 17, 2017

gianm commented Jul 17, 2017

jon-wei commented Jul 17, 2017

jon-wei commented Jul 18, 2017

gianm commented Jul 18, 2017

erikdubbelboer commented Jul 19, 2017

gianm commented Jul 19, 2017

erikdubbelboer commented Jul 20, 2017

jon-wei commented Jul 20, 2017 • edited

gianm commented Jul 20, 2017

jon-wei commented Jul 20, 2017 •

edited