Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop index task stopped working with 0.10.1-rc1 #4536

Closed
erikdubbelboer opened this issue Jul 12, 2017 · 25 comments · Fixed by #4562
Closed

Hadoop index task stopped working with 0.10.1-rc1 #4536

erikdubbelboer opened this issue Jul 12, 2017 · 25 comments · Fixed by #4562
Milestone

Comments

@erikdubbelboer
Copy link
Contributor

With the previous version of Druid everything worked fine. But after the upgrade to 0.10.1-rc1 we are getting:

2017-07-12T10:14:01,315 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1499852511486_0003_m_000000_0, Status : FAILED
Error: com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) null returned by binding at io.druid.guice.JsonConfigProvider.bind(JsonConfigProvider.java:132) (via modules: com.google.inject.util.Modules$OverrideModule -> io.druid.storage.google.GoogleStorageDruidModule)
 but the 2nd parameter of io.druid.storage.google.GoogleDataSegmentPusher.<init>(GoogleDataSegmentPusher.java:56) is not @Nullable
  while locating io.druid.storage.google.GoogleAccountConfig
    for the 2nd parameter of io.druid.storage.google.GoogleDataSegmentPusher.<init>(GoogleDataSegmentPusher.java:56)
  while locating io.druid.storage.google.GoogleDataSegmentPusher
  at io.druid.storage.google.GoogleStorageDruidModule.configure(GoogleStorageDruidModule.java:94) (via modules: com.google.inject.util.Modules$OverrideModule -> io.druid.storage.google.GoogleStorageDruidModule)
  while locating io.druid.segment.loading.DataSegmentPusher annotated with @com.google.inject.multibindings.Element(setName=,uniqueId=47, type=MAPBINDER, keyType=java.lang.String)
  at io.druid.guice.PolyBind.createChoice(PolyBind.java:67) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.guice.LocalDataStorageDruidModule)
  while locating io.druid.segment.loading.DataSegmentPusher

That seems to be that somehow the druid.google.* aren't propagated to the hadoop task like they were before?

@erikdubbelboer
Copy link
Contributor Author

Anyone have any idea what might cause this? I can try all kinds of things to fix it but I'm kinda lost right now.

@leventov leventov added this to the 0.10.1 milestone Jul 12, 2017
@b-slim
Copy link
Contributor

b-slim commented Jul 12, 2017

@erikdubbelboer after #4116 user can now propagate selected properties to hadoop workers.
Not sure how this was working for you before ?

@erikdubbelboer
Copy link
Contributor Author

@b-slim It seems like this is broken. When I settuningConfig.allowedHadoopPrefix to [ "druid.storage.", "druid.javascript.", "druid.google." ] I still get:

...
2017-07-13T08:07:10,806 INFO [main] io.druid.indexing.worker.executor.ExecutorLifecycle - Running with task: {
...
      "allowedHadoopPrefix" : [ "druid.storage.", "druid.javascript.", "druid.google." ]
...
2017-07-13T08:07:16,387 INFO [task-runner-0-priority-0] io.druid.indexer.HadoopDruidIndexerConfig - Running with config:
...
      "allowedHadoopPrefix" : [ "druid.storage.", "druid.javascript." ]
...

So it doesn't seem to propagate to the actual job.

@erikdubbelboer
Copy link
Contributor Author

@erikdubbelboer
Copy link
Contributor Author

This has fixed the issue for me:

diff --git a/indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java b/indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java
index e64c0e7..b131a5e 100644
--- a/indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java
+++ b/indexing-hadoop/src/main/java/io/druid/indexer/HadoopTuningConfig.java
@@ -267,7 +267,7 @@ public class HadoopTuningConfig implements TuningConfig
         numBackgroundPersistThreads,
         forceExtendableShardSpecs,
         useExplicitVersion,
-        null
+        allowedHadoopPrefix
     );
   }

@@ -292,7 +292,7 @@ public class HadoopTuningConfig implements TuningConfig
         numBackgroundPersistThreads,
         forceExtendableShardSpecs,
         useExplicitVersion,
-        null
+        allowedHadoopPrefix
     );
   }

@@ -317,7 +317,7 @@ public class HadoopTuningConfig implements TuningConfig
         numBackgroundPersistThreads,
         forceExtendableShardSpecs,
         useExplicitVersion,
-        null
+        allowedHadoopPrefix
     );
   }

@erikdubbelboer
Copy link
Contributor Author

You want me to make a pull request?

@b-slim
Copy link
Contributor

b-slim commented Jul 13, 2017

yeah sure will review that, @erikdubbelboer thanks.

@gianm
Copy link
Contributor

gianm commented Jul 13, 2017

How were the properties passed down before? I thought that pre 0.10.1, properties were generally not passed down to the hadoop cluster. So it seems weird that now they'd need to be.

Or is the difference that now in 0.10.1, the druid.storage.type property is passed down, which means now druid.google. needs to be as well if you're using the google extension? If that's the case, I think we should either add druid.google. to the default list, or give the deep storage extensions some way to say what properties they need passed down; otherwise upgrading will be too much of a pain for people using the extension.

@b-slim
Copy link
Contributor

b-slim commented Jul 13, 2017

@gianm i think you are right, now that we push down the storage properties, guice tries to bind things up, hence the issues. Am not sure what is the best fix but seems like we need to push all the properties of the selected storage module at least. Any way i will send a first PR to fix the null issue (guess @erikdubbelboer is in a different timezone) and then will focus on how to solve this. @gianm should this be part of 0.10.1 ?

@gianm
Copy link
Contributor

gianm commented Jul 13, 2017

@b-slim I think we should at least have something in the release notes explaining the situation. Pushing down the storage properties would be even better. Perhaps we could do this by adding a getPropertyPrefixForHadoop to DataSegmentPusher and then automatically adding that prefix to allowedHadoopPrefix for whatever the deep storage is.

@jon-wei
Copy link
Contributor

jon-wei commented Jul 14, 2017

@b-slim
Are you planning on working on a fix beyond #4541 for 0.10.1?

If so, let's keep this is 0.10.1 and we can bring the fix into rc2.

If not, let's move this issue to 0.11.0, and I'll mention the storage extension property push in the 0.10.1 release notes.

@b-slim
Copy link
Contributor

b-slim commented Jul 14, 2017

i don't want to block the RC for this issue thought, i will get to work on it maybe next week.

@erikdubbelboer
Copy link
Contributor Author

I'm trying to post at the druid-development list but my messages keep on being automatically deleted (even without any links).

I don't think this only affects the Google Storage extension. I think it affects all storage extensions as all of them use different prefixes for their properties and all of them inject their config into their SegmentPushers?

@gianm
Copy link
Contributor

gianm commented Jul 17, 2017

I don't think this only affects the Google Storage extension. I think it affects all storage extensions as all of them use different prefixes for their properties and all of them inject their config into their SegmentPushers?

Most of them put their properties under druid.storage, so they aren't affected since the default list does include those. It looks like Google Storage and Azure are the two that don't.

I'm trying to post at the druid-development list but my messages keep on being automatically deleted (even without any links).

They got stuck in the list's spam filter for some reason. I marked them as not-spam and whitelisted your email address.

@gianm
Copy link
Contributor

gianm commented Jul 17, 2017

Looks like the old messages are deleted though, so you might have to re-post them.

@erikdubbelboer
Copy link
Contributor Author

@gianm
Copy link
Contributor

gianm commented Jul 17, 2017

I think S3 should be ok, since Hadoop's S3 FileSystem is used for connection from hadoop -> s3, not Druid's S3 client. So even though Druid's S3 client won't be properly instantiated, it wouldn't matter.

Has anyone tried this with 0.10.1-rc1 to confirm? I think we have but I'm not totally sure.

@jon-wei
Copy link
Contributor

jon-wei commented Jul 17, 2017

@gianm We did run some hadoop tests against 0.10.1 before rc1 was released for this PR: #4116, but the deep storage was HDFS in those tests.

@jon-wei
Copy link
Contributor

jon-wei commented Jul 18, 2017

I just ran a batch ingestion task with 0.10.1-rc1 with EMR, using S3 as deep storage, the task worked

@gianm
Copy link
Contributor

gianm commented Jul 18, 2017

@erikdubbelboer could you try #4562 to see if that works for you, without having to specify allowedHadoopPrefix in the task?

@erikdubbelboer
Copy link
Contributor Author

@gianm I looks like #4562 can't be applied to the 0.10.1 branch we are running. I tried upgrading only our middlemanager node to master but this just resulted in lots of java.io.IOException: Failed to locate service uri exceptions when starting a task. Do you have any suggestion on how to test this without having to upgrade our whole cluster to master?

@gianm
Copy link
Contributor

gianm commented Jul 19, 2017

@erikdubbelboer please try #4567 which is a backport of that to 0.10.1.

Although, exceptions when middleManagers are upgraded and the rest of the cluster is not, are concerning since we do want to support upgrading them first as part of a rolling update. Could you please paste the full error message & stack traces you got when you saw those errors? Might be something we need to fix.

@erikdubbelboer
Copy link
Contributor Author

@gianm with #4567 everything works fine. There is only one exception that repeats a couple of times at the start but it doesn't seem to affect anything:

2017-07-20 09:23:56,956 task-runner-0-priority-0 WARN JNDI lookup class is not available because this JRE does not support JNDI. JNDI string lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JndiLookup to org.apache.logging.log4j.core.lookup.StrLookup
	at java.lang.Class.cast(Class.java:3369)
	at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)
	at org.apache.logging.log4j.core.util.Loader.newCheckedInstanceOf(Loader.java:301)
	at org.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:94)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.<init>(AbstractConfiguration.java:111)
	at org.apache.logging.log4j.core.config.DefaultConfiguration.<init>(DefaultConfiguration.java:48)
	at org.apache.logging.log4j.core.LoggerContext.<init>(LoggerContext.java:75)
	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.createContext(ClassLoaderContextSelector.java:171)
	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.locateContext(ClassLoaderContextSelector.java:145)
	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:70)
	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:57)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:140)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
	at org.apache.logging.log4j.LogManager.getContext(LogManager.java:182)
	at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:103)
	at org.apache.logging.log4j.jul.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:34)
	at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:42)
	at org.apache.logging.log4j.jul.LogManager.getLogger(LogManager.java:89)
	at java.util.logging.LogManager.demandLogger(LogManager.java:551)
	at java.util.logging.Logger.demandLogger(Logger.java:455)
	at java.util.logging.Logger.getLogger(Logger.java:502)
	at com.google.inject.internal.util.Stopwatch.<clinit>(Stopwatch.java:27)
	at com.google.inject.internal.InternalInjectorCreator.<init>(InternalInjectorCreator.java:61)
	at com.google.inject.Guice.createInjector(Guice.java:96)
	at com.google.inject.Guice.createInjector(Guice.java:73)
	at io.druid.guice.GuiceInjectors.makeStartupInjector(GuiceInjectors.java:60)
	at io.druid.indexer.HadoopDruidIndexerConfig.<clinit>(HadoopDruidIndexerConfig.java:105)
	at io.druid.indexing.common.task.HadoopIndexTask$HadoopDetermineConfigInnerProcessing.runTask(HadoopIndexTask.java:293)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:215)
	at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:177)
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436)
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

The exception I'm seeing over and over again when upgrading only the middlemanager to master is:

2017-07-20T03:44:51,532 WARN [main] io.druid.indexing.common.actions.RemoteTaskActionClient - Exception submitting action for task[kill_ssps_2000-01-01T00:00:00.000Z_3000-01-01T00:00:00.000Z_2017-07-20T02:50:01.978Z]
java.io.IOException: Failed to locate service uri
	at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:94) [druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.indexing.common.task.AbstractFixedIntervalTask.isReady(AbstractFixedIntervalTask.java:92) [druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.indexing.worker.executor.ExecutorLifecycle.start(ExecutorLifecycle.java:169) [druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
	at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:364) [java-util-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:263) [java-util-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:156) [druid-api-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:101) [druid-services-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.cli.CliPeon.run(CliPeon.java:283) [druid-services-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.cli.Main.main(Main.java:108) [druid-services-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
Caused by: java.lang.NullPointerException
	at io.druid.curator.discovery.ServerDiscoverySelector$1.apply(ServerDiscoverySelector.java:60) ~[druid-server-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.curator.discovery.ServerDiscoverySelector$1.apply(ServerDiscoverySelector.java:52) ~[druid-server-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.curator.discovery.ServerDiscoverySelector.pick(ServerDiscoverySelector.java:108) ~[druid-server-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.indexing.common.actions.RemoteTaskActionClient.getServiceInstance(RemoteTaskActionClient.java:164) ~[druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:89) ~[druid-indexing-service-0.11.0-SNAPSHOT.jar:0.11.0-SNAPSHOT]
	... 12 more
2017-07-20T03:44:51,539 INFO [main] io.druid.indexing.common.actions.RemoteTaskActionClient - Will try again in [PT4.686S].

@jon-wei
Copy link
Contributor

jon-wei commented Jul 20, 2017

@erikdubbelboer

The JNDI exception at startup can be ignored.

For the second exception, based on the line numbers in ServerDiscoverySelector, it looks like you've pulled in some post 0.10.1 changes (i.e., in master but not in 0.10.1-rc1 or rc2), specifically this PR that adds TLS support: #4270

I'm guessing what's happening is that the service announcement changes that set the TLS-enabled port property are missing because only the middleManager got the upgrade, but this newer code running on the middleManager expects that property to be non-null and gets an NPE here.

Can you try basing your custom branch on 0.10.1-rc2?

@gianm
Copy link
Contributor

gianm commented Jul 20, 2017

I'm guessing what's happening is that the service announcement changes that set the TLS-enabled port property are missing because only the middleManager got the upgrade, but this newer code running on the middleManager expects that property to be non-null and gets an NPE here.

Is that what happens when #4270 is rolled out to MMs but not overlords? If so, that makes it impossible to do a proper rolling update, and we should address it before 0.11.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants