New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiLangDaemon Throws NullPointerException Going From One Shard To Two When Multiple Daemons Are Running #29

Closed
eyesoftime opened this Issue Jun 29, 2015 · 14 comments

Comments

Projects
None yet
9 participants
@eyesoftime

eyesoftime commented Jun 29, 2015

If there's a stream with one shard that is split into two shards and two daemons are run starting from the trim horizon of the shards, the daemon that does processing of the parent shard dies with NullPointerException when it reaches the end of the shard. The second daemon will take over the processing of child shards but one of the daemons has exited. Happens with 1.2.0 as well as 1.4.0.

Jun 29, 2015 11:41:47 AM com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor stopProcessing
SEVERE: Encountered an error while trying to shutdown child process
java.lang.NullPointerException
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.shutdown(MultiLangRecordProcessor.java:154)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Jun 29, 2015 11:41:47 AM com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor stopProcessing
SEVERE: Encountered error while trying to shutdown
java.lang.NullPointerException
        at com.amazonaws.services.kinesis.multilang.MessageWriter.close(MessageWriter.java:163)
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.childProcessShutdownSequence(MultiLangRecordProcessor.java:186)
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.stopProcessing(MultiLangRecordProcessor.java:249)
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.shutdown(MultiLangRecordProcessor.java:164)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
@kevincdeng

This comment has been minimized.

Show comment
Hide comment
@kevincdeng

kevincdeng Jun 29, 2015

Contributor

Thanks for reporting. I'll try to reproduce this.

Contributor

kevincdeng commented Jun 29, 2015

Thanks for reporting. I'll try to reproduce this.

@kevincdeng

This comment has been minimized.

Show comment
Hide comment
@kevincdeng

kevincdeng Jun 29, 2015

Contributor

Hi eyesoftime,

I'm not able to reproduce this. Can you provide the steps which you took to produce the problem?

What language are you using to process the records? Are you using one of the official multilang KCLs?

Thanks

Contributor

kevincdeng commented Jun 29, 2015

Hi eyesoftime,

I'm not able to reproduce this. Can you provide the steps which you took to produce the problem?

What language are you using to process the records? Are you using one of the official multilang KCLs?

Thanks

@eyesoftime

This comment has been minimized.

Show comment
Hide comment
@eyesoftime

eyesoftime Jun 29, 2015

I was loading records into the stream, each one about 130KB of arbitrary data with indexes applied for tracking. Initially I started out with one shard. After about 2500 records I split the shard into two, after another 2500 records I merged the new shards. So, there's 4 shard altogether. There were no consumers running at the time (but that doesn't really change the outcome).

So then I started one daemon with Python sample application (with additional logging added, again for tracking). While it was consuming records from the first shard, I started the second daemon which then was idling as the first shard hadn't been consumed yet and second and third are the child shards of the first. Now when the consumer of the first shard reached the end, the first daemon died with the NPE. It happened repeatedly, while running daemons on the same EC2 instance, or parallel to one in my local machine. The same thing happened when the test was done 2-1-2 with shardes, ie merging and splitting. In that case it also died when going from one shard to two.

Hope it helps you.

eyesoftime commented Jun 29, 2015

I was loading records into the stream, each one about 130KB of arbitrary data with indexes applied for tracking. Initially I started out with one shard. After about 2500 records I split the shard into two, after another 2500 records I merged the new shards. So, there's 4 shard altogether. There were no consumers running at the time (but that doesn't really change the outcome).

So then I started one daemon with Python sample application (with additional logging added, again for tracking). While it was consuming records from the first shard, I started the second daemon which then was idling as the first shard hadn't been consumed yet and second and third are the child shards of the first. Now when the consumer of the first shard reached the end, the first daemon died with the NPE. It happened repeatedly, while running daemons on the same EC2 instance, or parallel to one in my local machine. The same thing happened when the test was done 2-1-2 with shardes, ie merging and splitting. In that case it also died when going from one shard to two.

Hope it helps you.

@kevincdeng

This comment has been minimized.

Show comment
Hide comment
@kevincdeng

kevincdeng Jun 29, 2015

Contributor

Thanks for the information. I did not do the shard merge in my own test, so that might be the problem. I will do so to see if that reproduces the problem.

Contributor

kevincdeng commented Jun 29, 2015

Thanks for the information. I did not do the shard merge in my own test, so that might be the problem. I will do so to see if that reproduces the problem.

@kevincdeng

This comment has been minimized.

Show comment
Hide comment
@kevincdeng

kevincdeng Jun 29, 2015

Contributor

I have reproduced the problem. The problem isn't with MultiLangRecordProcessor per se, but rather with the Worker implementation.

A Worker will sometimes call shutdown on an IRecordProcessor even if initialize has not been called on the same instance. Since MultiLangRecordProcessor uses its initialize method to construct certain fields, and its shutdown method assumes that those fields have been initialized, an NPE occurs.

Once again thank you for reporting the problem. It will be fixed in a future release.

Contributor

kevincdeng commented Jun 29, 2015

I have reproduced the problem. The problem isn't with MultiLangRecordProcessor per se, but rather with the Worker implementation.

A Worker will sometimes call shutdown on an IRecordProcessor even if initialize has not been called on the same instance. Since MultiLangRecordProcessor uses its initialize method to construct certain fields, and its shutdown method assumes that those fields have been initialized, an NPE occurs.

Once again thank you for reporting the problem. It will be fixed in a future release.

@pboocock

This comment has been minimized.

Show comment
Hide comment
@pboocock

pboocock Jul 28, 2015

@kevincdeng

Using the python wrapper for this package, and I seem to be running into the same issue. Is there a workaround you can recommend to guarantee that initialize always gets called?

pboocock commented Jul 28, 2015

@kevincdeng

Using the python wrapper for this package, and I seem to be running into the same issue. Is there a workaround you can recommend to guarantee that initialize always gets called?

@kevincdeng

This comment has been minimized.

Show comment
Hide comment
@kevincdeng

kevincdeng Jul 28, 2015

Contributor

If your code doesn't need the shard id, you might be able to place it in the constructor of the class instead. Do you absolutely need initialize to be called? If it's just to ensure proper functioning of the shutdown method, adding a flag to check whether initialization has happened might be sufficient.

Contributor

kevincdeng commented Jul 28, 2015

If your code doesn't need the shard id, you might be able to place it in the constructor of the class instead. Do you absolutely need initialize to be called? If it's just to ensure proper functioning of the shutdown method, adding a flag to check whether initialization has happened might be sufficient.

@kevinrivers

This comment has been minimized.

Show comment
Hide comment
@kevinrivers

kevinrivers Nov 16, 2015

Was this fixed in a recent version?

kevinrivers commented Nov 16, 2015

Was this fixed in a recent version?

@soulcutter

This comment has been minimized.

Show comment
Hide comment
@soulcutter

soulcutter Jan 28, 2016

This remains unfixed. The MultiLangRecordProcessor has not been changed since Oct 2014 https://github.com/awslabs/amazon-kinesis-client/blob/73ac2c0e25a25776cbc88f2c685223fb049e6757/src/main/java/com/amazonaws/services/kinesis/multilang/MultiLangRecordProcessor.java

I was able to reproduce this issue on 1.6.1 (the current latest version)

soulcutter commented Jan 28, 2016

This remains unfixed. The MultiLangRecordProcessor has not been changed since Oct 2014 https://github.com/awslabs/amazon-kinesis-client/blob/73ac2c0e25a25776cbc88f2c685223fb049e6757/src/main/java/com/amazonaws/services/kinesis/multilang/MultiLangRecordProcessor.java

I was able to reproduce this issue on 1.6.1 (the current latest version)

@findchris

This comment has been minimized.

Show comment
Hide comment
@findchris

findchris commented Mar 3, 2016

@kevincdeng ETA here?

@pboocock

This comment has been minimized.

Show comment
Hide comment
@pboocock

pboocock Mar 21, 2016

@kevincdeng @findchris FWIW I've found success by specifying the failoverTimeMillis property in the .properties file to a high number (e.g. 100s)

pboocock commented Mar 21, 2016

@kevincdeng @findchris FWIW I've found success by specifying the failoverTimeMillis property in the .properties file to a high number (e.g. 100s)

@Kahn

This comment has been minimized.

Show comment
Hide comment
@Kahn

Kahn Apr 4, 2016

This apparently shipped in https://github.com/awslabs/amazon-kinesis-client#release-162-march-23-2016 @manango can you close or merge this PR please? Its confusing to leave open.

Kahn commented Apr 4, 2016

This apparently shipped in https://github.com/awslabs/amazon-kinesis-client#release-162-march-23-2016 @manango can you close or merge this PR please? Its confusing to leave open.

@manango

This comment has been minimized.

Show comment
Hide comment
@manango

manango Apr 5, 2016

Contributor

The issue has been resolved in 1.6.2 release. Closing the issue.

Contributor

manango commented Apr 5, 2016

The issue has been resolved in 1.6.2 release. Closing the issue.

@manango manango closed this Apr 5, 2016

@manango manango removed the fix in progress label Apr 5, 2016

@prashantalhat

This comment has been minimized.

Show comment
Hide comment
@prashantalhat

prashantalhat Dec 19, 2016

I am facing similar issue in https://github.com/awslabs/amazon-kinesis-client-net. Can someone please help?

prashantalhat commented Dec 19, 2016

I am facing similar issue in https://github.com/awslabs/amazon-kinesis-client-net. Can someone please help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment