Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process s3 event messages from SQS #215

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cjxkb210
Copy link

@cjxkb210 cjxkb210 commented Aug 7, 2020

Added ability to consume s3 event messages from an SQS queue. This
facilitates the option of having multiple consumers of s3event
notifications for the same prefix. You could for example subscribe an
SNS topic to the s3event for a given prefix and then have multiple sqs
queues subscribed to that topic. This expands usage options for
multiple varying consumers of the same s3 prefix event.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Added ability to consume s3 event messages from an SQS queue. This
facilitates the option of having multiple consumers of s3event
notifications for the same prefix. You could for example subscribe an
SNS topic to the s3event for a given prefix and then have multiple sqs
queues subscribed to that topic. This expands usage options for
multiple varying consumers of the same s3 prefix event.
@cjxkb210
Copy link
Author

cjxkb210 commented Aug 7, 2020

Hi,

We'd like to expose the option of having multiple consumers of s3 events for a given prefix. AWS only allows you to have one consumer of s3 events for a given prefix. This consumer can be a Lambda, SQS, or SNS topic.

If the RedshiftLambdaLoader could consume s3 events indirectly via an SQS queue this would open the option of using the SNS topic as the subscriber of s3 events. In this case we could subscribe multiple queues, for example if you wanted to load the data into Redshift via the RedshiftLambdaLoader but independently consumer from a separate queue with a process that ingests the data into an index.

I've created the attached PR for review. Please note that in my example I've also allowed the processing of multiple sqs messages in the same request. In my testing I enabled raw message delivery on an SNS topic, to which I subscribed an SQS queue that the RedshiftLambdaLoader could consume from. This minimised the impact of the changes.

Please note, I'm not fully convinced that processing multiple sqs mesages in the same request is safe with the batch locking logic but I believe it may handle this although perhaps it was not intended to do so?

Could you review and let me know your thoughts on exposing such a feature? Would this approach cause you any concerns?

Thanks,
Chris.

@cjxkb210
Copy link
Author

cjxkb210 commented Aug 7, 2020

Hi @IanMeyers, would you mind reviewing my suggestion above?

Much appreciated,
Chris.

@IanMeyers IanMeyers self-assigned this Aug 7, 2020
@IanMeyers
Copy link
Contributor

Sorry have been on holiday. Will review monday the 10th.

@cjxkb210
Copy link
Author

cjxkb210 commented Aug 7, 2020

Thank you, have a great weekend.

@cjxkb210
Copy link
Author

cjxkb210 commented Aug 11, 2020

after testing, I confirmed my initial concerns that processing the sqs messages in a tight loop will cause problems.

Failing further processing to Batch 7a2afa1f-7dc5-45e8-b0ad-677ac59738d1 which may be stuck in 'locked' state. If so, unlock the back using node unlockBatch.js <batch ID>, delete the processed file marker with node processedFiles.js -d <filename>, and then re-store the file in S3"

I'm still digging into the cause, I think a Redshift Maintenance window might be in play. Perhaps the batch was left in a locked state because the lambda timed out/did not mark the batch as failed.

@IanMeyers
Copy link
Contributor

Lambda timeouts can indeed result in locked batches.

@cjxkb210
Copy link
Author

yeah, I've seen logs that look like this
Remaining duration of -90597 insufficient to load cluster

It's possible I simply got unlucky and the function did not manage to fail the batch leaving it in a locked state.

I've had my changes running through via SQS loading batches for 12 hours now. The data I'm testing with gets placed onto s3 every 15mins.

@cjxkb210
Copy link
Author

I'll be testing this over a longer time period but this is ready for review now. I've increased my timeouts a little and added failed batch re-processing with SQS delayed delivery for the redshift maintenance windows.

@cjxkb210
Copy link
Author

cjxkb210 commented Aug 12, 2020

interestingly I noticed this in the logs(I don't think is in anyway related to this PR)

2020-08-12T13:48:12.102+01:00
START RequestId: 1a5aa337-9f84-5f13-9486-f49a78bcbb46 Version: $LATEST

2020-08-12T13:48:12.115+01:00
info: Failing Batch 48e2e469-2fbd-4e98-99c6-9b028831a612 due to "Remaining duration of -89145 insufficient to load cluster"

2020-08-12T13:48:12.458+01:00
END RequestId: 1a5aa337-9f84-5f13-9486-f49a78bcbb46
REPORT RequestId: 1a5aa337-9f84-5f13-9486-f49a78bcbb46 Duration: 355.01 ms Billed Duration: 400 ms Memory Size: 256 MB Max Memory Used: 103 MB

I confirmed that the Remaining duration of -89145 insufficient to load cluster did pertain to the same request 1a5aa337-9f84-5f13-9486-f49a78bcbb46 by checking the batches table.

This means the context.getRemainingTimeInMillis() function returned -89145 13ms after request invocation when my lambda had been configured with a whopping 2min 30 second timeout. The number makes no sense and I'm not convinced getRemainingTimeInMillis is behaving nicely. I've upgraded my nodejs runtime from 10.x to 12.x in case there was an AWS defect. I could find a record of one, but I'll see if that eliminates the strange behavior around context.getRemainingTimeInMillis()

@IanMeyers
Copy link
Contributor

hmm. I can't regress this. Also, if this consistently happened, then no one's loads would ever succeed. I think there's something else going on here, but don't see anything obvious in the code that could cause it - this is a single instance of the function being used, and while it accesses context via a parent reference, it shouldn't cause any consistency issues. Let me know what happens with 12.x?

@cjxkb210
Copy link
Author

yeah I've just seen the same issue with 12.x. Its infrequent, I've seen it 3 times in 48 batches.

@IanMeyers
Copy link
Contributor

Can you please raise a support ticket?

@cjxkb210
Copy link
Author

yes, I'll raise a ticket for it, I'll spend a little more time this afternoon to see if I can spot anything that might explain it.

@cjxkb210
Copy link
Author

@IanMeyers this has gone through rounds of testing and is looking good. What are your thoughts of rolling this in?

Kind Regards, Chris.

@IanMeyers
Copy link
Contributor

Thanks for the info. In general I'm supportive, but need to do a code review. I will book out some time tomorrow (18/9) to do so.

@cjxkb210
Copy link
Author

Thank you @IanMeyers

[REQ-2126] - consume the message for bad events, log and move on.
[REQ-2126] safely consume and ignore aws test events which despite them being test in nature do not honour raw message delivery settings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants