-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated records #33
Comments
For example I have this s3 file :
The line starting with |
Unfortunately, because Kinesis supports as at-least-once delivery semantic, you are unable to suppress duplications unless you track ID's using a secondary mechanism (which can then move you to 'at most once' delivery semantics, which opens the possibility of data loss. It is best to leave the duplicate records in your delivery destination in S3, and deduplicate them within the analysis system you are using. |
Thanks a lot for you answer @IanMeyers. It's more clearer. What do you mean by :
Thanks again Ian |
Will probably follow Brent Nash, even if I didn't found the way to trigger all of this for the moment using this lambda and firehose.
Closing the issue because it's not related to the lambda itself. |
Yes, Brent is correct in one way to correctly merge new data that may contain duplicates. Thanks! |
Hey @benoittgt , I'm on vacation in the mountains at the moment, so my internet is spotty, but let me try to share a few details. As Ian mentioned, Kinesis has "at least once" semantics, so you can get the occasional duplicate. In my experience, they're mostly due to producer retries. See https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html for more details. In the system(s) I've built, we do what Ian mentioned where we archive all events (including duplicates) to S3 as sort of the "raw" record of what was received. Deduplication happens when we moving data from Kinesis or S3 into destination data stores like Redshift or ElasticSearchService. The basic gist of it is that the producer (i.e. the thing sending data to Kinesis) generates a unique v4 UUID (an "event_id") and sends it as part of every event. This is the field that can be used to deduplicate. When loading into Redshift, use a temporary staging table and then we perform a LEFT JOIN against the destination table(s) and only take event_ids that don't already exist in the destination table(s). In the case of ElasticSearchService, you can do a similar thing by using the event_id as the id in the standard index/type/id mapping ESS provides (this actually results in duplicates just getting overwritten with the same record, but has the effect of deduping). You can come up with similar mechanisms for other data stores as well. Hope that helps and let me know if you need any further details. |
Thanks a lot @brentnash. Sorry to bother you during your vacation.
I'm using Firehose for the moment to copy from S3 to Redshift. I will probably |
I get an other answer on : http://disq.us/p/1eyg90w I will add this mechanism tomorrow and will publish a blog post about the implementation next month. Thanks again to both of you. |
@brentnash The only issue I will have with keeping this lambda with firehose copying into Redshift is that the staging table will constantly received data. I'm gonna probably use two staging table and
That looks quite complicate for few duplicates but I think this is the best to do. |
I have something similar to BEGIN;
ALTER TABLE active_connections_temp
RENAME TO active_connections_process;
CREATE TABLE active_connections_temp (LIKE active_connections_process);
COMMIT;
BEGIN;
INSERT INTO active_connections_final
SELECT DISTINCT active_connections_process.*
FROM active_connections_process
LEFT JOIN active_connections_final ON active_connections_final.id = active_connections_process.id
WHERE active_connections_final.id IS NULL
ORDER BY active_connections_process.id;
DROP TABLE active_connections_process;
COMMIT; |
The other thing to consider is that you'll get relatively few duplicates while running normally, but the two places I see large numbers of duplicates are:
That being said, you're right, I don't think Firehose is a great fit for this at the moment. It's a lot of effort for potentially not much gain like you said. If you really want to use Firehose to write all the way to Redshift, then you're either going to have to do something complicated like you mentioned above or you'll have to live with the dupes. I'll try to find the actual SQL I use tomorrow to compare against what you posted above, but yours looks about like what I'd expect with the SELECT DISTINCT and the LEFT JOIN on id (which I assume is your unique ID). If you're not concerned about #2 I mentioned above (i.e. you only care about duplicates that happen within a limited time window), you could also consider using the new Firehose embedded Lambda function feature (https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html) to do some clever deduping. Maybe you could have it store the last hour or day worth of unique event IDs that it has seen in a Dynamo DB table and try to dedupe that way. I haven't tried it...just an idea. Good luck! |
Thanks @brentnash for your answer. We have now a lambda that run in production with a stagging table. Code looks like : 'use strict';
const config = require('./redshift_config_from_env');
const redshiftConn = `pg://${config.user}${config.password}@${config.host}/${config.database}`;
const pgp = require('pg-promise')();
var tableCopyQuery = function(tableName) {
return `
ALTER TABLE ${tableName}_temp
RENAME TO ${tableName}_process;
CREATE TABLE ${tableName}_temp (LIKE ${tableName}_process);`;
};
var insertQuery = function(tableName) {
return `
INSERT INTO ${tableName}
SELECT DISTINCT ${tableName}_process.*
FROM ${tableName}_process
LEFT JOIN ${tableName} USING (id)
WHERE ${tableName}.id IS NULL
ORDER BY ${tableName}_process.id;
DROP TABLE ${tableName}_process;`;
};
exports.handler = function(event, context) {
const client = pgp(redshiftConn);
return client.tx(function (t) {
return t.batch([
t.none(tableCopyQuery('user_stats')),
t.none(insertQuery('user_stats')),
t.none(tableCopyQuery('admin_stats')),
t.none(insertQuery('admin_stats'))
]);
})
.then(function () {
return context.succeed(`Successfully merged.`);
})
.catch(function (error) {
return context.fail(`Failed to run queries : ${JSON.stringify(error)}`);
});
}; For the moment it's working but will wait few days to be sure. |
Hey @benoittgt, Just to follow up, I checked my merge SQL and it looks pretty similar to yours. The only differences I see are:
One other thought is that you may want to check what the PRIMARY KEY/SORT KEY/DIST KEY are set to on your staging table. If your "id" is not part of those, your merges make take longer than necessary. Though since you're using CREATE TABLE ... LIKE ... you might not have a choice since you'll inherit those values from your parent table. Glad to hear it seems to be working for you! |
Hello @brentnash
Thanks a lot ! 2 days and it's still working as wanted. |
We finally had sometimes issue with other insert queries with Redshift Serializable isolation violation on table. We finally remove the redshift insert from Firehose and let it run only the s3 insert. The copy from s3 to a temp table and insert into final table are made in one transaction by a lambda. Also all insert query transaction lock the Redshift table before doing anything. It works quite perfectly (expect #37). It takes some times but we finally have a solution that can be easily debugged and that is very efficient. Thanks again for the help. |
I published with my team two blog post about our migration. You are mentioned ! Thanks again for your help |
Thanks Benoit! Glad everything worked out! Good luck!
~Brent
…On Thu, Mar 30, 2017 at 6:59 AM, Benoit Tigeot ***@***.***> wrote:
I published with my team two blog post about our migration. You are
mentioned ! Thanks again for your help
https://medium.com/appaloosa-store-engineering/migrating-
our-analytics-stack-from-mongodb-to-aws-redshift-334230d9ef7e
https://medium.com/appaloosa-store-engineering/from-
mongodb-to-aws-redshift-a-practical-guide-5ec8ee8fb147
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABnYFnMtQdJCq4nVQjfJniASVPemQkWUks5rq7UogaJpZM4LZgZ8>
.
|
Hello @IanMeyers and others
I'm getting random duplicates rows. For last day :
I can see the duplicate entries on s3 file but not when reading kinesis stream content.
Trying to investigate.
The text was updated successfully, but these errors were encountered: