Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the deduplication id. #149

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ryanfox1985
Copy link

No description provided.

}
}
end

def fifo_required_params(serialized_job)
parsed_job = JSON.parse(serialized_job)
deduplication_id = OpenSSL::Digest::MD5.hexdigest(parsed_job['job_class'] + parsed_job['arguments'].to_s)
Copy link
Author

@ryanfox1985 ryanfox1985 Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically this is the main change and the motivation of this PR (other changes are caused by Rubocop), I think the job_id is a random uuid and the deduplication id should be something deterministic like my suggestion.

@samsonjs
Copy link

samsonjs commented Mar 6, 2023

It seems like this is really dependent on a specific use-case. Let's say that I have a queue of transactions for a bank account and someone withdraws $100 twice resulting in 2 jobs with the class WithdrawalJob and arguments ['some-account-id', 100] then those should definitely not be de-duplicated.

@ryanfox1985
Copy link
Author

In my opinion, your example is missing an argument with timestamp. I think for a bank is essential when the withdrawal is performed.

@samsonjs
Copy link

samsonjs commented Mar 7, 2023

Sorry that wasn’t very clear. The specifics aren’t important, my point is only that because it depends on the application it should probably be optional.

@ryanfox1985
Copy link
Author

In my opionion, is a wrong usage having a random UUID for deduplication_id, as it is now is not possible to take adventajes of AWS FIFO queues.

The motivation deduplication_id of FIFO queues is to not to process duplicated message for a certain window time (I think is 5 minutes).

Your withdraw example using FIFO queues has a design problem, if you want to use FIFO then I think you should call WithdrawalJob with ['some-account-id', 100, timestamp] as I proposed.

@ryanfox1985
Copy link
Author

as FIFO queues is mandatory the deduplication_id, having a random UUID sounds a quick patch to be able to use FIFO queues.

@ryanfox1985
Copy link
Author

ryanfox1985 commented Mar 7, 2023

my bad, seems not mandatory.

https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/using-messagededuplicationid-property.html

So then, I think is the same use Random UUID than without deduplication_id.

@ryanfox1985
Copy link
Author

ryanfox1985 commented Mar 7, 2023

perhaps, a good approach could be to not set this field by default and have a property in the library to enable to set this field.

So, the user could enable/disable the deduplication_id, what do you think?

@ryanfox1985
Copy link
Author

Sorry, but I think both (message_group_id and deduplication_id) are mandatory for FIFO queues.

@ryanfox1985
Copy link
Author

do you prefer this approach? 48ba66d

btw I dislike, I think is a wrong usage of FIFO queues...

@samsonjs
Copy link

samsonjs commented Mar 8, 2023

do you prefer this approach? 48ba66d

I'm not sure what the best approach is to configure this behaviour. Maybe some of the maintainers more familiar with the codebase can chime in on that. Maybe the deduplication ID could be an optional param that's set on each job somehow because it sounds like you may want/need different behaviour per job. The parameter could be a symbol describing the strategy so you don't have to calculate the hash yourself and pass that in.


Looking at the SQS docs though I think they basically state that depending on your specific use-case you may want to provide a deduplication ID in order to make sure messages with the same content are treated as unique, or on the other hand to make sure that messages with slightly different content are treated as the same.

The producer should provide message deduplication ID values for each message in the following scenarios:

  • Messages sent with identical message bodies that Amazon SQS must treat as unique.
  • Messages sent with identical content but different message attributes that Amazon SQS must treat as unique.
  • Messages sent with different content (for example, retry counts included in the message body) that Amazon SQS must treat as duplicates.

The next section about single producer/consumer also has more info about what to do in different circumstances, which really makes me think it should be parameterized somehow.

@ryanfox1985
Copy link
Author

In my opinion it shouldn't be parametrized because deduplication_id is mandatory and Fifo queues has this behaviour.

Then the implementation should cover this deduplication thing otherwise this deduplication thing is totally useless.

@samsonjs
Copy link

samsonjs commented Mar 8, 2023

Using the ActiveJob ID isn't useless. It serves the purpose of preventing 2 jobs with identical parameters from being detected as a duplicate by mistake. Other strategies are valid too but it depends on your specific use-case.

@ryanfox1985
Copy link
Author

ryanfox1985 commented Mar 8, 2023

Not depends in user case, it's a design/implementation problem.

Your theory doesn't have sense because if it has then the message_deduplication_id would be optional and not mandatory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants