Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not blindly delete duplicate schedules #389

Merged
merged 5 commits into from
Mar 3, 2021
Merged

Conversation

izimobil
Copy link
Contributor

@izimobil izimobil commented Mar 2, 2021

Because there are no unique constraints on schedule models (except for
SolarSchedule), schedules tables can contain duplicates, those
duplicate schedules should not be blindly deleted when encountered
because they can be linked to existing tasks and would cause the tasks
to be also deleted (on delete cascade).

Instead we now just return the first duplicate found, just to avoid
creating further duplicates.

This fixes issue #322.

Because there are no unique constraints on schedule models (except for
SolarSchedule), schedules tables can contain duplicates, those
duplicate schedules should not be blindly deleted when encountered
because they can be linked to existing tasks and would cause the tasks
to be also deleted (on delete cascade).

Instead we now just return the first duplicate found, just to avoid
creating further duplicates.

This fixes issue celery#322.
@auvipy
Copy link
Member

auvipy commented Mar 2, 2021

can you check if this PR https://github.com/celery/django-celery-beat/pull/269/files is somewhat related?

@izimobil
Copy link
Contributor Author

izimobil commented Mar 2, 2021

can you check if this PR https://github.com/celery/django-celery-beat/pull/269/files is somewhat related?

It is related in the sense that it would prevent future duplicates, the idea of adding unique constraints is good, but there should be a warning in the changelog because it's not fully backwards compatible, existing code may try to create duplicate schedules (which currently works) and get a IntegrityError exception with the constraints added.

Furthermore, because existing databases can already contain duplicate schedules, the from_schedule methods should not be modified.

IMO you can merge my PR first (the bug is really critical in the sense that it deletes existing tasks !), release a bugfix version, and add constraints on a future major version.

(If the PR is reworked to only add constraints, the tests on my PR will have to be removed.)

@auvipy
Copy link
Member

auvipy commented Mar 2, 2021

@arnau126 can you check this please

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope return cls.objects.filter(**spec).first() won't create any regression as bug it is bug fix

@izimobil
Copy link
Contributor Author

izimobil commented Mar 2, 2021

hope return cls.objects.filter(**spec).first() won't create any regression as bug it is bug fix

Frankly, it can't be worse than it is now, I got tasks that disappeared on my production environment and it's really difficult to repair when you have hundreds of tasks.

Also, since the cls.objects.filter(**spec).first() is on the block handling the MultipleObjectsReturned exception, you are guaranteed to retrieve the schedule instance and not None.

@auvipy
Copy link
Member

auvipy commented Mar 2, 2021

just waiting for another eye for cross-checking. will merge soon after that

Copy link
Contributor

@hartwork hartwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

This improves over the existing code for sure. 👍
What I am wondering is:

  • Why are we using cls() over cls.objects.create() so that those functions sometimes return model instances that are persisted already and sometimes return instances that are not yet persisted in the database. Is that a bug or a feature?
  • It seems like we effectively have a tailor-made get_or_create here. Why not use what Django offers and simplify most of this code away for good?
  • The added test cases are effectively 3 times the same test. Personally, I would use parameterized.expand and a single parametrized test to avoid duplication, but it's just my two cents and I'm aware there there are multiple schools on that subject.

Best, Sebastian

# create 2 duplicates schedules
sched1 = CrontabSchedule.objects.create(hour="4")
CrontabSchedule.objects.create(hour="4")
self.assertEqual(CrontabSchedule.objects.count(), 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea:
I think adding .filter(hour="4") here and in line 93 would be a cheap way to make this test more robust to any potential cross-test effects. It would help fight future flakiness. What do you think?
PS: This applies to the other two tests as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

Comment on lines 121 to 124
def test_duplicate_schedules(self):
# Duplicates cannot be tested for solar schedules because of the
# unique constraints in the SolarSchedule model
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of adding dead code, to be honest. There are other things that we cannot test for either, right? Does this code really add value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See next comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked all. Which one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the constraint is here from the very beginning (ec46367) we can indeed drop the MultipleObjectsReturned block (as well as the test method of course)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the test would be dropped as well, excellent. 👍

Comment on lines 124 to 127
except MultipleObjectsReturned:
cls.objects.filter(**spec).delete()
return cls(**spec)
# unique_together constraint should not permit reaching this code,
# but just in case, we return the first schedule found
return cls.objects.filter(**spec).first()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of avoiding dead code, why not drop these lines altogether?

Copy link
Contributor Author

@izimobil izimobil Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the constraint is here from the very beginning (ec46367) we can indeed drop the MultipleObjectsReturned block (as well as the test method of course)

- removed MultipleObjectsReturned code block for solar schedules as
  unique_togther constraint safely prevents duplicates
- removed related test method
@izimobil
Copy link
Contributor Author

izimobil commented Mar 2, 2021

Hi,

I've updated the PR.

* Why are we using `cls()` over `cls.objects.create()` so that those functions sometimes return model instances that _are_ persisted already and sometimes return instances that _are not_ yet persisted in the database. Is that a bug or a feature?

* It seems like we effectively have a tailor-made `get_or_create` here.  Why not use what Django offers and simplify most of this code away for good?

I asked myself the same question regarding why the schedule isn't saved, but keep in mind I don't want to break anything with this bug fix PR. Improving the code should be the role of another PR IMHO.

* The added test cases are effectively 3 times the same test.  Personally, I would use [`parameterized.expand`](https://pypi.org/project/parameterized/) and a single parametrized test to avoid duplication, but it's just my two cents and I'm aware there there are multiple schools on that subject.

I won't take the responsability of adding yet another requirement just for that, instead I've factorized the code in a mixin class.

Thanks for your review, regards.

Copy link
Contributor

@hartwork hartwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izimobil thanks for taking this further! 👍 🙏

@hartwork
Copy link
Contributor

hartwork commented Mar 2, 2021

Travis CI taking ages to start builds as usual… 🤷‍♂️

@izimobil
Copy link
Contributor Author

izimobil commented Mar 2, 2021

@auvipy I think you can safely merge this one, I can't believe no one else came across the dramatic impacts of previous code !
Cheers.

@auvipy
Copy link
Member

auvipy commented Mar 3, 2021

caught by cold flu. but reviewing and merging it tomorrow for sure

@auvipy auvipy merged commit bab79d9 into celery:master Mar 3, 2021
@hartwork
Copy link
Contributor

hartwork commented Mar 3, 2021

caught by cold flu. but reviewing and merging it tomorrow for sure

All the best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants