Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the loop variable scheduler issue #4468

Merged
merged 5 commits into from
Nov 27, 2023
Merged

Conversation

pmahindrakar-oss
Copy link
Contributor

@pmahindrakar-oss pmahindrakar-oss commented Nov 22, 2023

Tracking issue

#4285

Describe your changes

All kudos to @leonlnj who discovered this bug in Gojek production deployment and helped debug this nasty loop variable issue which caused unnecessary production issue where schedules ran with incorrect times

Overall the issue that was observed was when Gojek redeployed there scheduler, scheduler ran its regular catchup routine to recover from the missed schedule time during the redeploy and found schedules it needed to trigger.
This is well tested behavior in the fleet and we never encountered any issues with it.

But in case of Gojek, it mysteriously ran the schedules which shouldn't have been run and also ran with incorrect catcup from time.

There is much detail in the attached issue.
To summarize,

  • The lastExecTime for each active schedule is computed using its updatedAt timestamp and whats stored in the lastsnapshot for that schedule
  • If the schedule is available in the snapshot and has timestamp greater than the updatedAt timestamp then we use that timestamp or else we assume schedule was updated recently and hence we should use that for catching up
  • Now consider the case where schedules dont have a snapshot and they will all use updatedAt timestamp,
  • What was happening though was the lastExecTime which was pointer to loop variable s.UpdatedAt , when the loop variable moved to next schedule, the lastExecTime being pointer used the new schedules UpdatedAt
  • So essentially this value will eventually have last schedules UpdatedAt timestamp and in gojek's case it was a new schedule and this schedule's updatedAt timestamp was used as lastExecTime for knowing the catupFromTime for the schedule
  • All the active schedules started then firing using the last scheudes updateAt timestamp

The unit test written by Gojek team nicely capture this behavior.
Have updated those tests with assertions and added the loop variable fix.

There is ongoing open golang issue to fix the gotcha which many go developers have complained or ran into golang/go#56010

Hopefully they fix that soon

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

@pmahindrakar-oss
Copy link
Contributor Author

cc : @leonlnj

Copy link

codecov bot commented Nov 22, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (db61e22) 58.98% compared to head (c5fa1b4) 76.39%.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #4468       +/-   ##
===========================================
+ Coverage   58.98%   76.39%   +17.41%     
===========================================
  Files         488       18      -470     
  Lines       41772     1521    -40251     
===========================================
- Hits        24639     1162    -23477     
+ Misses      15172      295    -14877     
+ Partials     1961       64     -1897     
Flag Coverage Δ
unittests 76.39% <ø> (+17.41%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pingsutw
pingsutw previously approved these changes Nov 22, 2023
flyteadmin/run.sh Outdated Show resolved Hide resolved
@wild-endeavor
Copy link
Contributor

I'm actually pretty lost... can you help me understand please?

https://goplay.tools/snippet/mNyKXEvyv4q

I understand that the pointer inside the iterator (t here, s in flyte code) doesn't change, cuz it's just one object. But why does that matter? the values that that pointer points to do change.

can you also add a comment where you do schedule := s? feels like something i would just delete if i came across it.

@leonlnj
Copy link

leonlnj commented Nov 24, 2023

I'm actually pretty lost... can you help me understand please?

https://goplay.tools/snippet/mNyKXEvyv4q

I understand that the pointer inside the iterator (t here, s in flyte code) doesn't change, cuz it's just one object. But why does that matter? the values that that pointer points to do change.

can you also add a comment where you do schedule := s? feels like something i would just delete if i came across it.

@wild-endeavor Because the jobStore store the lastExecTime as a pointer, when the iterator move, the value of it changes as well. This cause the catchup function to use an unexpected lastExecTime that don't belong to it and possibly fire schedule that was earlier than the original schedule's updatedAt time. Hope this helps.

Screenshot 2023-11-24 at 12 06 50 PM https://goplay.tools/snippet/_jD1NzTDZC6

@wild-endeavor
Copy link
Contributor

ohh we're saving it, got it, didn't see that. thank you @leonlnj

@wild-endeavor
Copy link
Contributor

@pmahindrakar-oss could you still remove that weird file? i added a comment.

@pmahindrakar-oss
Copy link
Contributor Author

pmahindrakar-oss commented Nov 27, 2023

@wild-endeavor this might be better example to showcase the issue https://goplay.tools/snippet/J1rE_i5BJxv and have removed the delve debugger file

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 27, 2023
Signed-off-by: pmahindrakar-oss <prafulla.mahindrakar@gmail.com>
Signed-off-by: pmahindrakar-oss <prafulla.mahindrakar@gmail.com>
Signed-off-by: pmahindrakar-oss <prafulla.mahindrakar@gmail.com>
Signed-off-by: pmahindrakar-oss <prafulla.mahindrakar@gmail.com>
Signed-off-by: pmahindrakar-oss <prafulla.mahindrakar@gmail.com>
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 27, 2023
@pmahindrakar-oss pmahindrakar-oss merged commit 98dd505 into master Nov 27, 2023
45 checks passed
@pmahindrakar-oss pmahindrakar-oss deleted the fix-scheduler-loop-bug branch November 27, 2023 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants