Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port QueueWorker #3267

Closed
8 tasks done
pbiggar opened this issue Nov 26, 2021 · 12 comments
Closed
8 tasks done

Port QueueWorker #3267

pbiggar opened this issue Nov 26, 2021 · 12 comments
Assignees

Comments

@pbiggar
Copy link
Member

pbiggar commented Nov 26, 2021

  • Port the queueworker code from OCaml to F# (mostly done, but it doesn't actually start as the main method is empty)
  • Start the queueworker in main
  • Determine how the queueworker is going to run (multi-threaded?)
  • add queueworker container to containers/
  • add queueworker service to services/
  • add healthcheck
  • Add tests
  • determine how we're going to switch over to this
@pbiggar
Copy link
Member Author

pbiggar commented Dec 21, 2021

Deployment plan:

  • Disable actually running the queueworker in code.
  • Deploy 2 workers and see if they stay up
  • Ensure we see honeycomb events
  • Wait until after the security review to actually pull from the queue

after the security review

  • study existing queueworker metrics and telemetry to see how it's running in prod
  • figure out why deployment isn't working
  • Scale to 1 worker
  • change "dequeue" to be selective

@pbiggar
Copy link
Member Author

pbiggar commented Dec 21, 2021

We can in the future look at running multithreaded stuff here, but not yet.

@pbiggar
Copy link
Member Author

pbiggar commented Mar 1, 2022

Given we actually need quite a bit of CPU/memory in these things, I think we should look at multithreading here. How many jobs should we run at once on a single 3 CPU node. Maybe 16, and have 4 nodes?

@pbiggar pbiggar self-assigned this Mar 9, 2022
@pbiggar pbiggar removed their assignment Apr 16, 2022
@pbiggar
Copy link
Member Author

pbiggar commented May 8, 2022

The new plan is to implement a new queue #3777, and the F# queueworker would pull from that queue. Then we set feature flags at emit and choose which canvases/handlers to send there then.

first version

  • implement new queue
  • add local emulator
  • migrate DB
    • add migration script
    • run the migration in prod
  • deploy and get containers running in production
  • manually test the queue in production
  • add feature flag to emit and cronchecker on canvases
  • add tests
    • tests needed for each state transition in the diagram, especially errors
  • add permissions for production
  • move things to configuration

second version

  • queueworker deployment should not go to zero pods
  • get honeycomb data right for the existing instrumentation
  • implement unpausing/unblocking: loading all saved events into the queue
  • test using scheduling rules to block a handler
  • manually test pausing a handling in production
  • tests for pausing/unpausing
  • add retries on failure / delay_until

third version

  • don't include queuing time in "dequeueAndProcess"
  • look at honeycomb and see if we have sufficient instrumentation
  • move a few users over to the new queue
  • check flushing is hooked up
  • handle FSTODOs
  • fix queue counts
  • add an index for queue counts
  • apply index in production
  • test queue count in production
  • add time limit for event execution (moving to issue)
  • write queue doc

@StachuDotNet
Copy link
Member

StachuDotNet commented May 9, 2022

... Then we set feature flags at emit and choose which canvases/handlers to send there then.

This is to migrate slowly to the F# queueworker, rather than all at once?

test disabling with admin rules in production

What does this mean? Disabling the queueworker for general purposes? (what do we normally do that for? in case of emergencies?)

@StachuDotNet
Copy link
Member

StachuDotNet commented May 9, 2022

Some TODOs:

  • refer to events.md in QueueWorker/readme.md
    should we refer to this elsewhere?
  • consider relocation events.md to docs rather than docs/production
    (currently, the queue-scheduler refers to the wrong directory)
  • update events.md given these changes

@pbiggar
Copy link
Member Author

pbiggar commented May 9, 2022

... Then we set feature flags at emit and choose which canvases/handlers to send there then.

This is to migrate slowly to the F# queueworker, rather than all at once?

Yes

test disabling with admin rules in production

What does this mean? Disabling the queueworker for general purposes? (what do we normally do that for? in case of emergencies?)

We have a way to disable a user's queues in case something goes awry (either abuse or operational issue).

@pbiggar
Copy link
Member Author

pbiggar commented May 9, 2022

Fyi, made a new version of the doc with the state diagram in a mermaid format!

https://github.com/darklang/dark/pull/3786/files#diff-5f242dada1ad8afc2b333f365391724a7d2d97e1b0a757f64f46a533df82eb64

@pbiggar
Copy link
Member Author

pbiggar commented May 13, 2022

Progress report: the first PR was merged and queue code is running in production. I ran a few tests by flipping the feature flag on the canvas, and it worked! Hoenstly a little surprised but there we go.

@pbiggar
Copy link
Member Author

pbiggar commented May 16, 2022

Progress report, have moved maybe 30 users over to the new queue. Doing it in larger chunks going forward.

@pbiggar
Copy link
Member Author

pbiggar commented May 16, 2022

Moving this here

4th version

  • run more than one queue at a time
  • validate users transitioned over work
  • check what operational stuff will go off if we move the queue
  • get a sense of what memory and cpu values are like (don't seem usable)
  • test out more than one queue at a time
  • fix pusher serialization with libdarkinternal::pushStrollerEvent_v1

8th version (numbers got out of sync)

  • Why is there a two minute gap where no events are run after a deploy
  • why do i see exceptions in the log that aren't in rollbar and telemetry
  • fix exception Status(StatusCode="DeadlineExceeded"

finishing steps

  • move the rest of the users over to the new queue
  • add honeycomb triggers if something goes awry with the queue

@pbiggar
Copy link
Member Author

pbiggar commented May 29, 2022

Done!

@pbiggar pbiggar closed this as completed May 29, 2022
Repository owner moved this from In Progress to Done in Release 2 in Darklang priorities May 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants