Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource check queue #3788

Closed
pivotal-jwinters opened this issue Apr 24, 2019 · 9 comments · Fixed by #4202
Labels
Projects

Comments

@pivotal-jwinters
Copy link
Contributor

@pivotal-jwinters pivotal-jwinters commented Apr 24, 2019

Instead of checking resources synchronously we want to add them to a queue (table in the db). This way we get better visibility into what resource checks are happening, better de-duping of concurrent checks and allow for rate limiting.

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor Author

@pivotal-jwinters pivotal-jwinters commented May 13, 2019

We've been investigating a two component approach: scanner and checker, plus manual interaction via the api.

scanner

The scanner is responsible for determining if a new check should be added to the database. It should be concerned with things like paused jobs/pipelines, pinned resources, disabled resources, check interval, etc...

checker

The checker is responsible for running any checks (and type checks) that get added to the database. The check will scan the table and grab a lock to run the check. If it can't get the lock it will simply retry next time it runs. The checker should run regardless of any preconditions that the scanner might care about.

api

Any requests made to the check-resource api would result in a check being added to the database (skipping all the scanner preconditions). These checks will still be subject to any UNIQUE constraints in the database, for example, if a check is already pending/running.

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor Author

@pivotal-jwinters pivotal-jwinters commented May 15, 2019

One thing we need to keep in mind is the fact that the checks table could get quite large. So we'll need to figure out how we're going to gc things.

@vito

This comment has been minimized.

Copy link
Member

@vito vito commented May 16, 2019

@pivotal-jwinters Should they just be marked as completed after the check is done and/or removed?

Or, could we track the queue time and reap anything queued before the last_checked value?

pivotal-jwinters added a commit that referenced this issue May 22, 2019
#3788

Signed-off-by: Joshua Winters <jwinters@pivotal.io>
Co-authored-by: Denise Yu <dyu@pivotal.io>
pivotal-jwinters added a commit that referenced this issue May 28, 2019
#3788

Signed-off-by: Joshua Winters <jwinters@pivotal.io>
Co-authored-by: Denise Yu <dyu@pivotal.io>
@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor Author

@pivotal-jwinters pivotal-jwinters commented Jun 19, 2019

While working through this issue we've hit a bit of a wrinkle.

Ever since we introduced global resources, radar's behaviour has been a bit strange. Basically every resource still has its own check interval, but if two resources share the same scope, then only one will actually get the lock and run the check. This makes sense in the current context of all the things we need to run a check, however moving forward we want to remodel things a bit.

In the context of our two component system (outlined above), the scanner will still interact at the resource level, but once we get to the checker we will have no more knowledge of which resource the check actually ties back to. We started thinking about how we could run a check with the actual resource:

Since the scope doesn't actually know the atc.Source to run the check against, we'll need to store it in the checks table. This starts to look a LOT like a build plan for checks, so why not actually make it such.

This lead us down the path of refactoring the exec package so that we can add a exec.CheckStep. We are going to have to change a few more things along the way, but it seems like this is a pretty good path forward.

pivotal-jwinters added a commit that referenced this issue Jun 19, 2019
#3788

Signed-off-by: Joshua Winters <jwinters@pivotal.io>
Co-authored-by: Denise Yu <dyu@pivotal.io>
pivotal-jwinters added a commit that referenced this issue Jun 19, 2019
#3788

Signed-off-by: Joshua Winters <jwinters@pivotal.io>
Co-authored-by: Denise Yu <dyu@pivotal.io>
xtremerui pushed a commit that referenced this issue Jul 3, 2019
#3788

Signed-off-by: Joshua Winters <jwinters@pivotal.io>
Co-authored-by: Denise Yu <dyu@pivotal.io>
@vito

This comment has been minimized.

Copy link
Member

@vito vito commented Jul 23, 2019

Notes from IPM regarding the handling of missing versions for parent resource types:

  • Let's have fly check-resource fail fast if the parent resource type has no version.
  • Let's add fly check-resource --recursive which will recurse into the parent types and perform checks for them, and then check the specified resource.
  • Then have fly check-resource print the command to run when it hits that error.

Also: be sure to not surface 'orange boxes' for missing resource type versions - just show a special message in the UI, since we expect this state to happen when a resource is first configured.

@vito vito added this to In progress in Core side-road Jul 31, 2019
@vito vito added this to To do in Chequeue via automation Aug 6, 2019
@vito vito removed this from In progress in Core side-road Aug 6, 2019
@vito vito moved this from To do to In progress in Chequeue Aug 6, 2019
Chequeue automation moved this from In progress to Done Sep 6, 2019
@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Sep 12, 2019

Hey @pivotal-jamie-klassen , did UX end up creating an issue for performing the necessary changes to the UI regarding getting the status from the specific check_id that was created after triggering the check?

(i.e., use the /api/v1/checks/:check_id)

https://github.com/concourse/concourse/pull/4202/files#diff-071d67b5e901a7e8d76fa176b8677d0cR15-R16

Asking that because we just saw the effects of not having that implemented 😁

Thank you!

@pivotal-jamie-klassen

This comment has been minimized.

Copy link
Contributor

@pivotal-jamie-klassen pivotal-jamie-klassen commented Sep 17, 2019

@cirocosta #4444

@kcmannem

This comment has been minimized.

Copy link
Member

@kcmannem kcmannem commented Oct 3, 2019

Some observations after deploying 5.6 with Lidar on our internal supported concourse env wings.

Looking at the graph below, the blue line splits time before and after lidar. We can see that CPU loads drops from ~45% to ~30%. And a great reduction on transactions/min from ~35k/min to ~25k/min.
Screen Shot 2019-10-02 at 2 22 20 PM

Something we also noticed is that the Web nodes now have less variance in the amount of queries they make, this is great progress towards making concourse scale horizontally a lot easier.
Screen Shot 2019-10-02 at 2 33 20 PM

Although the number of queries has decreased, its important to note that the aggregate amount of resource checks across all web nodes remains the same. :party:
Screen Shot 2019-10-02 at 2 26 15 PM

The throughput from the perspective of DB has increased even though the amount of transactions has reduced.
Screen Shot 2019-10-02 at 2 24 55 PM

looking at the GC activity that we mentioned before (that we had to wait to see it kicking off), we can see that it indeed does what we wanted (after 6 hours, storage stabilized).
Screen Shot 2019-10-03 at 10 19 33 AM

@kcmannem

This comment has been minimized.

Copy link
Member

@kcmannem kcmannem commented Oct 3, 2019

Another set of observations from our internal supported concourse env hush-house. This set takes a closer look at how the go runtime is performing after Lidar is enabled.

We see a 1k (50%) drop in the amount of goroutines being spawned.
Screen Shot 2019-10-03 at 12 47 24 PM

Something we're still curious about is that GC for the go process increased a significant amount. Even though q75 is still very negligible after the increase, q100 the worst offenders is magnitudes greater.
Screen Shot 2019-10-03 at 1 04 51 PM

Screen Shot 2019-10-03 at 1 07 05 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Chequeue
  
Done
5 participants
You can’t perform that action at this time.