RFC: allow running web backend and frontend in separate processes #47

jamieklassen · 2020-03-23T13:01:21Z

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>

jwntrs · 2020-03-30T20:27:59Z

I love the new subcommand proposal. I've been wanting to separate these for a while, thanks for formalizing these ideas.

One of the drivers for concourse/concourse#5305 was making gc easier to configure so that its easier to separate out.

I'm not sure about the ENABLE_* flags. This was supposed to have been addressed with the components table, where each component can be paused without having to restart the web node. Is there something that the components table doesn't address?

EDIT: Oh i see. The goal of the ENABLE_* flags would be to configure each web node with a different set of services enabled. Without having to create new subcommands.

jwntrs · 2020-03-30T20:30:47Z

47-microservices/proposal.md

+* `concourse ui` for the frontend, web handlers
+* `concourse api` for the http and https servers
+* `concourse backend` for the scheduler/gc/logcollector/syslogdrainer/etc
+* `concourse gateway` for the TSA


It would also be nice to see a separate concourse auth component.

jwntrs · 2020-03-30T20:42:14Z

Also I've actually attempted a spike to separate the ui from the api more than once. There are a couple problems:

We would either need to support CORS or use a web proxy so that web requests appear to originate from the same origin.
So many ports. Currently operators only need to expose a single port for all web traffic in concourse. In the microservice world they either need to expose many ports, or more proxies.

jamieklassen · 2020-04-02T15:22:53Z

I'm starting to have serious reservations about the use of the term 'microservices' to describe this change. The key idea here is to improve observability of the web node. The proposed solution happens to leverage existing mechanisms from Linux by allowing operators to configure the web node to spread its work across multiple co-located processes. Microservices, by definition, are 'independently deployable services modeled around a business domain'. As is perhaps typical for discussions involving microservices, that is a way bigger change than I'm trying to suggest - I'm not trying to make any recommendations about deployment cycles or business domains or separating the persistence layer or distributed transactions etc etc etc. This is not at all intended to be an architectural suggestion at that level.

jwntrs · 2020-04-02T15:45:27Z

Given the difficulties of splitting ui/api, I think we can get some big wins, just by splitting up the atc into frontend (api, web, auth) and backend (gc, scheduler, lidar, etc.). Basically the current atc run command becomes frontend and backend.

                       +------------- quickstart -------------+
                       |                                      |
         +----------  web ----------+                       worker
         |                          |
       tsa          +------------- atc -------------+
                    |               |               |
                frontend         backend         migrate

Thoughts?

jamieklassen · 2020-04-02T15:49:47Z

The key acceptance criteria is this - I want to know which API endpoint is eating all my memory without pulling a heap dump. Thinking lean here, suppose we just had a way to run the ListAllJobs function in its own process, with the guarantee that this process will handle all /api/v1/jobs requests. What consequences does this entail?

unix processes each have to have their own file handles - the web node has three kinds of file handles of interest that I can think of: stdout/stderr for logs, the ones that point to config files, which I believe should be opened, read and then closed as part of the startup process; and TCP sockets. What kind of sockets does it have?

listeners:
- HTTP (for web & api)
- HTTPS (ditto)
- SSH (TSA)
- pprof debug
- forwarded worker connections
database connections

The key consequence that stands out to me is what @pivotal-jwinters has pointed out: Which process actually serves the traffic, i.e. owns the HTTP listener? While we're at it, we probably want a consistent notion of a logical database connection pool - who will coordinate and make sure that the set of connections is equitably distributed between these processes?

When we ask the question this way, the solution that suggests itself is to rely primarily on proxies:

a reverse proxy receiving all the web traffic and somehow 'routing' the 'work' of serving those requests to the correct process
a db proxy process on the other end that handles the logical pooling.

We can also turn this question on its head. Let's assume that the web node continues its usual existence and manages all the interesting file handles - then how do we define the inter-process communication between the web process and the all-jobs-lister process? The first design that comes to mind when I ask the question this way would be to use the most-minimal unix IPC mechanisms - the web process is a parent of the all-jobs-lister, which it spins up on-demand, passing any information in the request via command-line parameters or environment variables, and when the all-jobs-lister terminates, the web receives the information it needs from an exit code or perhaps by reading stdout.

and reduce requirements. Signed-off-by: Jamie Klassen <cklassen@pivotal.io>

cirocosta · 2020-04-02T18:53:25Z

Hey @pivotal-jamie-klassen, thanks for bringing this up!

I wanted to challenge a point with regards to the key problem (as you
mentioned, "improving observability of the web node"):

I want to know which API endpoint is eating all my memory without pulling a
heap dump.

(from #47 (comment))

Could you expand on why relying on heap dumps is undesirable?

update: I just remembered that this is a thing https://rakyll.org/profiler-labels/

cirocosta · 2020-04-02T19:00:41Z

47-backend-frontend/proposal.md

+How can we improve situations like this in the future? I would like the following acceptance criteria:
+
+A way to report compute-resource consumption of Concourse by "member" (or smaller), such that
+* data can be harvested in real-time


Could you expand on what did you mean by this? Do you mean that that should be continuously collected throughout the execution of the member, or that at the moment I need, I should be able to retrieve it (i.e., take a snapshot)?

jamieklassen · 2020-04-02T19:02:35Z

@cirocosta two benefits:

operators don't need to learn how to use pprof to reason about their systems
process-level metrics can be measured in real-time (like watching top, etc)

my hope is that by having the option to have the measurements in real time, overall anxiety is reduced - a big part of my story is frantically searching through code and that wasn't very fun. I'm not sure I can draw a strong link between "real-time" and "reduced anxiety" beyond the fact that "big enterprises like graphs".

EDIT: maybe that's an important part of the story. the folks I was on Zoom with had to follow explicit instructions to collect a heap dump, send it to me over a file-sharing service and then I would crack open go tool pprof to inspect it. the whole feedback cycle felt awkward.

cirocosta · 2020-04-02T19:08:06Z

47-backend-frontend/proposal.md

+
+# Proposal
+
+Broadly, we need a solution that lets operators choose whether they want to run the various Concourse computations in `ifrit/grouper.Members` or actual separate unix processes -- these things are supposed to be nearly-interchangeable concepts anyway. We could provide this configurability at many levels of granularity, so let's focus this RFC on a simple case.


I'm struggling a bit with understanding if I'm looking at the RFC with the right angle 😬

e.g., if I take this exact sentence, and append to it a "so it solves what I'm trying to solve", is the following correct?

[...] we need a solution that lets operators choose whether they want to run the various Concourse computations in ifrit/grouper.Members or actual separate unix processes, so that observability of the web node is improved, being even able to know which API endpoint is eating all my memory without pulling a heap dump

(just trying to make sure I'm with the right mindset)

thanks!

I've definitely made a mess here - as though talking about microservices didn't give it away. I should maybe restate my goal as "figure out WTF is eating my memory without a heap dump". In this case the RFC just proposes a way to figure out whether the frontend or backend is to blame.

The key thing that is also missing is the ~3h offline conversation between @pivotal-jwinters and I where these goals were heavily revised...

ringods · 2020-04-02T21:45:05Z

I perceive this RFC as a mixture of a problem description and solution to narrow but not solve the problem.

The problem in the narrative is one of observability as the title mentions: you want operational insight in what is behaving wrong.

While not wanting to promote this tool (not affiliated BTW), may I point everybody to the Honeycomb blog? A number of their tracing articles describe the need for such insights at run time and define the problem as follow: find out about your unknown unknowns.

Reading further in the RFC, the proposal goes immediately into chopping up the process in smaller pieces. While this narrows the area of the code you have to look into, you still have the same problem albeit on a smaller scale.

I wanted to make a counterproposal to integrate OpenTelemetry in the code base, but I was surprised to see that it is already in the Go dependencies:

https://github.com/concourse/concourse/blob/5d47620249f15478765e23075fc3576cd102812c/go.mod#L103-L105

I'm interested to hear if the OpenTelemetry API usage is also fully integrated in the code base. If not, I would first make work of that as it will bring you the needed data to guide you in deciding what should best be split of as separate processes.

Stop logging, start tracing! 😉

cirocosta · 2020-04-14T22:38:23Z

Reading concourse/concourse#5360 (comment)

The web node CPU utilization looks like it's taken load off one instance, but not the other. Not sure how to interpret this.

reminded me of this RFC - having web split into smaller pieces could indeed help (is most of the work coming from scheduling? is it from serving api requests? scanning for checks?), even though I think (might be wrong) that ultimately root-causing (on a per-function basis) would only be truly possible via profiling (which, yes, is very expensive to do on a continuous basis, but there are ways to mitigate that - e.g., see "google-wide profiling")

propose microservices

37e415e

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>

jamieklassen mentioned this pull request Mar 27, 2020

atc: behaviour: add ArchivePipeline endpoint concourse/concourse#5346

Closed

10 tasks

jwntrs reviewed Mar 30, 2020

View reviewed changes

rename to backend-frontend

50d0d0c

and reduce requirements. Signed-off-by: Jamie Klassen <cklassen@pivotal.io>

jamieklassen changed the title ~~RFC: microservices~~ RFC: allow running web backend and frontend in separate processes Apr 2, 2020

cirocosta reviewed Apr 2, 2020

View reviewed changes

jwntrs mentioned this pull request Apr 3, 2020

Separate atc into frontend and backend commands concourse/concourse#5400

Open

vito self-assigned this Apr 15, 2021

jamieklassen closed this Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: allow running web backend and frontend in separate processes #47

RFC: allow running web backend and frontend in separate processes #47

jamieklassen commented Mar 23, 2020 •

edited

Loading

jwntrs commented Mar 30, 2020 •

edited

Loading

jwntrs Mar 30, 2020

jwntrs commented Mar 30, 2020 •

edited

Loading

jamieklassen commented Apr 2, 2020 •

edited

Loading

jwntrs commented Apr 2, 2020 •

edited

Loading

jamieklassen commented Apr 2, 2020

cirocosta commented Apr 2, 2020 •

edited

Loading

cirocosta Apr 2, 2020

jamieklassen commented Apr 2, 2020 •

edited

Loading

cirocosta Apr 2, 2020 •

edited

Loading

jamieklassen Apr 2, 2020

jamieklassen Apr 2, 2020

ringods commented Apr 2, 2020

cirocosta commented Apr 14, 2020 •

edited

Loading


		# Proposal

		Broadly, we need a solution that lets operators choose whether they want to run the various Concourse computations in `ifrit/grouper.Members` or actual separate unix processes -- these things are supposed to be nearly-interchangeable concepts anyway. We could provide this configurability at many levels of granularity, so let's focus this RFC on a simple case.

RFC: allow running web backend and frontend in separate processes #47

RFC: allow running web backend and frontend in separate processes #47

Conversation

jamieklassen commented Mar 23, 2020 • edited Loading

jwntrs commented Mar 30, 2020 • edited Loading

jwntrs Mar 30, 2020

Choose a reason for hiding this comment

jwntrs commented Mar 30, 2020 • edited Loading

jamieklassen commented Apr 2, 2020 • edited Loading

jwntrs commented Apr 2, 2020 • edited Loading

jamieklassen commented Apr 2, 2020

cirocosta commented Apr 2, 2020 • edited Loading

cirocosta Apr 2, 2020

Choose a reason for hiding this comment

jamieklassen commented Apr 2, 2020 • edited Loading

cirocosta Apr 2, 2020 • edited Loading

Choose a reason for hiding this comment

jamieklassen Apr 2, 2020

Choose a reason for hiding this comment

jamieklassen Apr 2, 2020

Choose a reason for hiding this comment

ringods commented Apr 2, 2020

cirocosta commented Apr 14, 2020 • edited Loading

jamieklassen commented Mar 23, 2020 •

edited

Loading

jwntrs commented Mar 30, 2020 •

edited

Loading

jwntrs commented Mar 30, 2020 •

edited

Loading

jamieklassen commented Apr 2, 2020 •

edited

Loading

jwntrs commented Apr 2, 2020 •

edited

Loading

cirocosta commented Apr 2, 2020 •

edited

Loading

jamieklassen commented Apr 2, 2020 •

edited

Loading

cirocosta Apr 2, 2020 •

edited

Loading

cirocosta commented Apr 14, 2020 •

edited

Loading