-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: allow running web backend and frontend in separate processes #47
Conversation
Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
I love the new subcommand proposal. I've been wanting to separate these for a while, thanks for formalizing these ideas. One of the drivers for concourse/concourse#5305 was making gc easier to configure so that its easier to separate out. I'm not sure about the EDIT: Oh i see. The goal of the |
47-microservices/proposal.md
Outdated
* `concourse ui` for the frontend, web handlers | ||
* `concourse api` for the http and https servers | ||
* `concourse backend` for the scheduler/gc/logcollector/syslogdrainer/etc | ||
* `concourse gateway` for the TSA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be nice to see a separate concourse auth
component.
Also I've actually attempted a spike to separate the
|
I'm starting to have serious reservations about the use of the term 'microservices' to describe this change. The key idea here is to improve observability of the web node. The proposed solution happens to leverage existing mechanisms from Linux by allowing operators to configure the web node to spread its work across multiple co-located processes. Microservices, by definition, are 'independently deployable services modeled around a business domain'. As is perhaps typical for discussions involving microservices, that is a way bigger change than I'm trying to suggest - I'm not trying to make any recommendations about deployment cycles or business domains or separating the persistence layer or distributed transactions etc etc etc. This is not at all intended to be an architectural suggestion at that level. |
Given the difficulties of splitting ui/api, I think we can get some big wins, just by splitting up the atc into frontend (api, web, auth) and backend (gc, scheduler, lidar, etc.). Basically the current
Thoughts? |
The key acceptance criteria is this - I want to know which API endpoint is eating all my memory without pulling a heap dump. Thinking lean here, suppose we just had a way to run the unix processes each have to have their own file handles - the web node has three kinds of file handles of interest that I can think of: stdout/stderr for logs, the ones that point to config files, which I believe should be opened, read and then closed as part of the startup process; and TCP sockets. What kind of sockets does it have?
The key consequence that stands out to me is what @pivotal-jwinters has pointed out: Which process actually serves the traffic, i.e. owns the HTTP listener? While we're at it, we probably want a consistent notion of a logical database connection pool - who will coordinate and make sure that the set of connections is equitably distributed between these processes? When we ask the question this way, the solution that suggests itself is to rely primarily on proxies:
We can also turn this question on its head. Let's assume that the web node continues its usual existence and manages all the interesting file handles - then how do we define the inter-process communication between the web process and the all-jobs-lister process? The first design that comes to mind when I ask the question this way would be to use the most-minimal unix IPC mechanisms - the web process is a parent of the all-jobs-lister, which it spins up on-demand, passing any information in the request via command-line parameters or environment variables, and when the all-jobs-lister terminates, the web receives the information it needs from an exit code or perhaps by reading stdout. |
and reduce requirements. Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
Hey @pivotal-jamie-klassen, thanks for bringing this up! I wanted to challenge a point with regards to the key problem (as you
Could you expand on why relying on heap dumps is undesirable? update: I just remembered that this is a thing https://rakyll.org/profiler-labels/ |
How can we improve situations like this in the future? I would like the following acceptance criteria: | ||
|
||
A way to report compute-resource consumption of Concourse by "member" (or smaller), such that | ||
* data can be harvested in real-time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on what did you mean by this? Do you mean that that should be continuously collected throughout the execution of the member, or that at the moment I need, I should be able to retrieve it (i.e., take a snapshot)?
@cirocosta two benefits:
my hope is that by having the option to have the measurements in real time, overall anxiety is reduced - a big part of my story is frantically searching through code and that wasn't very fun. I'm not sure I can draw a strong link between "real-time" and "reduced anxiety" beyond the fact that "big enterprises like graphs". EDIT: maybe that's an important part of the story. the folks I was on Zoom with had to follow explicit instructions to collect a heap dump, send it to me over a file-sharing service and then I would crack open |
|
||
# Proposal | ||
|
||
Broadly, we need a solution that lets operators choose whether they want to run the various Concourse computations in `ifrit/grouper.Members` or actual separate unix processes -- these things are supposed to be nearly-interchangeable concepts anyway. We could provide this configurability at many levels of granularity, so let's focus this RFC on a simple case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling a bit with understanding if I'm looking at the RFC with the right angle 😬
e.g., if I take this exact sentence, and append to it a "so it solves what I'm trying to solve", is the following correct?
[...] we need a solution that lets operators choose whether they want to run the various Concourse computations in
ifrit/grouper.Members
or actual separate unix processes, so that observability of the web node is improved, being even able to know which API endpoint is eating all my memory without pulling a heap dump
(just trying to make sure I'm with the right mindset)
thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've definitely made a mess here - as though talking about microservices didn't give it away. I should maybe restate my goal as "figure out WTF is eating my memory without a heap dump". In this case the RFC just proposes a way to figure out whether the frontend or backend is to blame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key thing that is also missing is the ~3h offline conversation between @pivotal-jwinters and I where these goals were heavily revised...
I perceive this RFC as a mixture of a problem description and solution to narrow but not solve the problem. The problem in the narrative is one of observability as the title mentions: you want operational insight in what is behaving wrong. While not wanting to promote this tool (not affiliated BTW), may I point everybody to the Honeycomb blog? A number of their tracing articles describe the need for such insights at run time and define the problem as follow: find out about your unknown unknowns. Reading further in the RFC, the proposal goes immediately into chopping up the process in smaller pieces. While this narrows the area of the code you have to look into, you still have the same problem albeit on a smaller scale. I wanted to make a counterproposal to integrate OpenTelemetry in the code base, but I was surprised to see that it is already in the Go dependencies: I'm interested to hear if the OpenTelemetry API usage is also fully integrated in the code base. If not, I would first make work of that as it will bring you the needed data to guide you in deciding what should best be split of as separate processes. Stop logging, start tracing! 😉 |
Reading concourse/concourse#5360 (comment)
reminded me of this RFC - having |
Rendered