Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flux components slides #267

Merged
merged 5 commits into from Apr 4, 2024
Merged

Add flux components slides #267

merged 5 commits into from Apr 4, 2024

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Mar 28, 2024

This PR adds an abstractions and architecture guide, which right now is our small set of documentation slides that go over flux (high level) and explain the space of projects. This addition includes:

  1. Various tweaks to build /config errors on other pages I noticed (the first commit)
  2. The index.rst (front page) links directly to the architectures page (first picture below)
  3. The guides page includes the new page (second picture)
  4. The architectures page directly links the slides, and they have the flux release code (third picture)

image

image

image

And it's quite pleasant to click through the slides and learn about flux! Each is very simple with words / pictures highlighted to make a point.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Problem: the flux components are diverse and can be confusing.
Solution: create an architecture page that includes a short
set of slides that go over high level concepts and projects.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Member Author

vsoch commented Mar 28, 2024

Oh neat, readthedocs has a new dashboard!
image

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch requested a review from garlick March 28, 2024 22:35
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks! I have one question about naming the slides "architecture slides", which I think may confuse some people.

.. _flux-architecture:

#################
Flux Architecture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "architecture" the correct term here? This seems to be describing a high level overview of Flux Framework and many of the current components that are currently project under that umbrella. It could just be me, but when I see "Flux Architecture", I would expect to see something more along the lines of what's described here

Maybe this document should be called "Flux Framework Overview and Components"?

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally good! Here are a few nitpicky suggestions that you may or may not want to include:

  • Where does flux run? How about a third box that is "your laptop" since is so easy to start a flux instance anywhere for experimentation (a nice feature)
  • The flux core slide is a bit skimpy. Maybe you could steal something from here and expand it into two or three slides? It does contain a lot that defines the overall architecture of Flux.
  • Flux-security: only required at an HPC site if using Flux as the native resource manager. Not required if running Flux under another RM like slurm. It is intended to collect all the tricky security bits of flux into one small, auditable, infrequently changing package
  • On the pmix slide I would specifically mention OpenMPI. I would hate to leave anyone wondering "Hey I use MPI, do I need PMIX?" Do you use openmpi? Ok maybe.
  • The segue into other flux projects is a little choppy and those sentences in boxes seem a little word salady to me. We just talked about components and now we're putting them inside each other? Please have another look at those and see if they still make sense to you. I'm not too sure what to suggest. Maybe given that this is a transition, something about bridging the HPC and cloud communities? It may not be bad to acknowledge their differences here?
  • At this point you are in your element and I wouldn't want to suggest that you do anything different!

@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

Thanks @garlick ! I'll get started on these changes and ping you when they are ready for a second review. I really appreciate it!

@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

@garlick the slides are updated - please take a look!

The segue into other flux projects is a little choppy and those sentences in boxes seem a little word salady to me. We just talked about components and now we're putting them inside each other? Please have another look at those and see if they still make sense to you. I'm not too sure what to suggest. Maybe given that this is a transition, something about bridging the HPC and cloud communities? It may not be bad to acknowledge their differences here?

I agree, but I don't have better ideas at the moment, primarily because all of these are changing so rapidly. The way I'm viewing these slides now is that the first section on flux projects is a cohesive thing, and the remainder sections (starting with fluence, for example) are separate references for when someone asks about them specifically. I think to map these into the same space we need a larger discussion / mapping out of the components and projects. I started a flux architecture initiative but it didn't pick up any interest so I've only been doing this thinking as needed. I think we can just improve upon this over time.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@garlick
Copy link
Member

garlick commented Apr 4, 2024

Better!

Note: see @grondo's comment about the title - I thought that was a good comment

  • In the core slides, I think I would not use the throughput example as 1 job/s is far less than what we can do normally, and we don't provide tooling to nest flux instances to increase throughput for a single stream of work. A more down to earth performance benefit of flux recursion is that, in contrast to a monolithic resource manager like slurm, batch jobs run as full flux instances, and thus could run a taxing workload (like a high throughput one) without impacting the parent Flux instance or other batch jobs.

  • How do nodes communicate? In the first slide, knda sounds like the lead broker role is user-optional, and that followers connect directly to the leader socket. Maybe just say "one broker is designated as the leader situated at the root of a tree based overlay network"? Followers join the overlay network.

  • In flux-security, I think the sentence about the tricky bits and auditability should be in the text of the first slide as that's fundamental to its existence as a component

Problem: we are not really talking about architecture.
Solution: rename to components.
@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

@grondo apologies I just missed your comment! Changes:

  • renamed to "Flux Components" (files, and documentation)
  • removed throughout slides
  • flux security "tricky bits" moved to first slide
  • rephrased the TBON slides
  • I added one slide for fluxion, which felt way too empty!

@garlick
Copy link
Member

garlick commented Apr 4, 2024

Thanks!

Did all those changes get made? I'm still seeing the throughput example on slide 13 for example

@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

Is your browser caching? Here is a direct link to the slides: https://docs.google.com/presentation/d/10EchFMjJYFCZGa0CMWR1AwazGsLGXJYY5Nwj34nSZvg/edit?usp=sharing

And that set has the throughput ones removed:
image

@garlick
Copy link
Member

garlick commented Apr 4, 2024

I meant the "real world example" slide which is the culmination of the throughput example (I see it there in the png)

@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

That's from the learning guide though - https://flux-framework.readthedocs.io/en/latest/guides/learning_guide.html#fully-hierarchical-resource-management-techniques

Why is it wrong? It's mostly meant to demonstrate why the instances are useful. There isn't really anything I can find that shows it beyond that.

@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

ok I added back the original slides and changed to "Here is a real world example that shows increasing throughput to 500 jobs/second with three instance levels." I think the real world example is still important - instances / nesting is one of Flux key features and we don't do a good job anywhere of telling people why they might care. The job submission example is relatively simple (easy to understand) and I think achieves that. But if there is a better example I can definitely use that, I just don't know of one.

@garlick
Copy link
Member

garlick commented Apr 4, 2024

Why is it wrong? It's mostly meant to demonstrate why the instances are useful. There isn't really anything I can find that shows it beyond that.

We don't support it with tooling, it fragments resources, and it's more of a stunt than a real solution to a problem. Also I hate to advertise 1 job/s when we can get

 garlick@system76-pc:~/proj/flux-core$ src/cmd/flux start src/test/throughput.py  -x
number of jobs: 100
submit time:    0.134 s (743.7 job/s)
script runtime: 0.649 s
job runtime:    0.550 s
throughput:     181.8 job/s (script: 154.1 job/s)

The example I was suggesting is

A more down to earth performance benefit of flux recursion is that, in contrast to a monolithic resource manager like slurm, batch jobs run as full flux instances, and thus could run a taxing workload (like a high throughput one) without impacting the parent Flux instance or other batch jobs.

@chu11
Copy link
Member

chu11 commented Apr 4, 2024

just some nits

slide 18 - I don't think the go bindings aren't a part of flux-core. Perhaps say something along the lines of "other bindings like rust/go available in other projects"?

slide 26 - perhaps more generically "flux-security is needed when different users will be running jobs on the resources, such as when flux is the native resource manager on an HPC cluster". There are some people that install schedulers on clusters just for themselves and no one else.

slide 37 - sorry if it's just me, but the English here sounds weird to me "When we expose the flux sched bindings in Go, we create a plugin called Fluence" ... It sounds like the Go bindings are called Fluence. Do you mean "We exposed the flux-sched bindings in Go, which allowed use to create a plugin called Fluence"?

slide 45 - Not super knowledgeable of kubernetes speak, so it could be me ... the sentence is also a little run on, so I read this as ...

"We can map Flux components into containers AND kubernetes abstractions, this allows us to implement ...."

but I think you mean

"We can map Flux components into containers. By using kubernetes abstractions we can implement ...."

@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

@garlick I added that description and cut the other slides entirely. I think in the future we do want to have a convincing example, because it's hard to understand the recursive / nesting and people often need something that is more proof in the pudding than hypothetical.

@chu11

  • I removed the reference to the Go bindings
  • suggestion for flux-security applied
  • fluence suggestion too
  • I do mean flux components into abstractions. An abstraction in Kubernetes is (high level) a pod, config map, job. We have to figure out analogous Kubernetes abstractions to handle different components of Flux.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vsoch vsoch changed the title Add architecture slides Add flux components slides Apr 4, 2024
@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

@chu11 when it looks good to you we can merge.

@chu11
Copy link
Member

chu11 commented Apr 4, 2024

looks good

@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

Thank you to you both!

@vsoch vsoch merged commit 07bf155 into master Apr 4, 2024
5 checks passed
@vsoch vsoch deleted the add-architecture-slides branch April 4, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants