Skip to content

Commit

Permalink
add google experiment data
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Jun 7, 2023
1 parent 921314e commit eff10a8
Show file tree
Hide file tree
Showing 14 changed files with 1,880 additions and 0 deletions.
1 change: 1 addition & 0 deletions google/service-timing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ of experiments are attempting to investigate different aspects.
- [run8](run8) was one more attempt to reproduce the issue (done, and one huge timeout)
- [run9](run9) was the final case to replicate (did)
- [run10](run10) is the equivalent experiment but scaled up to a larger cluster
- [run11](run11) are results from Dmitri on the Google networking team.
70 changes: 70 additions & 0 deletions google/service-timing/run11/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Google Experiments

These experiments were run by Google across 22 runs. We have data for:

- [workers.json](leaders.json)
- [leaders.json](leaders.json)

Since the leaders (rank 0) are most relevant (and reflect the workers hooking up as well)
we can focus on them. It's much less data to process and plot too. I think we are primarily
interested in:

- 'init->quorum': RANK 0: reflects any delay in running rc1
- 'quorum->run': RANK 0: this would be the time it takes to network?
- 'run->cleanup': RANK 0: this would be the runtime of lammps

## Plots

### init->quorum

Reflects delay in running rc1.
This plot is probably not telling us anything interesting.


![lammps-hist-stage-init-to-quorum.png](lammps-hist-stage-init-to-quorum.png)

### quorum->run

Reflects the time it takes to network.
We can clearly see the group that doesn't have a zeromq timeout set.
Since the network isn't ready on the getgo, the timeout kicks in.
This shouldn't happen I don't think.

![lammps-hist-stage-quorum-to-run.png](lammps-hist-stage-quorum-to-run.png)

### run->cleanup

Reflects the runtime of LAMMPS.
This is enormously concerning for such a small problem size on such a large
cluster! It reflects a huge cost in the networking and no benefit to the run.

![lammps-hist-stage-run-to-cleanup.png](lammps-hist-stage-run-to-cleanup.png)

### cleanup->shutdown

This plot is probably not telling us anything interesting.

![lammps-hist-stage-cleanup-to-shutdown.png](lammps-hist-stage-cleanup-to-shutdown.png)

### goodbye->exit

This plot seems to have three different groups - I'd be interested to know if
the zeromq timeout is related to this, but I don't think we collected that.

![lammps-hist-stage-goodbye-to-exit.png](lammps-hist-stage-goodbye-to-exit.png)

### join->init

![lammps-hist-stage-join-to-init.png](lammps-hist-stage-join-to-init.png)

### none->join

This plot is probably not telling us anything interesting.

![lammps-hist-stage-none-to-join.png](lammps-hist-stage-none-to-join.png)

### shutdown->finalize

This plot is probably not telling us anything interesting.

![lammps-hist-stage-shutdown-to-finalize.png](lammps-hist-stage-shutdown-to-finalize.png)
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit eff10a8

Please sign in to comment.