Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools: Apply stats patch on trace_replay.ml #4

Merged
merged 1 commit into from Mar 1, 2023

Conversation

punchagan
Copy link
Contributor

No description provided.

@punchagan punchagan marked this pull request as ready for review February 28, 2023 10:35
@punchagan punchagan marked this pull request as draft February 28, 2023 10:49
let exec_commit rs ((time, message, c), hash) =
let level = rs.current_row.level in
Stat_recorder.set_stat_specs (specs_of_row rs.current_row);
let time = Time.Protocol.of_seconds time in
let c = on_lhs_context rs c in
let* hash' = Context.commit ~time ?message c in
on_rhs_hash rs hash hash';
Stdlib.Hashtbl.add rs.hash_per_level level hash';
Stdlib.Hashtbl.add rs.hash_per_level rs.current_block_idx hash';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For context, this line change is a hack to allow the selection of the GC commit at the previous split, as it turns out that the cycles can get misaligned otherwise... I'm only mentioning it in case it interfere with @adatario trace update, just so you know there's a trap there :P )

Copy link
Contributor Author

@punchagan punchagan Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may have fallen into the trap. With this patch applied, I see the following error when 8192 commits have been processed (which is the value of --gc-when):

Replaying trace   8192/145000 commits 09:23 (ETA: --:--) [---------------]   5%
replay: internal error, uncaught exception:
        File "src/trace/stats_trace_recorder.ml", line 927, characters 14-20: Assertion failed
        Raised at Lwt.Miscellaneous.poll in file "src/core/lwt.ml", line 3077, characters 20-29
        Called from Lwt_main.run.run_loop in file "src/unix/lwt_main.ml", line 31, characters 10-20
        Called from Lwt_main.run in file "src/unix/lwt_main.ml", line 118, characters 8-13
        Re-raised at Lwt_main.run in file "src/unix/lwt_main.ml", line 124, characters 4-13
        Called from Cmdliner_term.app.(fun) in file "cmdliner_term.ml", line 24, characters 19-24
        Called from Cmdliner_eval.run_parser in file "cmdliner_eval.ml", line 34, characters 37-44

The assertion here fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I've not seen this assert before hmhm any chance it could be related to the skip-gc perhaps? (since the assert seems unhappy that there was no GC?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha ok yeah you're right, it's also possible that it doesn't find a target commit the first time the GC could run due to a +-1 offset? (I've no idea why it still tries to reach for the latest GC stats though)

Copy link
Contributor Author

@punchagan punchagan Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the skip-gc part of the original patch and pushed it as a separate branch/commit. @adatario can look at this, when overhauling the GC triggering code. I'll report if I see any more failures on the replay benchmark run while dumping these stats.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ui, it's all a bit of a mine-field :)

I also think that this is an interference with skip-gc and the new recording of gc events.

Copy link
Contributor

@art-w art-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! So excited to see the new graphs :)

Co-authored-by: Puneeth Chaganti <punchagan@muse-amuse.in>
@punchagan punchagan marked this pull request as ready for review February 28, 2023 12:35
@adatario
Copy link
Owner

Thanks @punchagan!

Looks good to me.

I think we just need to be careful with the fact that gc/split can now be caused by trace and also by our previous hacks. But with splitting out of the skip-gc option the effect of this should be limited.

One thought though: Would it make sense to integrate these additional out-of-band stats into the regular stats? That is, add entries to src/trace/stats.ml and handling then when running the manage_stats tool? Maybe that's something we can keep in list of potential/future improvements?

@adatario adatario merged commit 7763bcc into adatario:main Mar 1, 2023
@punchagan
Copy link
Contributor Author

One thought though: Would it make sense to integrate these additional out-of-band stats into the regular stats? That is, add entries to src/trace/stats.ml and handling then when running the manage_stats tool? Maybe that's something we can keep in list of potential/future improvements?

@art-w might be better equipped to comment on this. :)

@punchagan punchagan deleted the apply-stats-patch branch March 1, 2023 09:37
@art-w
Copy link
Contributor

art-w commented Mar 1, 2023

Yeah hm I've no idea, but I agree it sounds cleaner if possible :D This was a quick hack because I didn't want to mess with the existing stats (.. they might even be logging similar information already?)

@adatario
Copy link
Owner

adatario commented Mar 1, 2023

I ran into a little issue with this patch: All process write to the same file (/tmp/tezos_replay_stats). This is causing trouble on posada where multiple users are running multiple replays:

        Sys_error("/tmp/tezos_replay_stats: Permission denied")

I suggest adding a command line option that enables writing these traces to a specified location. E.g: tezos-context-reply --extra-gc-stats /tmp/tezos_replay_stats would enable writing to the specified file.

In the scripts we can then use mktemp to get safe temporary file locations.

@punchagan what do you think?

@punchagan
Copy link
Contributor Author

Sounds like a good idea. 👍

@adatario
Copy link
Owner

adatario commented Mar 2, 2023

Ran into another issue, when a replay is killed the /tmp/tezos_replay_stats is not properly cleaned up. When then re-running the replay, I get following error:

replay: internal error, uncaught exception:
        Invalid_argument("index out of bounds")
        Raised by primitive operation at Dune__exe__Trace_replay.Make.Commit_stats.print_stats in file "src/tools/replay/trace_replay.ml", line 672, characters 15-22
        Called from Dune__exe__Trace_replay.Make.Commit_stats.add in file "src/tools/replay/trace_replay.ml", line 684, characters 11-25
        Called from Dune__exe__Trace_replay.Make.exec_commit in file "src/tools/replay/trace_replay.ml", line 697, characters 4-23
        Called from Dune__exe__Trace_replay.Make.exec_block in file "src/tools/replay/trace_replay.ml", line 902, characters 18-36
        Called from Dune__exe__Trace_replay.Make.exec_blocks.(fun).aux in file "src/tools/replay/trace_replay.ml", line 941, characters 33-60
        Called from Progress_engine__Renderer.Make.Reporters.apply_all in file "src/progress/engine/renderer.ml", line 464, characters 54-59
        Called from Stdlib__Fun.protect in file "fun.ml", line 33, characters 8-15
        Re-raised at Stdlib__Fun.protect in file "fun.ml", line 38, characters 6-52
        Called from Dune__exe__Trace_replay.Make.run in file "src/tools/replay/trace_replay.ml", line 997, characters 32-75
        Called from Dune__exe__Replay.main in file "src/tools/replay/replay.ml", line 69, characters 15-30
        Called from Cmdliner_term.app.(fun) in file "cmdliner_term.ml", line 24, characters 19-24
        Called from Cmdliner_eval.run_parser in file "cmdliner_eval.ml", line 34, characters 37-44

Which seems to be caused by a dangling /tmp/tezos_replay_stats still lying around.

The suggestion with command-line option to specify path should also solve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants