Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument mod_mam #4224

Merged
merged 9 commits into from
Mar 20, 2024
Merged

Instrument mod_mam #4224

merged 9 commits into from
Mar 20, 2024

Conversation

chrzaszcz
Copy link
Member

@chrzaszcz chrzaszcz commented Feb 15, 2024

This PR updates mod_mam to use mongoose_instrument and adds instrumentation tests.

mod_mam

All metrics are updated to use mongoose_instrument. The naming scheme is simple: they start with mod_mam_pm or mod_mam_muc, followed by the event name.

Measurements include some data like params and jid. This is mostly to differentiate between the events in big tests. IMO we could add items here on demand. So far I limited the additions, but there doesn't seem to be a any real cost, because they are not copied anywhere. I decided neither not to put them in separate metadata, nor to put the metrics in a separate sub-map, but we could do that if we see benefit.

Apart from that, I tried not to modify what is instrumented, but only how it's done - some parts of MAM could be instrumented in a better (more consistent) way.

Tests

Because it is the first time we are testing instrumentation, new test helpers were needed. The main helper module instrument_helper is started and stopped with the whole test suite.

In init_per_suite, you call instrument_helper:start(DeclaredEvents), providing a list of declared events ({EventName, Labels} tuples). It injects and starts instrument_event_table on the tested node (currently mim()). Two ETS tables are initialized:

  • instrument_event_table will collect declared events with measurements on mim(). Non-declared events are skipped.
  • instrument_event_status_table contains status (tested/untested) for each of the declared event tuples{EventName, Labels}.

In the tests, there are calls to instrument_helper:assert/3, which assert that a matching event with measurements is present, and mark {EventName, Labels} as tested.

In end_per_suite, you call instrument_helper:stop(), which checks the status of tested/untested events.

If any events were declared and logged (collected in instrument_event_table), but were not tested with assert/3, end_per_suite would fail.

Events that were declared, but not logged, indicate functionality, that is not covered by any tests. We have such functionality, and this is why currently there is no fail in this case. IMO, we could change this later. Alternatively, we could explicitly list all uncovered instrumentation, but it would be a bit tedious.

Metric tests, which checked exometer metrics replaced by the new instrumentation, are now removed.

Current test coverage

This is the current report from end_per_suite. There is no fail, because the untested events were not logged, i.e. the functionality is not covered by the tests.

[🔗](file:///.../big_tests.tests.mam_SUITE.logs/run.2024-03-18_09.58.00/mam_suite.end_per_suite.html#e-35252)

Tested instrumentation events:
 [{mod_mam_muc_archive_message,#{host_type => <<"localhost">>}},
  {mod_mam_muc_dropped_iq,#{host_type => <<"localhost">>}},
  {mod_mam_muc_flushed,#{host_type => <<"localhost">>}},
  {mod_mam_muc_lookup,#{host_type => <<"localhost">>}},
  {mod_mam_muc_remove_archive,#{host_type => <<"localhost">>}},
  {mod_mam_pm_archive_message,#{host_type => <<"localhost">>}},
  {mod_mam_pm_dropped_iq,#{host_type => <<"localhost">>}},
  {mod_mam_pm_flushed,#{host_type => <<"localhost">>}},
  {mod_mam_pm_get_prefs,#{host_type => <<"localhost">>}},
  {mod_mam_pm_lookup,#{host_type => <<"localhost">>}},
  {mod_mam_pm_remove_archive,#{host_type => <<"localhost">>}},
  {mod_mam_pm_set_prefs,#{host_type => <<"localhost">>}}]


*** User 2024-03-18 09:58:40.225 ***
[🔗](file://.../big_tests/ct_report/ct_run.test@m.2024-03-18_09.57.59/big_tests.tests.mam_SUITE.logs/run.2024-03-18_09.58.00/mam_suite.end_per_suite.html#e-35253)

Instrumentation events that were not logged - functionality not covered by tests:
[{mod_mam_muc_dropped,#{host_type => <<"localhost">>}},
 {mod_mam_muc_get_prefs,#{host_type => <<"localhost">>}},
 {mod_mam_muc_set_prefs,#{host_type => <<"localhost">>}},
 {mod_mam_pm_dropped,#{host_type => <<"localhost">>}}]

The following changes will be needed to cover all events, but they are excluded from this PR:

  • Testing prefs for MUC. The tests seem to be completely absent. It requires a separate story.
  • Sorting out and testing the dropped events. The events are inconsistent, e.g. they occur only for async writers. They should be amended and tested.

This was referenced Feb 15, 2024
@mongoose-im

This comment was marked as outdated.

Copy link

codecov bot commented Feb 15, 2024

Codecov Report

Attention: Patch coverage is 72.22222% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 84.43%. Comparing base (7f5ee86) to head (ce475da).

Files Patch % Lines
src/mam/mod_mam_muc.erl 63.15% 7 Missing ⚠️
src/mam/mod_mam_muc_rdbms_arch_async.erl 60.00% 2 Missing ⚠️
src/mam/mod_mam_pm.erl 91.30% 2 Missing ⚠️
src/mam/mod_mam_rdbms_arch_async.erl 60.00% 2 Missing ⚠️
src/mam/mod_mam_elasticsearch_arch.erl 0.00% 1 Missing ⚠️
src/mam/mod_mam_muc_elasticsearch_arch.erl 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                  Coverage Diff                   @@
##           feature/instrument    #4224      +/-   ##
======================================================
- Coverage               84.44%   84.43%   -0.02%     
======================================================
  Files                     557      556       -1     
  Lines                   33677    33633      -44     
======================================================
- Hits                    28440    28399      -41     
+ Misses                   5237     5234       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mongoose-im

This comment was marked as outdated.

@chrzaszcz chrzaszcz marked this pull request as ready for review February 15, 2024 15:49
Base automatically changed from instrument/poc to feature/instrument February 20, 2024 09:30
@chrzaszcz chrzaszcz changed the base branch from feature/instrument to instrument/label-consistency February 22, 2024 18:51
@mongoose-im

This comment was marked as outdated.

@chrzaszcz chrzaszcz force-pushed the instrument/label-consistency branch 6 times, most recently from 582da60 to 277171b Compare February 27, 2024 11:56
@mongoose-im

This comment was marked as outdated.

Base automatically changed from instrument/label-consistency to feature/instrument February 28, 2024 07:54
@mongoose-im

This comment was marked as outdated.

@chrzaszcz chrzaszcz force-pushed the instrument/mod_mam branch 3 times, most recently from 72a514f to 031fb81 Compare March 14, 2024 14:25
@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

@chrzaszcz chrzaszcz force-pushed the instrument/mod_mam branch 2 times, most recently from 0970da4 to c4bfdff Compare March 15, 2024 18:28
@mongoose-im

This comment was marked as outdated.

@mongoose-im

This comment was marked as outdated.

Testing procedure:
1. In init_per_suite, call instrument_helper:start(DeclaredEvents).
   The helper installs instrument_event_table on mim(),
   which will collect events with matching EventName and Labels.
2. In the tests, call instrument_helper:assert/3.
   The function fails if there is no matching event.
3. In end_per_suite, call instrument_helper:stop/0,
   which deletes the collected events, and logs a summary
   of event status (tested/untested).

Don't delete
Unify event names: mod_mam_pm_* and mod_mam_muc_*

No functional changes other than metric renaming
All instrumentation is done with mongoose_instrument now.
Purpose:
- More information in logs
- Ability tomatch specific events in big tests
The checks in mongoose_instrument would cause an exception.
Only events, that are already triggered, are instrumented.
Archive deletion was triggered only in end_per_testcase,
so I decided to add it to two test cases, which already tested
instrumentation.

The plan is to extend the tests to all events in a separate task.
@mongoose-im
Copy link
Collaborator

mongoose-im commented Mar 18, 2024

elasticsearch_and_cassandra_26 / elasticsearch_and_cassandra_mnesia / ce475da
Reports root/ big
OK: 435 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0


small_tests_25 / small_tests / ce475da
Reports root / small


small_tests_26_arm64 / small_tests / ce475da
Reports root / small


small_tests_26 / small_tests / ce475da
Reports root / small


ldap_mnesia_25 / ldap_mnesia / ce475da
Reports root/ big
OK: 2275 / Failed: 0 / User-skipped: 895 / Auto-skipped: 0


dynamic_domains_pgsql_mnesia_26 / pgsql_mnesia / ce475da
Reports root/ big
OK: 4522 / Failed: 0 / User-skipped: 105 / Auto-skipped: 0


ldap_mnesia_26 / ldap_mnesia / ce475da
Reports root/ big
OK: 2275 / Failed: 0 / User-skipped: 895 / Auto-skipped: 0


dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / ce475da
Reports root/ big
OK: 4522 / Failed: 0 / User-skipped: 105 / Auto-skipped: 0


dynamic_domains_mysql_redis_26 / mysql_redis / ce475da
Reports root/ big
OK: 4489 / Failed: 0 / User-skipped: 138 / Auto-skipped: 0


internal_mnesia_26 / internal_mnesia / ce475da
Reports root/ big
OK: 2415 / Failed: 0 / User-skipped: 755 / Auto-skipped: 0


mysql_redis_26 / mysql_redis / ce475da
Reports root/ big
OK: 4890 / Failed: 0 / User-skipped: 133 / Auto-skipped: 0


pgsql_cets_26 / pgsql_cets / ce475da
Reports root/ big
OK: 4439 / Failed: 0 / User-skipped: 174 / Auto-skipped: 0


dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / ce475da
Reports root/ big
OK: 4517 / Failed: 2 / User-skipped: 108 / Auto-skipped: 0

graphql_muc_light_SUITE:admin_http:admin_muc_light:end_per_group
{error,
 {{unregistering_failed,
   {amount,1},
   {unregistered_items,
  [{{<<"_admin_send_message_to_room_errors_1043">>,
     [{escalus_event_mgr,<0.24567.0>},
    {tc_name,admin_send_message_to_room_errors},
    {escalus_cleaner,<0.24566.0>},
    {watchdog,<0.24565.0>},
    {muc_light_host,<<"muclight.domain.example.com">>},
    {secondary_muc_light_host,<<"muclight.domain.example.org">>},
    {protocol,http},
    {schema_endpoint,admin},
    {listener_opts,
     #{module => mongoose_graphql_handler,path => "/api/graphql",
       host => "localhost",username => <<"admin">>,
       password => <<"secret">>,sse_idle_timeout => 3600000,
       schema_endpoint => admin}},
    {{ejabberd_cwd,mongooseim@localhost},
     "/home/circleci/project/_build/mim1/rel/mongooseim"},
    {preset,"odbc_mssql_mnesia"},
    {mim_data_dir,
     "/home/circleci/project/big_tests/tests/graphql_muc_light_SUITE_data"},
    {tc_logfile,
     "https://circleci-mim-results.s3.eu-central-1.amazonaws.com/PR/4224/212820/odbc_mssql_mnesia.26.1.2-amd64/big/ct_run.test%4096a0c1a898e6.2024-03-18_09.08.02/big_tests.tests.graphql_muc_light_SUITE.logs/run.2024-03-18_09.12.27/graphql_muc_light_suite.admin_send_message_to_room_errors.html"},
    {tc_group_properties,[{name,admin_muc_light}]},
    {tc_group_path,[[{name,admin_http}]]},
    {data_dir,
     "/home/circleci/project/big_tests/_build/default/lib/mongoose_tests/ebin/graphql_muc_light_SUITE_data/"},
    {priv_dir,
     "https://circleci-mim-results.s3.eu-central-1.amazonaws.com/PR/4224/212820/odbc_mssql_mnesia.26.1.2-amd64/big/ct_run.test%4096a0c1a898e6.2024-03-18_09.08.02/big_tests.te...

Report log

graphql_muc_light_SUITE:admin_cli:admin_muc_light:admin_create_room_with_unprepped_id
{error,
  {#{what => invalid_response_code,expected_type => ok,
     response_code => {exit_status,1}},
   [{graphql_helper,assert_response_code,2,
      [{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
       {line,256}]},
    {graphql_helper,get_ok_value,2,
      [{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
       {line,239}]},
    {graphql_muc_light_SUITE,admin_create_room_with_unprepped_id,1,
      [{file,
         "/home/circleci/project/big_tests/tests/graphql_muc_light_SUITE.erl"},
       {line,1156}]},
    {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
    {test_server,run_test_case_eval1,6,
      [{file,"test_server.erl"},{line,1302}]},
    {test_server,run_test_case_eval,9,
      [{file,"test_server.erl"},{line,1234}]}]}}

Report log

graphql_muc_light_SUITE:domain_admin:domain_admin_muc_light:admin_create_room_with_unprepped_id
{error,
  {{badmatch,null},
   [{graphql_muc_light_SUITE,admin_create_room_with_unprepped_id,1,
      [{file,
         "/home/circleci/project/big_tests/tests/graphql_muc_light_SUITE.erl"},
       {line,1155}]},
    {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
    {test_server,run_test_case_eval1,6,
      [{file,"test_server.erl"},{line,1302}]},
    {test_server,run_test_case_eval,9,
      [{file,"test_server.erl"},{line,1234}]}]}}

Report log


pgsql_mnesia_25 / pgsql_mnesia / ce475da
Reports root/ big
OK: 4911 / Failed: 0 / User-skipped: 112 / Auto-skipped: 0


pgsql_mnesia_26 / pgsql_mnesia / ce475da
Reports root/ big
OK: 4911 / Failed: 0 / User-skipped: 112 / Auto-skipped: 0


mssql_mnesia_26 / odbc_mssql_mnesia / ce475da
Reports root/ big
OK: 4908 / Failed: 0 / User-skipped: 115 / Auto-skipped: 0


dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / ce475da
Reports root/ big
OK: 4519 / Failed: 0 / User-skipped: 108 / Auto-skipped: 0

@chrzaszcz chrzaszcz marked this pull request as draft March 18, 2024 11:42
@chrzaszcz chrzaszcz marked this pull request as ready for review March 18, 2024 11:42
Copy link
Collaborator

@NelsonVides NelsonVides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, but mostly reduced to already documenting any breaking change together with the PR introducing them (otherwise we're going to forget or potentially get confused down the road), and bypassing meck.

Comment on lines +92 to +94
#{time => Time,
time_per_message => round(Time / MessageCount),
count => MessageCount}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that this time_per_message key is too specific here, whoever will collect this data can already calculate so from the time and count keys, so maybe it is redundant, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be. I just made the assumption of not removing anything that was there already, and I would like to do such changes separately. Of course it makes sense, and I thought about the same actually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, I just hope not to bypass any of the things we've discovered here and there once we arrive to the moment of touching it all. Make sure to have notes about these points.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have follow-up stories for MAM already, so we can add this. But actually someone might use time_per_message, and it would be easier to switch the metric name than to also have to change the calculations? I think that "don't delete metrics - only rename or add" is still a good rule of thumb.

@@ -150,11 +147,13 @@ process_mam_iq(Acc, From, To, IQ, _Extra) ->
?LOG_WARNING(#{what => mam_max_delay_reached,
text => <<"Return max_delay_reached error IQ from MAM">>,
action => Action, acc => Acc}),
mongoose_metrics:update(HostType, modMamDroppedIQ, 1),
mongoose_instrument:execute(mod_mam_pm_dropped_iq,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this will change already the name of the metric in exometer, right? Don't we want to document these kinds of breaking changes already with the PR introducing them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we could just document all metrics at once as well, because almost everything will change - that was my reasoning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above 👍🏽

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the story for docs already, and it was my plan to do it there.

@@ -1851,10 +1861,6 @@ archived(Config) ->
#forwarded_message{result_id=ArcId} = parse_forwarded_message(ArcMsg),
?assert_equal(Id, ArcId),
ok
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also remove that ok if we're already removing dead code here 🤷🏽‍♂️

Comment on lines +10 to +11
meck:new(mongoose_instrument_log, [no_link, passthrough]),
meck:expect(mongoose_instrument_log, handle_event,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... There should be a better way than meck here. If we're already rpc-inserting a module here, and we're creating an ets table in the remote node, we can just rpc-insert a module that implements the mongoose_instrument behaviour already and does the ets inserts on its own, entirely bypassing meck.

Copy link
Member Author

@chrzaszcz chrzaszcz Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many ways without meck, and I tried the following:

  • Implementing mongoose_instrument_* is not easy, because all instrumentation is set up on startup. You would need to restart mongooseim.
  • Adding a special list of global handlers. This adds extra code executed for each event, that is only for tests, so I don't like it.
  • Log handler. This required a lot of work to silence the debug logs by reconfiguring existing handlers, and looked really messy. But actually a debug handler is what I would do if I had to get rid of meck.

I implemented all three of them, and the current one is the fourth attempt. I really don't see a cleaner solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate. Points 2 and 3 aren't good at all from my point of view, so I'd discard them before even trying. Point 1 is the one that sounds best to me, wonder if there's a way to rework how handlers are initialised so that they can be injected dynamically without restarting MIM. I'll make a note for this one, it can be in a separate PR, it's well encapsulated in the instrumentation helper modules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add this functionality, because it's not needed during normal operation, and I didn't want something just for tests. Also, we want to avoid the situation where configuration (mongoose_config) is different from actual setup (an ad-hoc added handler). So it would be something like dynamic_modules, but for handlers. It's doable, but do we really need it?

@NelsonVides NelsonVides merged commit 94366fd into feature/instrument Mar 20, 2024
3 of 4 checks passed
@NelsonVides NelsonVides deleted the instrument/mod_mam branch March 20, 2024 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants