Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken_config parsing wrong system.config.receive #626

Open
Xelynega opened this issue Mar 28, 2023 · 6 comments
Open

broken_config parsing wrong system.config.receive #626

Xelynega opened this issue Mar 28, 2023 · 6 comments

Comments

@Xelynega
Copy link

I'm running the 1.4.1 branch and am trying to create a device that passes the broken_config sequence test, but I've been running into issues and haven't been able to pass the test while others pass fine(extra_config, system_min_loglevel)

This is my understanding of what the sequencer tool is doing during the test(up to checking the response to the broken config):

  1. Sync state/config and check the state last_config matches the config timestamp
  2. Set min_loglevel to 100
  3. Wait for a system.config.apply log message, ignore log messages where timestamp < lastConfigUpdate
  4. Send the broken config
  5. Wait for a system.config.receive log message, ignore log messages where timestamp < lastConfigUpdate
  6. Check that the state has a stored status message with a warning/error(checkThatHasInterestingStatus(true))

The order of messages during the test from that my device is logging is:
TIME 1: <- config with testing.sequence_name of "reset_config" and min_loglevel of 200
TIME 2: -> state with last_config == TIME1
TIME 2: -> system.config.apply log of 300
TIME 3: <- config with testing.sequence_name of "reset_config", min_loglevel of 200, and system.extra_field of "reset_config"
TIME 4: -> state with last_config == TIME3
TIME 4: -> system.config.apply log of 300
TIME 5: <- config with testing.sequence_name of "reset_config", min_loglevel of 200, and system.extra_field of "reset_config"
TIME 6: -> state with last_config == TIME5
TIME 6: -> system.config.apply log of 300
TIME 7: <- config with testing.sequence_name of "broken_config" and min_loglevel of 200
TIME 8: -> state with last_config == TIME7
TIME 8: -> system.config.apply log of 300
TIME 9: <- config with testing.sequence_name of "broken_config" and min_loglevel of 100
TIME 10: -> state with last_config == TIME9
TIME 10: -> log event with system.config.apply log of 300, system.config.receive of 100, system.config.parse of 100
TIME 11: <- config received equal to b'{ broken because extra_field == break_json'
TIME 12: -> state with last_config == TIME9 and status message
TIME 12: -> log event with system.config.receive of 100, system.config.parse of 500

According to the verbose output of sequencer lastConfigUpdate gets set to TIME10 when system.config.receive from TIME10 is received by the sequencer, and TIME10 is the cutoff time for filtering out logs for the untilLogged call . What I think is happening is the system.config.receive message at TIME10 updates the "filter before" time, then since TIME10 >= TIME10, the call to untilLogged(SYSTEM_CONFIG_RECEIVE, SYSTEM_CONFIG_RECEIVE_LEVEL); after setExtraField("break_json") processes the system.config.receive message from TIME10 and calls checkThatHasInterestingStatus(true), which fails since the state hasn't been updates yet.

The example sequence log in validator/sequences/broken_config/sequencer.log shows that lastConfigUpdate was updated based on the system.config.receive message from T10(2023-03-07T16:40:01Z), the state with the error was sent at T12(2023-03-07T16:40:03Z), but the system.config.receive message that triggered the checkThatHasInterestingStatus call happened at T10:

2023-03-07T16:40:11Z DEBUG validation.feature.sequence stage finished waiting for log category system.config.receive level DEBUG was logged at 27s after 9s
2023-03-07T16:40:11Z TRACE validation.feature.sequence stage resume executing test at 27s
2023-03-07T16:40:11Z DEBUG validation.feature.sequence matching 2023-03-07T16:40:01Z DEBUG system.config.receive: success

Any help with this would be appreciated, thanks.

@Xelynega
Copy link
Author

Some additional information, It turns out that in my case I was able to pass the test by enforcing that logs are sent after the state messages instead of whatever order minimized context switching with my current implementation. Is it a requirement of the protocol to send state logs after the state message? I can't find any information that would indicate this is the case in the current documentation.

@grafnu
Copy link
Collaborator

grafnu commented Mar 29, 2023

Thank you for the detailed report! I'm starting to investigate... Short answer is "no, that is not a requirement" -- the messages are considered asynchronous in that there's no strict ordering required at that granularity. I don't have an answer for you yet, but I'm starting to dig through it.

@grafnu
Copy link
Collaborator

grafnu commented Mar 29, 2023

Ok, I have a preliminary fix on this working branch: https://github.com/grafnu/udmi/tree/logordering -- I was able to reproduce by introducing a before-state-send delay, and then the change I have in there clears it up, but if possible I'd like to get verification from you that it fixes the problem on your end. If that works, then I can cleanup and submit a PR proper.

@Xelynega
Copy link
Author

Xelynega commented Mar 29, 2023

Thank you for the quick reply. The logordering branch you've provided gets past that hasInterestingSystemStatus(true) check, but it's throwing an error on the checkNotLogged(SYSTEM_CONFIG_APPLY, SYSTEM_CONFIG_APPLY_LEVEL) even though the last one sent was for the config that set the minLogLevel to 100.

The system.config.apply message that throws the error has a timestamp of 2023-03-29T15:26:35Z. The state message with the system error status has a timestamp 2023-03-29T15:26:37.310223Z. The state message before the one with the system error status(that sets the min_loglevel to 100) has a timestamp 2023-03-29T15:26:35.530494Z, and the last log message related to lastConfigUpdate set it to 2023-03-29T15:26:35Z. I've attached the full log of the test with -v. Thanks for your help.

udmi.log

@grafnu
Copy link
Collaborator

grafnu commented Mar 30, 2023

Ok, I think I figured out what's going on and just pushed a potential fix to the same branch. Basically, in many situations the timestamps are quantized to a second, but the config updates might happen faster than that so it was intending to reject log messages from before the update but was including them since the (quantized) timestamps were a match. The log handling is a bit tricky because they end up asynchronous and out-of-order sometimes. The fix was just to add a fixed delay before updating the config, which should force log messages to be quantized usefully.

If this doesn't work then I might suggest we work out giving me access to the device entry in your cloud IoT instance (really just need the reflector private key you're using) and then I can run the tests locally on my end. But, hopefully this last iteration fixes the issue and there's no more!

@Xelynega
Copy link
Author

The latest change gets me to the end of the test, but fails at the seconds last line(the checkNotLogged call). It looks like the min_loglevel for the device is set to DEBUG(100), then the test checks that system.config.apply was sent, system.config.receive was not sent, and system.config.parse was not sent(both of which the device sends since it was just configured with a loglevel of DEBUG).

Changing the two checkNotLogged calls at the end of the test to untilLogged calls results in a consistent pass on my end, so I'm wondering if those are mistakenly swapped.

Thanks again for your help, let me know where I can pass the reflector key to you and I'll get it to you so you can run the sequence on your end if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants