Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: RuntimePolicies: Fix flake when validating logs #14529

Merged
merged 1 commit into from
Jan 8, 2021

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Jan 6, 2021

For the first test executed as part of RuntimePolicies, the validation of logs may include logs from the BeforeAll block's execution. Given we restart Cilium in that block, it may lead to JoinEP messages in logs if Cilium was stopped while endpoints were being regenerated.

An example of such a flake is visible at Cilium-PR-Runtime-4.9#3207. That test failed because JoinEP messages were found in logs. Using timestamps from test-output.log, we can check that the validation of logs included the BeforeAll block:

18:58:10 STEP: Running BeforeAll block for EntireTestsuite RuntimePolicies
[...]
19:03:27 STEP: Running JustAfterEach block for EntireTestsuite RuntimePolicies
time="2021-01-05T19:03:27Z" level=debug msg="running command: sudo journalctl -au cilium --since '316.7919274 seconds ago'"

This pull request changes our validation of logs in RuntimePolicies to ensure we only validate logs that correspond to the current test, excluding BeforeAll/BeforeEach blocks.

Related: e558100 ("test: RuntimeKVStoreTest: Fix flake when validating logs")

@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. ci/flake This is a known failure that occurs in the tree. Please investigate me! needs-backport/1.9 labels Jan 6, 2021
@pchaigno pchaigno requested a review from a team as a code owner January 6, 2021 08:43
@pchaigno pchaigno requested a review from a team January 6, 2021 08:43
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Jan 6, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.2 Jan 6, 2021
@pchaigno pchaigno requested a review from kkourt January 6, 2021 08:43
@pchaigno
Copy link
Member Author

pchaigno commented Jan 6, 2021

retest-runtime

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the intention, but it looks like this change will skip validating logs in the initial set up phases like -

Expect(vm.SetUpCiliumWithHubble()).To(BeNil())
ExpectCiliumReady(vm)

What if there is a panic in starting the cilium agent?

@pchaigno
Copy link
Member Author

pchaigno commented Jan 7, 2021

What if there is a panic in starting the cilium agent?

If there is a panic in starting the agent, the tests won't be able to pass anyway. They will fail with the usual "beforeAll failures". I think that's preferable given it highlights it's not the test itself failing but its preparation.

@aditighag
Copy link
Member

But doesn't it defeat the purpose of having ValidateNoErrorsInLogs? Aren't the cilium start-up logs are as critical, if not more when highlighting/validating logs in test runs?

// ValidateNoErrorsInLogs checks in cilium logs since the given duration (By
// default `CurrentGinkgoTestDescription().Duration`) do not contain `panic`,
// `deadlocks` or `segmentation faults` messages . In case of any of these
// messages, it'll mark the test as failed.

The test should be marked as failed when there are any of these messages in the start-up phase too. What do you think?

@pchaigno
Copy link
Member Author

pchaigno commented Jan 7, 2021

But doesn't it defeat the purpose of having ValidateNoErrorsInLogs? Aren't the cilium start-up logs are as critical, if not more when highlighting/validating logs in test runs?

Yes, but in this case, we start Cilium with our default configuration, which we already do in a bunch of other places. So if any of these "bad messages" pop up during startup here, they also will in other places.
That said, I agree we should ideally fail if those messages appear in this Cilium startup. Maybe we could add a call to ValidateNoErrorsInLogs in the BeforeAll block here. WDYT?

@aditighag
Copy link
Member

aditighag commented Jan 7, 2021

But doesn't it defeat the purpose of having ValidateNoErrorsInLogs? Aren't the cilium start-up logs are as critical, if not more when highlighting/validating logs in test runs?

Yes, but in this case, we start Cilium with our default configuration, which we already do in a bunch of other places. So if any of these "bad messages" pop up during startup here, they also will in other places.
That said, I agree we should ideally fail if those messages appear in this Cilium startup. Maybe we could add a call to ValidateNoErrorsInLogs in the BeforeAll block here. WDYT?

If I understand correctly, we want to validate logs from BeforeAll (once) as well as during every test run? I wonder if we make another call to ValidateNoErrorsInLogs in addition to JustAfterEach, would we get the desired behavior?

This is what I have in mind (I'm not sure if you also suggested something similar) -

BeforeAll(func() {
testStartTime = time.Now()
: 

vm.ValidateNoErrorsInLogs(time.Since(testStartTime))
}

& 

JustBeforeEach and JustAfterEach from your current patch

@pchaigno
Copy link
Member Author

pchaigno commented Jan 7, 2021

If I understand correctly, we want to validate logs from BeforeAll (once) as well as during every test run?

Yes. Your snippet of code is also what I had in mind :-)

I'll update the branch.

@pchaigno
Copy link
Member Author

pchaigno commented Jan 7, 2021

Hm, actually, if we do that, we'll get error from the previous shutdown again. We should stop Cilium, register time.Now(), then only start Cilium and check the logs. I'll update tomorrow morning.

For the first test executed as part of RuntimePolicies, the validation
of logs may include logs from the BeforeAll block's execution. Given we
restart Cilium in that block, it may lead to JoinEP messages in logs if
Cilium was stopped while endpoints were being regenerated.

An example of such a flake is visible at [1]. That test failed because
JoinEP messages were found in logs. Using timestamps from
test-output.log, we can check that the validation of logs included the
BeforeAll block:

    18:58:10 STEP: Running BeforeAll block for EntireTestsuite RuntimePolicies
    [...]
    19:03:27 STEP: Running JustAfterEach block for EntireTestsuite RuntimePolicies
    time="2021-01-05T19:03:27Z" level=debug msg="running command: sudo journalctl -au cilium --since '316.7919274 seconds ago'"

This commit changes our validation of logs in RuntimePolicies to ensure
we only validate logs that correspond to the current test, excluding
BeforeAll/BeforeEach blocks.

1 - https://jenkins.cilium.io/job/Cilium-PR-Runtime-4.9/3207/testReport/junit/(root)/Suite-runtime/RuntimePolicies_L3_L4_Checks/
Related: e558100 ("test: RuntimeKVStoreTest: Fix flake when validating logs")
Reported-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno force-pushed the pr/pchaigno/fix-runtimepolicies-flake branch from 7fdc41d to 2770766 Compare January 8, 2021 16:29
@pchaigno
Copy link
Member Author

pchaigno commented Jan 8, 2021

test-me-please

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

We need to stop and start Cilium separately (vs. doing a restart)

Now that we are stopping cilium process, do we need to update the restart command that we were executing earlier?

This is where cilium is being restarted, but looks the function is called from couple of other places so we might just leave it as-is -

res = s.Exec("sudo systemctl restart cilium")

@pchaigno
Copy link
Member Author

pchaigno commented Jan 8, 2021

The Runtime CI pipeline is passing and reviews are in, so I think this is ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 8, 2021
@pchaigno pchaigno merged commit c6b4a5e into master Jan 8, 2021
@pchaigno pchaigno deleted the pr/pchaigno/fix-runtimepolicies-flake branch January 8, 2021 18:55
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.2 Jan 11, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.2 Jan 13, 2021
joamaki added a commit to joamaki/cilium that referenced this pull request Sep 9, 2021
When running runtime tests locally sometimes the test fail as level=error
log entries are found that are the result of cilium-agent restarts during
provisioning.

This is similar to the fix done in cilium#14529.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
kaworu pushed a commit that referenced this pull request Sep 10, 2021
When running runtime tests locally sometimes the test fail as level=error
log entries are found that are the result of cilium-agent restarts during
provisioning.

This is similar to the fix done in #14529.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
kaworu pushed a commit to kaworu/cilium that referenced this pull request Sep 27, 2021
[ upstream commit 38994b0 ]

When running runtime tests locally sometimes the test fail as level=error
log entries are found that are the result of cilium-agent restarts during
provisioning.

This is similar to the fix done in cilium#14529.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
kaworu pushed a commit to kaworu/cilium that referenced this pull request Sep 28, 2021
[ upstream commit 38994b0 ]

When running runtime tests locally sometimes the test fail as level=error
log entries are found that are the result of cilium-agent restarts during
provisioning.

This is similar to the fix done in cilium#14529.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
jibi pushed a commit that referenced this pull request Sep 29, 2021
[ upstream commit 38994b0 ]

When running runtime tests locally sometimes the test fail as level=error
log entries are found that are the result of cilium-agent restarts during
provisioning.

This is similar to the fix done in #14529.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
aanm pushed a commit that referenced this pull request Oct 2, 2021
[ upstream commit 38994b0 ]

When running runtime tests locally sometimes the test fail as level=error
log entries are found that are the result of cilium-agent restarts during
provisioning.

This is similar to the fix done in #14529.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.9.2
Backport done to v1.9
Development

Successfully merging this pull request may close these issues.

None yet

5 participants