Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-3281. Add timeouts to all robot tests #723

Merged
merged 1 commit into from
Mar 27, 2020

Conversation

hanishakoneru
Copy link
Contributor

What changes were proposed in this pull request?

We have seen in some CI runs that the acceptance test suit is getting cancelled as it runs for more than 6 hours. Because of this, the test results and logs are also not saved.

This Jira aims to add a 5 minute timeout to all robot tests. In case some tests require more time, we can update the timeout. This would help to isolate the test which could be causing the whole acceptance test suit to time out.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3281

How was this patch tested?

CI acceptance test suit can test this change as it only adds a timeout to robot tests.

@hanishakoneru hanishakoneru requested review from elek and arp7 March 25, 2020 23:43
Copy link
Contributor

@vivekratnavel vivekratnavel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanishakoneru for working on this. We can try adding these timeouts, but I'm not sure it will really help.

I performed a small experiment: added "10 seconds" timeout and sleep 30. Based on Robot logs, it seems the test was marked as failed after 10 seconds, but the command was not interrupted and still took 30 seconds to execute:

<msg timestamp="20200326 11:52:18.496" level="INFO">Running command 'sleep 30 2&gt;&amp;1'.</msg>
<msg timestamp="20200326 11:52:28.347" level="FAIL">Test timeout 10 seconds exceeded.</msg>
<status status="FAIL" endtime="20200326 11:52:28.347" starttime="20200326 11:52:18.495"></status>
</kw>
<status status="FAIL" endtime="20200326 11:52:48.466" starttime="20200326 11:52:18.495"></status>
</kw>

@hanishakoneru
Copy link
Contributor Author

@adoroszlai, I think even with that limitation the timeout will help us isolate the problem. Let's say the acceptance suit is cancelled, we could still get to know which test contributed to the time out.

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@elek
Copy link
Member

elek commented Mar 27, 2020

@adoroszlai, I think even with that limitation the timeout will help us isolate the problem. Let's say the acceptance suit is cancelled, we could still get to know which test contributed to the time out.

There are two timeouts:

  1. timeout of the test (measured between two steps)
  2. timeout of one step test steps

As far as I understood @adoroszlai warned us that even if we have a test level timeout it doesn't help at all, if 2nd is not in place. If one curl based command is hanging (and robot test doesn't do a kill) it won't be stopped (and we won't have any logs / results).

But I agree even without 2nd, it's good to have this patch.

On the other hand, I tested it with sleep, and it seems to be working for me...

*** Settings ***
Documentation       Timeout test
Library             OperatingSystem
Test Timeout        20 seconds
#Resource            commonlib.robot

*** Test cases ***
Execute PI calculation
                    ${output} =      Run                     sleep 60
                    Should Contain   ${output}               completed successfully
time robot test.robot
==============================================================================
Test :: Timeout test
==============================================================================
Execute PI calculation                                                | FAIL |
Test timeout 20 seconds exceeded.
------------------------------------------------------------------------------
Test :: Timeout test                                                  | FAIL |
1 critical test, 0 passed, 1 failed
1 test total, 0 passed, 1 failed
==============================================================================
Output:  /home/elek/projects/ozone/hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/smoketest/output.xml
Log:     /home/elek/projects/ozone/hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/smoketest/log.html
Report:  /home/elek/projects/ozone/hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/smoketest/report.html
robot test.robot  0.25s user 0.03s system 1% cpu 20.285 total

As you see my sleep command was killed after 20 seconds.

Note: originally I suggested to put the Timeout to the commonlib.robot to avoid code duplication, but I tested it and doesn't work.

@hanishakoneru
Copy link
Contributor Author

hanishakoneru commented Mar 27, 2020

Note: originally I suggested to put the Timeout to the commonlib.robot to avoid code duplication, but I tested it and doesn't work.

Yes. Learned that robot framework does not allow "global timeout" by design.

@adoroszlai, @elek are we good to merge this patch?

@adoroszlai
Copy link
Contributor

@adoroszlai, @elek are we good to merge this patch?

Yes, thanks.

@hanishakoneru
Copy link
Contributor Author

Thank you all for the reviews. Will merge this PR.

@hanishakoneru hanishakoneru merged commit eece604 into apache:master Mar 27, 2020
isahekmat pushed a commit to isahekmat/hadoop-ozone that referenced this pull request Mar 29, 2020
@adoroszlai
Copy link
Contributor

@elek @hanishakoneru a 6-hours run despite timeout settings: https://github.com/apache/hadoop-ozone/runs/548358523

@hanishakoneru
Copy link
Contributor Author

@elek @hanishakoneru a 6-hours run despite timeout settings: https://github.com/apache/hadoop-ozone/runs/548358523

This branch was not rebased to include the timeout change. Can we rebase and retry.

@adoroszlai
Copy link
Contributor

@elek @hanishakoneru a 6-hours run despite timeout settings: https://github.com/apache/hadoop-ozone/runs/548358523

This branch was not rebased to include the timeout change. Can we rebase and retry.

This run is a push build for 94413cd on master, which does have the timeout change eece604.

@elek
Copy link
Member

elek commented Apr 1, 2020

Bot not in topology/cli which is timed out. Will create a follow-up PR to copy it to there...

@adoroszlai
Copy link
Contributor

Bot not in topology/cli which is timed out. Will create a follow-up PR to copy it to there...

Good catch. But can you please explain why topology/cli would be the one that timed out?
Freon test from basic/basic was not yet done:

2020-03-31T10:47:00.9355111Z ozone-topology-basic :: Smoketest ozone cluster startup                       
2020-03-31T10:47:00.9355811Z ==============================================================================
2020-03-31T10:47:01.0327747Z Check webui static resources                                          | PASS |
2020-03-31T10:47:01.0328887Z ------------------------------------------------------------------------------
2020-03-31T16:26:10.4480521Z ##[error]The operation was canceled.

Compare this to a normal run:

2020-04-01T03:34:33.4721761Z ozone-topology-basic :: Smoketest ozone cluster startup                       
2020-04-01T03:34:33.4722354Z ==============================================================================
2020-04-01T03:34:33.5768474Z Check webui static resources                                          | PASS |
2020-04-01T03:34:33.5772576Z ------------------------------------------------------------------------------
2020-04-01T03:35:39.2555783Z Start freon testing                                                   | PASS |
2020-04-01T03:35:39.2557114Z ------------------------------------------------------------------------------
2020-04-01T03:35:39.2568636Z ozone-topology-basic :: Smoketest ozone cluster startup               | PASS |
2020-04-01T03:35:39.2573293Z 2 critical tests, 2 passed, 0 failed
2020-04-01T03:35:39.2573731Z 2 tests total, 2 passed, 0 failed
2020-04-01T03:35:39.2573985Z ==============================================================================
2020-04-01T03:35:39.2586020Z Output:  /tmp/smoketest/ozone-topology/result/robot-ozone-topology-ozone-topology-basic-scm.xml
2020-04-01T03:35:41.8733119Z ==============================================================================
2020-04-01T03:35:41.8738206Z ozone-topology-cli :: Smoketest ozone cluster startup                         
2020-04-01T03:35:41.8738825Z ==============================================================================
2020-04-01T03:35:43.3927035Z Run printTopology                                                     | PASS |
2020-04-01T03:35:43.3927937Z ------------------------------------------------------------------------------
2020-04-01T03:35:44.8556870Z Run printTopology -o                                                  | PASS |
2020-04-01T03:35:44.8557296Z ------------------------------------------------------------------------------
2020-04-01T03:35:44.8576945Z ozone-topology-cli :: Smoketest ozone cluster startup                 | PASS |
2020-04-01T03:35:44.8580699Z 2 critical tests, 2 passed, 0 failed
2020-04-01T03:35:44.8580837Z 2 tests total, 2 passed, 0 failed

@hanishakoneru
Copy link
Contributor Author

@adoroszlai , @elek how about we print a start and end time (or duration) for each test suit.

@hanishakoneru
Copy link
Contributor Author

Just saw that there is an option to list the timestamp of each message.
The last test to run is ozone-topology-basic/Check webui static resources. The next test in the suit is basic/Start freon testing which is probably the one which is running forever and hence leading to cancellation of the whole acceptance test suit.

Tue, 31 Mar 2020 10:46:59 GMT   Safe mode is off
Tue, 31 Mar 2020 10:47:00 GMT   ==============================================
Tue, 31 Mar 2020 10:47:00 GMT   ozone-topology-basic :: Smoketest ozone cluster startup                       
Tue, 31 Mar 2020 10:47:00 GMT   ==============================================
Tue, 31 Mar 2020 10:47:01 GMT   Check webui static resources                                          | PASS |
Tue, 31 Mar 2020 10:47:01 GMT   ----------------------------------------------------------------
Tue, 31 Mar 2020 16:26:10 GMT   ##[error]The operation was canceled.

@elek @adoroszlai

@elek
Copy link
Member

elek commented Apr 2, 2020

Good catch. But can you please explain why topology/cli would be the one that timed out?

You are right it seems to be the freon. But in this case:

  1. Why can't I see the timeout?
  2. Why freon is hanging?

@hanishakoneru hanishakoneru deleted the HDDS-3281 branch December 1, 2020 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants