LP1974205: Case memory/memory_stress_ng be terminated after performing for a while #51

beliaev-maksim · 2022-11-28T09:00:53Z

This issue was migrated from https://bugs.launchpad.net/plainbox-provider-checkbox/+bug/1974205

Summary

Status	Created on	Heat	Importance	Security related
Confirmed	2022-05-19 15:14:17	172	Critical	False

Description

[Summary]
The terminal which runs memory/memory_stress_ng job will be closed (It seems that system be reboot) and the process of checkbox-cli be terminated after executing for a while.

Issue can also be observed on Cinnamon Bay platform. It will cause system reboot after running this case for a while.

[Steps to reproduce]

Boot into OS
Run "checkbox-cli run com.canonical.certification::memory/memory_stress_ng"

[Expected result]
checkbox-cli won't be terminated.

[Actual result]
checkbox-cli will be terminated.

[Failure rate]
10/10

[Additional information]
CID: 202112-29802
SKU: TRBA-DVT2-C4
Base Image: dell-bto-jammy-jellyfish-tentacool-X07-20220331-4.iso
Product Name: XPS 13 9320
BIOS Version: 0.2.13
kernel-version: 5.15.0-23-generic

CID: 202203-30134
SKU: CB16T-DVT2-C1
Image: canonical-oem-somerville-jammy-amd64-20220504-33+jellyfish-minccino+X11
system-manufacturer: Dell Inc.
system-product-name: Precision 7670
bios-version: 1.3.1
CPU: 12th Gen Intel(R) Core(TM) i5-12600HX (16x)
GPU: 00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:4688] (rev 0c)
kernel-version: 5.17.0-9004-oem

Attachments

sosreport-V-high-memory-2022-06-28-zjcpqwv.tar.xz
checkbox-ng.zip
sosreport-V-high-memory-2022-06-28-zjcpqwv.tar.xz

Tags:
['jammy-test', 'oem-priority', 'originate-from-1974073', 'originate-from-1981168', 'originate-from-1981180', 'originate-from-1982914', 'originate-from-1983006', 'originate-from-1983068', 'originate-from-1989083', 'somerville', 'stella', 'sutton']

beliaev-maksim · 2022-11-28T09:00:54Z

This thread was migrated from launchpad.net

https://launchpad.net/~bladernr wrote on 2022-05-19 17:50:07:

what is the config of the failing systems? is there adequate RAM and swap?

Adrian is currently working on tweaking this a bit based on suggestions from Colin King to not be as aggressive as we have been on some of the stressors, so the output of that may help here.

I suspect that on e the one system where checkbox's terminal is being killed, OOM Killer is being aggressive and is killing off checkbox or the shell checkbox is running in.

The one that is rebooting is more troubling, that suggests something has gone seriously wrong and probably should be looked at separately. I can see, and somewhat understand the first case when running a very memory hungry stress tool, but I cannot accept a system rebooting because of that stress, that would be, IMO a catastrophic failure.

https://launchpad.net/~pieq wrote on 2022-05-24 07:49:42:

Regarding the test killing the checkbox process, I think we are starting to see this more and more often. Recently, another colleague did some changes related to this, but just for the project he's working for[1].

I agree with Jeff regarding the reboot: if the device reboots, it means the kernel and the system are not handling things properly. @peiyao you should file a bug about this for your project, if it hasn't been done yet.

[1] https://code.launchpad.net/~rickwu4444/zhongyi/+git/checkbox-provider-zhongyi/+merge/421194

https://launchpad.net/~baconyao wrote on 2022-05-24 09:42:38:

Re #1

List the failed DUTs below:

Cinnamon Bay:

CB16T-DVT2-C1 / 16 GB Memory
CB16T-DVT2-C2 / 64 GB
CB16P-DVT2-C1 / 16 GB
CB16P-DVT2-C2X / 64 GB
CB16P-DVT2-C3 / 64 GB
CB16P-DVT2-C4 / 32 GB
CB17-DVT2-C1 / 16 GB
CB17-DVT2-C3 / 32 GB

Orchid Bay

OCBY-DVT2-C1U / 8GB
OCBY-DVT2-C3 / 32 GB
OCBY-DVT2-C2 / 16 GB

https://launchpad.net/~huntu207 wrote on 2022-06-28 09:43:48:

still observed the memory_stress_ng test killing the checkbox process w/o reboot on new desktop platform

CID: 202206-30364
SKU: Venc-High-EVT-C1
Image: canonical-oem-somerville-jammy-amd64-20220504-33+jellyfish-zubat+X20
system-manufacturer: Dell Inc.
system-product-name: OptiPlex SFF 7010
bios-version: 0.3.13

https://launchpad.net/~binli wrote on 2022-07-01 05:36:58:

We also met this issue in sutton jammy image, and I made MR to fix this issue.

https://code.launchpad.net/~binli/plainbox-provider-checkbox/+git/plainbox-provider-checkbox/+merge/425934

https://launchpad.net/~clairlin wrote on 2022-07-28 08:49:35:

Issue can reproduce on stella (BOG-SKU2, cid:202112-29752)

https://launchpad.net/~bladernr wrote on 2022-07-28 13:45:48:

before you go to all this trouble and hide the issue by disabling OOMD (by the way, oomd is supposed to be triggered and is an expected part of some stress-ng test cases), run the version of plainbox-provider-checkbox now in the dev PPA which has recently been modified to ease some of the load that was causing issues on some systems.

There are some memory stressors that are now run a bit less stressfully and that has resolved several issues that have cropped up on systems with low ram to core ratios and other configurations.

So please try that updated stress-ng-test script and see if that resolves the issue you're trying to fix here.

https://launchpad.net/~os369510 wrote on 2022-08-10 08:18:10:

Reply to comment#9, Weichen helped to upgrade checkbox to latest and issue is remaining.

In my cases (stella), I don't think the system is intend to reboot. Instead, it depends on how memory/memory_stress_ng be launched.

From bug report https://bugs.launchpad.net/stella/+bug/1983006/comments/9

Aug 9 18:03:29 ubuntu systemd-oomd[921]: Killed /user.slice/user-1001.slice/user@1001.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-f6b6384b-b966-430c-a0a6-9d9a873e133d.scope due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 62.83% > 50.00% for > 20s with reclaim activity
...
Aug 9 18:03:29 ubuntu systemd-oomd[921]: Killed /user.slice/user-1001.slice/user@1001.service/session.slice/org.gnome.Shell@wayland.service due to memory pressure for /user.slice/user-1001.slice/user@1001.service being 75.56% > 50.00% for > 20s with reclaim activity

systemd-oomd invoke oom-killer to kill gnome-terminal and gnome-shell.
Thus, any process invoked by gnome-terminal and gnome-shell will be terminated.

So question, is this test case expects the gnome services (and other graphic.target services) be terminated?

https://launchpad.net/~kent-jclin wrote on 2022-08-10 08:46:03:

@jeremy,

No, the test case does not expects the gnome services (and other graphic.target services) be terminated

https://launchpad.net/~os369510 wrote on 2022-08-10 11:41:05:

then can checkbox don't rely on gnome-terminal or gnome-shell? if checkbox puts stressors belong gnome-terminal or gnome-shell, then they will likely be killed first.

I think setsid can prevent it happens but checkbox needs to think how to sync with new session.

btw, I think it might not happen on checkbox-remote environment because FWIK, checkbox-remote rely on checkbox-ng.service.
It may worth to try if someone can help.

https://launchpad.net/~pieq wrote on 2022-08-10 18:46:15:

I tried running this test using checkbox remote. Unfortunately, at some point the controller loses access to the DUT and, because of lp:1936477, the session cannot be resumed...

So, for the time being, checkbox remote is not an option either @_@)

https://launchpad.net/~kchsieh wrote on 2022-08-11 01:57:13:

@pieq

is it valid to run it with ssh?

https://launchpad.net/~pieq wrote on 2022-08-11 11:31:57:

@kc: It was a good idea to try.

I launched a test this morning:

ssh into the DUT (201912-27634 running up-to-date stock 22.04)
run checkbox-cli from the DUT, and select the memory/memory_stress_ng job

Test passed:

https://certification.canonical.com/hardware/201912-27634/submission/276148/test-results/pass/

I'm not sure what that means, though. Does it mean systemd did not trigger oom-killer? Does it mean the system actually behaved as expected?

https://launchpad.net/~kchsieh wrote on 2022-08-11 11:58:20:

@pieq

I try to categorize the failed cases here [1]. They are both related to gnome-shell, but I don't think they are bug, since gnome-shell indeed uses a lot of RSS, which makes it easier to be killed by systemd-oomd or oom-killer. The ssh session can help the test keep running even gnome-shell service being killed.

Actually, I don't know what can I fix when the test stopped because of gnome-shell, because there is no fatal or stack trace, and we didn't mask gnome-shell to not be killed.

So I'd like to ask if we can perform it by ssh for platform certification.

[1] https://bugs.launchpad.net/somerville/+bug/1983068/comments/1

https://launchpad.net/~bladernr wrote on 2022-08-15 20:51:07:

@pieq it could be related to the changes recently landed that change how the memory stress test runs certain stressors less aggressively. Maybe... I knwo that resolved issues with lockups and tests being killed on some smaller server systems and maybe that's the case here as well?

https://launchpad.net/~os369510 wrote on 2022-08-16 08:56:36:

I discussed this topic with Kent. QA may consider to make it don't blocking the testing but it exactly lead a bad user experience. (if user session uses much memory)

The session created by sshd can prevent be monitored by systemd-oomd.
Also, if the checkbox somehow not able to use ssh during the test, then the systemd-run might be the other option.

I still wondering what's the target (hw? kswapd? oom-killer? systemd-oomd?) that checkbox wants to test (memory/memory_stress_ng), it relates to how we deal with this issue. (oom-killer will kill some processes before systemd-oomd)

BTW, I tried to create 30+ firefox tabs on 4G system and the gnome-shell won't be killed by systemd-oomd because it won't reach the 50%/20s threshold (but stress-ng will).

Thus, it depends on the user scenarios.

Anyway, back to the bug title: "Case memory/memory_stress_ng be terminated after performing for a while".
If checkbox don't want to be killed, then please put checkbox or stressors on other system sessions (than user session) by sshd or systemd-oomd.

For bad user experience, I created the other bug https://bugs.launchpad.net/oem-priority/+bug/1985887.

We have not further action if we don't know what this test case wants to test.

https://launchpad.net/~kaihengfeng wrote on 2022-08-17 07:02:26:

Personally I would like to have stress-ng run twice, one with systemd-oomd, one without, so both worlds are covered.

https://launchpad.net/~jay-ch wrote on 2022-08-22 04:16:46:

Hi @pierre may I know the next step forward ?

the other related bug #1985887 (systemd kills gnome-shell or gnome-terminal) is invalid (expected behavior) and bug #1983068 (Screen freeze when running memory/memory_stress_ng is blocked without the improved tool that can run though..

I believe the next step is to find a memory stressor that does not reach the over-stressing point that systemd would trigger kill of the process. Or run the stressors by sshd

can you or someone advise the test development plan onwards?

https://launchpad.net/~bladernr wrote on 2022-08-22 13:28:44:

would it be reasonable to adjust oomd's threshold? At least
experimentally? I only ask because I wonder if this isn't similar to
the issue with systemd-oomd killing off userspace applications.

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1972159

https://launchpad.net/~os369510 wrote on 2022-08-22 14:54:05:

reply on comment#21, if so, then you can specify
DefaultMemoryPressureLimit or DefaultMemoryPressureLimit on an unit[1].

BTW, Bug#1972159 is a different issue.
The stress_ng issue is same as Bug#1985887.

[1] https://www.freedesktop.org/software/systemd/man/oomd.conf.html

* Add: cupfreq module probe

* Initial commit * Migrate ARM generic jobs from launchpad * Initial commit * Clean up this repo, and up it to date. * Change: job id for cpu cases * Change: id of rpmsg * Add: manage.py Since we need this repo as a provider * Change: separate jobs into generic and non-generic * Add: checkbox-provider-arm folder Following the structure of checkbox * Change: the access of tool not created by us To prevent those tools been packaged into checkbox * Add: test plan for ARM generic jobs * Change: test method for NPU TensorFlow Change the test method by using test snap build by Isaac * Change: MTD test method Since we don't have reliable way to indentify the type of MTD. * Change: test method for buzzer Change the test method for buzzer that probe as an input PC speaker * Change: chunkfs test method * Add: dump DTB file * Add: manifest for the jobs * Add: rs485 remote test * Remove: non-generic jobs Since we need to move generic parts to github Co-authored-by: Nara Huang <nara.huang@canonical.com> * Add snapcraft yaml for uc20 (#2) * Add: snapcraft.yaml for uc20 * Add: README.md * Fix: export PYTHONPATH in override-build section * Add: libc6-dev for build-packages of input-pcspkr Since snapcraft7.x will need extra lib from libc6-dev * Fix cpu and device tree cases (#3) * Fix: cpu offlining test * Change: requires of cpu offlineing test * Add: stage-package linuxptp * Fix: test plan for dtb * Add eeprom test and change name of the provider (#6) * Add: eeprom test cases * Change: provider name from arm to ce-oem * Change name space and README.md (#9) * Change: name space * Change: README.md * Fix: wording of checkbox interface snap * Remove: snapcraft.yaml (#8) * Add: led generic test (#7) * Add: support cold reboot tests by pdu (#10) * Add: support cold reboot tests by pdu add cold reboot tests by network PDU for those system which not able to wakeup by rtc * Fix: fix bugs add following changes in this commit - check supported pdu type inside of main function - sync data after dump device information - correct command for init-boot-loop-data job * Fix: fix bug - fix the error reported by shellcheck * Change prefixes (#11) * Change: prefixes for plans and jobs * Change: rename test plan and add PDU related plan * Change: name of category * Fix: missing prefixes of serial test job and plan * Add: informations about used config in readme (#13) * Change: sysfs led method (#12) * Fix: rtc wakealarm checking to right attribute (#14) * Fix: remove ce-oem-power-automated-by-pdu Remove the ce-oem-power-automated-by-pdu from the ce-oem-automated The ce-oem-power-automated-by-pud will always be executed even the user not intend to reboot the DUT by PDU. And the checkbox session will stuck after send out a snmpset command. So remove this test first, and we need to integrate the cold boot related tests to PPC to support cold reboot test for those DUT which not able to wake up by rtc. * Add: add a generic iot test plans Adding a generic IoT test plans, it's a full testing set include tests in certification test plan and ce-oem-generic test plan * Improve the MTD script Before doing the read and write testing of specific mtd partition, we need to know the MTD is read-only or writable, therefore, a new variable called is_mtd_writable to help us to know it. If this mtd is read-only, I only read its content to a file. If this mtd is writable, backup the its original content to a file, then do the compare procedure, finally, recover its content. * Add: CAAM related test (#21) * Add: CAAM related test * Add: generic crypto test * Fix naming with ce-oem * Add: add digital-io tests (#24) * Add: add digital-io tests add digital-io tests * Fix: fixed scripts and test plan - fixed the format string style - fixed the typo in docstring - fixed the variable type for TEST_STATES - remove useless test jobs * Fix: fixed typo fixed typo * Add: add readme (#23) * Fix: updated README.md add a section about how to use checkbox-ce-oem as a checkbox launcher * Fix: updated the installation steps updated the installation steps for Ubuntu classic and Ubuntu core * fix: fixed typo fixed typo * Fix: updated digital-io tests (#25) - corrected digital-io tests in post-suspend stage - import manifest job * Add: cpufreq governors test for 6 governors (#19) * Add: cpufreq governors test for 6 governors Add a resource job to check system supported governors, and test jobs for all 6 governors. These tests are only applicable to the platform using acpi_cpufreq related drivers. Newer intel CPU using intel_pstate driver is NOT applicable. * Fix: typo and apply suggestions * Remove: ce-oem-cpu/offlining_test from the plan Since there is already a fix for cpu/offlining_test in checkbox base provider, this ce-oem-cpu/offlining_test is no longer required. #522 * Modify CAAM and installation time related jobs. (#26) * Modify: modify hwrng and caam algo check to require manifest * Add: requires for installation related jobs Make those jobs run only in ubuntu core * Fix: revert summary section for snapd_installation_time The previous commit accidentally removed the section * Fix script for pep8 check (#27) * Remove: test job/script that not in use. Since we have new script for cpu freq governors. And also cpu_offline test has been updated in base provider. * Fix: python script to fit flake8 * Fix: manifest shouldn't place in resource job Move the manifest to test job. To prevent manifest being executed at bootstrap stage. * Add code check by tox just like checkbox did (#28) * Add code check by tox just like checkbox did This commit following what checkbox did, and add pep8-naming check for python 3.8+. * Change the fork checkbox git repo url to offical one * Fix tox test environment error (#32) * Fix tox test environment error Each tox test environment won't be reseted, therefore removing all providers before test is needed. After checking with checkbox repo, checkbox-ng has to use different install method to install. * Follow checkbox repo to update pyYAML version to 6.0.1 for python 3.10+ to solve tox issue * Fix for shellcheck (#29) * Fix: shell scrip to fit shellcheck * Fix: audio jobs to fit shellcheck * Fix: caam jobs to fit shellcheck * Fix: device jobs to fit shellcheck * Fix: digital-io jobs to fit shellcheck * Change: nested EEPROM test from checkbox base Since EEPROM jobs merged back to the base provider already. * Fix: power-manager jobs to fit shellcheck * Change: nested RTC test from checkbox base Since RTC jobs merged back to the base provider already. * Fix: stress jobs to fit shellcheck * Fix: theraml-sensor jobs to fit shellcheck * Fix: Nested plan to ce-oem Follow the structure as a client-cert plan * Add multiple usb otg test (#33) * Remove: OTG related jobs Those OTG jobs have been migrated back to base provider. * Add: multiple OTG test * Add: readme.md * Refactor cpufreq governor test (#36) * Change: refactor cpufreq governor test CQT-2906 The original tests did not handle the following 2 conditions: 1. CPUfreq scaling driver not found (no files under cpufreq folder) 2. On ARM device, system using multiple policies to handle different CPUs group This commit refactor the tests to make them policies-based tests. Also adding a driver detect job that can be use on all platforms. * Add optee test (#34) * Add: optee test * Add: README.md * Include led test from base (#35) * Add: nested led job from checkbox-base * Change: README.md for led * Change: PTP command (#37) * Change: PTP command lp:2030942, workaround for the current version of ptp4l * Fix: apply musical scales for pwm buzzer test (#40) * Fix: apply musical scales for pwm buzzer test apply musical scales for pwm buzzer test * Fix: added description for test_pwm_buzzer func added description for test_pwm_buzzer function * Fix: add touchscreen tests (#39) * Fix: add touchscreen tests validate the required touch events for touchscreen also add new test to cover maximum fingers support on touchscreen * Change: hardware PTP support keywords For PTP capabilities, ethtool v5.4 (on Ubuntu 20.04) shows `hardware-transmit (SOF_TIMESTAMPING_TX_HARDWARE)` But ethtool v5.16 (on Ubuntu 22.04) only shows `hardware-transmit` The test case was using `SOF_TIMESTAMPING_TX_HARDWARE` as keywords, This commit changes it to use `hardware-transmit` for adapting new version of ethtool. * Remove: nested ptp-manual test plans The current manual test job steps are duplicated with the auto test job. Currently we only need to run the auto test job. The manual test steps here is used as a reference. If there are new jobs in th future, the ptp-manual test plan can be nested back. * Add: LD_PATH information in readme.md (#41) * Change: additional patterns to map UDC and USB node (#42) * Change: additional patterns to map UDC and USB node * Fix: the if else logic for map UDC * Fix: include python3-systemd in tox.ini (#48) include the python3-systemd package in tox.ini * Fix: updated the related packages for tox (#50) updated the related packages for tox * Fix: nested plan for server manual (#44) * Fix: nested plan for server manual * Fix: errors when mtd count is 0 (#55) * Fix: errors when mtd count is 0 * Add shellcheck ignore 2126 * Add: rpmsg tests for i.MX and TI processors (#46) * Fix: updated the related packages for tox updated the related packages for tox * Add: rpmsg tests for i.MX and TI processors adding rpmsg tests for i.MX and TI processors * Fix: updated rpmsg_tests.py scripts updated rpmsg_tests.py scripts * Fix: fixed bugs fixed bugs * Modify optee test run out of gadget (#45) * Add: support outsourcing optee-test * 230921 crypto test (#52) * Fix: updated the related packages for tox updated the related packages for tox * Fix: update the crypto tests seperate the crypto tests to the generic crypto tests and crypto accelerator tests * Fix: bypass if failed to set extra seed the extra seed might not be supported for all platform * Fix: fixed bugs fixed bugs * Fix: refactor check_crypto_driver_priority func refactor check_crypto_driver_priority function * Fix: mock systemd-python package and remove packages in tox mock systemd-python package and remove packages in tox * Revamp otg_ports resource job should not fail (#53) * Revamp otg_ports resource job should not fail Add a if condition to handle when the OTG config variable is not set. * Revise the echo string * Add: cupfreq module probe (#51) * Add: cupfreq module probe * Check hwrng test (#57) * Modify hwrng test more flexible * Add README.md * Make ce-oem-info jobs be attachment jobs (#58) ce-oem-info jobs are aimed at collecting the system bootup information from snapd and cloud-init to help developers analyze bugs. Therefore, these jobs are more suitable to be attachment jobs and should not fail even if some information can not be found. For example, some systems may not have cloud-init. In these systems, the job cannot gather the information from cloud-init, but this should not be a failure. * Modify the device tree dump job (#59) The current ce-oem-device-tree/dump job does not match the resource job format and will show warnings when running it. The original plan is to parse this job to match the resource format to let any other jobs able to get the dts info conveniently. As parse the dts to resource job format is a huge work, and will not be done in the near future. This commit excludes the ce-oem-device-tree/dump job from the test plan and add comments to explain the purpose. Also keep the attachment job to make sure we could get the device tree information when checking the report. * Revamp caam_hwrng_test by looping w/ smaller size (#60) Some of the project devices' caam_hwrng have buffer, so previously the dd size is increased to 10M to exceed the buffer size to make the interrupt increase. It will make the device without buffer take a very long time (over 30 mins) to do the dd command. Therefore, this commit separated the size to 512K and looped it 20 times. On the device without buffer it could do 512K and detect the increased interrupt to get the pass result, which is much faster. On the device with buttfer it could also exceed the buffer by dd 20 times and detect the increased interrupt to get the pass result. * Fix the nested crypto test plan name (#61) Previously the caam test plan was revamped to accelerator test plan, but forgot to update the name nested in ce-oem-automated. Fix the name in this commit. * Add: new socketcan stress test (#62) * Add: new socketcan tests add new socketcan stress test (100000 loop echo test) add bus-off test and fixed bug * Add: include new socketcan tests (#63) include new socketecan test plan into ce-oem test plans * Add ubuntu-frame and glmark2 tests (#64) * Add ubuntu-frame and glmark2 tests * Modify the way checking ubuntu-frame active * Modify: Add more details in README * Add: implement the RPMSG-tty tests implement the RPMSG-tty tests for i.MX series * Fix: update tox configuration update tox configuration to perform tests with specific Python version * Add: check VPU device tests (#68) * Add: check VPU device tests adding new test cases about identify VPU devices * Fix: fixed the issues from comments fixed the issues from comments * Fix: bug fixed bug fixed * Fix: fixed issues fixed issues * Fix: refactor the thermal tests (#69) * Fix: refactor the thermal tests refactor the thermal tests * Fix: correct the argument name in socketcan_test (#71) corrected the argument name in socketcan_test.py * Add dbus reboot stress (#70) * Add script for installing test snap and connect interfaces * Add reboot via dbus command This test is for snap strict confinement mode * Add tcp test (#65) * Add TCP test * Move ce-oem README inside the provider directory * Change the ce-oem provider namespace to the contrib namespace * Add README for the contrib area * Remove .gitignore from contrib/ Artefact left from the ce-oem provider migration --------- Co-authored-by: stanley31huang <stanley.huang@canonical.com> Co-authored-by: rickwu666666 <98441647+rickwu666666@users.noreply.github.com> Co-authored-by: Nara Huang <nara.huang@canonical.com> Co-authored-by: rickwu4444 <rick.wu@canonical.com> Co-authored-by: baconyao <patrick.chang@canonical.com> Co-authored-by: liaou3 <vincent.liao@canonical.com> Co-authored-by: patliuu <111331153+patliuu@users.noreply.github.com> Co-authored-by: Patrick Liu <patrick.liu@canonical.com> Co-authored-by: hanhsuan <32028620+hanhsuan@users.noreply.github.com>

…al#932) * Initial commit * Migrate ARM generic jobs from launchpad * Initial commit * Clean up this repo, and up it to date. * Change: job id for cpu cases * Change: id of rpmsg * Add: manage.py Since we need this repo as a provider * Change: separate jobs into generic and non-generic * Add: checkbox-provider-arm folder Following the structure of checkbox * Change: the access of tool not created by us To prevent those tools been packaged into checkbox * Add: test plan for ARM generic jobs * Change: test method for NPU TensorFlow Change the test method by using test snap build by Isaac * Change: MTD test method Since we don't have reliable way to indentify the type of MTD. * Change: test method for buzzer Change the test method for buzzer that probe as an input PC speaker * Change: chunkfs test method * Add: dump DTB file * Add: manifest for the jobs * Add: rs485 remote test * Remove: non-generic jobs Since we need to move generic parts to github Co-authored-by: Nara Huang <nara.huang@canonical.com> * Add snapcraft yaml for uc20 (canonical#2) * Add: snapcraft.yaml for uc20 * Add: README.md * Fix: export PYTHONPATH in override-build section * Add: libc6-dev for build-packages of input-pcspkr Since snapcraft7.x will need extra lib from libc6-dev * Fix cpu and device tree cases (canonical#3) * Fix: cpu offlining test * Change: requires of cpu offlineing test * Add: stage-package linuxptp * Fix: test plan for dtb * Add eeprom test and change name of the provider (canonical#6) * Add: eeprom test cases * Change: provider name from arm to ce-oem * Change name space and README.md (canonical#9) * Change: name space * Change: README.md * Fix: wording of checkbox interface snap * Remove: snapcraft.yaml (canonical#8) * Add: led generic test (canonical#7) * Add: support cold reboot tests by pdu (canonical#10) * Add: support cold reboot tests by pdu add cold reboot tests by network PDU for those system which not able to wakeup by rtc * Fix: fix bugs add following changes in this commit - check supported pdu type inside of main function - sync data after dump device information - correct command for init-boot-loop-data job * Fix: fix bug - fix the error reported by shellcheck * Change prefixes (canonical#11) * Change: prefixes for plans and jobs * Change: rename test plan and add PDU related plan * Change: name of category * Fix: missing prefixes of serial test job and plan * Add: informations about used config in readme (canonical#13) * Change: sysfs led method (canonical#12) * Fix: rtc wakealarm checking to right attribute (canonical#14) * Fix: remove ce-oem-power-automated-by-pdu Remove the ce-oem-power-automated-by-pdu from the ce-oem-automated The ce-oem-power-automated-by-pud will always be executed even the user not intend to reboot the DUT by PDU. And the checkbox session will stuck after send out a snmpset command. So remove this test first, and we need to integrate the cold boot related tests to PPC to support cold reboot test for those DUT which not able to wake up by rtc. * Add: add a generic iot test plans Adding a generic IoT test plans, it's a full testing set include tests in certification test plan and ce-oem-generic test plan * Improve the MTD script Before doing the read and write testing of specific mtd partition, we need to know the MTD is read-only or writable, therefore, a new variable called is_mtd_writable to help us to know it. If this mtd is read-only, I only read its content to a file. If this mtd is writable, backup the its original content to a file, then do the compare procedure, finally, recover its content. * Add: CAAM related test (canonical#21) * Add: CAAM related test * Add: generic crypto test * Fix naming with ce-oem * Add: add digital-io tests (canonical#24) * Add: add digital-io tests add digital-io tests * Fix: fixed scripts and test plan - fixed the format string style - fixed the typo in docstring - fixed the variable type for TEST_STATES - remove useless test jobs * Fix: fixed typo fixed typo * Add: add readme (canonical#23) * Fix: updated README.md add a section about how to use checkbox-ce-oem as a checkbox launcher * Fix: updated the installation steps updated the installation steps for Ubuntu classic and Ubuntu core * fix: fixed typo fixed typo * Fix: updated digital-io tests (canonical#25) - corrected digital-io tests in post-suspend stage - import manifest job * Add: cpufreq governors test for 6 governors (canonical#19) * Add: cpufreq governors test for 6 governors Add a resource job to check system supported governors, and test jobs for all 6 governors. These tests are only applicable to the platform using acpi_cpufreq related drivers. Newer intel CPU using intel_pstate driver is NOT applicable. * Fix: typo and apply suggestions * Remove: ce-oem-cpu/offlining_test from the plan Since there is already a fix for cpu/offlining_test in checkbox base provider, this ce-oem-cpu/offlining_test is no longer required. canonical#522 * Modify CAAM and installation time related jobs. (canonical#26) * Modify: modify hwrng and caam algo check to require manifest * Add: requires for installation related jobs Make those jobs run only in ubuntu core * Fix: revert summary section for snapd_installation_time The previous commit accidentally removed the section * Fix script for pep8 check (canonical#27) * Remove: test job/script that not in use. Since we have new script for cpu freq governors. And also cpu_offline test has been updated in base provider. * Fix: python script to fit flake8 * Fix: manifest shouldn't place in resource job Move the manifest to test job. To prevent manifest being executed at bootstrap stage. * Add code check by tox just like checkbox did (canonical#28) * Add code check by tox just like checkbox did This commit following what checkbox did, and add pep8-naming check for python 3.8+. * Change the fork checkbox git repo url to offical one * Fix tox test environment error (canonical#32) * Fix tox test environment error Each tox test environment won't be reseted, therefore removing all providers before test is needed. After checking with checkbox repo, checkbox-ng has to use different install method to install. * Follow checkbox repo to update pyYAML version to 6.0.1 for python 3.10+ to solve tox issue * Fix for shellcheck (canonical#29) * Fix: shell scrip to fit shellcheck * Fix: audio jobs to fit shellcheck * Fix: caam jobs to fit shellcheck * Fix: device jobs to fit shellcheck * Fix: digital-io jobs to fit shellcheck * Change: nested EEPROM test from checkbox base Since EEPROM jobs merged back to the base provider already. * Fix: power-manager jobs to fit shellcheck * Change: nested RTC test from checkbox base Since RTC jobs merged back to the base provider already. * Fix: stress jobs to fit shellcheck * Fix: theraml-sensor jobs to fit shellcheck * Fix: Nested plan to ce-oem Follow the structure as a client-cert plan * Add multiple usb otg test (canonical#33) * Remove: OTG related jobs Those OTG jobs have been migrated back to base provider. * Add: multiple OTG test * Add: readme.md * Refactor cpufreq governor test (canonical#36) * Change: refactor cpufreq governor test CQT-2906 The original tests did not handle the following 2 conditions: 1. CPUfreq scaling driver not found (no files under cpufreq folder) 2. On ARM device, system using multiple policies to handle different CPUs group This commit refactor the tests to make them policies-based tests. Also adding a driver detect job that can be use on all platforms. * Add optee test (canonical#34) * Add: optee test * Add: README.md * Include led test from base (canonical#35) * Add: nested led job from checkbox-base * Change: README.md for led * Change: PTP command (canonical#37) * Change: PTP command lp:2030942, workaround for the current version of ptp4l * Fix: apply musical scales for pwm buzzer test (canonical#40) * Fix: apply musical scales for pwm buzzer test apply musical scales for pwm buzzer test * Fix: added description for test_pwm_buzzer func added description for test_pwm_buzzer function * Fix: add touchscreen tests (canonical#39) * Fix: add touchscreen tests validate the required touch events for touchscreen also add new test to cover maximum fingers support on touchscreen * Change: hardware PTP support keywords For PTP capabilities, ethtool v5.4 (on Ubuntu 20.04) shows `hardware-transmit (SOF_TIMESTAMPING_TX_HARDWARE)` But ethtool v5.16 (on Ubuntu 22.04) only shows `hardware-transmit` The test case was using `SOF_TIMESTAMPING_TX_HARDWARE` as keywords, This commit changes it to use `hardware-transmit` for adapting new version of ethtool. * Remove: nested ptp-manual test plans The current manual test job steps are duplicated with the auto test job. Currently we only need to run the auto test job. The manual test steps here is used as a reference. If there are new jobs in th future, the ptp-manual test plan can be nested back. * Add: LD_PATH information in readme.md (canonical#41) * Change: additional patterns to map UDC and USB node (canonical#42) * Change: additional patterns to map UDC and USB node * Fix: the if else logic for map UDC * Fix: include python3-systemd in tox.ini (canonical#48) include the python3-systemd package in tox.ini * Fix: updated the related packages for tox (canonical#50) updated the related packages for tox * Fix: nested plan for server manual (canonical#44) * Fix: nested plan for server manual * Fix: errors when mtd count is 0 (canonical#55) * Fix: errors when mtd count is 0 * Add shellcheck ignore 2126 * Add: rpmsg tests for i.MX and TI processors (canonical#46) * Fix: updated the related packages for tox updated the related packages for tox * Add: rpmsg tests for i.MX and TI processors adding rpmsg tests for i.MX and TI processors * Fix: updated rpmsg_tests.py scripts updated rpmsg_tests.py scripts * Fix: fixed bugs fixed bugs * Modify optee test run out of gadget (canonical#45) * Add: support outsourcing optee-test * 230921 crypto test (canonical#52) * Fix: updated the related packages for tox updated the related packages for tox * Fix: update the crypto tests seperate the crypto tests to the generic crypto tests and crypto accelerator tests * Fix: bypass if failed to set extra seed the extra seed might not be supported for all platform * Fix: fixed bugs fixed bugs * Fix: refactor check_crypto_driver_priority func refactor check_crypto_driver_priority function * Fix: mock systemd-python package and remove packages in tox mock systemd-python package and remove packages in tox * Revamp otg_ports resource job should not fail (canonical#53) * Revamp otg_ports resource job should not fail Add a if condition to handle when the OTG config variable is not set. * Revise the echo string * Add: cupfreq module probe (canonical#51) * Add: cupfreq module probe * Check hwrng test (canonical#57) * Modify hwrng test more flexible * Add README.md * Make ce-oem-info jobs be attachment jobs (canonical#58) ce-oem-info jobs are aimed at collecting the system bootup information from snapd and cloud-init to help developers analyze bugs. Therefore, these jobs are more suitable to be attachment jobs and should not fail even if some information can not be found. For example, some systems may not have cloud-init. In these systems, the job cannot gather the information from cloud-init, but this should not be a failure. * Modify the device tree dump job (canonical#59) The current ce-oem-device-tree/dump job does not match the resource job format and will show warnings when running it. The original plan is to parse this job to match the resource format to let any other jobs able to get the dts info conveniently. As parse the dts to resource job format is a huge work, and will not be done in the near future. This commit excludes the ce-oem-device-tree/dump job from the test plan and add comments to explain the purpose. Also keep the attachment job to make sure we could get the device tree information when checking the report. * Revamp caam_hwrng_test by looping w/ smaller size (canonical#60) Some of the project devices' caam_hwrng have buffer, so previously the dd size is increased to 10M to exceed the buffer size to make the interrupt increase. It will make the device without buffer take a very long time (over 30 mins) to do the dd command. Therefore, this commit separated the size to 512K and looped it 20 times. On the device without buffer it could do 512K and detect the increased interrupt to get the pass result, which is much faster. On the device with buttfer it could also exceed the buffer by dd 20 times and detect the increased interrupt to get the pass result. * Fix the nested crypto test plan name (canonical#61) Previously the caam test plan was revamped to accelerator test plan, but forgot to update the name nested in ce-oem-automated. Fix the name in this commit. * Add: new socketcan stress test (canonical#62) * Add: new socketcan tests add new socketcan stress test (100000 loop echo test) add bus-off test and fixed bug * Add: include new socketcan tests (canonical#63) include new socketecan test plan into ce-oem test plans * Add ubuntu-frame and glmark2 tests (canonical#64) * Add ubuntu-frame and glmark2 tests * Modify the way checking ubuntu-frame active * Modify: Add more details in README * Add: implement the RPMSG-tty tests implement the RPMSG-tty tests for i.MX series * Fix: update tox configuration update tox configuration to perform tests with specific Python version * Add: check VPU device tests (canonical#68) * Add: check VPU device tests adding new test cases about identify VPU devices * Fix: fixed the issues from comments fixed the issues from comments * Fix: bug fixed bug fixed * Fix: fixed issues fixed issues * Fix: refactor the thermal tests (canonical#69) * Fix: refactor the thermal tests refactor the thermal tests * Fix: correct the argument name in socketcan_test (canonical#71) corrected the argument name in socketcan_test.py * Add dbus reboot stress (canonical#70) * Add script for installing test snap and connect interfaces * Add reboot via dbus command This test is for snap strict confinement mode * Add tcp test (canonical#65) * Add TCP test * Move ce-oem README inside the provider directory * Change the ce-oem provider namespace to the contrib namespace * Add README for the contrib area * Remove .gitignore from contrib/ Artefact left from the ce-oem provider migration --------- Co-authored-by: stanley31huang <stanley.huang@canonical.com> Co-authored-by: rickwu666666 <98441647+rickwu666666@users.noreply.github.com> Co-authored-by: Nara Huang <nara.huang@canonical.com> Co-authored-by: rickwu4444 <rick.wu@canonical.com> Co-authored-by: baconyao <patrick.chang@canonical.com> Co-authored-by: liaou3 <vincent.liao@canonical.com> Co-authored-by: patliuu <111331153+patliuu@users.noreply.github.com> Co-authored-by: Patrick Liu <patrick.liu@canonical.com> Co-authored-by: hanhsuan <32028620+hanhsuan@users.noreply.github.com>

beliaev-maksim added FromLaunchpad Importance: Critical labels Nov 28, 2022

beliaev-maksim closed this as completed Nov 28, 2022

beliaev-maksim reopened this Nov 28, 2022

beliaev-maksim added the bug Something isn't working label Nov 28, 2022

beliaev-maksim closed this as completed Nov 28, 2022

beliaev-maksim reopened this Nov 28, 2022

LiaoU3 mentioned this issue Jan 11, 2023

Fix: Checkbox is killed by systemd-oomd when running memory/memory_stress_ng job #297

Merged

pieqq closed this as completed in #297 Feb 20, 2023

pieqq pushed a commit that referenced this issue Jan 12, 2024

Add: cupfreq module probe (#51)

d18c177

* Add: cupfreq module probe

pieqq pushed a commit that referenced this issue Jan 12, 2024

Add: cupfreq module probe (#51)

bdba990

* Add: cupfreq module probe

kissiel pushed a commit that referenced this issue Jan 12, 2024

Add: cupfreq module probe (#51)

0d8349d

* Add: cupfreq module probe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LP1974205: Case memory/memory_stress_ng be terminated after performing for a while #51

LP1974205: Case memory/memory_stress_ng be terminated after performing for a while #51

beliaev-maksim commented Nov 28, 2022

beliaev-maksim commented Nov 28, 2022

LP1974205: Case memory/memory_stress_ng be terminated after performing for a while #51

LP1974205: Case memory/memory_stress_ng be terminated after performing for a while #51

Comments

beliaev-maksim commented Nov 28, 2022

Summary

Description

Attachments

beliaev-maksim commented Nov 28, 2022

https://launchpad.net/~bladernr wrote on 2022-05-19 17:50:07:

https://launchpad.net/~pieq wrote on 2022-05-24 07:49:42:

https://launchpad.net/~baconyao wrote on 2022-05-24 09:42:38:

https://launchpad.net/~huntu207 wrote on 2022-06-28 09:43:48:

https://launchpad.net/~binli wrote on 2022-07-01 05:36:58:

https://launchpad.net/~clairlin wrote on 2022-07-28 08:49:35:

https://launchpad.net/~bladernr wrote on 2022-07-28 13:45:48:

https://launchpad.net/~os369510 wrote on 2022-08-10 08:18:10:

From bug report https://bugs.launchpad.net/stella/+bug/1983006/comments/9

https://launchpad.net/~kent-jclin wrote on 2022-08-10 08:46:03:

https://launchpad.net/~os369510 wrote on 2022-08-10 11:41:05:

https://launchpad.net/~pieq wrote on 2022-08-10 18:46:15:

https://launchpad.net/~kchsieh wrote on 2022-08-11 01:57:13:

https://launchpad.net/~pieq wrote on 2022-08-11 11:31:57:

https://launchpad.net/~kchsieh wrote on 2022-08-11 11:58:20:

https://launchpad.net/~bladernr wrote on 2022-08-15 20:51:07:

https://launchpad.net/~os369510 wrote on 2022-08-16 08:56:36:

https://launchpad.net/~kaihengfeng wrote on 2022-08-17 07:02:26:

https://launchpad.net/~jay-ch wrote on 2022-08-22 04:16:46:

https://launchpad.net/~bladernr wrote on 2022-08-22 13:28:44:

https://launchpad.net/~os369510 wrote on 2022-08-22 14:54:05: