Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DLPX-87572 sdb: want live kernel tests to find kernel regressions early #337

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

sdimitro
Copy link
Contributor

@sdimitro sdimitro commented Aug 17, 2023

= Problem

With our switch to the new v5.15 kernel a subset of SDB commands broke without us realizing until we actually needed them. Our regression dumps helps us to ensure we don't introduce regressions for older kernels when developing new features but they can't help us in detecting changes in the upstream kernel or ZFS that break our commands.

= This Patch

This patch attempts to provide a rudimentary mechanism for catching regression introduced by the upstream Ubuntu kernels by running a few basic SDB commands in a Github action that's run nightly and for every PR.

Specifically this patch makes it so we have such a test for each Ubuntu LTS kernel starting from 20.04 (currently the ubuntu-latest Github runner tag points to 22.04 so we'd test that twice but in the future that tag will point to 24.04, etc...).

We also change for all the available Python versions for each Ubuntu version to further ensure SDB's compatibility with future Python versions.

= Misc Notes

In order to use SDB in the Github runner I had to introduce an extra script that downloads the kernel's debug info. See the install-live-kernel-dbg.sh script for more info.

I also made sure to decouple the apt-install of the python-dev files to its own shell script too as different Ubuntu versions ship with different Python versions. See install-python-dev.sh for more info.

= Potential Future Items

In the future we may want to detect whenever our ZFS commands are not getting out of date. test_live_kernel.sh has a way of detecting whether the ZFS module is installed and running a few ZFS commands on the live kernel. The idea is that we can either introduce Github Actions like the upstream openzfs that install our kernel module to the runner and run the commands there OR we can create a BlackBox test that clones the repo and runs this script.

@sdimitro sdimitro marked this pull request as ready for review August 17, 2023 17:00
@sdimitro sdimitro force-pushed the ubuntu_live_test branch 26 times, most recently from 37c2944 to 3a92736 Compare August 22, 2023 20:46
@sdimitro sdimitro changed the title WIP DLPX-87572 sdb: want live kernel tests to find kernel regressions early Aug 22, 2023
@sdimitro sdimitro force-pushed the ubuntu_live_test branch 2 times, most recently from 1f1cb88 to abf539e Compare August 22, 2023 21:30
@sdimitro sdimitro force-pushed the ubuntu_live_test branch 3 times, most recently from 2ac14fc to 52feab9 Compare August 22, 2023 22:15
= Problem

With our switch to the new v5.15 kernel a subset of SDB commands
broke without us realizing until we actually needed them. Our
regression dumps helps us to ensure we don't introduce regressions
for older kernels when developing new features but they can't help
us in detecting changes in the upstream kernel or ZFS that break
our commands.

= This Patch

This patch attempts to provide a rudimentary mechanism for catching
regression introduced by the upstream Ubuntu kernels by running a
few basic SDB commands in a Github action that's run nightly and
for every PR.

Specifically this patch makes it so we have such a test for each
Ubuntu LTS kernel starting from 20.04 (currently the `ubuntu-latest`
Github runner tag points to 22.04 so we'd test that twice but in
the future that tag will point to 24.04, etc...).

We also change for all the available Python versions for each
Ubuntu version to further ensure SDB's compatibility with future
Python versions.

= Misc Notes

In order to use SDB in the Github runner I had to introduce an
extra script that downloads the kernel's debug info. See the
`install-live-kernel-dbg.sh` script for more info.

I also made sure to decouple the apt-install of the python-dev
files to its own shell script too as different Ubuntu versions
ship with different Python versions. See `install-python-dev.sh`
for more info.

= Potential Future Items

In the future we may want to detect whenever our ZFS commands are
not getting out of date. `test_live_kernel.sh` has a way of
detecting whether the ZFS module is installed and running a few
ZFS commands on the live kernel. The idea is that we can either
introduce Github Actions like the upstream openzfs that install
our kernel module to the runner and run the commands there OR
we can create a BlackBox test that clones the repo and runs this
script.
@codecov-commenter
Copy link

codecov-commenter commented Aug 22, 2023

Codecov Report

Merging #337 (bae4146) into develop (3a1fadc) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@             Coverage Diff             @@
##           develop     #337      +/-   ##
===========================================
+ Coverage    85.18%   85.21%   +0.03%     
===========================================
  Files           67       67              
  Lines         3077     3070       -7     
===========================================
- Hits          2621     2616       -5     
+ Misses         456      454       -2     
Files Changed Coverage Δ
sdb/commands/linux/threads.py 96.55% <ø> (ø)
sdb/commands/linux/stacks.py 95.42% <100.00%> (+1.04%) ⬆️

sdimitro added a commit to sdimitro/sdb-dlpx that referenced this pull request Aug 22, 2023
= Problem

Currently having our own custom function for figuring out a task's
state has two drawbacks:

1] As we saw in a2bdd57 things can
   get out of date and it is up to us to fix them.

2] There are some weird edge cases that we don't handle as well
   like the following crash that I have never been able to reproduce
   locally but it occasionally reproduces in the Github runners
   of PR delphix#337:

```
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/internal/repl.py", line 107, in eval_cmd
    for obj in invoke([], input_):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 152, in invoke
    yield from execute_pipeline(first_input, pipeline)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 84, in execute_pipeline
    yield from massage_input_and_call(pipeline[-1], this_input)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 67, in massage_input_and_call
    yield from cmd.call(objs)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 413, in call
    yield from self.__invalid_memory_objects_check(
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 358, in __invalid_memory_objects_check
    for obj in objs:
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 625, in _call
    self.pretty_print(self.caller(objs))
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 407, in pretty_print
    self.print_stacks(filter(self.match_stack, objs))
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 382, in print_stacks
    for stack_key, tasks in KernelStacks.aggregate_stacks(objs):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 375, in aggregate_stacks
    stack_key = (KernelStacks.task_struct_get_state(task),
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 221, in task_struct_get_state
    return KernelStacks.TASK_STATES[(state | exit_state) & 0x7f]
KeyError: 101
```

= This Patch

Uses the drgn helper whose implementation handles more edge cases
and is more probable to stay up to date with the latest kernels
while keeping backwards compatibility.

= Testing

The above stack trace that I was able to reproduce consistently
in that PR no longer shows up with this patch.
sdimitro added a commit to sdimitro/sdb-dlpx that referenced this pull request Aug 22, 2023
= Problem

Currently having our own custom function for figuring out a task's
state has two drawbacks:

1] As we saw in a2bdd57 things can
   get out of date and it is up to us to fix them.

2] There are some weird edge cases that we don't handle as well
   like the following crash that I have never been able to reproduce
   locally but it occasionally reproduces in the Github runners
   of PR delphix#337:

```
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/internal/repl.py", line 107, in eval_cmd
    for obj in invoke([], input_):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 152, in invoke
    yield from execute_pipeline(first_input, pipeline)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 84, in execute_pipeline
    yield from massage_input_and_call(pipeline[-1], this_input)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 67, in massage_input_and_call
    yield from cmd.call(objs)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 413, in call
    yield from self.__invalid_memory_objects_check(
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 358, in __invalid_memory_objects_check
    for obj in objs:
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 625, in _call
    self.pretty_print(self.caller(objs))
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 407, in pretty_print
    self.print_stacks(filter(self.match_stack, objs))
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 382, in print_stacks
    for stack_key, tasks in KernelStacks.aggregate_stacks(objs):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 375, in aggregate_stacks
    stack_key = (KernelStacks.task_struct_get_state(task),
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 221, in task_struct_get_state
    return KernelStacks.TASK_STATES[(state | exit_state) & 0x7f]
KeyError: 101
```

= This Patch

Uses the drgn helper whose implementation handles more edge cases
and is more probable to stay up to date with the latest kernels
while keeping backwards compatibility.

= Testing

The above stack trace that I was able to reproduce consistently
in that PR no longer shows up with this patch.
sdimitro added a commit to sdimitro/sdb-dlpx that referenced this pull request Aug 23, 2023
= Problem

Currently having our own custom function for figuring out a task's
state has two drawbacks:

1] As we saw in a2bdd57 things can
   get out of date and it is up to us to fix them.

2] There are some weird edge cases that we don't handle as well
   like the following crash that I have never been able to reproduce
   locally but it occasionally reproduces in the Github runners
   of PR delphix#337:

```
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/internal/repl.py", line 107, in eval_cmd
    for obj in invoke([], input_):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 152, in invoke
    yield from execute_pipeline(first_input, pipeline)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 84, in execute_pipeline
    yield from massage_input_and_call(pipeline[-1], this_input)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 67, in massage_input_and_call
    yield from cmd.call(objs)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 413, in call
    yield from self.__invalid_memory_objects_check(
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 358, in __invalid_memory_objects_check
    for obj in objs:
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 625, in _call
    self.pretty_print(self.caller(objs))
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 407, in pretty_print
    self.print_stacks(filter(self.match_stack, objs))
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 382, in print_stacks
    for stack_key, tasks in KernelStacks.aggregate_stacks(objs):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 375, in aggregate_stacks
    stack_key = (KernelStacks.task_struct_get_state(task),
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 221, in task_struct_get_state
    return KernelStacks.TASK_STATES[(state | exit_state) & 0x7f]
KeyError: 101
```

= This Patch

Uses the drgn helper whose implementation handles more edge cases
and is more probable to stay up to date with the latest kernels
while keeping backwards compatibility.

= Testing

The above stack trace that I was able to reproduce consistently
in that PR no longer shows up with this patch.
sdimitro added a commit that referenced this pull request Aug 23, 2023
= Problem

Currently having our own custom function for figuring out a task's
state has two drawbacks:

1] As we saw in a2bdd57 things can
   get out of date and it is up to us to fix them.

2] There are some weird edge cases that we don't handle as well
   like the following crash that I have never been able to reproduce
   locally but it occasionally reproduces in the Github runners
   of PR #337:

```
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/internal/repl.py", line 107, in eval_cmd
    for obj in invoke([], input_):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 152, in invoke
    yield from execute_pipeline(first_input, pipeline)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 84, in execute_pipeline
    yield from massage_input_and_call(pipeline[-1], this_input)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/pipeline.py", line 67, in massage_input_and_call
    yield from cmd.call(objs)
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 413, in call
    yield from self.__invalid_memory_objects_check(
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 358, in __invalid_memory_objects_check
    for obj in objs:
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/command.py", line 625, in _call
    self.pretty_print(self.caller(objs))
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 407, in pretty_print
    self.print_stacks(filter(self.match_stack, objs))
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 382, in print_stacks
    for stack_key, tasks in KernelStacks.aggregate_stacks(objs):
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 375, in aggregate_stacks
    stack_key = (KernelStacks.task_struct_get_state(task),
  File "/usr/local/lib/python3.8/dist-packages/sdb-0.1.0-py3.8.egg/sdb/commands/linux/stacks.py", line 221, in task_struct_get_state
    return KernelStacks.TASK_STATES[(state | exit_state) & 0x7f]
KeyError: 101
```

= This Patch

Uses the drgn helper whose implementation handles more edge cases
and is more probable to stay up to date with the latest kernels
while keeping backwards compatibility.

= Testing

The above stack trace that I was able to reproduce consistently
in that PR no longer shows up with this patch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants