Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/diskprediction_local: Add tests for device failure predictors #37513

Closed
wants to merge 7 commits into from

Conversation

chauhankaranraj
Copy link
Contributor

This PR adds a sample smartctl json and unit tests to check if the Red Hat predictor and ProphetStor predictor models don't throw errors and yield valid outputs, when using the versions of python packages specified in requirements.txt.

Fixes: https://tracker.ceph.com/issues/47448
Signed-off-by: Karanraj Chauhan kachau@redhat.com

Hey @yaarith, I wrote some tests to address point 1 of this issue. Please lmk what you think :)

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

…sses

Add sample smartctl json and unit tests to check if the Red Hat
predictor and ProphetStor predictor models function properly
and give valid outputs.

Fixes: https://tracker.ceph.com/issues/47448
Signed-off-by: Karanraj Chauhan <kachau@redhat.com>

# predict and check if value is a valid output
pred = rh_predictor.predict(self.predict_datas)
self.assertIn(pred, self.valid_outputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the assert to verify the specific output we're expecting the model to return given a specific input (instead of any possible outputs the model has in general).

Same in def test_prophetstor_predictor(self).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll update it 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being patient, I've updated tests to check for specific output. And also added a check to ensure that models don't crash in case of an invalid json input

Karanraj Chauhan and others added 5 commits October 7, 2020 16:50
Updated sample input to be a json that we know is *unambiguously*
in a "good" health state.

Fixes: https://tracker.ceph.com/issues/47448
Signed-off-by: Karanraj Chauhan <kachau@redhat.com>
Updated tests to check whether models specifically output "good",
and not just one of "good", "warning", "bad" for the sample input
json that is known to be representative of a healthy disk.

Fixes: https://tracker.ceph.com/issues/47448
Signed-off-by: Karanraj Chauhan <kachau@redhat.com>
Updated tests to check whether models throw errors or show message
and exit gracefully when input is an invalid smartctl json.

Fixes: https://tracker.ceph.com/issues/47448
Signed-off-by: Karanraj Chauhan <kachau@redhat.com>
diskprediction_local unit tests do not have tox (make check)
declarations.

tox.ini installs the requirements needed by diskprediction_local module,
so the tests run in the expected environment.

Tests can now be run with:
$ ./do_cmake
$ cd build
$ ctest -R run-tox-mgr-diskprediction_local -V

Fixes: https://tracker.ceph.com/issues/47448
Signed-off-by: Yaarit Hatuka <yaarit@redhat.com>
…n tests

In order to verify that diskpredicion_local models function properly
across all supported distribution, we can simply run their unit tests
via this rados suite.
We do not need to bring up a cluster since we use smartctl JSON samples
to mock the input to the models.

This by itself will mock the loading of the required dependencies, and
will provide us with an accurate indication regarding the behavior of
diskprediction_local module on each distribution.

Please note:
1. These tests will fail when:
  - sklearn or numpy are not installed.
  - sklearn or numpy are installed, but with different (older or newer)
    versions which cause the code to break (due to lack of backward
    compatibility, or API changes).

2. Currently the tests print warnings that several functions used by the
   models will be deprecated in newer sklearn versions, but the tests are
   considered to pass successfully.

3. Nautilus runs python 2. sklearn 0.20 was the last version to support
   Python 2.7.

Fixes: https://tracker.ceph.com/issues/47448
Signed-off-by: Yaarit Hatuka <yaarit@redhat.com>
@yaarith
Copy link
Contributor

yaarith commented Oct 19, 2020

jenkins retest this please

Logic was missing, and diskprediction_local CMake tests could not
pass.

Signed-off-by: Yaarit Hatuka <yaarit@redhat.com>
@yaarith
Copy link
Contributor

yaarith commented Oct 20, 2020

Regarding backporting these tests to Octopus and Nautilus:

The missing sklearn dependency is fixed and merged in master:
#35648
but could not be backported to Octopus:
#37376
and probably to Nautilus for the same reason.
This means the tests will never pass (in Octopus and Nautilus), so we do not backport them until we find a solution.

Whenever we do backport to Nautilus though, we should change:
https://github.com/ceph/ceph/pull/37513/files#diff-f3ba8fbe68e55784d2e2b3617712a2d6de00f1b18dfb447657d5f45d0fe96596R2
to:
envlist = py27

@yaarith
Copy link
Contributor

yaarith commented Nov 23, 2020

Update: We still wait for the availability of scikit-learn package in EPEL 8.

@tchaikov
Copy link
Contributor

tchaikov commented Dec 6, 2020

Update: We still wait for the availability of scikit-learn package in EPEL 8.

@yaarith i cannot find a bugzilla ticket asking for python3-scikit-learn EPEL8. neither can i find the epel8 branch at https://src.fedoraproject.org/rpms/python-scikit-learn . have we ping'ed the maintainer for creating the epel8 branch?

@yaarith
Copy link
Contributor

yaarith commented Dec 7, 2020

@yaarith i cannot find a bugzilla ticket asking for python3-scikit-learn EPEL8. neither can i find the epel8 branch at https://src.fedoraproject.org/rpms/python-scikit-learn . have we ping'ed the maintainer for creating the epel8 branch?

Hi @ktdreyer, regarding @tchaikov's comment here, I'm not sure what the next steps are once we have the packages ready on our end.

@yaarith
Copy link
Contributor

yaarith commented Jan 6, 2021

@tchaikov after a discussion with @ktdreyer I contacted the maintainer, asking if he can maintain EPEL (7 and 8) versions. Will update with progress.

@tchaikov
Copy link
Contributor

FWIW, https://copr.fedorainfracloud.org/coprs/tchaikov/python-scikit-learn/

@stale
Copy link

stale bot commented Jul 21, 2021

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jul 21, 2021
@rzarzynski rzarzynski requested a review from yaarith March 8, 2022 17:37
@stale stale bot removed the stale label Mar 8, 2022
@djgalloway djgalloway changed the base branch from master to main July 8, 2022 00:00
@djgalloway djgalloway requested a review from a team as a code owner July 8, 2022 00:00
@github-actions
Copy link

github-actions bot commented Sep 6, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 6, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants