Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/diskprediction Add diskprediction plugin service #22239

Closed
wants to merge 23 commits into from

Conversation

hsiang41
Copy link
Contributor

@hsiang41 hsiang41 commented May 25, 2018

The DiskProphet plugin service continuously collects and sends time series data to an DiskProphet server. Users has the option to fetch physical disk of the osd predicted health state. The physical disk prediction result store in the ceph device info(#22423).

The plugin has two mode.
Local - The plugin include internal predictor module. It can use device health data to do the simple prediction.
Cloud - This mode related on the plugin pushed data that include ceph cluster/mon/osd status and workload to do the device health predicted.

Signed-off-by: Rick Chen rick.chen@prophetstor.com

To test, ping sage in #ceph-devel for a credential to use, or see /ceph/diskprediction_config.txt on teuthology.

@liewegas liewegas added the mgr label May 25, 2018
@hsiang41 hsiang41 force-pushed the mgr/diskprophet branch 5 times, most recently from b847f8b to 3cd1144 Compare May 29, 2018 06:33
@liewegas liewegas added the DNM label Jun 1, 2018
@liewegas
Copy link
Member

liewegas commented Jun 1, 2018

Discussed offline yesterday. Summary:

  • the module should avoid any local storage in files and use the cluster instead
  • we'll hold off on this until we have some generic infrastructure in place to (1) map device ids to osds and vice versa, and (2) associate predicted failures with devices
  • the module should try to avoid any branding (or be distributed out of tree)
  • the API to query the prediction service should be clearly documented

@liewegas
Copy link
Member

liewegas commented Jun 1, 2018

Blocked by this work: https://pad.ceph.com/p/smart

@hsiang41 hsiang41 force-pushed the mgr/diskprophet branch 3 times, most recently from 92a3568 to 5bc4cbc Compare June 5, 2018 09:05
@hsiang41 hsiang41 changed the title mgr/diskprophet Add diskprophet plugin service mgr/diskprediction Add diskprophet plugin service Jun 7, 2018
@hsiang41 hsiang41 changed the title mgr/diskprediction Add diskprophet plugin service mgr/diskprediction Add diskpredictionplugin service Jun 7, 2018
@hsiang41 hsiang41 changed the title mgr/diskprediction Add diskpredictionplugin service mgr/diskprediction Add diskprediction plugin service Jun 7, 2018
@hsiang41 hsiang41 force-pushed the mgr/diskprophet branch 8 times, most recently from 4d237ce to 407813e Compare June 7, 2018 08:12
@liewegas
Copy link
Member

liewegas commented Jun 8, 2018

Discussed offline today:

  • Will consume the device tracking in mgr: add device id tracking #22423, with a change to that PR that specifies the life expectancy as a range instead of a specific date (e.g. 0-4 weeks, 4-6 weeks, 6+ weeks)
  • Prophetstor plans to open source a simplified prediction algorithm and include it in a mgr module!
  • That algorithm will rely on some recent history of smart data. the plan is to resurrect/adapt parts of DNM: pybind/mgr: update smart mgr module #21301 to store this in rados.
  • The smart module with scrape the latest data and then either (1) query an api to report metrics and get back a predicted life (prophetstor's commercial product or some other service implmeneting the API), or (2) use the built-in mode that records the latest sample in rados, generates a new prediction based on its model.
  • (An alternative approach would be to implement 2 as a separate mgr service/module that implements the same REST API that 1 consumes. This makes for nice slideware but it is more work to implement.)
  • Both modes will take the resulting life expectancy and feed it back to 'ceph device set-life-expectancy'. Another module (or part of the same module) will implement the policy that automatically responds to looming device failures by marking OSDs out or raising health alerts.

@hsiang41 Let me know if I missed anything or got it wrong!


The connection settings can be configured on any machine with the proper cephx
credentials; they are usually the monitor node with client.admin keyring.
Run the following command to set up connection betweet Ceph system and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/betweet/between/

@hsiang41 hsiang41 force-pushed the mgr/diskprophet branch 2 times, most recently from 50ed034 to fabaaca Compare August 29, 2018 04:45
Add plugin failed reason in the command status.
Rename partial command prefix to be device.
Change local predictor data related on the devicehealth history.

Signed-off-by: Rick Chen rick.chen@prophetstor.com
@liewegas
Copy link
Member

I'm getting

$ bin/ceph device predict-life-expectancy WDC_WD6002FFWX-68TZ4N0_K1GX50LD
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2018-08-31 17:18:14.302 7faa22924700 -1 WARNING: all dangerous and experimental features are enabled.
2018-08-31 17:18:14.321 7faa22924700 -1 WARNING: all dangerous and experimental features are enabled.
2018-08-31 17:18:14.578 7faa22924700  0 mgrc start_command no mgr session (no running mgr daemon?), waiting
Error EINVAL: Traceback (most recent call last):
  File "/home/sage/src/ceph/src/pybind/mgr/diskprediction/module.py", line 329, in handle_command
    return fun(inbuf, cmd)
  File "/home/sage/src/ceph/src/pybind/mgr/diskprediction/module.py", line 298, in _predict_life_expectancy
    result = obj_predictor.query_info('', cmd['dev_id'], '')
  File "/home/sage/src/ceph/src/pybind/mgr/diskprediction/common/localpredictor.py", line 106, in query_info
    predicted_result = self._local_predict(predict_datas)
  File "/home/sage/src/ceph/src/pybind/mgr/diskprediction/common/localpredictor.py", line 74, in _local_predict
    return obj_predictor.predict(smart_datas)
  File "/home/sage/src/ceph/src/pybind/mgr/diskprediction/predictor/DiskFailurePredictor.py", line 210, in predict
    clf = joblib.load(modelpath)
  File "/usr/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 578, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/usr/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 508, in _unpickle
    obj = unpickler.load()
  File "/usr/lib64/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 328, in load_build
    Unpickler.load_build(self)
  File "/usr/lib64/python2.7/pickle.py", line 1230, in load_build
    d = inst.__dict__
AttributeError: 'unicode' object has no attribute '__dict__'

when using the local prediction mode.

@hsiang41
Copy link
Contributor Author

hsiang41 commented Sep 1, 2018 via email

@hsiang41
Copy link
Contributor Author

hsiang41 commented Sep 4, 2018

@liewegas Do you pip install the requirements.txt that stored in the diskprediction plugin installed path? Because the numpy library require at least value 1.8.2.
pip install -r /diskprediction/requirements.txt –upgrade

@liewegas
Copy link
Member

liewegas commented Sep 5, 2018

Okay, so we have a larger challenge here of translating the requirements.txt into rpm package versions and adding them to ceph.spec.in and debian/control. I'm not sure where the pickle version you're referring to is coming from?

@hsiang41
Copy link
Contributor Author

hsiang41 commented Sep 6, 2018

The pickle is numpy depended library. Because the older version numpy did not support the UNICODE. Can we modify the install-deps.sh to add pip install requirements?

@liewegas
Copy link
Member

liewegas commented Sep 6, 2018

@jcsp I'm assuming the preferred path is to rely on installed packages for everything. This makes me a bit nervous as there are a lot of dependencies here. Is there an option to do a virtualenv and bundle the dependencies?

@jcsp
Copy link
Contributor

jcsp commented Sep 6, 2018

@liewegas Bunding python dependencies is generally impractical if they have native code components (as e.g. numpy does).

The good news is that scikit-learn (version 0.18.1) is already packaged in Fedora at least.

The google/grpc bits I'm not so sure, but given that they're only used by DiskProphet's product users (right?) maybe we can worry less about them; perhaps any packaging that needs doing for that could happen outside of the upstream Ceph packaging.

@hsiang41 couple of questions for you I see that the requirements.txt is referencing sklearn==0.0, where the sklearn page on pypi says to use scikit-learn instead. Is there significant difference in the interfaces?

@liewegas
Copy link
Member

liewegas commented Sep 6, 2018

local model -> recall: 63%, false alarm: 6.5%, accuracy: 78.25%
cloud engine -> Recall: 97%, accuracy: 95%

@hsiang41 hsiang41 force-pushed the mgr/diskprophet branch 2 times, most recently from ad9ac72 to 4921a4e Compare September 11, 2018 02:11
@hsiang41
Copy link
Contributor Author

@jcsp The sklearn interface is same as scikit-learn. I modify the requirements.txt to include scikit-learn. Also update the depended library as below:
numpy==1.15.1
scikit-learn==0.19.2
scipy==1.1.0

@hsiang41 hsiang41 force-pushed the mgr/diskprophet branch 2 times, most recently from 55a274f to 1b723ec Compare September 11, 2018 09:57
@jcsp
Copy link
Contributor

jcsp commented Sep 11, 2018

@hsiang41 it would be useful to work out how those versions relate to what's available in major distros, (e.g. centos7 has numpy 1.7.1, is that recent enough?) and whether the versions in distros are compatible with your code. The goal is to know whether we can just add dependency lines to the RPM packaging, or whether somebody would need to create special packages in order to use this ceph-mgr module on certain distros.

Regarding the grpc dependencies, am I correct in thinking those are only relevant to people using your cloud engine?

@hsiang41
Copy link
Contributor Author

hsiang41 commented Sep 12, 2018

@jcsp I have test below rpm package in my machine(CentOS Linux release 7.5.1804 (Core)), that can work with local predictor.
numpy-1.7.1-13.el7.x86_64
scipy-0.12.1-6.el7.x86_64
python-scikit-learn-0.18.1-3.el7.x86_64
The python-scikit-learn package I download from https://centos.pkgs.org/7/harbottle-epypel-x86_64/.

The grpc dependencies use by the diskprediction plugin to push data into the colud service. But I did not find any rpm package about the grpc. Do you have any advise about this problem?

@@ -0,0 +1,1775 @@
# Generated by the protocol buffer compiler. DO NOT EDIT!
# source: mainServer.proto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include mainServer.proto in the source tree instead of the generated python binding?

Copy link
Contributor Author

@hsiang41 hsiang41 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov If we add mainServer.proto in the source tree, so We need add convertor script for proto to py code in the ceph-mgr deploy script. Do we need do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsiang41 i am not sure if i am following you. could you define "ceph-mgr deploy script"? is it the src/pybind/mgr/diskprediction/CMakeLists.txt ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov My concern is the proto generate python apply to CMakeLists.txt that mean the ceph need install grpc library and grpc plugins library. These library did not have rpm package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsiang41 i see.. so what we need is to ready grpc_tools.protoc and googleapis-common-protos python modules for compiling the .proto definition file. since this grpc server is hosted in cloud, and the grpc service is only available to user who use this cloud service. i'd suggest package it downstream.

@@ -0,0 +1,77 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern is the license of the pre-trained SVM models. because this dataset are pickled SVM classifiers, which by themselves are machine readable after being un-pickled. but what about its "source"? or are they in the source form already? if yes, how is user allowed to "modify" it in an effective way even he/she understands the SVM and the python language? as LGPL 2.1 requires the work to be accompanied with the source code of it. if we cannot provide the source, we will have to re-distribute these data files in a different license.

i had a hard time when preparing[0] a software package which used a statistical language model for debian. the packaged software was licensed under LGPL2.1 and CDDL. and the package was rejected by debian's FTP master because of the license of the pre-trained data: we licensed it under the same dual license.

yes, in this context, we are the upstream developers not downstream maintainer. but i think it's worthy of mention.


[0] https://lists.debian.org/debian-devel/2008/05/msg00005.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting thought. In my opinion, the original dataset would not constitute "source code" for licensing purposes, but I can see how there could be some debate. It is probably prudent to apply a different license to the model.

Perhaps just declare the model files as public domain -- @hsiang41 , what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a different license for the model files would be the easiest thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcsp Agree. But I did not know how to do this. Do you have sample for this? or Need I apply something comment in my project?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsiang41 in the top level COPYING file you can see a list of various exceptions. I'd add a section at the bottom of that, and also add a COPYING file in your models/ directory that makes a statement that these particular files are donated by ProphetStor to be used by anyone for any purpose and you make no copyright claims over them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsiang41 probably you could update https://github.com/ceph/ceph/blob/master/COPYING , https://github.com/ceph/ceph/blob/master/debian/copyright accordingly in this PR ? like

diff --git a/COPYING b/COPYING
index cd45ce086a..f0a37b8bf3 100644
--- a/COPYING
+++ b/COPYING
@@ -145,3 +145,7 @@ Files: src/include/timegm.h
   Copyright (C) Copyright Howard Hinnant
   Copyright (C) Copyright 2010-2011 Vicente J. Botet Escriba
   License: Boost Software License, Version 1.0
+
+Files: src/pybind/mgr/diskprediction/predictor/models/*
+Copyright: None
+License: Public domain

@jcsp
Copy link
Contributor

jcsp commented Sep 12, 2018

The python-scikit-learn package I download from https://centos.pkgs.org/7/harbottle-epypel-x86_64/.

OK, so this is the challenging part. If we have a dependency on python-scikit-learn, then we probably also need to be providing (+ therefore building, as we can't rely on third party repos) that package.

I see that Fedora has a python-scikit-learn package, Ubuntu 16.04 has a python-sklearn package, and SUSE has it in tumbleweed+leap 15, but not in SLES.

In the short term, the answer is probably to just include this module, but make clear to users that they will need to find their own python-scikit-learn packages before using it.

The grpc dependencies use by the diskprediction plugin to push data into the colud service. But I did not find any rpm package about the grpc. Do you have any advise about this problem?

That's probably up to you, as it would only be diskprophet customers that are affected. Your options are basically to build packages yourself, or ask your customers to install using pip if they are comfortable with that.

@hsiang41
Copy link
Contributor Author

@votdev Can you help to review my changed that already follow your advise?

Refresh local predictor model.

Signed-off-by: Rick Chen rick.chen@prophetstor.com
hsiang41 added 2 commits September 15, 2018 01:49
1. Refresh diskprediction plugin doc guide.
2. Change the COPYING file.

Signed-off-by: Rick Chen rick.chen@prophetstor.com
Correct command "ceph device set-cloud-prediction-config' typo.

Signed-off-by: Rick Chen rick.chen@prophetstor.com
@liewegas
Copy link
Member

Merged via #24104

@liewegas liewegas closed this Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants