Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to CEPH input. #3311

Merged
merged 26 commits into from Jan 24, 2017

Conversation

@amandahla
Copy link
Contributor

commented Jan 9, 2017

Using as reference Telegraf (https://github.com/influxdata/telegraf/blob/master/plugins/inputs/ceph/ceph.go). Working in progress.
#197

amandahla added 10 commits Jan 3, 2017
Merge remote-tracking branch 'upstream/master'
Conflicts:
	metricbeat/metricbeat.yml
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Jan 9, 2017

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

1 similar comment
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Jan 9, 2017

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

enabled: true
period: 1s

ceph:

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 9, 2017

Collaborator

Is this working? This looks like invalid yaml. All config options related to a metricset should be in the module itself.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 9, 2017

Author Contributor

Actually no. I commited a new version with this file corrected now.

@ruflin
Copy link
Collaborator

left a comment

Thanks a lot for starting with this PR and adding support for CEPH to metricbeat. What is your current stage? Should we already have a detailed look at the PR or do you prefer first some high level feedback?

@@ -1,105 +0,0 @@
###################### Metricbeat Configuration Example #######################

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 9, 2017

Collaborator

This file should not be removed. Running make update should generate it again.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 9, 2017

Author Contributor

Thanks, new commit with this corrected.

@amandahla

This comment has been minimized.

Copy link
Contributor Author

commented Jan 9, 2017

Hi Ruflin, thanks for the review. I prefer a detailed look because everything is working but it can really be better. I had difficult to deal with json output from ceph and only tested with a container (https://hub.docker.com/r/ceph/demo/).
Any help / review is appreciated. :-)

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented Jan 10, 2017

I will have a closer look. Please make sure to run make fmt so all code adheres to the coding standards.

@ruflin
Copy link
Collaborator

left a comment

Thanks for all the work on this. To make this testable it is important to add a testing environment and add integration tests.

Have a look at the docker-compose.yml file in metricbeat. There you can add the ceph service and then add tests for it. This will also make it easy to create the example data.json files and will directly show all the fields that are exported by these metricsets.

enabled: true
period: 1s

ceph:

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

This part of the config must be moved under the module. I assume it doesn't work the way it is at the moment.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Moved.

# Consult the ceph documentation for more detail on keyring generation.
user = "client.admin"
# Ceph configuration to use to locate the cluster
config_path = "/etc/ceph/ceph.conf"

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

In general we recommend metricbeat to connect to each node and get the specific metrics for each node instead from the cluster. What is inside this file? You should use hosts: ["..."] for that if possible.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

I used telegraf as reference and that only collects performance metrics from the MON and OSD nodes in a Ceph storage cluster. I can change the description, perhaps.

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 11, 2017

Collaborator

I'm not aware of the inner workings of CEPH. Can you share some details on what MON and OSD nodes are?

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 11, 2017

Author Contributor

A Ceph Storage Cluster requires at least one MON and at least 2 OSD.

A OSD (Object Storage Daemon) handles data (storage, replication, recovery...) and provide some monitoring information to MON by checking others OSD for a heartbeat.

A MON (Monitor) handles the cluster state (maps, history...).

You can have this architecture on one server/node but just for tests. Real productions environments works in a distributed way.

If you want metrics just for the daemons where metricbeat its collecting, you can use Admin Socket Stats. In our case here, it's the "perf" metricset.

If you want metrics for the whole cluster, you can use ceph commands. In our case here, it's the "status", "df" and "osdpoolstats" metricsets.

That's why I wrote those comments on config.yml. Depending on choosen metricset, you need just some parameters.

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 12, 2017

Collaborator

So perf is a local stat that should be collect by all metricbeat instances and the other 3 are cluster stats, means only one of the instances connecting needs to collect them? If there are multiple MON, will they provide different data?

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 12, 2017

Author Contributor

"...means only one of the instances connecting needs to collect them?" Yes.
"If there are multiple MON, will they provide different data?" No, at least it's not expected this behavior.

enabled: true
period: 5s

ceph:

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

what is this line for? Forgot to remove?

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Changed. Useless line, sorry.

period: 5s

ceph:
# location of ceph binary

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

I suggest to split up the config into config.full.yml and config.yml and only have the extensive options in the full file. See other metricsets for examples.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Working.

- module: ceph
metricsets: ["perf","status","df","osdpoolstats"]
enabled: true
period: 5s

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

The default we use is 10s

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Changed.

logp.Err("An error occurred while parsing data for getting ceph df: %v", err)
}

stats_fields, ok := data["stats"].(map[string]interface{})

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

we use statsFields etc as naming ocnvention.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Changed.

This is the ceph Module.

We used code/reference from:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/ceph/ceph.go

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

We should make sure to not just copy / paste code from here. Or was this also created by you?

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Actually I copy/paste code and then adapted to collect metrics. It's really similar but not exactly equal. That's why I wrote this on docs.asciidoc. Is it a problem?

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Ah: This referente on telegraf was not created by me.

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 11, 2017

Collaborator

I think we need to be careful about the licenses here. Getting inspired by some code is one thing, copy / pasting it an other. If we have the same code, we should use it as a library ;-)

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 13, 2017

Author Contributor

I saw that Telegraf uses MIT License. So, I understand that there is no problem to copy parts of the Software. Unfortunately, I don't think that it's possible to use as a library.
To get a better view what kind of copy I did, here it's a example. I changed the code to generate events.
Mine (line 40)
Telegraf (line 462)

description: >
df
fields:
- name: example

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

Fields docs have to be added. This will also make it easier for us to understand what kind of data is provided.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Working.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 13, 2017

Author Contributor

Please, just to be sure: is there a way to generate fields automatically? Or just manually?


func New(base mb.BaseMetricSet) (mb.MetricSet, error) {

config := ceph.Config{}

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

All the Ceph metricsets should be marked experimental first. We do this for all metricsets for the first release. See other metricsets for examples.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Working.

},
"metricset":{
"host":"localhost",
"module":"mysql",

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 10, 2017

Collaborator

you can generate this files by running make generate-json if you have created the integration test files which are responsible to create the data (see other metricsets). This requires a working docker environment.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 10, 2017

Author Contributor

Working.

amandahla added 3 commits Jan 10, 2017
Merge with upstream.
Config reading changed.
@ruflin

This comment has been minimized.

Copy link
Collaborator

commented Jan 11, 2017

@amandahla Ping me when it is ready for an other round of reviews.

amandahla added 2 commits Jan 16, 2017
@amandahla

This comment has been minimized.

Copy link
Contributor Author

commented Jan 16, 2017

Hi @ruflin
I made some changes here:

  • Removed "perf" metricset because our CEPH Admin told me that isn't so useful. So we have now just the ones from cluster.
  • Wrote fields.yml files.
  • Wrote integration tests.
  • Working on unity tests.
    Please, can you review?
@ruflin
Copy link
Collaborator

left a comment

Currently this module depends on executing a binary which we normally don't do in Metricbeat. It should be changed to use the CEPH REST API which will also simplify the implementation and configuration. Are there any limitations using the REST API?

@@ -46,13 +47,26 @@ services:
volumes:
- ${PWD}/..:/go/src/github.com/elastic/beats/
- /var/run/docker.sock:/var/run/docker.sock
- datavolume:/etc/ceph

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 17, 2017

Collaborator

Do we need these data volumes on the client side?

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 17, 2017

Author Contributor

I'll study the rest api but I think that we can remove all this and don't use command any more.

command: make
entrypoint: /go/src/github.com/elastic/beats/metricbeat/docker-entrypoint.sh

# Modules
apache:
build: ${PWD}/module/apache/_meta

ceph:
image: ceph/demo
container_name: metricbeatceph

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 17, 2017

Collaborator

Please do not give the container a special name. This should not be needed.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 17, 2017

Author Contributor

Same as above :)
"I'll study the rest api but I think that we can remove all this and don't use command any more."

@@ -91,3 +105,8 @@ services:

zookeeper:
image: jplock/zookeeper:3.4.8

volumes:

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 17, 2017

Collaborator

If these volumes are ceph specific, we should have better names. If possible, I would prefer not to have "global" volumes.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 17, 2017

Author Contributor

Same as above :)
I'll study the rest api but I think that we can remove all this and don't use command any more.


This is the ceph Module.

We used code/reference from:

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 17, 2017

Collaborator

If the reference is only used to get inspiration, these should be removed.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 17, 2017

Author Contributor

Since I'll change to use Rest API, this reference will be no longer necessary indeed. I'll remove.

var output string

if strings.Contains(c.BinaryPath, "docker") {
cmdCeph := []string{"/usr/bin/ceph"}

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 17, 2017

Collaborator

We should not execute binaries whenever possible. As far as I can see ceph also has a REST API to fetch monitoring data: http://ceph.com/planet/experimenting-with-the-ceph-rest-api/ This should be used to fetch the data. Calling a binary requires CEPH to be installed on the machine. This works in most cases but breaks for example in docker environments, where on container is used to monitor the other containers.

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 17, 2017

Collaborator

BTW: Why is the execution different if it is in a container or not?

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 17, 2017

Author Contributor

If is a container, I needed to call the ceph command inside him so that's why was different. In container, command will be something like "docker exec -i container_name /usr/bin/ceph ...".
With REST api, I can remove this too.

"mons":[
{
"name":"prcdsrvv1619",
"kb_total":6281216,

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 17, 2017

Collaborator

Please check our naming conventions: https://www.elastic.co/guide/en/beats/libbeat/5.1/event-conventions.html

Whenever possible we should no use nested arrays. I think we need to have a closer look at the data set and potentially resturcture it to fit into metricbeat.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 17, 2017

Author Contributor

So far, I see that with REST API the output it's more simple. But I'll have a closer look at this.

var defaultConfig = CephConfig{
BinaryPath: "/usr/bin/ceph",
User: "client.admin",
ConfigPath: "/etc/ceph/ceph.conf",

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 17, 2017

Collaborator

If we use the REST API using config etc. becomes obsolete.

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 17, 2017

Author Contributor

I agree. :)

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented Jan 17, 2017

@amandahla Keep me posted about the REST API

@amandahla

This comment has been minimized.

Copy link
Contributor Author

commented Jan 17, 2017

@ruflin Thanks for showing me the CEPH Rest API. I was so focused on use the same way that Telegraf did, that didn't even think about searching for another way.
Now I'll take a bit more time to change everything, but I hope to finish soon. I'll keep this PR open but when I finish, I'll open another, ok?

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented Jan 17, 2017

You can also just push the updated version to this branch. No worries if you overwrite the current commits. I think the title stays the same anyways ;-)

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented Jan 18, 2017

@amandahla Your last commit looks very promising. This heavily simplifies the implementation 👍 Let me know if you should have another detailed look.

@amandahla

This comment has been minimized.

Copy link
Contributor Author

commented Jan 18, 2017

Hi @ruflin
Thanks!
I think that I'll let this PR only with one MetricSet ("health") like you suggested before. Please, can you review?
It's really much better to use the Ceph Rest API!

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented Jan 19, 2017

Quick note on the failing travis build: Run make fmt to clean up the source code and make update to update all generated files. It should go green afterwards.

@ruflin
Copy link
Collaborator

left a comment

Change looks good to me. I left some comments on the naming of the fields. We are very close to get this in. Thanks a lot for going forward with this.

This is the ceph Module. Metrics are collected submitting HTTP GET requests to ceph-rest-api.
Reference: http://docs.ceph.com/docs/master/man/8/ceph-rest-api/

Thanks!

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

Remove this line.

type: keyword
description: >
Status of the round
- name: mon.avail_percent

This comment has been minimized.

Copy link
@ruflin
type: keyword
description: >
Health of the MON
- name: mon.kb_avail

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

mon.available.kb

type: long
description: >
Available KB of the MON
- name: mon.kb_total

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

mon.total.kb

type: long
description: >
Total KB of the MON
- name: mon.kb_used

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

mon.used.kb

type: long
description: >
Log bytes of MON
- name: mon.store_stats.bytes_misc

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

mon.store_stats.misc.bytes

You can also add format: bytes to all bytes value, so Kibana will automatically know how to convert it to MB etc.

    - name: mon.store_stats.sst.bytes
      type: long
      description: >
        SST bytes of MON
      format: bytes
type: long
description: >
Misc bytes of MON
- name: mon.store_stats.bytes_sst

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

same with the bytes for the following two

"encoding/json"
"github.com/elastic/beats/libbeat/common"
"github.com/elastic/beats/libbeat/logp"
"io"

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

we normally but the external imports on the two following and empty line before the internal once


events := []common.MapStr{}

event := common.MapStr{

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

You could probably use the schema package here to do the event conversion, but we can do this cleanup in a second step.


event := common.MapStr{
"cluster.overall_status": d.Output.OverallStatus,
"cluster.timechecks": common.MapStr{

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

You must use here "cluster" : common.MapStr{ "timechecks" : common.MapStr{ ... as otherwise the event will not be compatible with elasticsearch 2.x version which do not support dots in fields.

@amandahla

This comment has been minimized.

Copy link
Contributor Author

commented Jan 19, 2017

Thanks @ruflin !
I made the changes that you asked and created a unity test. I hope that will be ok now (I always forget the make fmt/update, sorry!).

@ruflin
ruflin approved these changes Jan 19, 2017
Copy link
Collaborator

left a comment

LGTM. The only thing I see missing is a quick system tests that checks if all fields are as expected. See https://github.com/elastic/beats/blob/master/metricbeat/tests/system/test_haproxy.py#L34 But we can also add this as a follow up PR.

func TestFetchEventContents(t *testing.T) {
absPath, err := filepath.Abs("./testdata/")

response, err := ioutil.ReadFile(absPath + "/sample_response.json")

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 19, 2017

Collaborator

This is a great way to test the fetcher with predefined data. We should apply the same logic to other metricsets.

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented Jan 19, 2017

jenkins, test it

@amandahla

This comment has been minimized.

Copy link
Contributor Author

commented Jan 24, 2017

Hello @ruflin . I'm thinking about start on a new metricset now. Please, is this one ok so I can use as reference?

@ruflin

This comment has been minimized.

Copy link
Collaborator

commented Jan 24, 2017

@amandahla will review just now and let you know.

@ruflin ruflin merged commit f9031b9 into elastic:master Jan 24, 2017

4 checks passed

CLA Commit author has signed the CLA
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
default Build finished.
Details
@ruflin

This comment has been minimized.

Copy link
Collaborator

commented Jan 24, 2017

@amandahla Just merged it. I will do a follow up PR with some cleanup and will ping you on it. So you can take this directly into account in your next metricset. Thanks for pushing this forward.


events = append(events, event)

for _, HealthService := range d.Output.Health.HealthServices {

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 24, 2017

Collaborator

@amandahla I just realised now that we have a problem here. We send two different event types as part of the same metricset. We need to extract this part here into its own metricset. Not should war we should call it. monitor?

This comment has been minimized.

Copy link
@amandahla

amandahla Jan 24, 2017

Author Contributor

@ruflin Sorry, I didn't notice that would be a problem because all data is related and is obtained with only one command.
Answering your question: Yes, I think that "monitor" its good.

This comment has been minimized.

Copy link
@ruflin

ruflin Jan 24, 2017

Collaborator

@amandahla As we send it here in separate events anyways, it doesn't make a difference if it is a different metricset. Can you open a PR with an additional metricset? Sorry that I missed that one during the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.