Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/influx: Various time fixes #20494

Merged
merged 3 commits into from Mar 27, 2018

Conversation

Projects
None yet
2 participants
@wido
Copy link
Member

commented Feb 20, 2018

This PR fixes three things:

  • Make the default interval 30 seconds
  • Make sure all values have the same timestamp and reduce syscalls for fetching the current time
  • Set a retention period by default in InfluxDB

The default interval of 5 seconds is a nice to have, but it breaks large systems. Collecting and sending all this information on a 2000 OSD cluster is killing for the Manager. We shouldn't set that by default.

In addition a optimization here is to only fetch the time once, store it in a variable and us it in that run. Otherwise values have different times, but we also perform a lot of syscalls for fetching the time.

The last fix is to set a default retention of 8 weeks in InfluxDB. Users can change this however they like, it will not be modified afterwards, but it's to prevent users setting this up and having a DB grow for ever and breaking at some point in the future.

wido added some commits Jan 31, 2018

mgr/influx: Set the default interval to 30 seconds
A interval of 5 seconds (previous default) provides a high
resolution, but also a very large number of entries in the Influx
database.

By setting the default 30 seconds the amount of entries is reduced to
1/6th of that and thus saves storage and other resources.

If users want they can change the interval to anything they like.

Signed-off-by: Wido den Hollander <wido@42on.com>
mgr/influx: Set a default retention of 8 weeks on database
Without a retention policy it is up to the administrator to set
a retention policy on the data.

This can easily be forgotten resulting in a ever growing InfluxDB
database.

Users can change the retention policy if they want to afterwards,
this is just the default policy set by the InfluxDB module when
it initially creates the database.

Signed-off-by: Wido den Hollander <wido@42on.com>
mgr/influx: Only fetch the current time once when gathering data
By fetching the current time once and storing it into a variable
we save a lot of system calls. On large clusters this can be a lot
of system calls.

In addition we also make sure that all points gathered in the same
loop/run have exactly the same timestamp.

Otherwise there will be a difference in time between the items in
InfluxDB which then causes problems when creating graphs with for
example Grafana.

Signed-off-by: Wido den Hollander <wido@42on.com>

@wido wido added the mgr label Feb 20, 2018

@wido wido requested a review from jcsp Feb 20, 2018

@jcsp

This comment has been minimized.

Copy link
Contributor

commented Feb 20, 2018

This all looks fine, although I'm a bit concerned that this was painfully slow at 5s intervals on a 2000 OSD cluster -- suggests that the code to gather and send the report was taking multiple seconds to execute.

@wido

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2018

@jcsp Well, Influx also has to accept the data and the JSON needed to be send.

The thing is, I don't have this cluster at my disposal to test and debug everything on it, it's in production.

However, having a 30 second interval means you have a 6th of the data compared to the 5s interval.

People can still use a 5s interval, but I don't think it's wise to make it the default

@wido

This comment has been minimized.

Copy link
Member Author

commented Mar 26, 2018

@jcsp Can you take a look at this one again?

If OK I want to create a backport PR just like I did with the Zabbix module in #20781 to make sure that Luminous has the latest fixes for the Influx en Zabbix module.

@jcsp

jcsp approved these changes Mar 26, 2018

@jcsp jcsp merged commit c7049c1 into ceph:master Mar 27, 2018

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
@jcsp

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2018

(tested this locally)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.