bosun: convert datastore to Ledisdb/redis implementation. #1332

Merged
merged 5 commits into from Sep 28, 2015

Projects

None yet

4 participants

@captncraig
Contributor

The current data strategy is essentially to keep everything in memory and back it up to boltdb as a failsafe only. This has a few issues, including long startup times, long save pauses, inability to share data between multiple instances, etc.

We would like to move to using a redis datastore for bosun state data, but also don't want to alienate users who prefer a standalone application. ledisdb allows us to have a in-proc redis-compatible data store. A standard redis client can be used to talk to it, or can be pointed at a real redis server.

What will change

  • Instead of using the bolt state file, data will be stored in ledisdb or redis.
  • Data will be converted on bosun startup, and removed from the statefile.
  • Most data structures used by the sched package will need to be reworked into a more granular key/value access pattern.
  • If using embedded ledis mode, ledis server should be availible at 127.0.0.1:9565 redis clients should be able to interact with this for the most part.
  • due to differences between ledis and redis, we will maintain a suite of tests to insure all functionality works identically between implementations.

Configuration

  • redisHost = myRedis:6379

OR

  • ledisDir = /opt/bosun/ledis_data

default setup is ledisDir = ledis_data

Migration

I strongly recommend setting one of the above config items before rolling this change.

We will likely only convert one data structure at a time in order to test things thoroughly.

This pr only migrates metric-metadata.

future work

  • migration app to move ledis -> redis or vice versa.
  • all data structures converted
@gbrayut gbrayut commented on an outdated diff Sep 24, 2015
opentsdb/tsdb_test.go
@@ -1,6 +1,9 @@
package opentsdb
-import "testing"
+import (
+ "fmt"
@gbrayut
gbrayut Sep 24, 2015 Contributor

Unused import

@gbrayut
Contributor
gbrayut commented Sep 24, 2015

Read through the changes, nothing jumps out except the unused import statement that I commented about. Some more notes:

  1. Should probably add ledisDir = ../ledis_data to the dev.sample.conf file
  2. This branch doesn't build (missing dependencies), but I see there is another commit/pr to add those.
  3. I built the ledisparty branch and am testing it now. First thing I noticed is the host metadata tab is empty, but this may just be scollector waiting an hour before sending the metadata. When I test using /api/metadata/get?metric=bosun.collect.sent (or ?metric=scollector.collect.sent) I see the desc and unit details, but nothing for rate.
  4. I don't think the Series Type = auto is working the same as before on the graph tab. If I use the above URL to see the metadata, and it is missing the name=rate settings, I would expect the graph page to return an error when using auto. The current master returns "no metadata for sum:elastic.cluster.status: cannot use auto rate" if no metadata exists for that metric, but this branch seems to default to gauge.
@gbrayut
Contributor
gbrayut commented Sep 24, 2015

An hour in and the metadata tab started working, so I think it is just an scollector metadata issue.

The series type = auto and /api/metadata/get?metric=bosun.collect.sent route missing rate details mentioned above are still occurring.

@gbrayut
Contributor
gbrayut commented Sep 24, 2015

Confirmed the series type is being set correctly now, but if I query for something that doesn't have a rate metadata I would expect to get the error message indicated above. It seems to just default to gauge now instead of warning that you have to manually select one.

Not sure if you still want comments here or on the other PR.

@captncraig captncraig Converting metadata storage from in-memory to a redis-based model.
677f012
@captncraig
Contributor

@gbrayut this should be the authority now, and should build. Dependencies have been previously vendored.

@captncraig captncraig update redigo
86f7e48
@gbrayut
Contributor
gbrayut commented Sep 25, 2015

looks good now, the graph tab displays the correct error when there isn't metadata for gauge/counter.

Probably ready to start testing this on branchbosun, just make sure to check the host metadata tab to see if anything stops working there. I'm out in Seattle next week but ping me if you run into any issues.

captncraig added some commits Sep 28, 2015
@captncraig captncraig not deleting from bolt, just setting flag.
d782773
@captncraig captncraig Setting reds clientname 88f74b0
@captncraig captncraig fix
008f048
@captncraig captncraig merged commit 26acd8a into master Sep 28, 2015

2 checks passed

bosun All checks Passed!
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@captncraig captncraig deleted the ledis branch Nov 4, 2015
@krutaw
krutaw commented Mar 4, 2016

So, just to make sure I'm understanding this correctly, if we were to configure Bosun to interface with Redis, we'd be able to share the data across multiple Bosun instances and thus have a team of Bosun instances handling the work or am I completely off here?

@kylebrandt
Member

@krutaw Redis does not give us clustering. It does position us to have a redis replica, and instance of bosun that only reads the state. But I don't think we were looking towards active-active, the redis readonly replica would at least seem to be the next logical step for us, but not there yet.

The main reason we brought in redis was performance. We had everything in big blobs, and lock times would cause 30 second delays in places as we started to grow our instance.

@krutaw
krutaw commented Mar 4, 2016

That makes alot of sense. Is that something that is on the radar or more to the point, how do you guys handle the whole "High Availability" question regarding Bosun internally?

@kylebrandt
Member

Currently manual failover and backups. Bosun also has to restart to change the config, which causes about a 20 second gap as it loads all the last data points from redis (although if you don't index your data to bosun, this doesn't mater).

In general though, I posted this earlier this week to show what we do at Stack: http://kbrandt.com/post/bosun_arch/

@krutaw
krutaw commented Mar 4, 2016

AWESOME post, seriously, thank you. So quick question, how do you detect when the bosun instance needs to be rebuilt from backup/restarted/etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment