Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 107 lines (68 sloc) 7.373 kB
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
1 Configuring Graphite for StatsD
2 -------------------------------
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
3
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
4 Many users have been confused to see their hit counts averaged, gone missing when
5 the data is intermittent, or never stored when statsd is sending at a different
6 interval than graphite expects. Careful setup of Graphite as suggested below should help to alleviate all these issues. When configuring Graphite, two main factors you need to consider are:
7
8 1. What is the highest resolution of data points kept by Graphite, and at which points in time is data downsampled to lower resolutions. This decision is by nature directly related to your functional requirements: how far back should you keep data? what is the data resolution you actually need? However, the retention rules you set must also be in sync with statsd.
9
10 2. How should data be aggregated when downsampled, in order to correctly preserve its meaning? Graphite of course knows nothing of the 'meaning' of your data, so let's explore the correct setup for the various metrics sent by statsd.
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
11
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
12 ### Storage Schemas
13
14 To define retention and downsampling which match your needs, edit Graphite's conf/storage-schemas.conf file. Here is a simple example file that would handle all metrics sent by statsd:
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
15
16 [stats]
17 pattern = ^stats.*
17e591b @ralph-tice fixed default retention suggestion
ralph-tice authored
18 retentions = 10s:6h,1min:6d,10min:1800d
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
19
20 This translates to: for all metrics starting with 'stats' (i.e. all metrics sent by statsd), capture:
21
22 * 6 hours of 10 second data (what we consider "near-realtime")
17e591b @ralph-tice fixed default retention suggestion
ralph-tice authored
23 * 6 days of 1 minute data
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
24 * 5 years of 10 minute data
25
26 These settings have been a good tradeoff so far between size-of-file (database files are fixed size) and data we care about. Each "stats" database file is about 3.2 megs with these retentions.
27
28 Retentions are read from the file in order and the first pattern that matches is used.
29 Graphite stores each metric in its own database file, and the retentions take effect when a metric file is first created. This means that changing this config file would not affect any files already created. To view or alter the settings on existing files, use whisper-info.py and whisper-resize.py included with the Whisper package.
30
31 #### Correlation with statsd's flush interval:
32
990703e fix typo
Li Zhao authored
33 In the case of the above example, what would happen if you flush from statsd any faster than every 10 seconds? in that case, multiple values for the same metric may reach Graphite at any given 10-second timespan, and only the last value would take hold and be persisted - so your data would immediately be partially lost.
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
34
35 To fix that, simply ensure your flush interval is at least as long as the highest-resolution retention. However, a long interval may cause other unfortunate mishaps, so keep reading - it pays to understand what's really going on.
36
37 (Note: Older versions of Graphite do not support the human-readable time format shown above)
38
39 ### Storage Aggregation
40
ee5ec77 @zr40 Fix typo
zr40 authored
41 The next step is ensuring your data isn't corrupted or discarded when downsampled. Continuing with the example above, take for instance the downsampling of .mean values calculated for all statsd timers:
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
42
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
43 Graphite should downsample up to 6 samples representing 10-second mean values into a single value signfying the mean for a 1-minute timespan. This is simple: just average all samples to get the new value, and this is exactly the default method applied by Graphite. However, what about the .count metric also sent for timers? Each sample contains the count of occurences per flush interval, so you want these samples summed-up, not averaged!
44
45 You would not even notice any problem till you look at a graph for data older than 6 hours ago, since Graphite would need only the high-res 10-second samples to render the first 6 hours, but would have to switch to lower resolution data for rendering a longer timespan.
46
47 Two other metric kinds also deserve a note:
48
49 * Counts which are normalized by statsd to signify a per-second count should not be summed, since their meaning does not change when downsampling.
50
51 * Metrics for minimum/maximum values should not be averaged but rather preserve the lowest/highest point, respectively.
52
53 Let's see now how to configure downsampling in Graphite's conf/storage-aggregation.conf:
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
54
55 [min]
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
56 pattern = \.lower$
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
57 xFilesFactor = 0.1
58 aggregationMethod = min
59
60 [max]
1d88950 @zerkms Aggregation rule refactoring: to have a single one for upper and uppe…
zerkms authored
61 pattern = \.upper(_\d+)?$
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
62 xFilesFactor = 0.1
63 aggregationMethod = max
64
65 [sum]
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
66 pattern = \.sum$
67 xFilesFactor = 0
68 aggregationMethod = sum
69
70 [count]
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
71 pattern = \.count$
72 xFilesFactor = 0
73 aggregationMethod = sum
74
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
75 [count_legacy]
76 pattern = ^stats_counts.*
77 xFilesFactor = 0
78 aggregationMethod = sum
79
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
80 [default_average]
81 pattern = .*
82 xFilesFactor = 0.3
83 aggregationMethod = average
84
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
85 This means:
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
86
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
87 * For metrics ending with '.lower' or '.upper' (these are sent for all timers), keep only the minimum and maximum value when rolling up data and store a None if less than 10% of the datapoints were received.
88 * For metrics ending with 'count' or 'sum' in the name, or those under 'stats_counts', add all the values together, and store a None only if none of the datapoints were received. This would capture all non-normalized counters, but ignore the per-second ones.
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
89 * For all other databases, average the values (mean) when rolling up data, and
90 store a None if less than 30% of the datapoints were received
91
272049a @robert-zaremba Update graphite.md
robert-zaremba authored
92 Pay close attention to xFilesFactor: if your flush interval is not long enough so there are not enough samples to satisfy this minimum factor, your data would simply be lost in the first downsampling cycle. However, setting a very low factor would also produce a misleading result, since you would probably agree that if you only have a single 10-second mean value sample reported in a 10-minute timeframe, this single sample alone should not normally be downsampled into a 10-minute mean value. For counts, however, every count should count ;-), hence the zero factor.
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
93
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
94 **Notes:**
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
95
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
96 1. a '.count' metric is calculated for all timers, but up to and including v0.5.0, non-normalized counters are written under stats_counts - not under stats.counters as you might expect. Post-0.5.0, if you set legacyNamespace=false in the config then counters would indeed be written under stats.counters, in two variations: per-second counts under stats.counters.\<name\>.*rate*, and non-normalized per-flush counts under stats.counters.\<name\>.*count*. Hence, the rules above would handle counts for both timers and legacy/non-legacy counters.
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
97
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
98 2. upper and lower values are also calculated for the n-percentile value defined for timers. The above example does not include rules for these, for brevity and performance.
99
100 Similar to retentions, the aggregations in effect for any metric are set once the metric is first received, so a change to these settings would not affect existing metrics.
101
102 ### In conclusion
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
103
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
104 Graphite's handling of your statsd metrics should be verified at least once: is data mysteriously lost at any point? is data downsampled properly? are you defining graphs for counter metrics without knowing what timespan does each y-value actually represent? (admittedly, in some cases you may not even care about the y-values in the graph, as only the trend is of any interest. The coolest graphs seem to always lack y-values...)
1ea6836 @draco2003 Initial Doc Split starting point
draco2003 authored
105
47373a8 @eladroz Update graphite doc for common pitfalls
eladroz authored
106 For more information, see: http://graphite.readthedocs.org/en/latest/config-carbon.html
Something went wrong with that request. Please try again.