-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding ParserSpec for Influx Line Protocol #5440
Conversation
return new Double(raw); | ||
} | ||
|
||
// TODO: Support returning numeric value? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would allow true
/false
to be converted to an int 1
/0
, which would exposing an additional config param and might not be very useful. If you think it's necessary / highly useful, I could add it but my first inclination is to wait and see if anyone needs it before complicating the code and config. If it would be sufficient to just remove the TODO that would be my preference.
could you please fill CLA http://druid.io/community/cla.html ? |
Hi @b-slim, thanks for the quick feedback. I will work with my employer (Target) to get the ball rolling on signing the corporate CLA. It might take a while but I will let you know as soon as I hear from the authorities. |
|
||
private void parseTimestamp(String timestamp, Map<String, Object> dest) | ||
{ | ||
if (timestamp.length() < 7) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment about the nanosecond->millisecond conversion here?
Some other comments:
- Should timestamps under 1 millisecond be treated as invalid? To me it seems like 0 would be a better return value for that case, or rounding the number to nearest milli
- The truncation could let bad timestamps slip by, e.g., if I got a timestamp "152036516185300abcd", this is probably bad data but this conversion would allow that timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @jon-wei; I'll add comment and improve default handling there. Still waiting on the Man to sign the CLA 😁; should be pretty soon.
- Remove extraneous TODO - Better handling of parse errors (e.g. invalid timestamp) - Handle sub-millisecond timestamps
CLA was signed and submitted; I made one additional commit to address PR feedback (removed todo, better handling of short / invalid timestamps) but did not squash to make it easier to see what changed -- I can squash if desired. |
super cool, @njhartwell can you please make sure to add documentation for the extension? |
), | ||
testCase( | ||
"truncated timestamp", | ||
"foo,region=us-east-1,host=127.0.0.1 m=1.0,n=3.0,o=500i 123", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a timestamp like "500i 123" should throw a ParseException instead of being treated as 0, can you add a check on the truncated portion to make sure that it's also numeric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 500i there is the (integer) value for the measurement named o
and 123
is a valid timestamp. Invalid timestamps are caught by the parser since a valid line has to end with a number, newline or EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, sorry, misread it, I see you have a case already for that type of failure in testParseFailures
LGTM, I'll merge after docs, the merge will auto-squash |
Thanks @jon-wei! Just added docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs nits.
|
||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-influx-extensions`. | ||
|
||
This extension enables Druid to parse [InfluxDB Line Protocol](https://docs.influxdata.com/influxdb/v1.5/write_protocols/line_protocol_tutorial/), a popular text-based timeseries metric serialization format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "parse the InfluxDB Line Protocol" would read slightly better.
```cpu,application=dbhost=prdb123,region=us-east-1 usage_idle=99.24,usage_user=0.55 1520722030000000000``` | ||
|
||
which contains four parts: | ||
- measurement: A string indicating what kind of measurement is represented (e.g. cpu, network, web_requests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would stick with "name" over "kind" here. It's also the nomenclature in the Influx docs that you linked above.
- measurements: one or more key-value pairs; values can be numeric, boolean, or string | ||
- timestamp: nanoseconds since Unix epoch (the parser truncates it to milliseconds) | ||
|
||
The parser extracts these fields into a map, giving the measurement the key `measurement` and the timestamp the key `_ts`. The tag and measurement keys are copied verbatum, so users should take care to avoid name collisions. It is up to the ingestion spec to decide which fields should be treated as dimensions and which should be treated as metrics (typically tags correspond to dimensions and values correspond to measurements). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/verbatum/verbatim/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"metrics" and "measurements" seem to be used interchangeably here but I believe it should always be "metrics" no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @josephglanville. Fixed first three as you suggested. measurements
is the term used in the influx line protocol to refer to the second set of k/v pairs and I stuck with that in the docs here to avoid ambiguity with Druid metrics
since they don't necessarily line up--you might have some influx measurements that you ingest as dimensions. That suggestion did lead me to fix this line "typically tags correspond to dimensions and values correspond to measurements" which I updated to "typically tags correspond to dimensions and measurements correspond to metrics".
LGTM. 👍 |
@jon-wei do you need anything else on this or can it be merged? Thanks! |
@njhartwell thanks for the contrib! |
Hi @njhartwell, can you please fill out the Druid CLA here: http://druid.io/community/cla.html |
Just did, although this contribution is on behalf of Target Corp. which I am told has already signed the corporate CLA. |
Got it, thanks for making that clear. |
|
||
private Object parseQuotedString(String text) | ||
{ | ||
return text.substring(1, text.length() - 1).replaceAll("\\\\\"", "\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right number of back slashes (5)? Maybe just 4 needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 4 slashes is a syntax error. The first four are read into the java string as two backlashes, and the fifth is paired with the double quote, so that the string passed to the regex library is \\"
which matches the string \"
, i.e. an escaped quote.
This is a fairly complete parser for the current version of Influx Line Protocol (https://docs.influxdata.com/influxdb/v1.4/write_protocols/line_protocol_tutorial/) which is a common format for sending time series metric data.