Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ParserSpec for Influx Line Protocol #5440

Merged
merged 4 commits into from
Mar 26, 2018

Conversation

njhartwell
Copy link
Contributor

This is a fairly complete parser for the current version of Influx Line Protocol (https://docs.influxdata.com/influxdb/v1.4/write_protocols/line_protocol_tutorial/) which is a common format for sending time series metric data.

return new Double(raw);
}

// TODO: Support returning numeric value?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would allow true/false to be converted to an int 1/0, which would exposing an additional config param and might not be very useful. If you think it's necessary / highly useful, I could add it but my first inclination is to wait and see if anyone needs it before complicating the code and config. If it would be sufficient to just remove the TODO that would be my preference.

@b-slim
Copy link
Contributor

b-slim commented Feb 28, 2018

could you please fill CLA http://druid.io/community/cla.html ?

@njhartwell
Copy link
Contributor Author

Hi @b-slim, thanks for the quick feedback. I will work with my employer (Target) to get the ball rolling on signing the corporate CLA. It might take a while but I will let you know as soon as I hear from the authorities.


private void parseTimestamp(String timestamp, Map<String, Object> dest)
{
if (timestamp.length() < 7) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about the nanosecond->millisecond conversion here?

Some other comments:

  • Should timestamps under 1 millisecond be treated as invalid? To me it seems like 0 would be a better return value for that case, or rounding the number to nearest milli
  • The truncation could let bad timestamps slip by, e.g., if I got a timestamp "152036516185300abcd", this is probably bad data but this conversion would allow that timestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @jon-wei; I'll add comment and improve default handling there. Still waiting on the Man to sign the CLA 😁; should be pretty soon.

- Remove extraneous TODO
- Better handling of parse errors (e.g. invalid timestamp)
- Handle sub-millisecond timestamps
@njhartwell
Copy link
Contributor Author

CLA was signed and submitted; I made one additional commit to address PR feedback (removed todo, better handling of short / invalid timestamps) but did not squash to make it easier to see what changed -- I can squash if desired.

@drcrallen
Copy link
Contributor

super cool, @njhartwell can you please make sure to add documentation for the extension?

),
testCase(
"truncated timestamp",
"foo,region=us-east-1,host=127.0.0.1 m=1.0,n=3.0,o=500i 123",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a timestamp like "500i 123" should throw a ParseException instead of being treated as 0, can you add a check on the truncated portion to make sure that it's also numeric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 500i there is the (integer) value for the measurement named o and 123 is a valid timestamp. Invalid timestamps are caught by the parser since a valid line has to end with a number, newline or EOF.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, sorry, misread it, I see you have a case already for that type of failure in testParseFailures

@jon-wei
Copy link
Contributor

jon-wei commented Mar 12, 2018

LGTM, I'll merge after docs, the merge will auto-squash

@njhartwell
Copy link
Contributor Author

Thanks @jon-wei! Just added docs

Copy link
Contributor

@josephglanville josephglanville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs nits.


To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-influx-extensions`.

This extension enables Druid to parse [InfluxDB Line Protocol](https://docs.influxdata.com/influxdb/v1.5/write_protocols/line_protocol_tutorial/), a popular text-based timeseries metric serialization format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "parse the InfluxDB Line Protocol" would read slightly better.

```cpu,application=dbhost=prdb123,region=us-east-1 usage_idle=99.24,usage_user=0.55 1520722030000000000```

which contains four parts:
- measurement: A string indicating what kind of measurement is represented (e.g. cpu, network, web_requests)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would stick with "name" over "kind" here. It's also the nomenclature in the Influx docs that you linked above.

- measurements: one or more key-value pairs; values can be numeric, boolean, or string
- timestamp: nanoseconds since Unix epoch (the parser truncates it to milliseconds)

The parser extracts these fields into a map, giving the measurement the key `measurement` and the timestamp the key `_ts`. The tag and measurement keys are copied verbatum, so users should take care to avoid name collisions. It is up to the ingestion spec to decide which fields should be treated as dimensions and which should be treated as metrics (typically tags correspond to dimensions and values correspond to measurements).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/verbatum/verbatim/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"metrics" and "measurements" seem to be used interchangeably here but I believe it should always be "metrics" no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @josephglanville. Fixed first three as you suggested. measurements is the term used in the influx line protocol to refer to the second set of k/v pairs and I stuck with that in the docs here to avoid ambiguity with Druid metrics since they don't necessarily line up--you might have some influx measurements that you ingest as dimensions. That suggestion did lead me to fix this line "typically tags correspond to dimensions and values correspond to measurements" which I updated to "typically tags correspond to dimensions and measurements correspond to metrics".

@josephglanville
Copy link
Contributor

LGTM. 👍

@njhartwell
Copy link
Contributor Author

@jon-wei do you need anything else on this or can it be merged? Thanks!

@jon-wei jon-wei merged commit ea30c05 into apache:master Mar 26, 2018
@jon-wei
Copy link
Contributor

jon-wei commented Mar 26, 2018

@njhartwell thanks for the contrib!

@gianm
Copy link
Contributor

gianm commented Mar 27, 2018

Hi @njhartwell, can you please fill out the Druid CLA here: http://druid.io/community/cla.html

@njhartwell
Copy link
Contributor Author

Just did, although this contribution is on behalf of Target Corp. which I am told has already signed the corporate CLA.

@gianm
Copy link
Contributor

gianm commented Mar 27, 2018

Got it, thanks for making that clear.

@njhartwell njhartwell deleted the influx-parser-spec branch July 25, 2018 21:00
@dclim dclim added this to the 0.13.0 milestone Oct 8, 2018

private Object parseQuotedString(String text)
{
return text.substring(1, text.length() - 1).replaceAll("\\\\\"", "\"");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right number of back slashes (5)? Maybe just 4 needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 4 slashes is a syntax error. The first four are read into the java string as two backlashes, and the fifth is paired with the double quote, so that the string passed to the regex library is \\" which matches the string \", i.e. an escaped quote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants