New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add stringLast and stringFirst aggregators extension #5789

Merged

jihoonson merged 47 commits into apache:master from wizzie-io:feature-first-last-string-aggregators

Aug 1, 2018

Contributor

andresgomezfrr commented May 20, 2018 •

edited

Loading

Hi all,

This PR contains a druid extension module that adds a stringLast & stringFirst aggregators.

Andrés Gomez added 2 commits

May 20, 2018 15:58


          Add lastString and firstString aggregators extension

a0049fd


          Remove duplicated class

09338b9

dylwylie reviewed

View reviewed changes

Contributor

dylwylie left a comment

Looks good!

docs/content/development/extensions-core/first-last-string.md Outdated

		@@ -0,0 +1,41 @@
		---

Contributor

dylwylie May 20, 2018

This file's in the extensions-core folder, but the link to it in extensions.md points at extensions-contrib.

Contributor Author

andresgomezfrr May 20, 2018

You are right!

...last-string/src/main/java/io/druid/query/aggregation/first/StringFirstAggregatorFactory.java Outdated

+                @JsonCreator
+                public StringFirstAggregatorFactory(
+                    @JsonProperty("name") String name,
+                    @JsonProperty("fieldName") final String fieldName,

Contributor

dylwylie May 20, 2018

It'd be cool if this (and last) supported expressions like the *sum and min/max aggregators

Contributor Author

andresgomezfrr May 23, 2018

What is the expressions usage? I have seen that:

https://github.com/druid-io/druid/blob/master/processing/src/main/java/io/druid/query/aggregation/DoubleSumAggregatorFactory.java#L42

But, I can't find anything on druid documentation about expressions. Taking a look at the class:

https://github.com/druid-io/druid/blob/master/processing/src/main/java/io/druid/query/aggregation/SimpleDoubleAggregatorFactory.java

I think that is another way to define the fieldName ... Could you give me some documentation about that? Or some example?

Contributor

dylwylie May 23, 2018

Afraid there's not comprehensive documentation at the moment. I haven't been digging into Druid's source for too long so feel free to ignore me!

There's some documentation on the language itself here: https://github.com/druid-io/druid/blob/master/docs/content/misc/math-expr.md

I think adding support for it should be making the constructor of this class similar to LongsumAggregatorFactory, taking an injected macroTable and a string for the expression.

Then in your factorize methods you'd check for an expression, if that's present you'd return an ExprEvalSelector instead of the results of the metricFactory.makeColumnValueSelector.

Making the ExprEvalSelector would look something like this:

final Expr expr = Parser.parse(fieldExpression, macroTable); return ExpressionSelectors.makeExprEvalSelector(metricFactory, expr);

Contributor

jihoonson Jun 7, 2018

Druid's expression system is not documented yet because it was an experimental feature (but maybe it's time to document).

I think we can add expression field to here later.

Andrés Gomez added 3 commits

May 20, 2018 17:09


          Move first-last-string doc page to extensions-contrib

3658b0c


          Fix ObjectStrategy compare method


          Fix doc bad aggregatos type name

adc773b

andresgomezfrr changed the title ~~Add lastString and firstString aggregators extension~~ Add stringLast and stringFirst aggregators extension

Andrés Gomez added 4 commits

May 31, 2018 14:20


          Create FoldingAggregatorFactory classes to fix SegmentMetadataQuery

c2f3672


          Add getMaxStringBytes() method to support JSON serialization

1c3bd6a


          Fix null pointer exception at segment creation phase when the string …

c75c88f

…value is null


          Control the valueSelector object class on BufferAggregators

d671922

Contributor

jihoonson commented Jun 7, 2018

@andresgomezfrr thanks for the contribution! I'm reviewing this PR.

jihoonson reviewed

View reviewed changes

Contributor

jihoonson left a comment

@andresgomezfrr thanks for the PR. I left some comments.

...ontrib/first-last-string/src/main/java/io/druid/query/aggregation/SerializablePairSerde.java Outdated

+              import java.nio.ByteBuffer;
+              import java.nio.charset.StandardCharsets;
+              public class SerializablePairSerde extends ComplexMetricSerde

Contributor

jihoonson Jun 8, 2018

Would you please add some java doc for this class? It should contain the value types to be serde and where this class is used.

...ontrib/first-last-string/src/main/java/io/druid/query/aggregation/SerializablePairSerde.java Outdated

+              public class SerializablePairSerde extends ComplexMetricSerde
+              {
+                public SerializablePairSerde()

Contributor

jihoonson Jun 8, 2018

Unnecessary class constructor.

...ontrib/first-last-string/src/main/java/io/druid/query/aggregation/SerializablePairSerde.java Outdated

+                @Override
+                public String getTypeName()
+                {
+                  return "serializablePairLongString";

Contributor

jihoonson Jun 8, 2018

Can we define this string as a static variable in somewhere and use it?

...ontrib/first-last-string/src/main/java/io/druid/query/aggregation/SerializablePairSerde.java Outdated

+                @Override
+                public ObjectStrategy getObjectStrategy()
+                {
+                  return new ObjectStrategy<SerializablePair>()

Contributor

jihoonson Jun 8, 2018

Please specify type parameters.

Contributor Author

andresgomezfrr Jun 9, 2018

I will create a new class SerializablePairLongString because if we specify type parameters later I can't do the SerializablePair<Long, String>.class

...ontrib/first-last-string/src/main/java/io/druid/query/aggregation/SerializablePairSerde.java Outdated

+                  return new ObjectStrategy<SerializablePair>()
+                  {
+                    @Override
+                    public int compare(SerializablePair o1, SerializablePair o2)

Contributor

jihoonson Jun 8, 2018

Please specify type parameters.

...st-last-string/src/main/java/io/druid/query/aggregation/last/StringLastBufferAggregator.java Outdated

+                  long lastTime = mutationBuffer.getLong(position);
+                  if (time >= lastTime) {
+                    byte[] valueBytes = lastString.getBytes(StandardCharsets.UTF_8);

Contributor

jihoonson Jun 8, 2018

lastString is nullable. You should check it's null.

...st-last-string/src/main/java/io/druid/query/aggregation/last/StringLastBufferAggregator.java Outdated

+                    byte[] valueBytes = new byte[stringSizeBytes];
+                    mutationBuffer.position(position + Long.BYTES + Integer.BYTES);
+                    mutationBuffer.get(valueBytes, 0, stringSizeBytes);
+                    serializablePair = new SerializablePair<>(timeValue, new String(valueBytes, StandardCharsets.UTF_8));

Contributor

jihoonson Jun 8, 2018

Please use StringUtils.toUtf8() instead.

...string/src/main/java/io/druid/query/aggregation/last/StringLastFoldingAggregatorFactory.java Outdated

+                    public void aggregate()
+                    {
+                      SerializablePair<Long, String> pair = (SerializablePair<Long, String>) selector.getObject();
+                      if (pair.lhs >= lastTime) {

Contributor

jihoonson Jun 8, 2018

pair can be null.

...string/src/main/java/io/druid/query/aggregation/last/StringLastFoldingAggregatorFactory.java Outdated

+                      SerializablePair<Long, String> pair = (SerializablePair<Long, String>) selector.getObject();
+                      long lastTime = mutationBuffer.getLong(position);
+                      if (pair.lhs >= lastTime) {

Contributor

jihoonson Jun 8, 2018

pair can be null.

...string/src/main/java/io/druid/query/aggregation/last/StringLastFoldingAggregatorFactory.java Outdated

+                      long lastTime = mutationBuffer.getLong(position);
+                      if (pair.lhs >= lastTime) {
+                        mutationBuffer.putLong(position, pair.lhs);
+                        byte[] valueBytes = pair.rhs.getBytes(StandardCharsets.UTF_8);

Contributor

jihoonson Jun 8, 2018

Please use StringUtils.toUtf8() instead.

Contributor

jihoonson commented Jun 9, 2018

I have one more comment. Can we move this to druid core (i.e., druid-processing) rather than extensions-contrib like first/lastAggregators for Long/Float/Double? I think it's worthwhile.

Andrés Gomez added 8 commits

June 9, 2018 15:10


          Perform all improvements

6945bf9


          Add java doc on SerializablePairLongStringSerde

7dd12e0


          Refactor ObjectStraty compare method

217627f


          Remove unused ;

9b68a60


          Add aggregateCombiner unit tests. Rename BufferAggregators unit tests

e552f04


          Remove unused imports

48325f3


          Add license header

657e8a3


          Add class name to java doc class serde

e8a2ded

Contributor Author

andresgomezfrr commented Jun 9, 2018

I did all the improvements and refactors. Now, we will start to move the source to the druid core.

@jihoonson thanks for the code review!! 😃

jihoonson reviewed

View reviewed changes

Contributor

jihoonson left a comment

@andresgomezfrr thank you for the quick fix!

.../first-last-string/src/main/java/io/druid/query/aggregation/first/StringFirstAggregator.java Outdated

+                  long time = timeSelector.getLong();
+                  if (time < firstTime) {
+                    firstTime = time;
+                    Object value = valueSelector.getObject();

Contributor

jihoonson Jun 9, 2018

I mean, value might be accidentally another type because of bugs. We need to add a sanity check.

Andrés Gomez and others added 6 commits

June 10, 2018 00:45


          Throw exception if value is unsupported class type

4d24159


          Merge branch 'master' into feature-first-last-string-aggregators

7eeef86


          Move first-last-string extension into druid core

bbf84b8


          Update druid core docs

b09217c


          Fix null pointer exception when pair->string is null

834662d


          Add null control unit tests

482e8bf


          Fix StringFirstAggregatorCombiner

7ff6fc3

Contributor Author

andresgomezfrr commented Jul 30, 2018

Fixed:

Commented:

Add stringLast and stringFirst aggregators extension #5789 (comment)

Thanks for your review!

Contributor

jihoonson commented Jul 30, 2018

@andresgomezfrr thanks. Please check my comment here.

Also, this unit test failure looks legitimate.

Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec <<< FAILURE! - in io.druid.query.aggregation.first.StringFirstAggregationTest
testStringFirstAggregateCombiner(io.druid.query.aggregation.first.StringFirstAggregationTest)  Time elapsed: 0.002 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:<[AAAA]> but was:<[BBBB]>
	at org.junit.Assert.assertEquals(Assert.java:115)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at io.druid.query.aggregation.first.StringFirstAggregationTest.testStringFirstAggregateCombiner(StringFirstAggregationTest.java:166)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runners.Suite.runChild(Suite.java:127)
	at org.junit.runners.Suite.runChild(Suite.java:26)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.apache.maven.surefire.junitcore.JUnitCore.run(JUnitCore.java:55)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:137)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeLazy(JUnitCoreWrapper.java:119)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:87)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:75)
	at org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:161)
	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121)

Andrés Gomez added 2 commits

July 31, 2018 09:58


          Fix if condition at StringFirstAggregateCombiner

b074c5f


          Remove filterNullValues from string first/last aggregators

21dfdc1

Contributor Author

andresgomezfrr commented Jul 31, 2018

Fix the test, I didn't see it, sorry!

Remove the filterNullValues property. Instead, you can use the aggregator with FilteredAggregator at ingestion time. Thanks @jihoonson

jihoonson reviewed

View reviewed changes

Contributor

jihoonson left a comment

@andresgomezfrr thanks for the quick fix! I left my last comments.

processing/src/main/java/io/druid/query/aggregation/last/StringLastAggregatorFactory.java Outdated

+                @Override
+                public List<AggregatorFactory> getRequiredColumns()
+                {
+                  return Arrays.asList(new StringLastAggregatorFactory(fieldName, fieldName, maxStringBytes));

Contributor

jihoonson Jul 31, 2018

nit: can use Collections.singletonList() instead.

Contributor Author

andresgomezfrr Jul 31, 2018

👍

processing/src/main/java/io/druid/query/aggregation/first/StringFirstAggregatorFactory.java Outdated

+                @Override
+                public List<AggregatorFactory> getRequiredColumns()
+                {
+                  return Arrays.asList(new StringFirstAggregatorFactory(fieldName, fieldName, maxStringBytes));

Contributor

jihoonson Jul 31, 2018

nit: can use Collections.singletonList() instead.

Contributor Author

andresgomezfrr Jul 31, 2018

👍

processing/src/main/java/io/druid/query/aggregation/first/StringFirstAggregateCombiner.java Outdated

+                @Override
+                public void fold(ColumnValueSelector selector)
+                {
+                  if (firstString == null) {

Contributor

jihoonson Jul 31, 2018

It looks that this is to check reset() is called or not. But, firstValue can be null even when reset() is called because selector.getObject() can return null. I think we need a flag isReset to check this.

Contributor Author

andresgomezfrr Jul 31, 2018

Yeah! It is true, good point!

Andrés Gomez added 2 commits

July 31, 2018 23:18


          Add isReset flag in FirstAggregatorCombiner

19b55c3


          Change Arrays.asList to Collections.singletonList

f674212

jihoonson approved these changes

View reviewed changes

Contributor

jihoonson left a comment

The latest change looks good to me. Thanks @andresgomezfrr!

jihoonson merged commit e270362 into apache:master

andresgomezfrr deleted the feature-first-last-string-aggregators branch

August 2, 2018 07:01

dclim added this to the 0.13.0 milestone

dclim mentioned this pull request

Druid 0.13.0-incubating release notes #6442

Closed

andresgomezfrr restored the feature-first-last-string-aggregators branch

October 17, 2018 14:04

Contributor

glasser commented Mar 5, 2019

Does stringFirst actually work at ingestion time? The doc change made here (changing the existing claim that first/last aggregators don't work at ingestion time to say that only numeric ones don't) plus the implementation of makeAggregateCombiner makes it seem like it should, and when I define a Kafka indexing service data source with a stringFirst aggregator, I can properly query the metric against data as the indexing task indexes it.

But the indexing task publish stage fails (in 0.13) with errors like:

2019-03-05T08:26:25,579 WARN [appenderator_merge_0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Failed to push merged index for segment[trace_refs_2019-03-05T07:00:00.000Z_2019-03-05T08:00:00.000Z_2019-03-05T07:00:01.884Z_2].
java.lang.ClassCastException: org.apache.druid.query.aggregation.SerializablePairLongString cannot be cast to java.lang.String
	at org.apache.druid.query.aggregation.first.StringFirstAggregateCombiner.reset(StringFirstAggregateCombiner.java:35) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.RowCombiningTimeAndDimsIterator.resetCombinedMetrics(RowCombiningTimeAndDimsIterator.java:249) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.RowCombiningTimeAndDimsIterator.combineToCurrentTimeAndDims(RowCombiningTimeAndDimsIterator.java:229) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.RowCombiningTimeAndDimsIterator.moveToNext(RowCombiningTimeAndDimsIterator.java:191) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.mergeIndexesAndWriteColumns(IndexMergerV9.java:492) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.makeIndexFiles(IndexMergerV9.java:191) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.merge(IndexMergerV9.java:914) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.mergeQueryableIndex(IndexMergerV9.java:832) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.mergeQueryableIndex(IndexMergerV9.java:810) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.mergeAndPush(AppenderatorImpl.java:719) ~[druid-server-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.lambda$push$1(AppenderatorImpl.java:623) ~[druid-server-0.13.0-incubating.jar:0.13.0-incubating]
	at com.google.common.util.concurrent.Futures$1.apply(Futures.java:713) [guava-16.0.1.jar:?]
	at com.google.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:861) [guava-16.0.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

Has anyone actually successfully used stringFirst at ingestion time?

Contributor Author

andresgomezfrr commented Mar 5, 2019

Yes, I use it at indexing time. Could you share your ingestion spec and some example input data?

Contributor

glasser commented Mar 7, 2019

OK, I attempted a a trivial reproduction by working through the Kafka stream tutorial but removing channel from dimensionsSpec and changing metricsSpec to:

    "metricsSpec": [{
      "name": "channel",
      "fieldName": "channel",
      "type": "stringFirst",
      "maxStringBytes": 100
    }],

This worked just fine (including actually publishing), so it's unclear what happened when I ran it in our cluster.

In our cluster, this was the ingestion spec. Note that we use a custom parser implementation which parses some protobufs into MapBasedInputRow, but it should just end up mapping sampled_trace_id to a String.

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "trace_refs",
    "parser": {
      "type": "mdg_trace_refs",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "timestamp",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": [{
            "name": "gcs_bucket",
            "type": "string"
          }, {
            "name": "duration_bucket",
            "type": "long"
          }, {
            "name": "client_reference_id",
            "type": "string"
          }, {
            "name": "client_name",
            "type": "string"
          }, {
            "name": "client_version",
            "type": "string"
          }, {
            "name": "query_id",
            "type": "string"
          }, {
            "name": "query_name",
            "type": "string"
          }, {
            "name": "query_signature",
            "type": "string"
          }, {
            "name": "schema_hash",
            "type": "string"
          }, {
            "name": "schema_tag",
            "type": "string"
          }, {
            "name": "service_id",
            "type": "string"
          }, {
            "name": "service_version",
            "type": "string"
          }, {
            "name": "trace_id",
            "type": "string"
          }]
        }
      }
    },
    "metricsSpec": [{
      "name": "sampled_trace_id",
      "fieldName": "sampled_trace_id",
      "type": "stringFirst",
      "maxStringBytes": 100
    }, {
      "name": "total_trace_size_bytes",
      "fieldName": "total_trace_size_bytes",
      "type": "longSum"
    }],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "MINUTE",
      "rollup": true
    }
  },
  "ioConfig": {
    "topic": "engine-reports-trace-processed",
    "consumerProperties": {
      "bootstrap.servers": "kafka:9092",
      "max.poll.records": 10000,
      "max.partition.fetch.bytes": 33554432
    },
    "taskCount": 1,
    "replicas": 1,
    "taskDuration": "PT1H",
    "lateMessageRejectionPeriod": "P31D",
    "earlyMessageRejectionPeriod": "PT1H",
    "useEarliestOffset": false
  },
  "tuningConfig": {
    "type": "kafka",
    "logParseExceptions": true,
    "maxParseExceptions": 0,
    "maxSavedParseExceptions": 1,
    "basePersistDirectory": "/tmp/ignored"
  }
}

I'll keep investigating, but definitely curious to hear if there's anything obviously strange here!

Contributor

glasser commented Mar 7, 2019

I don't really understand the variety of slightly different entry points involved in the metric aggregation process, but it seems like if StringFirstAggregateCombiner used the same logic as StringFirstAggregator.aggregate where it allowed selector.getObject() to return either a String or a SerializablePairLongString, then whatever this issue actually is would be resolved.

Contributor

glasser commented Mar 7, 2019

I tried running this again with DEBUG logging enabled but nothing obvious showed up. Same stack trace.

Is there a good way to poke at the persisted files while a task runs and see if they're in the right format?

Contributor

gianm commented Mar 12, 2019

@glasser Hmm, I just noticed the AggregateCombiner type is a ObjectAggregateCombiner<String>. I would think it should be a SerializablePairLongString. I wonder if the combiner just plain doesn't work, and the reason the repro doesn't trigger this is because it is loading too little data and doesn't actually need to combine anything from two different spill files (I believe that's when AggregateCombiners are used).

Contributor

glasser commented Mar 12, 2019

@gianm I could believe that — it would explain why some small reproductions I tried worked but running in our QA cluster didn't. Though I don't know what a spill file is :)

Contributor

gianm commented Mar 12, 2019

@glasser By 'spill file' I mean the files that get written to disk every maxRowsInMemory or intermediatePersistPeriod (and are merged later into a single segment)

Contributor

glasser commented Mar 12, 2019

Hmm, I'm not sure if that's exactly it. I've been trying the standard quickstart Kafka ingestion example with this supervisor:

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "wikipedia",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "time",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": [
            "cityName",
            "comment",
            "countryIsoCode",
            "countryName",
            "isAnonymous",
            "isMinor",
            "isNew",
            "isRobot",
            "isUnpatrolled",
            "metroCode",
            "namespace",
            "page",
            "regionIsoCode",
            "regionName",
            "user",
            { "name": "added", "type": "long" },
            { "name": "deleted", "type": "long" },
            { "name": "delta", "type": "long" }
          ]
        }
      }
    },
    "metricsSpec": [{
      "name": "channel",
      "fieldName": "channel",
      "type": "stringFirst",
      "maxStringBytes": 100
    }],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "DAY",
      "queryGranularity": "NONE",
      "rollup": false
    }
  },
  "tuningConfig": {
    "type": "kafka",
    "reportParseExceptions": false,
    "maxRowsInMemory": 3000
  },
  "ioConfig": {
    "topic": "wikipedia",
    "replicas": 2,
    "taskDuration": "PT2M",
    "completionTimeout": "PT20M",
    "consumerProperties": {
      "bootstrap.servers": "localhost:9092"
    }
  }
}

Note the maxRowsInMemory: 3000, which is less than the number of rows in the wikiticker-2015-09-12-sampled.json. (I tried setting it just to 1 but that leads to OOMs.) This job runs successfully.

I should probably try with just an index task instead of kafka to make it simpler though.

Contributor

glasser commented Mar 12, 2019

Yeah, running this task against a fresh download of 0.13-incubating succeeds, even though I would think it would need to invoke AggregateCombiner?

{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
              "cityName",
              "comment",
              "countryIsoCode",
              "countryName",
              "isAnonymous",
              "isMinor",
              "isNew",
              "isRobot",
              "isUnpatrolled",
              "metroCode",
              "namespace",
              "page",
              "regionIsoCode",
              "regionName",
              "user",
              { "name": "added", "type": "long" },
              { "name": "deleted", "type": "long" },
              { "name": "delta", "type": "long" }
            ]
          },
          "timestampSpec": {
            "column": "time",
            "format": "iso"
          }
        }
      },
    "metricsSpec": [{
      "name": "channel",
      "fieldName": "channel",
      "type": "stringFirst",
      "maxStringBytes": 100
    }],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"],
        "rollup" : false
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : "quickstart/tutorial/",
        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 5000000,
      "maxRowsInMemory" : 1000,
      "forceExtendableShardSpecs" : true
    }
  }
}

Contributor

glasser commented Mar 12, 2019

Oh hmm. "rollup" : false looks bad, but fixing that to be "rollup": true also doesn't reproduce.

Contributor

glasser commented Mar 12, 2019 •

edited

Loading

Oh, right. We need to actually make things roll up with each other, so set a non-trivial queryGranularity. I now have an actual reproduction so I'll open an issue: #7243.

glasser mentioned this pull request

stringFirst/stringLast crashes at aggregation time #7243

Closed

Aka-shi commented May 26, 2020

@glasser @andresgomezfrr Did you happen to find any workaround for this issue? Or is it solved in any of the latest versions? I am facing this exact issue when using stringLast aggregation during ingestion.

Contributor

glasser commented May 26, 2020

I didn't have time to look into this and we switched to working on migrating this particular weird data source away from Druid instead. Somebody told me this might have been fixed though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment