Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stringLast and stringFirst aggregators extension #5789

Merged

Conversation

andresgomezfrr
Copy link
Contributor

@andresgomezfrr andresgomezfrr commented May 20, 2018

Hi all,

This PR contains a druid extension module that adds a stringLast & stringFirst aggregators.

Copy link
Contributor

@dylwylie dylwylie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@@ -0,0 +1,41 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file's in the extensions-core folder, but the link to it in extensions.md points at extensions-contrib.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right!

@JsonCreator
public StringFirstAggregatorFactory(
@JsonProperty("name") String name,
@JsonProperty("fieldName") final String fieldName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be cool if this (and last) supported expressions like the *sum and min/max aggregators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expressions usage? I have seen that:

https://github.com/druid-io/druid/blob/master/processing/src/main/java/io/druid/query/aggregation/DoubleSumAggregatorFactory.java#L42

But, I can't find anything on druid documentation about expressions. Taking a look at the class:

https://github.com/druid-io/druid/blob/master/processing/src/main/java/io/druid/query/aggregation/SimpleDoubleAggregatorFactory.java

I think that is another way to define the fieldName ... Could you give me some documentation about that? Or some example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afraid there's not comprehensive documentation at the moment. I haven't been digging into Druid's source for too long so feel free to ignore me!

There's some documentation on the language itself here: https://github.com/druid-io/druid/blob/master/docs/content/misc/math-expr.md

I think adding support for it should be making the constructor of this class similar to LongsumAggregatorFactory, taking an injected macroTable and a string for the expression.

Then in your factorize methods you'd check for an expression, if that's present you'd return an ExprEvalSelector instead of the results of the metricFactory.makeColumnValueSelector.

Making the ExprEvalSelector would look something like this:

final Expr expr = Parser.parse(fieldExpression, macroTable); return ExpressionSelectors.makeExprEvalSelector(metricFactory, expr);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Druid's expression system is not documented yet because it was an experimental feature (but maybe it's time to document).

I think we can add expression field to here later.

@andresgomezfrr andresgomezfrr changed the title Add lastString and firstString aggregators extension Add stringLast and stringFirst aggregators extension May 29, 2018
@jihoonson
Copy link
Contributor

@andresgomezfrr thanks for the contribution! I'm reviewing this PR.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andresgomezfrr thanks for the PR. I left some comments.

import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;

public class SerializablePairSerde extends ComplexMetricSerde
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add some java doc for this class? It should contain the value types to be serde and where this class is used.

public class SerializablePairSerde extends ComplexMetricSerde
{

public SerializablePairSerde()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary class constructor.

@Override
public String getTypeName()
{
return "serializablePairLongString";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define this string as a static variable in somewhere and use it?

@Override
public ObjectStrategy getObjectStrategy()
{
return new ObjectStrategy<SerializablePair>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify type parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a new class SerializablePairLongString because if we specify type parameters later I can't do the SerializablePair<Long, String>.class

return new ObjectStrategy<SerializablePair>()
{
@Override
public int compare(SerializablePair o1, SerializablePair o2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify type parameters.


long lastTime = mutationBuffer.getLong(position);
if (time >= lastTime) {
byte[] valueBytes = lastString.getBytes(StandardCharsets.UTF_8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastString is nullable. You should check it's null.

byte[] valueBytes = new byte[stringSizeBytes];
mutationBuffer.position(position + Long.BYTES + Integer.BYTES);
mutationBuffer.get(valueBytes, 0, stringSizeBytes);
serializablePair = new SerializablePair<>(timeValue, new String(valueBytes, StandardCharsets.UTF_8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use StringUtils.toUtf8() instead.

public void aggregate()
{
SerializablePair<Long, String> pair = (SerializablePair<Long, String>) selector.getObject();
if (pair.lhs >= lastTime) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pair can be null.


SerializablePair<Long, String> pair = (SerializablePair<Long, String>) selector.getObject();
long lastTime = mutationBuffer.getLong(position);
if (pair.lhs >= lastTime) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pair can be null.

long lastTime = mutationBuffer.getLong(position);
if (pair.lhs >= lastTime) {
mutationBuffer.putLong(position, pair.lhs);
byte[] valueBytes = pair.rhs.getBytes(StandardCharsets.UTF_8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use StringUtils.toUtf8() instead.

@jihoonson
Copy link
Contributor

I have one more comment. Can we move this to druid core (i.e., druid-processing) rather than extensions-contrib like first/lastAggregators for Long/Float/Double? I think it's worthwhile.

@andresgomezfrr
Copy link
Contributor Author

I did all the improvements and refactors. Now, we will start to move the source to the druid core.

@jihoonson thanks for the code review!! 😃

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andresgomezfrr thank you for the quick fix!

long time = timeSelector.getLong();
if (time < firstTime) {
firstTime = time;
Object value = valueSelector.getObject();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, value might be accidentally another type because of bugs. We need to add a sanity check.

@jihoonson
Copy link
Contributor

@andresgomezfrr thanks. Please check my comment here.

Also, this unit test failure looks legitimate.

Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec <<< FAILURE! - in io.druid.query.aggregation.first.StringFirstAggregationTest
testStringFirstAggregateCombiner(io.druid.query.aggregation.first.StringFirstAggregationTest)  Time elapsed: 0.002 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:<[AAAA]> but was:<[BBBB]>
	at org.junit.Assert.assertEquals(Assert.java:115)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at io.druid.query.aggregation.first.StringFirstAggregationTest.testStringFirstAggregateCombiner(StringFirstAggregationTest.java:166)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runners.Suite.runChild(Suite.java:127)
	at org.junit.runners.Suite.runChild(Suite.java:26)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.apache.maven.surefire.junitcore.JUnitCore.run(JUnitCore.java:55)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:137)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeLazy(JUnitCoreWrapper.java:119)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:87)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:75)
	at org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:161)
	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121)

@andresgomezfrr
Copy link
Contributor Author

Fix the test, I didn't see it, sorry!

Remove the filterNullValues property. Instead, you can use the aggregator with FilteredAggregator at ingestion time. Thanks @jihoonson

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andresgomezfrr thanks for the quick fix! I left my last comments.

@Override
public List<AggregatorFactory> getRequiredColumns()
{
return Arrays.asList(new StringLastAggregatorFactory(fieldName, fieldName, maxStringBytes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use Collections.singletonList() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Override
public List<AggregatorFactory> getRequiredColumns()
{
return Arrays.asList(new StringFirstAggregatorFactory(fieldName, fieldName, maxStringBytes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use Collections.singletonList() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Override
public void fold(ColumnValueSelector selector)
{
if (firstString == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks that this is to check reset() is called or not. But, firstValue can be null even when reset() is called because selector.getObject() can return null. I think we need a flag isReset to check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! It is true, good point!

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest change looks good to me. Thanks @andresgomezfrr!

@jihoonson jihoonson merged commit e270362 into apache:master Aug 1, 2018
@andresgomezfrr andresgomezfrr deleted the feature-first-last-string-aggregators branch August 2, 2018 07:01
@dclim dclim added this to the 0.13.0 milestone Oct 8, 2018
@andresgomezfrr andresgomezfrr restored the feature-first-last-string-aggregators branch October 17, 2018 14:04
@glasser
Copy link
Contributor

glasser commented Mar 5, 2019

Does stringFirst actually work at ingestion time? The doc change made here (changing the existing claim that first/last aggregators don't work at ingestion time to say that only numeric ones don't) plus the implementation of makeAggregateCombiner makes it seem like it should, and when I define a Kafka indexing service data source with a stringFirst aggregator, I can properly query the metric against data as the indexing task indexes it.

But the indexing task publish stage fails (in 0.13) with errors like:

2019-03-05T08:26:25,579 WARN [appenderator_merge_0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Failed to push merged index for segment[trace_refs_2019-03-05T07:00:00.000Z_2019-03-05T08:00:00.000Z_2019-03-05T07:00:01.884Z_2].
java.lang.ClassCastException: org.apache.druid.query.aggregation.SerializablePairLongString cannot be cast to java.lang.String
	at org.apache.druid.query.aggregation.first.StringFirstAggregateCombiner.reset(StringFirstAggregateCombiner.java:35) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.RowCombiningTimeAndDimsIterator.resetCombinedMetrics(RowCombiningTimeAndDimsIterator.java:249) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.RowCombiningTimeAndDimsIterator.combineToCurrentTimeAndDims(RowCombiningTimeAndDimsIterator.java:229) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.RowCombiningTimeAndDimsIterator.moveToNext(RowCombiningTimeAndDimsIterator.java:191) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.mergeIndexesAndWriteColumns(IndexMergerV9.java:492) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.makeIndexFiles(IndexMergerV9.java:191) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.merge(IndexMergerV9.java:914) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.mergeQueryableIndex(IndexMergerV9.java:832) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.IndexMergerV9.mergeQueryableIndex(IndexMergerV9.java:810) ~[druid-processing-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.mergeAndPush(AppenderatorImpl.java:719) ~[druid-server-0.13.0-incubating.jar:0.13.0-incubating]
	at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.lambda$push$1(AppenderatorImpl.java:623) ~[druid-server-0.13.0-incubating.jar:0.13.0-incubating]
	at com.google.common.util.concurrent.Futures$1.apply(Futures.java:713) [guava-16.0.1.jar:?]
	at com.google.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:861) [guava-16.0.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

Has anyone actually successfully used stringFirst at ingestion time?

@andresgomezfrr
Copy link
Contributor Author

Yes, I use it at indexing time. Could you share your ingestion spec and some example input data?

@glasser
Copy link
Contributor

glasser commented Mar 7, 2019

OK, I attempted a a trivial reproduction by working through the Kafka stream tutorial but removing channel from dimensionsSpec and changing metricsSpec to:

    "metricsSpec": [{
      "name": "channel",
      "fieldName": "channel",
      "type": "stringFirst",
      "maxStringBytes": 100
    }],

This worked just fine (including actually publishing), so it's unclear what happened when I ran it in our cluster.

In our cluster, this was the ingestion spec. Note that we use a custom parser implementation which parses some protobufs into MapBasedInputRow, but it should just end up mapping sampled_trace_id to a String.

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "trace_refs",
    "parser": {
      "type": "mdg_trace_refs",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "timestamp",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": [{
            "name": "gcs_bucket",
            "type": "string"
          }, {
            "name": "duration_bucket",
            "type": "long"
          }, {
            "name": "client_reference_id",
            "type": "string"
          }, {
            "name": "client_name",
            "type": "string"
          }, {
            "name": "client_version",
            "type": "string"
          }, {
            "name": "query_id",
            "type": "string"
          }, {
            "name": "query_name",
            "type": "string"
          }, {
            "name": "query_signature",
            "type": "string"
          }, {
            "name": "schema_hash",
            "type": "string"
          }, {
            "name": "schema_tag",
            "type": "string"
          }, {
            "name": "service_id",
            "type": "string"
          }, {
            "name": "service_version",
            "type": "string"
          }, {
            "name": "trace_id",
            "type": "string"
          }]
        }
      }
    },
    "metricsSpec": [{
      "name": "sampled_trace_id",
      "fieldName": "sampled_trace_id",
      "type": "stringFirst",
      "maxStringBytes": 100
    }, {
      "name": "total_trace_size_bytes",
      "fieldName": "total_trace_size_bytes",
      "type": "longSum"
    }],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "MINUTE",
      "rollup": true
    }
  },
  "ioConfig": {
    "topic": "engine-reports-trace-processed",
    "consumerProperties": {
      "bootstrap.servers": "kafka:9092",
      "max.poll.records": 10000,
      "max.partition.fetch.bytes": 33554432
    },
    "taskCount": 1,
    "replicas": 1,
    "taskDuration": "PT1H",
    "lateMessageRejectionPeriod": "P31D",
    "earlyMessageRejectionPeriod": "PT1H",
    "useEarliestOffset": false
  },
  "tuningConfig": {
    "type": "kafka",
    "logParseExceptions": true,
    "maxParseExceptions": 0,
    "maxSavedParseExceptions": 1,
    "basePersistDirectory": "/tmp/ignored"
  }
}

I'll keep investigating, but definitely curious to hear if there's anything obviously strange here!

@glasser
Copy link
Contributor

glasser commented Mar 7, 2019

I don't really understand the variety of slightly different entry points involved in the metric aggregation process, but it seems like if StringFirstAggregateCombiner used the same logic as StringFirstAggregator.aggregate where it allowed selector.getObject() to return either a String or a SerializablePairLongString, then whatever this issue actually is would be resolved.

@glasser
Copy link
Contributor

glasser commented Mar 7, 2019

I tried running this again with DEBUG logging enabled but nothing obvious showed up. Same stack trace.

Is there a good way to poke at the persisted files while a task runs and see if they're in the right format?

@gianm
Copy link
Contributor

gianm commented Mar 12, 2019

@glasser Hmm, I just noticed the AggregateCombiner type is a ObjectAggregateCombiner<String>. I would think it should be a SerializablePairLongString. I wonder if the combiner just plain doesn't work, and the reason the repro doesn't trigger this is because it is loading too little data and doesn't actually need to combine anything from two different spill files (I believe that's when AggregateCombiners are used).

@glasser
Copy link
Contributor

glasser commented Mar 12, 2019

@gianm I could believe that — it would explain why some small reproductions I tried worked but running in our QA cluster didn't. Though I don't know what a spill file is :)

@gianm
Copy link
Contributor

gianm commented Mar 12, 2019

@glasser By 'spill file' I mean the files that get written to disk every maxRowsInMemory or intermediatePersistPeriod (and are merged later into a single segment)

@glasser
Copy link
Contributor

glasser commented Mar 12, 2019

Hmm, I'm not sure if that's exactly it. I've been trying the standard quickstart Kafka ingestion example with this supervisor:

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "wikipedia",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "time",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": [
            "cityName",
            "comment",
            "countryIsoCode",
            "countryName",
            "isAnonymous",
            "isMinor",
            "isNew",
            "isRobot",
            "isUnpatrolled",
            "metroCode",
            "namespace",
            "page",
            "regionIsoCode",
            "regionName",
            "user",
            { "name": "added", "type": "long" },
            { "name": "deleted", "type": "long" },
            { "name": "delta", "type": "long" }
          ]
        }
      }
    },
    "metricsSpec": [{
      "name": "channel",
      "fieldName": "channel",
      "type": "stringFirst",
      "maxStringBytes": 100
    }],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "DAY",
      "queryGranularity": "NONE",
      "rollup": false
    }
  },
  "tuningConfig": {
    "type": "kafka",
    "reportParseExceptions": false,
    "maxRowsInMemory": 3000
  },
  "ioConfig": {
    "topic": "wikipedia",
    "replicas": 2,
    "taskDuration": "PT2M",
    "completionTimeout": "PT20M",
    "consumerProperties": {
      "bootstrap.servers": "localhost:9092"
    }
  }
}

Note the maxRowsInMemory: 3000, which is less than the number of rows in the wikiticker-2015-09-12-sampled.json. (I tried setting it just to 1 but that leads to OOMs.) This job runs successfully.

I should probably try with just an index task instead of kafka to make it simpler though.

@glasser
Copy link
Contributor

glasser commented Mar 12, 2019

Yeah, running this task against a fresh download of 0.13-incubating succeeds, even though I would think it would need to invoke AggregateCombiner?

{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
              "cityName",
              "comment",
              "countryIsoCode",
              "countryName",
              "isAnonymous",
              "isMinor",
              "isNew",
              "isRobot",
              "isUnpatrolled",
              "metroCode",
              "namespace",
              "page",
              "regionIsoCode",
              "regionName",
              "user",
              { "name": "added", "type": "long" },
              { "name": "deleted", "type": "long" },
              { "name": "delta", "type": "long" }
            ]
          },
          "timestampSpec": {
            "column": "time",
            "format": "iso"
          }
        }
      },
    "metricsSpec": [{
      "name": "channel",
      "fieldName": "channel",
      "type": "stringFirst",
      "maxStringBytes": 100
    }],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"],
        "rollup" : false
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : "quickstart/tutorial/",
        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 5000000,
      "maxRowsInMemory" : 1000,
      "forceExtendableShardSpecs" : true
    }
  }
}

@glasser
Copy link
Contributor

glasser commented Mar 12, 2019

Oh hmm. "rollup" : false looks bad, but fixing that to be "rollup": true also doesn't reproduce.

@glasser
Copy link
Contributor

glasser commented Mar 12, 2019

Oh, right. We need to actually make things roll up with each other, so set a non-trivial queryGranularity. I now have an actual reproduction so I'll open an issue: #7243.

@Aka-shi
Copy link

Aka-shi commented May 26, 2020

@glasser @andresgomezfrr Did you happen to find any workaround for this issue? Or is it solved in any of the latest versions? I am facing this exact issue when using stringLast aggregation during ingestion.

@glasser
Copy link
Contributor

glasser commented May 26, 2020

I didn't have time to look into this and we switched to working on migrating this particular weird data source away from Druid instead. Somebody told me this might have been fixed though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants