Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store AggregatorFactory[] in segment metadata #1728

Merged
merged 4 commits into from
Jan 19, 2016

Conversation

himanshug
Copy link
Contributor

This PR adds the list of aggregators used to create a segment to it's metadata which is persisted in metadata.drd file. This feature aids

  1. implementation of hadoop "reindexing/merging" task so that it can merge intervals automatically with minimal user input.
  2. UI implementations to present a list of aggregators to users

Also, this PR improves the segment metadata framework which allows adding more stuff easily e.g. QueryGranularity (not added in this PR though)

In order to implement this feature, following notable changes were made.

  1. AggregatorFactory is made an abstract class, a new method was added to AggregatorFactory with a default implementation (also AggregatorFactory was converted to an abstract class), this is required in order to correctly "merge"
    aggregator factories. This is a backward incompatible change.
/**
 * Returns an AggregatorFactory that can be used to merge the output of aggregators from this factory and
 * other factory.
 * This method is relevant only for AggregatorFactory which can be used at ingestion time.
 *
 * @return a new Factory that can be used for merging the output of aggregators from this factory and other.
 */
public AggregatorFactory getMergingFactory(AggregatorFactory other) throws AggregatorFactoryNotMergeableException


  1. User can optionally supply the list of Aggregators to AppendTask so as to explicitly specify list of aggregators to be stored inside the segment metadata of final appended segment.

);
}

public static final AggregatorFactory[] mergeAggregators(List<IndexableAdapter> indexes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some javadocs to clarify what the expected behavior is when merging indices? i.e. there are a few things I think we should explicitly state, e.g. what happens if there is a naming conflict, how does the order of the list affect which aggregators are kept? Also, in some cases the aggregators may not be compatible, e.g. if a name conflict redefines a longSum metric as a complex metric, it will probably break unexpectedly somewhere downstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update the javadoc.

I have tried to replicate the behavior that of merging name of metrics at https://github.com/druid-io/druid/blob/master/processing/src/main/java/io/druid/segment/IndexMerger.java#L435 .
If two different aggregators had same name, then the first one will be taken. Also, if any of the input index did not have aggregators in the metadata then resulting segment will not have same as well.
does this behavior look ok or we should do something else?

Also, I'm yet to write the UTs for this. sent the PR first to ensure that the behavior chosen looks ok to others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already do this somewhere else then let's keep the behavior and document it. What defines the order in which we pass segments to this function, is that documented?

{
public static final String METRIC_SPEC_KEY = "metricsSpec";

private static final ObjectMapper mapper = new DefaultObjectMapper();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we inject or pass around jsonMapper in method calls instead ?
I wonder if the mapper created by calling constr would have visibility to the JacksonModules registered in Initialization.registerJacksonModules() ?
can you try if it works with Approximate histogram aggregator ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to be injected properly or passed around, otherwise extensions and other aggregations setup as part of the AggregatorsModule will not work. IndexIO has the same problem, everything is static and it has to create it's own injector and it's really ugly, so I'd rather we pass it around if we want to keep it static.

We should also add a test to make sure injected aggregators work with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making it a method argument and have it come from caller is good .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is turning out to be harder than I thought.. SegmentMetadataHelper.xxx(..) is to be used by Index[Merger/Maker].xxx(..) all of which are static so having it injected is not possible. Index[Merger/Maker] contain a static handle to a mapper but that is not setup to handle extension. I am wondering whether it makes sense to make them as sophisticated as the one in HadoopDruidIndexerConfig or to have Index[Merger/Maker] methods get ObjectMapper from the callers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me IndexMaker and IndexMerger both implement same functions, wouldn't it make more sense for both of them to inherit from a base class, and create shared static injector + ObjectMapper there? We can also put SegmentMetadataHelper as an inner class of the base class since only IndexMaker and IndexMerger use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO best would be to have IndexMerger/Maker not actually be static, that way they can have the correct object mapper injected.

Second best would be to have callers pass in the objectMapper they want to use.

Third best would be something in the style of HadoopDruidIndexerConfig.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guobingkun I'm not too worried about IndexMerger/IndexMaker duplication in this PR, basic problem here is to really avoid creating that static injector. That said, if we do end up creating a static one then it would make sense to create it only in one class and use in other.

@gianm agree with your ordering, will try to make methods in Index[Maker/Merger] non-static.

@himanshug
Copy link
Contributor Author

@xvrl @gianm @nishantmonu51
thinking more towards #1517 and it appears that list of aggregators for various segments would be needed at the overlord (or middle manager) which means that list of aggregators has to be stored inside "descriptor.json" (having it inside metadata.drd is optional really)
that leads to me to think that Index[Merger/Maker] should't really touch segmentMetadata and callers should put aggregators inside segmentMetadata before calling IndexMerger.xxx(), that way caller will be able put same thing inside descriptor.json as well as metadata.drd.

@himanshug
Copy link
Contributor Author

ok, spoke to @cheddar about storing aggregators in descriptor.json and decided to not put it there as it might increase memory pressure on other nodes... instead have metadata be part of segment metadata query probably and coordinator/overlord or whoever else needs it can query same.

@gianm
Copy link
Contributor

gianm commented Sep 17, 2015

@himanshug, I think @jon-wei can look at that as part of other changes he is working on for the segment metadata query.

One thing I'm wondering is, should the aggregators-merging stuff here be generalized into a general segment metadata merger thing? The segment metadata query has a "merge" flag that causes it to merge individual segment results into a single overall result. It'll need some way of doing the merging for the metadata.drd metadata.

@xvrl
Copy link
Member

xvrl commented Sep 17, 2015

@himanshug I don't think storing it in descriptor has to add memory pressure. We can make the data-structures such that we don't have to always keep unnecessary information in memory. Most of the memory pressure, as far as I know is on the coordinator and brokers, which don't need that information. If we load it separately when needed and only keep the minimum for brokers/coordinators it should not be painful.

@himanshug
Copy link
Contributor Author

@xvrl From what I understand, currently, everything inside descriptor.json goes to zk nodes and to other druid nodes and hence will add pressure.
Are you suggesting that we split descriptor.json (when publishing to zk) into 2 parts and druid nodes only get one or both parts to whatever degree necessary? Even in this case though, memory pressure on zk will increase and whoever else needs it (either one of overlord/coordinator/MM) .
However, do you think there would be a major problem if we don't put it inside descriptor.json and whoever needs that information makes a segment metadata query?

@xvrl
Copy link
Member

xvrl commented Oct 23, 2015

@himanshug can we summarize in the PR description what we ultimately agreed on and why?

@xvrl
Copy link
Member

xvrl commented Oct 23, 2015

@himanshug we should also add a note on what problem this is addressing

@himanshug
Copy link
Contributor Author

@xvrl added

@himanshug himanshug force-pushed the aggregators_in_segment_metadata branch from 1c93bee to e857e7f Compare November 6, 2015 04:53
@gianm
Copy link
Contributor

gianm commented Nov 8, 2015

@himanshug looks like a legitimate failure; perhaps the aggregators returned by the converted segment, and the expected aggregators, are in a different order?

@himanshug
Copy link
Contributor Author

@gianm java8 build passed, i must have used something java8 specific, will check.
BTW, this PR is ready for review .

@gianm
Copy link
Contributor

gianm commented Nov 8, 2015

@himanshug the failure might be a nondeterministic ordering thing- are the aggregators always merged in deterministic order?

@himanshug himanshug force-pushed the aggregators_in_segment_metadata branch from 1ce9785 to f5eae7f Compare January 7, 2016 21:51
@himanshug
Copy link
Contributor Author

@gianm resolved conflicts.

@himanshug himanshug force-pushed the aggregators_in_segment_metadata branch from f5eae7f to 9980014 Compare January 8, 2016 15:39
@himanshug
Copy link
Contributor Author

build failed with JDBCExtractionNamespaceTest failure

testFindNew[tsColumn](io.druid.server.namespace.cache.JDBCExtractionNamespaceTest)  Time elapsed: 63.055 sec  <<< ERROR!

java.lang.RuntimeException: java.util.concurrent.TimeoutException

    at java.util.concurrent.FutureTask.get(FutureTask.java:205)

    at io.druid.server.namespace.cache.JDBCExtractionNamespaceTest.tearDown(JDBCExtractionNamespaceTest.java:307)

https://travis-ci.org/druid-io/druid/jobs/101088054

@himanshug himanshug force-pushed the aggregators_in_segment_metadata branch from 9980014 to 04a2f55 Compare January 12, 2016 19:48
@himanshug
Copy link
Contributor Author

rebased again to resolve merge conflicts.

@fjy
Copy link
Contributor

fjy commented Jan 16, 2016

@himanshug more merge conflicts :)

mergedAggregators.put(name, other.getMergingFactory(aggregator));
}
catch (AggregatorFactoryNotMergeableException ex) {
log.warn(ex, "failed to merge aggregator factories");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this warning include name somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nevermind, it already does due to the automatic message in AggregatorFactoryNotMergeableException.

@gianm
Copy link
Contributor

gianm commented Jan 19, 2016

👍 after conflict fixes & travis. Looks good otherwise though. thanks @himanshug

@himanshug himanshug force-pushed the aggregators_in_segment_metadata branch from 04a2f55 to fae284f Compare January 19, 2016 04:16
@himanshug
Copy link
Contributor Author

@gianm thanks for reviewing, resolved conflicts again.

@himanshug himanshug force-pushed the aggregators_in_segment_metadata branch from fae284f to fc3edfe Compare January 19, 2016 04:31
@gianm
Copy link
Contributor

gianm commented Jan 19, 2016

@himanshug looks like legitimate failures in the build

@himanshug himanshug force-pushed the aggregators_in_segment_metadata branch from fc3edfe to 71206c1 Compare January 19, 2016 05:03
@himanshug
Copy link
Contributor Author

@gianm fixed build, new IndexMergerV9 stuff received in latest rebase was failing with the changes.

@gianm
Copy link
Contributor

gianm commented Jan 19, 2016

👍

*/
public AggregatorFactory getMergingFactory(AggregatorFactory other) throws AggregatorFactoryNotMergeableException
{
throw new UnsupportedOperationException("can't merge");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, it'd be nice if this printed the type of base agg factory and other agg factory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added class name in the msg

@fjy
Copy link
Contributor

fjy commented Jan 19, 2016

👍 even if my comments aren't addressed

@himanshug himanshug force-pushed the aggregators_in_segment_metadata branch from 71206c1 to a99aef2 Compare January 19, 2016 20:23
fjy added a commit that referenced this pull request Jan 19, 2016
Store AggregatorFactory[] in segment metadata
@fjy fjy merged commit 0c31f00 into apache:master Jan 19, 2016
@fjy fjy mentioned this pull request Feb 5, 2016
@himanshug himanshug deleted the aggregators_in_segment_metadata branch February 8, 2016 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants