Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce SegmentId class #6370

Merged
merged 48 commits into from Jan 21, 2019

Conversation

@leventov
Copy link
Member

commented Sep 22, 2018

This PR introduces the SegmentId class for operating with segment ids in the codebase instead of Strings.

Three reasons:

  • Memory efficiency: SegmentId's footprint is about 50 bytes smaller than a String, assuming non-compressed Oops, if I calculated correctly (should be ~25 bytes smaller since Java 9's "compact strings"). The difference should be bigger with compressed Oops.
  • Code readability and safety: it's more readable and safer to operate with Map<SegmentId, Something> than with Map<String, Something>
  • SegmentId enabled to get rid of the segments map in DruidServer and ImmutableDruidServer, for additional memory savings.

Also:

  • Replaced ConcurrentSkipListMap with ConcurrentHashMap in DruidDataSource, because it doesn't really need to be sorted.
  • Improved concurrency of DruidServer.
  • Improved concurrency of SQLMetadataSegmentManager.

Caveats of SegmentId introduction:

  • HTTP endpoints that return a lot of segment ids as parts of some JSON answers generate more garbage, because they convert SegmentId objects into Strings. Previously, those Strings were already stored in the heap.
  • HTTP endpoints which accept segment id need to do more work, because they need to iterate all possible parsings of a segment id string into SegmentId, that is ambiguous. But this should be minor, compared to the cost of an HTTP call itself. The exception is MetadataResource.getDatabaseSegmentDataSourceSegment(): to preserve it's former behaviour precisely, it would require to generate quite a lot of garbage String objects on each endpoint call, so the logic of this endpoint was changed (see below).

Changed HTTP endpoint behaviour:

  • MetadataResource.getDatabaseSegmentDataSourceSegment(), /datasources/{dataSourceName}/segments/{segmentId}: it used to search for segment id match ignoring the case of the string (for reference, it dates back to this commit: 3cba0c8#diff-1fad8d6f3e6e3a17b71cd63bd21353e9R153), that's not in line with all other segmentId-accepting endpoints, and to me it doesn't look correct, because "ver" and "VER" are two different versions according to the rest of the logic. In the PR, the endpoint is changed to look for case-sensitive segmentId match.
  • DatasourcesResource.getSegmentDataSourceSpecificInterval(), /{dataSourceName}/intervals/{interval}, simple option: it creates a HashMap of intervals, unlike same method, full option, and both simple and full options of getSegmentDataSourceIntervals(), which all create TreeMaps of intervals. To me it looked like a slip, so I changed this endpoint to create a TreeMap as well (i. e. return interval JSON maps in different order the endpoint does currently).

This PR is labelled Incompatible because it changed multiple Public APIs (to return SegmentId instead of String from some getters).

@leventov

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2018

This is also a blocker for me for fixing #6358 and making some improvements to HttpServerInventoryView.

@leventov

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2018

For reviewers: key changed classes in this PR are SegmentId, DataSegment, DruidDataSource, DruidServer and ImmutableDruidServer. SQLMetadataSegmentManager receives some other non-trivial changes.

* The difference between this class and org.apache.druid.segment.Segment is that this class contains the segment
* metadata only, while org.apache.druid.segment.Segment represents the actual body of segment data, queryable.
*
* TODO explain the difference with org.apache.druid.query.SegmentDescriptor

This comment has been minimized.

Copy link
@egor-ryashin

egor-ryashin Sep 29, 2018

Contributor

left todo

This comment has been minimized.

Copy link
@leventov

leventov Oct 1, 2018

Author Member

We do usually have a policy that we create issues on Github instead of TODOs in code, but I felt this is a too small thing for this. We should probably allow TODOs for documentation only (including internal, Javadoc documentation). I'll create a thread about this in the dev mailing list.

This comment has been minimized.

Copy link
@gianm

gianm Oct 1, 2018

Contributor

My feeling is that if it's too small for an issue, then you should just implement it as part of the original patch. In this case I think adding the following sentence would do it.

SegmentDescriptor is a "light" version of this class that only contains the interval, version, and partition number. SegmentDescriptor is used when brokers tell data servers which segments to include for a particular query.

This comment has been minimized.

Copy link
@leventov

leventov Oct 1, 2018

Author Member

@gianm thank you. However PR authors don't always have ability to ask anybody about things like this.

* Returns a (potentially empty) lazy iteration of all possible valid parsings of the given segment id string into
* {@code SegmentId} objects.
*
* Warning: this iterable is repeats most of the parsing work on each iteration, so it should be iterated only once if

This comment has been minimized.

Copy link
@egor-ryashin

egor-ryashin Sep 29, 2018

Contributor

iterable is

This comment has been minimized.

Copy link
@leventov

leventov Sep 30, 2018

Author Member

Thanks, fixed

@egor-ryashin

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2018

BTW, it would have been much easier to review if naming changes like getIdentifier -> getId and Datasource -> DataSource had been in a separate commit, as they affect large amount of code, which makes it somewhat challenging to get the picture of functional changes there.

@leventov

This comment has been minimized.

Copy link
Member Author

commented Sep 30, 2018

@egor-ryashin to get the picture of functional changes it should be enough to review closely the classes mentioned here: #6370 (comment). Changes in other places are more or less mechanical.

leventov added 4 commits Sep 30, 2018
@egor-ryashin

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2018

I wonder if we could forbid ambiguous datasource names (ie containing timestamp)?

@leventov

This comment has been minimized.

Copy link
Member Author

commented Oct 1, 2018

@egor-ryashin because data source names and "versions" don't have any prohibited characters, including "_", ambiguity is unavoidable. But it doesn't seem to me that it harms too much. Certainly it would be better to avoid ambiguity when the segment Id format was designed, but at this point backward compatibility is more important.

@leventov leventov requested a review from drcrallen Oct 5, 2018
@drcrallen

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2019

SegmentId is not just "id", it's a useful metadata container

Is there a way we can avoid that? What I'm asking is if we can remove the string ID completely to avoid such a temptation. Having metadata serialization for machine consumption inside of a string is almost always problematic in the future. Having it ONLY for human consumption would be OK.

io.druid.query.spec.MultipleSpecificSegmentSpec already references io.druid.query.SegmentDescriptor, which is mostly duplicated in SegmentIdWithShardSpec. Any need for a difference between the two classes is not clear. As such, can they be reconciled into a single metadata class that identifies whatever needs to be identified?

@leventov

This comment has been minimized.

Copy link
Member Author

commented Jan 9, 2019

SegmentId is not just "id", it's a useful metadata container

Is there a way we can avoid that? What I'm asking is if we can remove the string ID completely to avoid such a temptation. Having metadata serialization for machine consumption inside of a string is almost always problematic in the future. Having it ONLY for human consumption would be OK.

Why having SegmentId as a useful metadata container should be avoided? What temptation? I don't understand what are you asking here and what problem do you see.

@leventov

This comment has been minimized.

Copy link
Member Author

commented Jan 9, 2019

io.druid.query.SegmentDescriptor, which is mostly duplicated in SegmentIdWithShardSpec. Any need for a difference between the two classes is not clear. As such, can they be reconciled into a single metadata class that identifies whatever needs to be identified?

The Javadoc comment of SegmentDescriptor says:

The difference between this class and {@link org.apache.druid.timeline.DataSegment} is that this class is a "light" version of {@link org.apache.druid.timeline.DataSegment}, that only contains the interval, version, and partition number. SegmentDescriptor is used when Brokers tell data servers which segments to include for a particular query.

To me, this is clear enough. Before you asked, I didn't remember that at all, I went to SegmentDescriptor to read the Javadoc, so it's clear to me not because I'm deep in the context of those things. This comment is also duplicated in the header of DataSegment, but not in the Javadoc, because DataSegment is @PublicApi and I didn't want to contaminate it's class-level Javadoc with private details and explanations.

Additionally, I want to note that SegmentDescriptor has different names of JSON fields ("itvl", "ver" and "part") that makes reconciling it with SegmentIdWithShardSpec much harder practically.

…er than with DataSegment
@leventov

This comment has been minimized.

Copy link
Member Author

commented Jan 9, 2019

After writing the previous comment, I realized that SegmentDescriptor should be more properly compared with SegmentId rather than with DataSegment, it would be more apples-to-apples comparison. I updated the relevant comments to the following:

The difference between this class and {@link org.apache.druid.query.SegmentDescriptor} is that the latter is a "light" version of SegmentId, that only contains the interval, version, and partition number. It's used where the data source, another essential part of SegmentId is determined by the context (e. g. in {@link org.apache.druid.client.CachingClusteredClient}, where SegmentDescriptor is used when Brokers tell data servers which segments to include for a particular query) and where having lean JSON representations is important, because it's actively transferred detween Druid nodes. It's also for this reason that the JSON field names of SegmentDescriptor are abbreviated.

@drcrallen

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2019

The issue I have with putting metadata inside a string inside an object is that there is no structure to the metadata anymore. In other words you are putting serialization mechanisms inside of serialization mechanisms which almost always leads to pain down the line when some corner of the system behaves differently than expected.

The serialization system within druid is json. As such, any metadata is expected to be expressed as json, with the ability for Jackson to deserialize it into objects for consumption in the code.

One of the few exceptions to this are places with URIs in the json because the URIs are usually generated or indicative of systems outside of druid.

@leventov

This comment has been minimized.

Copy link
Member Author

commented Jan 11, 2019

Charles, unfortunately I still don't understand what you are talking about. Before there were segment id strings with underscores inside them, and now there are SegmentId objects with fields. If anything, there is more structure now.

@leventov

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

@drcrallen could you please clarify your concerns?

@jihoonson

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2019

@leventov sorry for the late review. The latest change looks good to me. Would you please fix the conflicts and Travis?

public String getIdentifier()
// "identifier" for backward compatibility of JSON API
@JsonProperty(value = "identifier", access = JsonProperty.Access.READ_ONLY)
public SegmentId getId()

This comment has been minimized.

Copy link
@jihoonson

jihoonson Jan 17, 2019

Contributor

Out of curiosity, do you know why the serialized JSON has this field even though it's not used in deserialization of dataSegment?

This comment has been minimized.

Copy link
@leventov
@jihoonson

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2019

Ah, probably I saw a stale page. It looks fine after refresh.

@drcrallen do you have more comments?

@jihoonson

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2019

Oh, I missed one question. @leventov do you have any number for how much memory you could save with this PR?

@leventov

This comment has been minimized.

Copy link
Member Author

commented Jan 18, 2019

@jihoonson I don't have a number. I hope the effect of removing segments maps from DruidServer and ImmutableDruidServer will be considerable. There are also some changes I want to do to avoid creating temporary collections of the all-segments cardinality (similar to #6827, but for coordination subsystem rather than SQL). I wait for this PR to make that changes.

Thanks for review!

@jihoonson

This comment has been minimized.

Copy link
Contributor

commented Jan 21, 2019

Thanks. I’m merging this pr.

@jihoonson jihoonson merged commit 8eae26f into apache:master Jan 21, 2019
2 checks passed
2 checks passed
Inspections: pull requests (Druid) TeamCity build finished
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@clintropolis clintropolis referenced this pull request Jan 21, 2019
@leventov leventov deleted the metamx:segmentId branch Jan 22, 2019
@jon-wei jon-wei added this to the 0.14.0 milestone Feb 20, 2019
this.intervalChronology = interval.getChronology();
// Versions are timestamp-based Strings, interning of them doesn't make sense. If this is not the case, interning
// could be conditionally allowed via a system property.
this.version = Objects.requireNonNull(version);

This comment has been minimized.

Copy link
@jihoonson

jihoonson Apr 14, 2019

Contributor

Hmm, I think it makes sense to intern version as well because it's pretty common to have the same version across different segments. For example, if you create 10 segments per interval using a Hadoop task, they must have the same version in each interval. If you create segments using appending tasks (like Kafka task or index task of appending mode), they must have the same version with existing segments.

I'm not sure why conditional allowance is needed though. Would you elaborate more?

This comment has been minimized.

Copy link
@leventov

leventov Apr 15, 2019

Author Member

Probably the comment should have stated "In some setups, versions are timestamp-based strings". We don't want to intern versions because we generate them based on timestamps.

This comment has been minimized.

Copy link
@jihoonson

jihoonson Apr 15, 2019

Contributor

I mean, even though the version is generated based on a timestamp, it's common that multiple segments have the same version which is created from the same timestamp (see https://github.com/apache/incubator-druid/blob/master/server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java#L684-L700 for allocating the same version for appending segments to the timeChunk where some segments already exist).

In some setups, versions are timestamp-based strings.

I think the version is always a timestamp-based string for now. Or is it possible to change to use something else other than timestamp?

This comment has been minimized.

Copy link
@leventov

leventov Apr 16, 2019

Author Member

Or is it possible to change to use something else other than timestamp?

I didn't how the versions are created, I assumed that they may be not timestamps just when you started talking about interning.

Ok, it seems to me that SegmentId is not the right place to do version interning, but DataSegment's constructor probably is, if the ShardSpec implies a lot of partitions.

gvsmirnov added a commit to Plumbr/druid that referenced this pull request Jul 23, 2019
 from 0.10.1 to 0.15.0-incubating.

The following additional changes had to be made for compatibility with newer codebase:

1. DataSegment.makeDataSegmentIdentifier(...) call replaced with SegmentId.of(...).toString(), see apache#6370
2. size() added to PartitionHolder
3. Minor changes to comply with the new lint checks, e.g. using StringUtils, unfolding imports, fixing indentation, etc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.