PHOENIX-4925 Use a Variant Segment tree to organize Guide Post Info #482

binshi-bing · 2019-04-11T06:09:07Z

@dbwong please review this early draft. Haven't completely sorted out local index and cross region boundary condition when generating scan. Still adding unit test and integration test and trying to pass the tests. Please focus on main logic, data structure, main algorithms and interfaces first. The main data structure change is in GuidePostsInfo for now.

karanmehta93 · 2019-04-11T06:22:21Z

@BinShi-SecularBird can you fix the conflicts?

phoenix-core/src/main/java/org/apache/phoenix/iterate/SnapshotScanner.java

phoenix-core/src/main/java/org/apache/phoenix/schema/stats/DefaultStatisticsCollector.java

dbwong

Initial high level review. Generally okay with the high level picture but I think the tree classes need unit tests badly before I review in detail.

dbwong · 2019-04-12T20:14:33Z

phoenix-core/src/main/java/org/apache/phoenix/query/QueryServices.java

    // The size of the thread pool used for refreshing cached table stats in stats client cache
    public static final String STATS_CACHE_THREAD_POOL_SIZE = "phoenix.stats.cache.threadPoolSize";
+    // The targeted size of the guide post chunk measured in the number of guide posts.
+    public static final String STATS_TARGETED_CHUNK_SIZE = "phoenix.stats.targeted.chunk.size";


I don’t think this should be on the query level and Should be on the table level sicne we won't be collecting stats for the table multiple times ideally. With you guidepost chunk shouldn't this match the collected size? Is this for a 1 size chunk for backward compatibility?

This is a cluster configuration applied to all queries and all tables. Isn't it the right place to add a cluster configuration?

Mmm what about the case where the chunk size here is > the table guide post size for example will you combine chunks?

if we are doing it on the table level, we have to add an extra column to store this chunk size. Adding more and more columns on the syscat is not a good idea, what's the best practice here @dbwong ?

Chunk size is max(1, count of guide posts).

phoenix-core/src/main/java/org/apache/phoenix/schema/stats/DefaultStatisticsCollector.java

dbwong · 2019-04-12T20:27:32Z

phoenix-core/src/main/java/org/apache/phoenix/schema/stats/GuidePostChunk.java

+/**
+ * A guide post chunk is comprised of a group of guide posts, and it has one of key ranges below:
+ *     (UNBOUND, gp_i0], (gp_i0, gp_i1], (gp_i1, gp_i2], ..., (gp_in, gp_n], (gp_n, UNBOUND)
+ * where gp_x is one of guide post collected on the server side. The last guide post chunk is a DUMMY chunk


Let’s call it something other than dummy maybe residual? And describe that it covers the key range not covered by the guideposts

Changed name to ending chunk.

phoenix-core/src/it/java/org/apache/phoenix/end2end/ExplainPlanWithStatsEnabledIT.java

phoenix-core/src/it/java/org/apache/phoenix/schema/stats/BaseStatsCollectorIT.java

phoenix-core/src/main/java/org/apache/phoenix/schema/stats/GuidePostsInfo.java

dbwong · 2019-04-12T23:58:17Z

phoenix-core/src/main/java/org/apache/phoenix/schema/stats/GuidePostsInfo.java

+            Construct(this.guidePostChunks, 0, n - 1, 0);
+        } else {
+            this.treeSize = 0;
+            this.nodes = null;


dislike the use of null for presence consider Optional.

I don't need to check whether nodes is null and there is no nested check for null object, so use Optional class is overkilled here.

dbwong · 2019-04-13T00:02:22Z

phoenix-core/src/main/java/org/apache/phoenix/schema/stats/GuidePostsInfoBuilder.java

     */
-    public boolean trackGuidePost(ImmutableBytesWritable row, long byteCount, long rowCount,
-            long updateTimestamp) {
+    public boolean trackGuidePost(ImmutableBytesWritable row, long byteCount, long rowCount, long updateTimestamp) {


This isn't really following the builder pattern. consider rename or handling through exceptions or changing the caller.

This intend not to use builder pattern here. No need to use builder pattern.

dbwong · 2019-04-13T00:10:58Z

phoenix-core/src/main/java/org/apache/phoenix/schema/stats/GuidePostChunk.java

+    /**
+     * The index of the guide post chunk in the chunk array.
+     */
+    private final int guidePostChunkIndex;


This seems to be leaking the top level classes lower into this class. The accessor is never used. Can you briefly explain why this index is needed to be tracked in multiple levels? InnerPointLookupResult in GuidePostsInfo even has 2 separate indexes tracked.

It's used in DecodedGuidePostCache to track the guide post chunks in the cache.

binshi-bing · 2019-04-23T00:04:23Z

@dbwong , could you please review BaseResultIterators.getParallelScans()? All the required changes should be there now. BaseResultIterators.getRowKeyRanges() contains the logic of generating query key ranges for local index.

dbwong

Initial thoughts feedback on the code in BaseResultIterators @BinShi-SecularBird