[CARBONDATA-3680][alpha-feature]Support Secondary Index feature on carbon table. #3608

akashrn5 · 2020-02-10T14:07:17Z

Why is this PR needed?

Currently we have datamaps like,* default datamaps* which are block and
blocklet and coarse grained datamaps like bloom, and fine grained
datamaps like lucene which helps in better pruning during query. What if we
introduce another kind of datamap which can hold blockletId as index? Initial level,
we call it as index which will work as a child table to the main table like we have
MV in our current code.

Yes, lets introduce the secondary index to carbon table which will be the
child table to main table and it can be created on column like we create
lucene datamap, where we give index columns to create index. In a similar way,
we create secondary index on column, so indexes on these column will be blocklet IDs
which will help in better pruning and faster query when we have a filter query on the
index column.

What changes were proposed in this PR?

introduced SI feature
it contains:

create SI table
load to SI
query from SI

Does this PR introduce any user interface change?

No

Is any new testcase added?

yes

CarbonDataQA1 · 2020-02-10T14:20:10Z

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/214/

CarbonDataQA1 · 2020-02-10T15:18:23Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1916/

CarbonDataQA1 · 2020-02-10T16:45:09Z

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/221/

CarbonDataQA1 · 2020-02-10T17:49:47Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1923/

CarbonDataQA1 · 2020-02-11T07:15:10Z

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/229/

CarbonDataQA1 · 2020-02-11T08:27:17Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1931/

CarbonDataQA1 · 2020-02-11T11:22:00Z

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/235/

CarbonDataQA1 · 2020-02-11T11:34:46Z

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/236/

CarbonDataQA1 · 2020-02-11T11:36:17Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1938/

CarbonDataQA1 · 2020-02-11T12:06:44Z

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/237/

CarbonDataQA1 · 2020-02-11T13:05:04Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1939/

CarbonDataQA1 · 2020-02-11T18:51:53Z

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/244/

CarbonDataQA1 · 2020-02-11T19:54:53Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1946/

CarbonDataQA1 · 2020-02-12T07:24:27Z

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/248/

CarbonDataQA1 · 2020-02-12T08:23:31Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1951/

Indhumathi27 · 2020-02-12T08:33:19Z

retest this please

CarbonDataQA1 · 2020-02-12T08:58:22Z

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/252/

QiangCai · 2020-02-12T09:56:31Z

please add description

CarbonDataQA1 · 2020-02-12T10:04:12Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1955/

CarbonDataQA1 · 2020-02-12T10:41:39Z

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/261/

CarbonDataQA1 · 2020-02-12T12:22:41Z

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/264/

jackylk · 2020-02-12T13:51:08Z

core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java

+   * threshold of high cardinality
+   */
+  @CarbonProperty
+  public static final String HIGH_CARDINALITY_THRESHOLD = "high.cardinality.threshold";


I think this is not required now

yes, removed

jackylk · 2020-02-12T13:51:44Z

core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java

+
+  /**
+   * Default value for SI segment Compaction / merge small files
+   * Making this true degrade the LOAD performance


please explain in comment when should user set to true?

jackylk · 2020-02-12T13:52:27Z

core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java

@@ -2341,4 +2347,78 @@ private CarbonCommonConstants() {
   * Default first day of week
   */
  public static final String CARBON_TIMESERIES_FIRST_DAY_OF_WEEK_DEFAULT = "SUNDAY";
+
+  @CarbonProperty
+  public static final String CARBON_PUSH_LEFTSEMIEXIST_JOIN_AS_IN_FILTER =


Please explain in comment when should user set to true

jackylk · 2020-02-12T13:55:15Z

core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java

+  /**
+   * Method to prune the segments based on task min/max values
+   *
+   * @param segments


remove it if you are not writing description

jackylk · 2020-02-12T13:57:26Z

core/src/main/java/org/apache/carbondata/core/datamap/dev/expr/DataMapExprWrapper.java

@@ -32,14 +32,14 @@
 * It is the wrapper around datamap and related filter expression. By using it user can apply
 * datamaps in expression style.
 */
-public interface DataMapExprWrapper extends Serializable {
+public abstract class DataMapExprWrapper implements Serializable {


you can still use interface, java 8 interface can have default implementation

since some users are still using older java version,i think we can keep until we completely move out.

jackylk · 2020-02-12T13:58:32Z

core/src/main/java/org/apache/carbondata/core/indexstore/AbstractMemoryDMStore.java

+  public void serializeMemoryBlock() {
+  }
+
+  public void copyToMemoryBlock() {


why empty implementation? not abstract?

since only for unsafe implementation, its required.

jackylk · 2020-02-12T13:59:29Z

core/src/main/java/org/apache/carbondata/core/indexstore/BlockletDataMapIndexStore.java

-                  identifierWrapper.getConfiguration(), indexInfos);
+                  identifierWrapper.isAddToUnsafe(),
+                  identifierWrapper.getConfiguration(),
+                  identifierWrapper.isSerializeDmStore(), indexInfos);


move indexInfos to next line

jackylk · 2020-02-12T13:59:36Z

core/src/main/java/org/apache/carbondata/core/indexstore/BlockletDataMapIndexStore.java

-                      identifierWrapper.getConfiguration(), indexInfos);
+                      identifierWrapper.isAddToUnsafe(),
+                      identifierWrapper.getConfiguration(),
+                      identifierWrapper.isSerializeDmStore(), indexInfos);


move indexInfos to next line

CarbonDataQA1 · 2020-02-12T14:06:59Z

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/265/

CarbonDataQA1 · 2020-02-12T15:28:04Z

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/268/

CarbonDataQA1 · 2020-02-12T16:53:35Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1971/

QiangCai · 2020-02-13T09:32:46Z

core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java

+   * merge the data files for upcoming loads or run SI rebuild command which does this job for all
+   * segments. (REBUILD INDEX <index_table>)
+   */
+  public static final String DEFAULT_CARBON_SI_SEGMENT_MERGE = "false";


DEFAULT_CARBON_SI_SEGMENT_MERGE => CARBON_SI_SEGMENT_MERGE_DEFAULT

QiangCai · 2020-02-13T09:41:48Z

.../java/org/apache/carbondata/core/metadata/schema/table/DiskBasedDMSchemaStorageProvider.java

@@ -61,7 +61,8 @@

  public DiskBasedDMSchemaStorageProvider(String storePath) {
    this.storePath = CarbonUtil.checkAndAppendHDFSUrl(storePath);
-    this.mdtFilePath = storePath + CarbonCommonConstants.FILE_SEPARATOR + "datamap.mdtfile";
+    this.mdtFilePath = CarbonUtil.checkAndAppendHDFSUrl(


this.mdtFilePath = this.storePath + CarbonCommonConstants.FILE_SEPARATOR + "datamap.mdtfile";

QiangCai · 2020-02-13T09:48:29Z

core/src/main/java/org/apache/carbondata/core/mutate/CarbonUpdateUtil.java

+   * inside the secondary index table, we need to delete the stale carbondata files.
+   * refer {@link org.apache.spark.sql.secondaryindex.rdd.CarbonSIRebuildRDD}
+   */
+  private static void cleanUpDataFilesAfterSmallFIlesMergeForSI(CarbonTable table,


cleanUpDataFilesAfterSmallFIlesMergeForSI => cleanUpDataFilesAfterSmallFilesMergeForSI

QiangCai · 2020-02-13T10:18:40Z

integration/spark2/src/main/scala/org/apache/spark/sql/parser/CarbonSpark2SqlParser.scala

+      case indexTableName ~ table ~ cols ~ indexStoreType ~ tblProp =>
+
+        if (!("carbondata".equalsIgnoreCase(indexStoreType) ||
+              "org.apache.carbondata.format".equalsIgnoreCase(indexStoreType))) {


support only "carbondata"

QiangCai · 2020-02-13T10:21:20Z

integration/spark2/src/main/scala/org/apache/spark/sql/parser/CarbonSpark2SqlParser.scala

+    }
+
+  protected lazy val showIndexes: Parser[LogicalPlan] =
+    (SHOW ~> opt(FORMATTED)) ~> (INDEXES | INDEX) ~> ON ~> ident ~ opt((FROM | IN) ~> ident) <~


how about
(SHOW ~> INDEXES ~> ON ~> (ident <~ ".").? ~ ident

QiangCai · 2020-02-13T10:24:57Z

...on/spark2/src/main/scala/org/apache/spark/sql/secondaryindex/command/SICreationCommand.scala

+        throw new ErrorMessage(
+          s"Parent Table  ${ carbonTable.getDatabaseName }." +
+          s"${ carbonTable.getTableName }" +
+          s" is Partition Table and Secondary index on Partition table is not supported ")


is it easy to support it in the future?

yes, there will some issues in pruning part, but we can support

QiangCai · 2020-02-13T10:30:04Z

secondary_index/src/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

+## See the License for the specific language governing permissions and
+## limitations under the License.
+## ------------------------------------------------------------------------
+org.apache.spark.sql.CarbonSource


already exists in integration/spark2 module

CarbonDataQA1 · 2020-02-13T11:57:25Z

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/275/

CarbonDataQA1 · 2020-02-13T13:15:22Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1978/

CarbonDataQA1 · 2020-02-13T14:01:46Z

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/278/

CarbonDataQA1 · 2020-02-13T15:19:32Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1981/

QiangCai · 2020-02-13T15:27:50Z

LGTM

Why is this PR needed? There is a random 11 testcase failure in CI. What changes were proposed in this PR? a) After analyzing and adding logs, found out that test case added in #3608 has not reset the carbon.enable.auto.load.merge carbon property. This issue is random as if any other testcase (which resets that carbon property) in CarbonIndexFileMergeTestCase is ran before issue testcase, it will set the value to false and mask the issue. Also in a suite, order of testcase execution is not same always. So, it is random. b) Test case added by #3832 has a problem. test cdc with compaction fails randomly because 12.2 segment will not be there and merge carbon property is not reset when this is failed. It is because auto compaction sometimes 0.1 goes for marked for delete during merge operation. There is no record mismatch. Just that 12.2 doesn't exist. It will be in some other segment name. currently removed checking that 12.2 segment. Does this PR introduce any user interface change? No Is any new testcase added? No this closes #3948

akashrn5 force-pushed the SI_feature branch from 53fb7e0 to 7781c92 Compare February 10, 2020 16:33

akashrn5 force-pushed the SI_feature branch from 7781c92 to 33d4886 Compare February 11, 2020 07:05

akashrn5 force-pushed the SI_feature branch 2 times, most recently from 26cb931 to c5c1b46 Compare February 11, 2020 11:00

akashrn5 force-pushed the SI_feature branch from c5c1b46 to 7067278 Compare February 11, 2020 11:30

akashrn5 force-pushed the SI_feature branch from 7067278 to 7807f9f Compare February 11, 2020 11:42

Indhumathi27 force-pushed the SI_feature branch from 7807f9f to dd5eb17 Compare February 11, 2020 18:28

Indhumathi27 force-pushed the SI_feature branch from dd5eb17 to dbf4283 Compare February 12, 2020 07:05

Indhumathi27 force-pushed the SI_feature branch from dbf4283 to 9ce0d3f Compare February 12, 2020 08:28

Indhumathi27 force-pushed the SI_feature branch from 9ce0d3f to bfb367b Compare February 12, 2020 10:29

akashrn5 changed the title ~~[WIP]Si feature~~ [CARBONDATA-3680]Support Secondary Index feature on carbon table. Feb 12, 2020

Indhumathi27 force-pushed the SI_feature branch from cd239e4 to a0a2609 Compare February 12, 2020 13:44

jackylk reviewed Feb 12, 2020

View reviewed changes

akashrn5 changed the title ~~[CARBONDATA-3680]Support Secondary Index feature on carbon table.~~ [CARBONDATA-3680][alpha-feature]Support Secondary Index feature on carbon table. Feb 12, 2020

akashrn5 force-pushed the SI_feature branch from a0a2609 to 53808c7 Compare February 12, 2020 15:06

QiangCai reviewed Feb 13, 2020

View reviewed changes

akashrn5 force-pushed the SI_feature branch from 53808c7 to 2d68230 Compare February 13, 2020 11:33

akashrn5 and others added 2 commits February 13, 2020 19:09

add Secondary index feature for better pruning

42b15ec

core changes and load changes for SI integration

ec1d823

Indhumathi27 force-pushed the SI_feature branch from 2d68230 to ec1d823 Compare February 13, 2020 13:40

asfgit closed this in f127245 Feb 13, 2020

ajantha-bhat mentioned this pull request Sep 10, 2020

[CARBONDATA-3923] support global sort for SI #3787

Closed

ajantha-bhat mentioned this pull request Oct 12, 2020

[HOTFIX] Fix random 11 testcase failure in CI #3948

Closed

[CARBONDATA-3680][alpha-feature]Support Secondary Index feature on carbon table. #3608

[CARBONDATA-3680][alpha-feature]Support Secondary Index feature on carbon table. #3608

Conversation

akashrn5 commented Feb 10, 2020 • edited

Why is this PR needed?

What changes were proposed in this PR?

Does this PR introduce any user interface change?

Is any new testcase added?

CarbonDataQA1 commented Feb 10, 2020

CarbonDataQA1 commented Feb 10, 2020

CarbonDataQA1 commented Feb 10, 2020

CarbonDataQA1 commented Feb 10, 2020

CarbonDataQA1 commented Feb 11, 2020

CarbonDataQA1 commented Feb 11, 2020

CarbonDataQA1 commented Feb 11, 2020

CarbonDataQA1 commented Feb 11, 2020

CarbonDataQA1 commented Feb 11, 2020

CarbonDataQA1 commented Feb 11, 2020

CarbonDataQA1 commented Feb 11, 2020

CarbonDataQA1 commented Feb 11, 2020

CarbonDataQA1 commented Feb 11, 2020

CarbonDataQA1 commented Feb 12, 2020

CarbonDataQA1 commented Feb 12, 2020

Indhumathi27 commented Feb 12, 2020

CarbonDataQA1 commented Feb 12, 2020

QiangCai commented Feb 12, 2020

CarbonDataQA1 commented Feb 12, 2020

CarbonDataQA1 commented Feb 12, 2020

CarbonDataQA1 commented Feb 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylk Feb 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA1 commented Feb 12, 2020

CarbonDataQA1 commented Feb 12, 2020

CarbonDataQA1 commented Feb 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA1 commented Feb 13, 2020

CarbonDataQA1 commented Feb 13, 2020

CarbonDataQA1 commented Feb 13, 2020

CarbonDataQA1 commented Feb 13, 2020

QiangCai commented Feb 13, 2020

akashrn5 commented Feb 10, 2020 •

edited

jackylk Feb 12, 2020 •

edited