Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-3680][alpha-feature]Support Secondary Index feature on carbon table. #3608

Closed
wants to merge 2 commits into from

Conversation

akashrn5
Copy link
Contributor

@akashrn5 akashrn5 commented Feb 10, 2020

Why is this PR needed?

Currently we have datamaps like,* default datamaps* which are block and
blocklet and coarse grained datamaps like bloom, and fine grained
datamaps
like lucene which helps in better pruning during query. What if we
introduce another kind of datamap which can hold blockletId as index? Initial level,
we call it as index which will work as a child table to the main table like we have
MV in our current code.

Yes, lets introduce the secondary index to carbon table which will be the
child table to main table and it can be created on column like we create
lucene datamap, where we give index columns to create index. In a similar way,
we create secondary index on column, so indexes on these column will be blocklet IDs
which will help in better pruning and faster query when we have a filter query on the
index column.

What changes were proposed in this PR?

introduced SI feature
it contains:

  1. create SI table
  2. load to SI
  3. query from SI

Does this PR introduce any user interface change?

  • No

Is any new testcase added?

  • yes

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/214/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1916/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/221/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1923/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/229/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1931/

@akashrn5 akashrn5 force-pushed the SI_feature branch 2 times, most recently from 26cb931 to c5c1b46 Compare February 11, 2020 11:00
@CarbonDataQA1
Copy link

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/235/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/236/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1938/

@CarbonDataQA1
Copy link

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/237/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1939/

@CarbonDataQA1
Copy link

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/244/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1946/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/248/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1951/

@Indhumathi27
Copy link
Contributor

retest this please

@CarbonDataQA1
Copy link

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/252/

@QiangCai
Copy link
Contributor

please add description

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1955/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/261/

@akashrn5 akashrn5 changed the title [WIP]Si feature [CARBONDATA-3680]Support Secondary Index feature on carbon table. Feb 12, 2020
@CarbonDataQA1
Copy link

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/264/

* threshold of high cardinality
*/
@CarbonProperty
public static final String HIGH_CARDINALITY_THRESHOLD = "high.cardinality.threshold";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not required now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, removed


/**
* Default value for SI segment Compaction / merge small files
* Making this true degrade the LOAD performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain in comment when should user set to true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2341,4 +2347,78 @@ private CarbonCommonConstants() {
* Default first day of week
*/
public static final String CARBON_TIMESERIES_FIRST_DAY_OF_WEEK_DEFAULT = "SUNDAY";

@CarbonProperty
public static final String CARBON_PUSH_LEFTSEMIEXIST_JOIN_AS_IN_FILTER =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain in comment when should user set to true

/**
* Method to prune the segments based on task min/max values
*
* @param segments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it if you are not writing description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -32,14 +32,14 @@
* It is the wrapper around datamap and related filter expression. By using it user can apply
* datamaps in expression style.
*/
public interface DataMapExprWrapper extends Serializable {
public abstract class DataMapExprWrapper implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can still use interface, java 8 interface can have default implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since some users are still using older java version,i think we can keep until we completely move out.

public void serializeMemoryBlock() {
}

public void copyToMemoryBlock() {
Copy link
Contributor

@jackylk jackylk Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why empty implementation? not abstract?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since only for unsafe implementation, its required.

identifierWrapper.getConfiguration(), indexInfos);
identifierWrapper.isAddToUnsafe(),
identifierWrapper.getConfiguration(),
identifierWrapper.isSerializeDmStore(), indexInfos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move indexInfos to next line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

identifierWrapper.getConfiguration(), indexInfos);
identifierWrapper.isAddToUnsafe(),
identifierWrapper.getConfiguration(),
identifierWrapper.isSerializeDmStore(), indexInfos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move indexInfos to next line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@CarbonDataQA1
Copy link

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/265/

@akashrn5 akashrn5 changed the title [CARBONDATA-3680]Support Secondary Index feature on carbon table. [CARBONDATA-3680][alpha-feature]Support Secondary Index feature on carbon table. Feb 12, 2020
@CarbonDataQA1
Copy link

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/268/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1971/

* merge the data files for upcoming loads or run SI rebuild command which does this job for all
* segments. (REBUILD INDEX <index_table>)
*/
public static final String DEFAULT_CARBON_SI_SEGMENT_MERGE = "false";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_CARBON_SI_SEGMENT_MERGE => CARBON_SI_SEGMENT_MERGE_DEFAULT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -61,7 +61,8 @@

public DiskBasedDMSchemaStorageProvider(String storePath) {
this.storePath = CarbonUtil.checkAndAppendHDFSUrl(storePath);
this.mdtFilePath = storePath + CarbonCommonConstants.FILE_SEPARATOR + "datamap.mdtfile";
this.mdtFilePath = CarbonUtil.checkAndAppendHDFSUrl(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.mdtFilePath = this.storePath + CarbonCommonConstants.FILE_SEPARATOR + "datamap.mdtfile";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* inside the secondary index table, we need to delete the stale carbondata files.
* refer {@link org.apache.spark.sql.secondaryindex.rdd.CarbonSIRebuildRDD}
*/
private static void cleanUpDataFilesAfterSmallFIlesMergeForSI(CarbonTable table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanUpDataFilesAfterSmallFIlesMergeForSI => cleanUpDataFilesAfterSmallFilesMergeForSI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

case indexTableName ~ table ~ cols ~ indexStoreType ~ tblProp =>

if (!("carbondata".equalsIgnoreCase(indexStoreType) ||
"org.apache.carbondata.format".equalsIgnoreCase(indexStoreType))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support only "carbondata"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

protected lazy val showIndexes: Parser[LogicalPlan] =
(SHOW ~> opt(FORMATTED)) ~> (INDEXES | INDEX) ~> ON ~> ident ~ opt((FROM | IN) ~> ident) <~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about
(SHOW ~> INDEXES ~> ON ~> (ident <~ ".").? ~ ident

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

throw new ErrorMessage(
s"Parent Table ${ carbonTable.getDatabaseName }." +
s"${ carbonTable.getTableName }" +
s" is Partition Table and Secondary index on Partition table is not supported ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it easy to support it in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there will some issues in pruning part, but we can support

## See the License for the specific language governing permissions and
## limitations under the License.
## ------------------------------------------------------------------------
org.apache.spark.sql.CarbonSource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already exists in integration/spark2 module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@CarbonDataQA1
Copy link

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/275/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1978/

@CarbonDataQA1
Copy link

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/278/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1981/

@QiangCai
Copy link
Contributor

LGTM

@asfgit asfgit closed this in f127245 Feb 13, 2020
asfgit pushed a commit that referenced this pull request Oct 19, 2020
Why is this PR needed?
There is a random 11 testcase failure in CI.

What changes were proposed in this PR?
a) After analyzing and adding logs, found out that test case added in #3608 has not reset the carbon.enable.auto.load.merge carbon property.
This issue is random as if any other testcase (which resets that carbon property) in CarbonIndexFileMergeTestCase is ran before issue testcase, it will set the value to false and mask the issue. Also in a suite, order of testcase execution is not same always. So, it is random.

b) Test case added by #3832 has a problem. test cdc with compaction fails randomly because 12.2 segment will not be there and merge carbon property is not reset when this is failed.
It is because auto compaction sometimes 0.1 goes for marked for delete during merge operation. There is no record mismatch. Just that 12.2 doesn't exist. It will be in some other segment name. currently removed checking that 12.2 segment.

Does this PR introduce any user interface change?
No

Is any new testcase added?
No

this closes #3948
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants