[CARBONDATA-3025]add more metadata in carbon file footer #2829

akashrn5 · 2018-10-17T14:29:48Z

Changes Proposed in this PR:

Add more info to carbon file footer, like written_by (which will be spark application_name) in case of insert into and load command. To read this info one can use CLI
For SDK this API will be exposed to write this info in footer and one API will exposed to read this info from SDK.
footer will have information about in which version of carbon the file is written, which will be helpful for getting details, for comaptibility etc

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

[] Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

CarbonDataQA · 2018-10-17T14:40:21Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/839/

CarbonDataQA · 2018-10-17T15:47:40Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1037/

CarbonDataQA · 2018-10-17T15:55:00Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9105/

CarbonDataQA · 2018-10-18T05:44:29Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/846/

CarbonDataQA · 2018-10-18T06:49:25Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9111/

CarbonDataQA · 2018-10-18T06:53:29Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1043/

jackylk · 2018-10-18T08:47:38Z

format/src/main/thrift/carbondata.thrift

@@ -206,6 +206,8 @@ struct FileFooter3{
    4: optional list<BlockletInfo3> blocklet_info_list3;	// Information about blocklets of all columns in this file for V3 format
    5: optional dictionary.ColumnDictionaryChunk dictionary; // Blocklet local dictionary
    6: optional bool is_sort; // True if the data is sorted in this file, it is used for compaction to decide whether to use merge sort or not
+    7: optional string written_by; // written by is used to write who wrote the file, it can be LOAD, or SDK etc


please add a map<string, string> property, and put these two new fields into this property map. We can extend any field by using this property map instead of adding more fields

added a map

jackylk · 2018-10-18T08:48:21Z

...arbondata/spark/testsuite/createTable/TestCarbonFileInputFormatWithExternalCarbonTable.scala

@@ -56,7 +56,7 @@ class TestCarbonFileInputFormatWithExternalCarbonTable extends QueryTest with Be
      val builder = CarbonWriter.builder()
      val writer =
        builder.outputPath(writerPath + "/Fact/Part0/Segment_null")
-          .withCsvInput(Schema.parseJson(schema)).build()
+          .withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build()


Suggested change

.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build()

.withCsvInput(Schema.parseJson(schema)).writtenBy("TestCarbonFileInputFormatWithExternalCarbonTable").build()

added classname for writtenby

jackylk · 2018-10-18T08:48:54Z

...cala/org/apache/carbondata/spark/testsuite/createTable/TestNonTransactionalCarbonTable.scala

@@ -139,13 +139,13 @@ class TestNonTransactionalCarbonTable extends QueryTest with BeforeAndAfterAll {
            .sortBy(sortColumns.toArray)
            .uniqueIdentifier(
              System.currentTimeMillis).withBlockSize(2).withLoadOptions(options)
-            .withCsvInput(Schema.parseJson(schema)).build()
+            .withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build()


Suggested change

.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build()

.withCsvInput(Schema.parseJson(schema)).writtenBy("TestNonTransactionalCarbonTable").build()

jackylk · 2018-10-18T08:49:44Z

...ce/src/test/scala/org/apache/spark/sql/carbondata/datasource/SparkCarbonDataSourceTest.scala

@@ -984,7 +984,7 @@ class SparkCarbonDataSourceTest extends FunSuite with BeforeAndAfterAll {
      val writer =
        builder.outputPath(path)
          .uniqueIdentifier(System.nanoTime()).withBlockSize(2)
-          .withCsvInput(new Schema(structType)).build()
+          .withCsvInput(new Schema(structType)).writtenBy("DataSource").build()


use the class name instead

jackylk · 2018-10-18T08:50:39Z

...sing/src/main/java/org/apache/carbondata/processing/loading/CarbonDataLoadConfiguration.java

@@ -460,4 +464,20 @@ public String getColumnCompressor() {
  public void setColumnCompressor(String columnCompressor) {
    this.columnCompressor = columnCompressor;
  }
+
+  public String getAppName() {


It is better use the same name: writtenBy

handled as carbon property

jackylk · 2018-10-18T08:51:13Z

...sing/src/main/java/org/apache/carbondata/processing/loading/CarbonDataLoadConfiguration.java

+    return version;
+  }
+
+  public void setVersion(String version) {


Is this carbon jar version? or application version? Application version should be given in the writtenBy

it is carbon version

jackylk · 2018-10-18T08:53:19Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

@@ -76,6 +80,17 @@ public static Schema readSchemaInDataFile(String dataFilePath) throws IOExceptio
    return new Schema(columnSchemaList);
  }

+  public static String getVersionDetails(String dataFilePath) throws IOException {


Why add this method in SchemaReader? This information is from Footer, right?

yes this info is from footer, in SDK we expose API is schema reader right, i thought may be i can expose one more API there only to get the version details, will this be ok? or need to write this in new class for footer info? please suggest

CarbonDataQA · 2018-10-18T10:26:01Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/858/

CarbonDataQA · 2018-10-18T10:41:08Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/860/

CarbonDataQA · 2018-10-18T11:01:58Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9124/

CarbonDataQA · 2018-10-18T11:11:49Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1058/

xuchuanyin · 2018-10-19T06:41:02Z

format/src/main/thrift/carbondata.thrift

@@ -206,6 +206,7 @@ struct FileFooter3{
    4: optional list<BlockletInfo3> blocklet_info_list3;	// Information about blocklets of all columns in this file for V3 format
    5: optional dictionary.ColumnDictionaryChunk dictionary; // Blocklet local dictionary
    6: optional bool is_sort; // True if the data is sorted in this file, it is used for compaction to decide whether to use merge sort or not
+    7: optional map<string, string> extra_info; // written by is used to write who wrote the file, it can be Aplication name, or SDK etc and version in which this carbondata file is written etc


Since this is optional and we will set many extra information in the footer, I think we can provide a general interface to set and get this info, which means that we do not need to provide 'writtenBy' and 'setVersion' interface. Because following this pattern, the interfaces will become more and more.

In my opinion, we can only provide one interface setExtraInfo/getExtraInfo and it accepts/returns a map.
Moreover, this extraInfo is optional, which means you do not need to set it in all the tes tcases, you just need to focus your test case to avoid too many changes.

for all the extra info, create map, it didnt get much of it, currently, it is map, and this suits for adding extra meta, and about changing test case, since those are the api level changes, we need to change those test cases.

Please modify the comment for this field

kunal642 · 2018-10-22T07:04:07Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonWriterBuilder.java

@@ -371,8 +381,14 @@ public CarbonWriter build() throws IOException, InvalidLoadOptionException {
          "Writer type is not set, use withCsvInput() or withAvroInput() or withJsonInput()  "
              + "API based on input");
    }
+    if (this.writtenByApp == null) {


Add a check for empty string too. No point in setting empty value

CarbonDataQA · 2018-10-22T08:57:28Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1108/

CarbonDataQA · 2018-10-22T09:22:44Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/908/

CarbonDataQA · 2018-10-22T10:20:32Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9174/

CarbonDataQA · 2018-10-23T05:34:16Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/946/

CarbonDataQA · 2018-10-23T06:34:16Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9211/

CarbonDataQA · 2018-10-23T06:34:17Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1154/

CarbonDataQA · 2018-10-23T08:28:16Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1162/

CarbonDataQA · 2018-10-23T09:18:29Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/957/

CarbonDataQA · 2018-10-23T09:22:24Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9221/

CarbonDataQA · 2018-10-23T10:26:48Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9227/

CarbonDataQA · 2018-10-23T10:57:25Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1170/

CarbonDataQA · 2018-10-23T10:59:39Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/962/

CarbonDataQA · 2018-10-23T12:29:19Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9230/

CarbonDataQA · 2018-10-23T13:00:35Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/969/

CarbonDataQA · 2018-10-23T13:29:02Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1179/

jackylk · 2018-10-24T07:57:58Z

core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java

+  /**
+   * carbon version detail to be written in carbondata footer for better maintanability
+   */
+  public static final String CARBON_VERSION_FOOTER_INFO = "version";


I think it is better to give a more specific name instead of "version". How about "carbon_writter_version"
And in the comment, you can say it is CarbonData software version used when writing the carbon files

yes, even this suits better

CarbonDataQA · 2018-10-25T05:48:07Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1002/

CarbonDataQA · 2018-10-25T06:54:26Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9268/

CarbonDataQA · 2018-10-25T07:09:47Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1215/

akashrn5 · 2018-10-25T10:47:28Z

@akashrn5 Instead of changing many classes to pass writtenBy and appName can't we set to CarbonProperties and in writer step we can get from the same and write to thrift??

handled

kunal642 · 2018-10-25T10:56:11Z

LGTM

Changes Proposed in this PR: Add more info to carbon file footer, like written_by (which will be spark application_name) in case of insert into and load command. To read this info one can use CLI. For SDK this API will be exposed to write this info in footer and one API will exposed to read this info from SDK. footer will have information about in which version of carbon the file is written, which will be helpful for getting details, for comaptibility etc. This closes #2829

akashrn5 mentioned this pull request Oct 17, 2018

[CARBONDATA-3025]Added CLI enhancements #2830

Closed

5 tasks

akashrn5 force-pushed the integrate1 branch from 9b39814 to be29918 Compare October 18, 2018 05:32

jackylk reviewed Oct 18, 2018

View reviewed changes

akashrn5 force-pushed the integrate1 branch 3 times, most recently from 896f8ab to 821aa19 Compare October 18, 2018 09:52

xuchuanyin reviewed Oct 19, 2018

View reviewed changes

kunal642 reviewed Oct 22, 2018

View reviewed changes

akashrn5 force-pushed the integrate1 branch 2 times, most recently from b4a15d6 to aae31eb Compare October 22, 2018 08:45

akashrn5 force-pushed the integrate1 branch from aae31eb to 00306c9 Compare October 22, 2018 10:34

akashrn5 force-pushed the integrate1 branch 2 times, most recently from 786fcfe to e4f17d0 Compare October 23, 2018 04:58

akashrn5 force-pushed the integrate1 branch from e4f17d0 to e11dd5a Compare October 23, 2018 06:39

akashrn5 force-pushed the integrate1 branch from e11dd5a to 4a7437b Compare October 23, 2018 08:48

akashrn5 force-pushed the integrate1 branch 2 times, most recently from d905fd8 to 6821413 Compare October 23, 2018 10:26

jackylk reviewed Oct 24, 2018

View reviewed changes

akashrn5 added 3 commits October 25, 2018 10:29

add more metadata in carbon file footer

59fc546

review comments handled

50a4445

review comments handled

d86fd5e

akashrn5 force-pushed the integrate1 branch from 6821413 to d86fd5e Compare October 25, 2018 05:08

asfgit closed this in 9578786 Oct 25, 2018

	.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build()
	.withCsvInput(Schema.parseJson(schema)).writtenBy("TestCarbonFileInputFormatWithExternalCarbonTable").build()

[CARBONDATA-3025]add more metadata in carbon file footer #2829

[CARBONDATA-3025]add more metadata in carbon file footer #2829

Conversation

akashrn5 commented Oct 17, 2018 • edited

Changes Proposed in this PR:

CarbonDataQA commented Oct 17, 2018

CarbonDataQA commented Oct 17, 2018

CarbonDataQA commented Oct 17, 2018

CarbonDataQA commented Oct 18, 2018

CarbonDataQA commented Oct 18, 2018

CarbonDataQA commented Oct 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylk Oct 18, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Oct 18, 2018

CarbonDataQA commented Oct 18, 2018

CarbonDataQA commented Oct 18, 2018

CarbonDataQA commented Oct 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Oct 22, 2018

CarbonDataQA commented Oct 22, 2018

CarbonDataQA commented Oct 22, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Oct 25, 2018

CarbonDataQA commented Oct 25, 2018

CarbonDataQA commented Oct 25, 2018

akashrn5 commented Oct 25, 2018

kunal642 commented Oct 25, 2018

akashrn5 commented Oct 17, 2018 •

edited

jackylk Oct 18, 2018 •

edited