New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-3025]add more metadata in carbon file footer #2829
Conversation
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/839/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1037/ |
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9105/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/846/ |
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9111/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1043/ |
@@ -206,6 +206,8 @@ struct FileFooter3{ | |||
4: optional list<BlockletInfo3> blocklet_info_list3; // Information about blocklets of all columns in this file for V3 format | |||
5: optional dictionary.ColumnDictionaryChunk dictionary; // Blocklet local dictionary | |||
6: optional bool is_sort; // True if the data is sorted in this file, it is used for compaction to decide whether to use merge sort or not | |||
7: optional string written_by; // written by is used to write who wrote the file, it can be LOAD, or SDK etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a map<string, string> property, and put these two new fields into this property map. We can extend any field by using this property map instead of adding more fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a map
@@ -56,7 +56,7 @@ class TestCarbonFileInputFormatWithExternalCarbonTable extends QueryTest with Be | |||
val builder = CarbonWriter.builder() | |||
val writer = | |||
builder.outputPath(writerPath + "/Fact/Part0/Segment_null") | |||
.withCsvInput(Schema.parseJson(schema)).build() | |||
.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build() | |
.withCsvInput(Schema.parseJson(schema)).writtenBy("TestCarbonFileInputFormatWithExternalCarbonTable").build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added classname for writtenby
@@ -139,13 +139,13 @@ class TestNonTransactionalCarbonTable extends QueryTest with BeforeAndAfterAll { | |||
.sortBy(sortColumns.toArray) | |||
.uniqueIdentifier( | |||
System.currentTimeMillis).withBlockSize(2).withLoadOptions(options) | |||
.withCsvInput(Schema.parseJson(schema)).build() | |||
.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build() | |
.withCsvInput(Schema.parseJson(schema)).writtenBy("TestNonTransactionalCarbonTable").build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
@@ -984,7 +984,7 @@ class SparkCarbonDataSourceTest extends FunSuite with BeforeAndAfterAll { | |||
val writer = | |||
builder.outputPath(path) | |||
.uniqueIdentifier(System.nanoTime()).withBlockSize(2) | |||
.withCsvInput(new Schema(structType)).build() | |||
.withCsvInput(new Schema(structType)).writtenBy("DataSource").build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the class name instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -460,4 +464,20 @@ public String getColumnCompressor() { | |||
public void setColumnCompressor(String columnCompressor) { | |||
this.columnCompressor = columnCompressor; | |||
} | |||
|
|||
public String getAppName() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better use the same name: writtenBy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handled as carbon property
return version; | ||
} | ||
|
||
public void setVersion(String version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this carbon jar version? or application version? Application version should be given in the writtenBy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is carbon version
@@ -76,6 +80,17 @@ public static Schema readSchemaInDataFile(String dataFilePath) throws IOExceptio | |||
return new Schema(columnSchemaList); | |||
} | |||
|
|||
public static String getVersionDetails(String dataFilePath) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this method in SchemaReader? This information is from Footer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this info is from footer, in SDK we expose API is schema reader right, i thought may be i can expose one more API there only to get the version details, will this be ok? or need to write this in new class for footer info? please suggest
896f8ab
to
821aa19
Compare
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/858/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/860/ |
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9124/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1058/ |
@@ -206,6 +206,7 @@ struct FileFooter3{ | |||
4: optional list<BlockletInfo3> blocklet_info_list3; // Information about blocklets of all columns in this file for V3 format | |||
5: optional dictionary.ColumnDictionaryChunk dictionary; // Blocklet local dictionary | |||
6: optional bool is_sort; // True if the data is sorted in this file, it is used for compaction to decide whether to use merge sort or not | |||
7: optional map<string, string> extra_info; // written by is used to write who wrote the file, it can be Aplication name, or SDK etc and version in which this carbondata file is written etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is optional and we will set many extra information in the footer, I think we can provide a general interface to set and get this info, which means that we do not need to provide 'writtenBy' and 'setVersion' interface. Because following this pattern, the interfaces will become more and more.
In my opinion, we can only provide one interface setExtraInfo/getExtraInfo and it accepts/returns a map.
Moreover, this extraInfo is optional, which means you do not need to set it in all the tes tcases, you just need to focus your test case to avoid too many changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for all the extra info, create map, it didnt get much of it, currently, it is map, and this suits for adding extra meta, and about changing test case, since those are the api level changes, we need to change those test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please modify the comment for this field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
@@ -371,8 +381,14 @@ public CarbonWriter build() throws IOException, InvalidLoadOptionException { | |||
"Writer type is not set, use withCsvInput() or withAvroInput() or withJsonInput() " | |||
+ "API based on input"); | |||
} | |||
if (this.writtenByApp == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a check for empty string too. No point in setting empty value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
b4a15d6
to
aae31eb
Compare
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1108/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/908/ |
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9174/ |
786fcfe
to
e4f17d0
Compare
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/946/ |
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9211/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1154/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1162/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/957/ |
Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9221/ |
d905fd8
to
6821413
Compare
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9227/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1170/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/962/ |
Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9230/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/969/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1179/ |
/** | ||
* carbon version detail to be written in carbondata footer for better maintanability | ||
*/ | ||
public static final String CARBON_VERSION_FOOTER_INFO = "version"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to give a more specific name instead of "version". How about "carbon_writter_version"
And in the comment, you can say it is CarbonData software version used when writing the carbon files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, even this suits better
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1002/ |
Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9268/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1215/ |
handled |
LGTM |
Changes Proposed in this PR: Add more info to carbon file footer, like written_by (which will be spark application_name) in case of insert into and load command. To read this info one can use CLI. For SDK this API will be exposed to write this info in footer and one API will exposed to read this info from SDK. footer will have information about in which version of carbon the file is written, which will be helpful for getting details, for comaptibility etc. This closes #2829
Changes Proposed in this PR:
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
[] Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.