Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-3025]add more metadata in carbon file footer #2829

Closed
wants to merge 3 commits into from

Conversation

akashrn5
Copy link
Contributor

@akashrn5 akashrn5 commented Oct 17, 2018

Changes Proposed in this PR:

  1. Add more info to carbon file footer, like written_by (which will be spark application_name) in case of insert into and load command. To read this info one can use CLI
  2. For SDK this API will be exposed to write this info in footer and one API will exposed to read this info from SDK.
  3. footer will have information about in which version of carbon the file is written, which will be helpful for getting details, for comaptibility etc

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

  • [] Any interfaces changed?

  • Any backward compatibility impacted?

  • Document update required?

  • Testing done
    Please provide details on
    - Whether new unit test cases have been added or why no new tests are required?
    - How it is tested? Please attach test report.
    - Is it a performance related change? Please attach the performance test report.
    - Any additional information to help reviewers in testing this change.

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/839/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1037/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9105/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/846/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9111/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1043/

@@ -206,6 +206,8 @@ struct FileFooter3{
4: optional list<BlockletInfo3> blocklet_info_list3; // Information about blocklets of all columns in this file for V3 format
5: optional dictionary.ColumnDictionaryChunk dictionary; // Blocklet local dictionary
6: optional bool is_sort; // True if the data is sorted in this file, it is used for compaction to decide whether to use merge sort or not
7: optional string written_by; // written by is used to write who wrote the file, it can be LOAD, or SDK etc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a map<string, string> property, and put these two new fields into this property map. We can extend any field by using this property map instead of adding more fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a map

@@ -56,7 +56,7 @@ class TestCarbonFileInputFormatWithExternalCarbonTable extends QueryTest with Be
val builder = CarbonWriter.builder()
val writer =
builder.outputPath(writerPath + "/Fact/Part0/Segment_null")
.withCsvInput(Schema.parseJson(schema)).build()
.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build()
.withCsvInput(Schema.parseJson(schema)).writtenBy("TestCarbonFileInputFormatWithExternalCarbonTable").build()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added classname for writtenby

@@ -139,13 +139,13 @@ class TestNonTransactionalCarbonTable extends QueryTest with BeforeAndAfterAll {
.sortBy(sortColumns.toArray)
.uniqueIdentifier(
System.currentTimeMillis).withBlockSize(2).withLoadOptions(options)
.withCsvInput(Schema.parseJson(schema)).build()
.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.withCsvInput(Schema.parseJson(schema)).writtenBy("SDK").build()
.withCsvInput(Schema.parseJson(schema)).writtenBy("TestNonTransactionalCarbonTable").build()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@@ -984,7 +984,7 @@ class SparkCarbonDataSourceTest extends FunSuite with BeforeAndAfterAll {
val writer =
builder.outputPath(path)
.uniqueIdentifier(System.nanoTime()).withBlockSize(2)
.withCsvInput(new Schema(structType)).build()
.withCsvInput(new Schema(structType)).writtenBy("DataSource").build()
Copy link
Contributor

@jackylk jackylk Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the class name instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -460,4 +464,20 @@ public String getColumnCompressor() {
public void setColumnCompressor(String columnCompressor) {
this.columnCompressor = columnCompressor;
}

public String getAppName() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better use the same name: writtenBy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handled as carbon property

return version;
}

public void setVersion(String version) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this carbon jar version? or application version? Application version should be given in the writtenBy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is carbon version

@@ -76,6 +80,17 @@ public static Schema readSchemaInDataFile(String dataFilePath) throws IOExceptio
return new Schema(columnSchemaList);
}

public static String getVersionDetails(String dataFilePath) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this method in SchemaReader? This information is from Footer, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this info is from footer, in SDK we expose API is schema reader right, i thought may be i can expose one more API there only to get the version details, will this be ok? or need to write this in new class for footer info? please suggest

@akashrn5 akashrn5 force-pushed the integrate1 branch 3 times, most recently from 896f8ab to 821aa19 Compare October 18, 2018 09:52
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/858/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/860/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9124/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1058/

@@ -206,6 +206,7 @@ struct FileFooter3{
4: optional list<BlockletInfo3> blocklet_info_list3; // Information about blocklets of all columns in this file for V3 format
5: optional dictionary.ColumnDictionaryChunk dictionary; // Blocklet local dictionary
6: optional bool is_sort; // True if the data is sorted in this file, it is used for compaction to decide whether to use merge sort or not
7: optional map<string, string> extra_info; // written by is used to write who wrote the file, it can be Aplication name, or SDK etc and version in which this carbondata file is written etc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is optional and we will set many extra information in the footer, I think we can provide a general interface to set and get this info, which means that we do not need to provide 'writtenBy' and 'setVersion' interface. Because following this pattern, the interfaces will become more and more.

In my opinion, we can only provide one interface setExtraInfo/getExtraInfo and it accepts/returns a map.
Moreover, this extraInfo is optional, which means you do not need to set it in all the tes tcases, you just need to focus your test case to avoid too many changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all the extra info, create map, it didnt get much of it, currently, it is map, and this suits for adding extra meta, and about changing test case, since those are the api level changes, we need to change those test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify the comment for this field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified

@@ -371,8 +381,14 @@ public CarbonWriter build() throws IOException, InvalidLoadOptionException {
"Writer type is not set, use withCsvInput() or withAvroInput() or withJsonInput() "
+ "API based on input");
}
if (this.writtenByApp == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a check for empty string too. No point in setting empty value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@akashrn5 akashrn5 force-pushed the integrate1 branch 2 times, most recently from b4a15d6 to aae31eb Compare October 22, 2018 08:45
@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1108/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/908/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9174/

@akashrn5 akashrn5 force-pushed the integrate1 branch 2 times, most recently from 786fcfe to e4f17d0 Compare October 23, 2018 04:58
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/946/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9211/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1154/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1162/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/957/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9221/

@akashrn5 akashrn5 force-pushed the integrate1 branch 2 times, most recently from d905fd8 to 6821413 Compare October 23, 2018 10:26
@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9227/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1170/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/962/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9230/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/969/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1179/

/**
* carbon version detail to be written in carbondata footer for better maintanability
*/
public static final String CARBON_VERSION_FOOTER_INFO = "version";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to give a more specific name instead of "version". How about "carbon_writter_version"
And in the comment, you can say it is CarbonData software version used when writing the carbon files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, even this suits better

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1002/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9268/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1215/

@akashrn5
Copy link
Contributor Author

@akashrn5 Instead of changing many classes to pass writtenBy and appName can't we set to CarbonProperties and in writer step we can get from the same and write to thrift??

handled

@kunal642
Copy link
Contributor

LGTM

@asfgit asfgit closed this in 9578786 Oct 25, 2018
asfgit pushed a commit that referenced this pull request Nov 21, 2018
Changes Proposed in this PR:
Add more info to carbon file footer, like written_by (which will be spark application_name)
in case of insert into and load command. To read this info one can use CLI.
For SDK this API will be exposed to write this info in footer and one API will exposed to read this info from SDK.
footer will have information about in which version of carbon the file is written,
which will be helpful for getting details, for comaptibility etc.

This closes #2829
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants