Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-2996] CarbonSchemaReader support read schema from folder path #2804

Closed

Conversation

xubo245
Copy link
Contributor

@xubo245 xubo245 commented Oct 9, 2018

[CARBONDATA-2996] CarbonSchemaReader support read schema from folder path
1.Deprecated readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema
2.Deprecated readSchemaInSchemaFile
3.readSchema support read schema from folder path,carbonindex file, and carbondata file. and user can decide whether check all files schema

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/746/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9012/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/944/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/757/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/758/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/956/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9024/

@xubo245
Copy link
Contributor Author

xubo245 commented Oct 10, 2018

@KanakaKumar @kunal642 @jackylk Please review it.

if (carbonFiles == null || carbonFiles.length < 1) {
throw new RuntimeException("Carbon data file not exists.");
}
dataFilePath = carbonFiles[0].getAbsolutePath();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking only one data file (first file) ?

What if this folder has multiple files with different schema. what if user wanted schema info from other file also?

Supporting schema read from folder is not required as this is exposed for user and he has the list of files.
a) to read one file, user passes single file for this API. -- already supported
b) to read multiple files, user can list files and pass all the files he want schema and call our API in a list -- already supported.

Just reading first file from folder doesn't make sense. This PR is not required as existing API already support all user scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, take the only one data file.
It's more convenient for user give a path to read schema。and maybe the folder has sub-folder,use need list iteratively。There are some customer has this problem。
We can judge the different files schema if it's necessary。SDK can throw exception if multiple files has different schema。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case you can implement,

String getFirstCarbonFile(path, ExtenstionType)

and pass it to existing method. ReadSchemaFromFile() must only read it. It should not do any extra work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I add ReadSchemaFromFirstDataFile and ReadSchemaFromFirstIndexFile

@ajantha-bhat
Copy link
Member

@xubo245 :
Just reading first file from folder doesn't make sense. This PR is not required as existing API already support all user scenarios.
please check my comment for more details.

@xubo245
Copy link
Contributor Author

xubo245 commented Oct 26, 2018

@ajantha-bhat There are already some user has this problem。 Between different services,they only give the path to other, the user need list the index/data file, even though they need list sub-folder iteratively to find the carbon index/data file, which is not convenient for user。 We can make it become public function for all user。

@xubo245
Copy link
Contributor Author

xubo245 commented Oct 30, 2018

@ajantha-bhat @KanakaKumar Please review again.

@ajantha-bhat
Copy link
Member

@xubo245 : In that case you can implement,

String getFirstCarbonFile(path, ExtenstionType)

and pass it to existing method. ReadSchemaFromFile() must only read it. It should not do any extra work.

FileUtils.deleteDirectory(new File(path));

Field[] fields = new Field[11];
fields[0] = new Field("stringField", DataTypes.STRING);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write you can move it in the setup() step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

assert (strings[1].equalsIgnoreCase("shortField"));
assert (strings[2].equalsIgnoreCase("intField"));
assert (strings[3].equalsIgnoreCase("longField"));
assert (strings[4].equalsIgnoreCase("doubleField"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move it to a method and use for both the test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

@xubo245 xubo245 force-pushed the CARBONDATA-2996_SchemaSupportFolder branch from a1f9629 to 5046e76 Compare October 30, 2018 07:37
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1148/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1359/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9413/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1365/

@xubo245
Copy link
Contributor Author

xubo245 commented Oct 30, 2018

retest this please

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1158/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9420/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1374/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1161/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9425/

@xubo245
Copy link
Contributor Author

xubo245 commented Oct 31, 2018

@ajantha-bhat CI pass, please check again.

* @param path carbondata file path
* @return first carbondata file name
*/
public static String getFirstCarbonDataFile(String path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already suggested to keep getFirstCarbonFile(path, extension) -- this only will give data or index file based on the extension.

no need to have duplicate code for both index and data file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, misunderstand , sorry。
Updated

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9448/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1187/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1242/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9507/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1458/

@xubo245
Copy link
Contributor Author

xubo245 commented Nov 4, 2018

@KanakaKumar @kunal642 @ajantha-bhat CI pass, Please check.

* @return schema
* @throws IOException
*/
public static Schema readSchema(String path, boolean checkFilesSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkFilesSchema should be validateSchema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

.asOriginOrder();

assertEquals(schema.getFieldsLength(), 12);
checkSchema(schema);
} catch (Throwable e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done,added Assert.fail();

* @return schema
* @throws IOException
*/
public static Schema readSchema(String path, boolean checkFilesSchema) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readSchema(String path, boolean checkFilesSchema)
-- Is this schema validation method is required ? If no use case we can skip this.. during query execution anyways schema is validated.

Copy link
Contributor Author

@xubo245 xubo245 Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when user only want to check schema and no need to query data, they can use readSchema.

}
});
if (carbonFiles == null || carbonFiles.length < 1) {
throw new RuntimeException("Carbon file not exists.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why RunTimeException, IO related failures should throw IOException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

@xubo245 xubo245 force-pushed the CARBONDATA-2996_SchemaSupportFolder branch 2 times, most recently from c61c6a6 to a27280e Compare November 6, 2018 02:30
…path

1.Deprecated readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema in SDK
2.delete readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema in CSDK
3.readSchema support read schema from folder path,carbonindex file, and carbondata file. and user can decide whether check all files schema
@xubo245 xubo245 force-pushed the CARBONDATA-2996_SchemaSupportFolder branch from a27280e to e853036 Compare November 6, 2018 02:37
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1297/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1509/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9558/

@xubo245
Copy link
Contributor Author

xubo245 commented Nov 6, 2018

@KanakaKumar @kunal642 CI pass, please check it.

schema = readSchemaFromIndexFile(carbonIndexFiles[0].getAbsolutePath());
for (int i = 1; i < carbonIndexFiles.length; i++) {
Schema schema2 = readSchemaFromIndexFile(carbonIndexFiles[i].getAbsolutePath());
if (schema != schema2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use equals .. schema.equals(schema2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

if (indexFilePath != null) {
return readSchemaFromIndexFile(indexFilePath);
} else {
String dataFilePath = getCarbonFile(path, CARBON_DATA_EXT)[0].getAbsolutePath();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per getCarbonFile(...) implementation, if there is no INDEX file found, it throws exception. So, there is no need of this else case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, removed else

}
return carbonFiles;
}
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can stick to one contract from the method. Either return the list or throw exception. Generally listing APIs should not return null, if this case is not expected, we can throw exception to avoid null checks in the callers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, throw exception

* @return
* @throws IOException
*/
public static String getVersionDetails(String dataFilePath) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complete method is displayed as removed and added again. Is it possible to avoid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't

}
} else {
String indexFilePath = getCarbonFile(path, INDEX_FILE_EXT)[0].getAbsolutePath();
if (indexFilePath != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this null check is not required. Is there any chance the absolute path can be null ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1531/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9578/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1320/

@xubo245
Copy link
Contributor Author

xubo245 commented Nov 6, 2018

@KanakaKumar CI pass, please check it.

@KanakaKumar
Copy link
Contributor

LGTM

1 similar comment
@jackylk
Copy link
Contributor

jackylk commented Nov 7, 2018

LGTM

@asfgit asfgit closed this in 6093a32 Nov 7, 2018
asfgit pushed a commit that referenced this pull request Nov 21, 2018
…path

1.Deprecated readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema in SDK
2.delete readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema in CSDK
3.readSchema support read schema from folder path,carbonindex file, and carbondata file. and user can decide whether check all files schema

This closes #2804
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants