[CARBONDATA-2996] CarbonSchemaReader support read schema from folder path #2804

xubo245 · 2018-10-09T07:29:19Z

[CARBONDATA-2996] CarbonSchemaReader support read schema from folder path
1.Deprecated readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema
2.Deprecated readSchemaInSchemaFile
3.readSchema support read schema from folder path,carbonindex file, and carbondata file. and user can decide whether check all files schema

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

Any interfaces changed?
No
Any backward compatibility impacted?
No
Document update required?
No
Testing done
add test case
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
https://issues.apache.org/jira/browse/CARBONDATA-2951

CarbonDataQA · 2018-10-09T07:39:33Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/746/

CarbonDataQA · 2018-10-09T08:44:38Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9012/

CarbonDataQA · 2018-10-09T08:46:59Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/944/

CarbonDataQA · 2018-10-10T02:05:02Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/757/

CarbonDataQA · 2018-10-10T02:14:55Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/758/

CarbonDataQA · 2018-10-10T03:23:20Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/956/

CarbonDataQA · 2018-10-10T03:34:00Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9024/

xubo245 · 2018-10-10T03:54:53Z

@KanakaKumar @kunal642 @jackylk Please review it.

ajantha-bhat · 2018-10-26T06:05:38Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

+      if (carbonFiles == null || carbonFiles.length < 1) {
+        throw new RuntimeException("Carbon data file not exists.");
+      }
+      dataFilePath = carbonFiles[0].getAbsolutePath();


Taking only one data file (first file) ?

What if this folder has multiple files with different schema. what if user wanted schema info from other file also?

Supporting schema read from folder is not required as this is exposed for user and he has the list of files.
a) to read one file, user passes single file for this API. -- already supported
b) to read multiple files, user can list files and pass all the files he want schema and call our API in a list -- already supported.

Just reading first file from folder doesn't make sense. This PR is not required as existing API already support all user scenarios.

yes, take the only one data file.
It's more convenient for user give a path to read schema。and maybe the folder has sub-folder，use need list iteratively。There are some customer has this problem。
We can judge the different files schema if it's necessary。SDK can throw exception if multiple files has different schema。

In that case you can implement,

String getFirstCarbonFile(path, ExtenstionType)

and pass it to existing method. ReadSchemaFromFile() must only read it. It should not do any extra work.

ok, I add ReadSchemaFromFirstDataFile and ReadSchemaFromFirstIndexFile

ajantha-bhat · 2018-10-26T06:06:15Z

@xubo245 :
Just reading first file from folder doesn't make sense. This PR is not required as existing API already support all user scenarios.
please check my comment for more details.

xubo245 · 2018-10-26T12:27:38Z

@ajantha-bhat There are already some user has this problem。 Between different services，they only give the path to other， the user need list the index/data file， even though they need list sub-folder iteratively to find the carbon index/data file， which is not convenient for user。 We can make it become public function for all user。

xubo245 · 2018-10-30T03:43:29Z

@ajantha-bhat @KanakaKumar Please review again.

ajantha-bhat · 2018-10-30T04:35:45Z

@xubo245 : In that case you can implement,

String getFirstCarbonFile(path, ExtenstionType)

and pass it to existing method. ReadSchemaFromFile() must only read it. It should not do any extra work.

ajantha-bhat · 2018-10-30T04:36:42Z

store/sdk/src/test/java/org/apache/carbondata/sdk/file/CarbonSchemaReaderTest.java

+      FileUtils.deleteDirectory(new File(path));
+
+      Field[] fields = new Field[11];
+      fields[0] = new Field("stringField", DataTypes.STRING);


write you can move it in the setup() step

ajantha-bhat · 2018-10-30T04:37:10Z

store/sdk/src/test/java/org/apache/carbondata/sdk/file/CarbonSchemaReaderTest.java

+      assert (strings[1].equalsIgnoreCase("shortField"));
+      assert (strings[2].equalsIgnoreCase("intField"));
+      assert (strings[3].equalsIgnoreCase("longField"));
+      assert (strings[4].equalsIgnoreCase("doubleField"));


can move it to a method and use for both the test case.

CarbonDataQA · 2018-10-30T08:43:38Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1148/

CarbonDataQA · 2018-10-30T09:29:48Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1359/

CarbonDataQA · 2018-10-30T10:31:13Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9413/

CarbonDataQA · 2018-10-30T11:25:57Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1365/

xubo245 · 2018-10-30T11:52:22Z

retest this please

CarbonDataQA · 2018-10-30T12:25:17Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1158/

CarbonDataQA · 2018-10-30T12:35:58Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9420/

CarbonDataQA · 2018-10-30T13:59:53Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1374/

CarbonDataQA · 2018-10-30T14:02:28Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1161/

CarbonDataQA · 2018-10-30T14:31:58Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9425/

xubo245 · 2018-10-31T01:53:31Z

@ajantha-bhat CI pass, please check again.

ajantha-bhat · 2018-10-31T07:16:17Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

+   * @param path carbondata file path
+   * @return first carbondata file name
+   */
+  public static String getFirstCarbonDataFile(String path) {


I have already suggested to keep getFirstCarbonFile(path, extension) -- this only will give data or index file based on the extension.

no need to have duplicate code for both index and data file

ok, misunderstand , sorry。
Updated

CarbonDataQA · 2018-10-31T11:06:54Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9448/

CarbonDataQA · 2018-10-31T11:28:48Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1187/

CarbonDataQA · 2018-11-02T09:54:27Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1242/

CarbonDataQA · 2018-11-02T11:10:56Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9507/

CarbonDataQA · 2018-11-02T11:21:39Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1458/

xubo245 · 2018-11-04T10:57:37Z

@KanakaKumar @kunal642 @ajantha-bhat CI pass, Please check.

KanakaKumar · 2018-11-05T15:18:43Z

docs/sdk-guide.md

+   * @return schema
+   * @throws IOException
+   */
+  public static Schema readSchema(String path, boolean checkFilesSchema);


checkFilesSchema should be validateSchema

KanakaKumar · 2018-11-05T15:47:11Z

store/sdk/src/test/java/org/apache/carbondata/sdk/file/CarbonSchemaReaderTest.java

          .asOriginOrder();

      assertEquals(schema.getFieldsLength(), 12);
      checkSchema(schema);
+    } catch (Throwable e) {
+      e.printStackTrace();


should fail

ok, done,added Assert.fail();

KanakaKumar · 2018-11-05T15:51:29Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

+   * @return schema
+   * @throws IOException
+   */
+  public static Schema readSchema(String path, boolean checkFilesSchema) throws IOException {


readSchema(String path, boolean checkFilesSchema)
-- Is this schema validation method is required ? If no use case we can skip this.. during query execution anyways schema is validated.

when user only want to check schema and no need to query data, they can use readSchema.

KanakaKumar · 2018-11-05T15:53:01Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

+            }
+          });
+      if (carbonFiles == null || carbonFiles.length < 1) {
+        throw new RuntimeException("Carbon file not exists.");


Why RunTimeException, IO related failures should throw IOException

…path 1.Deprecated readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema in SDK 2.delete readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema in CSDK 3.readSchema support read schema from folder path,carbonindex file, and carbondata file. and user can decide whether check all files schema

CarbonDataQA · 2018-11-06T04:11:57Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1297/

CarbonDataQA · 2018-11-06T04:41:55Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1509/

CarbonDataQA · 2018-11-06T05:03:25Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9558/

xubo245 · 2018-11-06T06:47:09Z

@KanakaKumar @kunal642 CI pass, please check it.

KanakaKumar · 2018-11-06T09:48:54Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

+        schema = readSchemaFromIndexFile(carbonIndexFiles[0].getAbsolutePath());
+        for (int i = 1; i < carbonIndexFiles.length; i++) {
+          Schema schema2 = readSchemaFromIndexFile(carbonIndexFiles[i].getAbsolutePath());
+          if (schema != schema2) {


use equals .. schema.equals(schema2)

KanakaKumar · 2018-11-06T09:53:21Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

+      if (indexFilePath != null) {
+        return readSchemaFromIndexFile(indexFilePath);
+      } else {
+        String dataFilePath = getCarbonFile(path, CARBON_DATA_EXT)[0].getAbsolutePath();


As per getCarbonFile(...) implementation, if there is no INDEX file found, it throws exception. So, there is no need of this else case ?

yeah, removed else

KanakaKumar · 2018-11-06T09:56:34Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

+      }
+      return carbonFiles;
+    }
+    return null;


We can stick to one contract from the method. Either return the list or throw exception. Generally listing APIs should not return null, if this case is not expected, we can throw exception to avoid null checks in the callers

ok, throw exception

KanakaKumar · 2018-11-06T09:57:42Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

+   * @return
+   * @throws IOException
+   */
+  public static String getVersionDetails(String dataFilePath) throws IOException {


This complete method is displayed as removed and added again. Is it possible to avoid?

KanakaKumar · 2018-11-06T10:00:07Z

store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonSchemaReader.java

+      }
+    } else {
+      String indexFilePath = getCarbonFile(path, INDEX_FILE_EXT)[0].getAbsolutePath();
+      if (indexFilePath != null) {


I think this null check is not required. Is there any chance the absolute path can be null ?

CarbonDataQA · 2018-11-06T14:00:16Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1531/

CarbonDataQA · 2018-11-06T14:05:24Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9578/

CarbonDataQA · 2018-11-06T14:27:19Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1320/

xubo245 · 2018-11-06T14:57:24Z

@KanakaKumar CI pass, please check it.

KanakaKumar · 2018-11-06T15:07:09Z

LGTM

jackylk · 2018-11-07T01:54:08Z

LGTM

…path 1.Deprecated readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema in SDK 2.delete readSchemaInIndexFile and readSchemaInDataFile, unify them to readSchema in CSDK 3.readSchema support read schema from folder path,carbonindex file, and carbondata file. and user can decide whether check all files schema This closes #2804

ajantha-bhat reviewed Oct 26, 2018

View reviewed changes

ajantha-bhat reviewed Oct 30, 2018

View reviewed changes

xubo245 force-pushed the CARBONDATA-2996_SchemaSupportFolder branch from a1f9629 to 5046e76 Compare October 30, 2018 07:37

ajantha-bhat reviewed Oct 31, 2018

View reviewed changes

KanakaKumar reviewed Nov 5, 2018

View reviewed changes

xubo245 force-pushed the CARBONDATA-2996_SchemaSupportFolder branch 2 times, most recently from c61c6a6 to a27280e Compare November 6, 2018 02:30

xubo245 force-pushed the CARBONDATA-2996_SchemaSupportFolder branch from a27280e to e853036 Compare November 6, 2018 02:37

KanakaKumar reviewed Nov 6, 2018

View reviewed changes

optimize

85996c6

asfgit closed this in 6093a32 Nov 7, 2018

[CARBONDATA-2996] CarbonSchemaReader support read schema from folder path #2804

[CARBONDATA-2996] CarbonSchemaReader support read schema from folder path #2804

Conversation

xubo245 commented Oct 9, 2018 • edited

CarbonDataQA commented Oct 9, 2018

CarbonDataQA commented Oct 9, 2018

CarbonDataQA commented Oct 9, 2018

CarbonDataQA commented Oct 10, 2018

CarbonDataQA commented Oct 10, 2018

CarbonDataQA commented Oct 10, 2018

CarbonDataQA commented Oct 10, 2018

xubo245 commented Oct 10, 2018

ajantha-bhat Oct 26, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat commented Oct 26, 2018

xubo245 commented Oct 26, 2018

xubo245 commented Oct 30, 2018

ajantha-bhat commented Oct 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

xubo245 commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

xubo245 commented Oct 31, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Oct 31, 2018

CarbonDataQA commented Oct 31, 2018

CarbonDataQA commented Nov 2, 2018

CarbonDataQA commented Nov 2, 2018

CarbonDataQA commented Nov 2, 2018

xubo245 commented Nov 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xubo245 Nov 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Nov 6, 2018

CarbonDataQA commented Nov 6, 2018

CarbonDataQA commented Nov 6, 2018

xubo245 commented Nov 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KanakaKumar Nov 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Nov 6, 2018

CarbonDataQA commented Nov 6, 2018

CarbonDataQA commented Nov 6, 2018

xubo245 commented Nov 6, 2018

KanakaKumar commented Nov 6, 2018

jackylk commented Nov 7, 2018

xubo245 commented Oct 9, 2018 •

edited

ajantha-bhat Oct 26, 2018 •

edited

xubo245 commented Oct 31, 2018 •

edited

xubo245 Nov 6, 2018 •

edited

KanakaKumar Nov 6, 2018 •

edited