[CARBONDATA-2659] Support partition table by DataFrame API #2415

jackylk · 2018-06-26T12:34:12Z

Currently only partition table is only supported by SQL, it should be supported by Spark DataFrame API also.
This PR added an option to specify the partition columns when writing a DataFrame to carbon table
For example:

    df.write
      .format("carbondata")
      .option("tableName", "carbon_df_table")
      .option("partitionColumns", "c1, c2")  // a list of column names
      .mode(SaveMode.Overwrite)
      .save()

Any interfaces changed?
Added an option for DataFrame.write
Any backward compatibility impacted?
No
Document update required?
Testing done
Added one test case
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
NA

xuchuanyin · 2018-06-26T13:52:48Z

integration/spark-common/src/main/scala/org/apache/carbondata/spark/CarbonOption.scala

    options.getOrElse("partitionClass",
      "org.apache.carbondata.processing.partition.impl.SampleDataPartitionerImpl")
  }

-  def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
+  lazy val tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean


I remember that the 'tempCsv' option has been deprecated

xuchuanyin · 2018-06-26T13:55:02Z

integration/spark2/src/main/scala/org/apache/spark/sql/CarbonDataFrameWriter.scala

+      options.partitionColumns.get.map { column =>
+        val c = schema.fields.find(_.name.equalsIgnoreCase(column))
+        if (c.isEmpty) {
+          throw new MalformedCarbonCommandException(s"invalid partition column: $column")


missing validation for duplicated column names?

CarbonDataQA · 2018-06-26T14:15:30Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5387/

CarbonDataQA · 2018-06-26T18:12:08Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6561/

CarbonDataQA · 2018-06-26T21:13:22Z

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5395/

CarbonDataQA · 2018-06-26T22:04:09Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6568/

ravipesala · 2018-06-27T02:25:57Z

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5468/

ravipesala · 2018-06-27T06:01:27Z

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5471/

ravipesala · 2018-06-27T11:04:09Z

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5475/

CarbonDataQA · 2018-07-13T17:34:26Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7105/

CarbonDataQA · 2018-07-13T20:21:12Z

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5881/

ravipesala · 2018-07-17T06:48:47Z

integration/spark2/src/main/scala/org/apache/spark/sql/CarbonDataFrameWriter.scala

+    }
+
+    val schemaWithoutPartition = if (options.partitionColumns.isDefined) {
+      val fields = schema.filterNot(field => options.partitionColumns.get.contains(field.name))


better check exists with equalsIgnoreCase inside filterNot instead of contains

ravipesala · 2018-07-17T06:49:53Z

@jackylk Please rebase it

fix fix comment

CarbonDataQA · 2018-07-18T02:41:57Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7268/

ravipesala · 2018-07-18T02:47:50Z

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5896/

CarbonDataQA · 2018-07-18T04:40:20Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6037/

ravipesala · 2018-07-18T13:32:39Z

retest this please

CarbonDataQA · 2018-07-18T17:34:55Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7291/

CarbonDataQA · 2018-07-18T19:22:52Z

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6058/

ravipesala · 2018-08-07T12:45:32Z

retest this please

ravipesala · 2018-08-07T13:52:26Z

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6200/

CarbonDataQA · 2018-08-07T15:43:49Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7826/

CarbonDataQA · 2018-08-07T17:00:29Z

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6551/

ravipesala · 2018-08-08T06:33:24Z

LGTM

Currently only partition table is only supported by SQL, it should be supported by Spark DataFrame API also. This PR added an option to specify the partition columns when writing a DataFrame to carbon table. This closes #2415

xuchuanyin reviewed Jun 26, 2018

View reviewed changes

ravipesala reviewed Jul 17, 2018

View reviewed changes

add test

4a26a2c

fix fix comment

jackylk force-pushed the dataframe-partition branch from ca5feb2 to 4a26a2c Compare July 18, 2018 01:12

fix comment

a6c1129

asfgit closed this in 8f7b594 Aug 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CARBONDATA-2659] Support partition table by DataFrame API #2415

[CARBONDATA-2659] Support partition table by DataFrame API #2415

jackylk commented Jun 26, 2018

xuchuanyin Jun 26, 2018

jackylk Jun 26, 2018

xuchuanyin Jun 26, 2018

jackylk Jun 26, 2018

CarbonDataQA commented Jun 26, 2018

CarbonDataQA commented Jun 26, 2018

CarbonDataQA commented Jun 26, 2018

CarbonDataQA commented Jun 26, 2018

ravipesala commented Jun 27, 2018

ravipesala commented Jun 27, 2018

ravipesala commented Jun 27, 2018

CarbonDataQA commented Jul 13, 2018

CarbonDataQA commented Jul 13, 2018

ravipesala Jul 17, 2018

jackylk Aug 7, 2018

ravipesala commented Jul 17, 2018

CarbonDataQA commented Jul 18, 2018

ravipesala commented Jul 18, 2018

CarbonDataQA commented Jul 18, 2018

ravipesala commented Jul 18, 2018

CarbonDataQA commented Jul 18, 2018

CarbonDataQA commented Jul 18, 2018

ravipesala commented Aug 7, 2018

ravipesala commented Aug 7, 2018

CarbonDataQA commented Aug 7, 2018

CarbonDataQA commented Aug 7, 2018

ravipesala commented Aug 8, 2018

[CARBONDATA-2659] Support partition table by DataFrame API #2415

[CARBONDATA-2659] Support partition table by DataFrame API #2415

Conversation

jackylk commented Jun 26, 2018

xuchuanyin Jun 26, 2018

Choose a reason for hiding this comment

jackylk Jun 26, 2018

Choose a reason for hiding this comment

xuchuanyin Jun 26, 2018

Choose a reason for hiding this comment

jackylk Jun 26, 2018

Choose a reason for hiding this comment

CarbonDataQA commented Jun 26, 2018

CarbonDataQA commented Jun 26, 2018

CarbonDataQA commented Jun 26, 2018

CarbonDataQA commented Jun 26, 2018

ravipesala commented Jun 27, 2018

ravipesala commented Jun 27, 2018

ravipesala commented Jun 27, 2018

CarbonDataQA commented Jul 13, 2018

CarbonDataQA commented Jul 13, 2018

ravipesala Jul 17, 2018

Choose a reason for hiding this comment

jackylk Aug 7, 2018

Choose a reason for hiding this comment

ravipesala commented Jul 17, 2018

CarbonDataQA commented Jul 18, 2018

ravipesala commented Jul 18, 2018

CarbonDataQA commented Jul 18, 2018

ravipesala commented Jul 18, 2018

CarbonDataQA commented Jul 18, 2018

CarbonDataQA commented Jul 18, 2018

ravipesala commented Aug 7, 2018

ravipesala commented Aug 7, 2018

CarbonDataQA commented Aug 7, 2018

CarbonDataQA commented Aug 7, 2018

ravipesala commented Aug 8, 2018