Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility #7231

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
03c3bd9
Refactors Parquet read path to implement backwards-compatibility rules
liancheng Jul 5, 2015
0525346
Removes old Parquet record converters
liancheng Jul 5, 2015
a74fb2c
More comments
liancheng Jul 5, 2015
bcac49f
Removes the 16-byte restriction of decimals
liancheng Jul 5, 2015
6437d4b
Assembles requested schema from Parquet file schema
liancheng Jul 5, 2015
1781dff
Adds test case for SPARK-8811
liancheng Jul 5, 2015
7fb21f1
Reverts an unnecessary debugging change
liancheng Jul 5, 2015
38fe1e7
Adds explicit return type
liancheng Jul 6, 2015
802cbd7
Fixes bugs related to schema merging and empty requested columns
liancheng Jul 6, 2015
884d3e6
Fixes styling issue and reverts unnecessary changes
liancheng Jul 6, 2015
0cc1b37
Fixes MiMa checks
liancheng Jul 6, 2015
a099d3e
More comments
liancheng Jul 6, 2015
06cfe9d
Adds comments about TimestampType handling
liancheng Jul 6, 2015
13b9121
Adds ParquetAvroCompatibilitySuite
liancheng Jul 7, 2015
440f7b3
Adds generated files to .rat-excludes
liancheng Jul 7, 2015
1d390aa
Adds parquet-thrift compatibility test
liancheng Jul 7, 2015
f2208cd
Adds README.md for Thrift/Avro code generation
liancheng Jul 7, 2015
a8f13bb
Using Parquet writer API to do compatibility tests
liancheng Jul 7, 2015
3d7ab36
Fixes .rat-excludes
liancheng Jul 7, 2015
7946ee1
Fixes Scala styling issues
liancheng Jul 7, 2015
926af87
Simplifies Parquet compatibility test suites
liancheng Jul 8, 2015
598c3e8
Adds extra Maven repo for hadoop-lzo, which is a transitive dependenc…
liancheng Jul 8, 2015
b8c1295
Excludes the whole parquet package from MiMa
liancheng Jul 8, 2015
c6fbc06
Removes WIP file committed by mistake
liancheng Jul 8, 2015
360fe18
Adds ParquetHiveCompatibilitySuite
liancheng Jul 8, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .rat-excludes
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,5 @@ help/*
html/*
INDEX
.lintr
gen-java.*
.*avpr
33 changes: 33 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,7 @@
<fasterxml.jackson.version>2.4.4</fasterxml.jackson.version>
<snappy.version>1.1.1.7</snappy.version>
<netlib.java.version>1.1.2</netlib.java.version>
<thrift.version>0.9.2</thrift.version>
<!-- For maven shade plugin (see SPARK-8819) -->
<create.dependency.reduced.pom>false</create.dependency.reduced.pom>

Expand All @@ -179,6 +180,8 @@
<hbase.deps.scope>compile</hbase.deps.scope>
<hive.deps.scope>compile</hive.deps.scope>
<parquet.deps.scope>compile</parquet.deps.scope>
<parquet.test.deps.scope>test</parquet.test.deps.scope>
<thrift.test.deps.scope>test</thrift.test.deps.scope>

<!--
Overridable test home. So that you can call individual pom files directly without
Expand Down Expand Up @@ -270,6 +273,18 @@
<enabled>false</enabled>
</snapshots>
</repository>
<!-- For transitive dependencies brougt by parquet-thrift -->
<repository>
<id>twttr-repo</id>
<name>Twttr Repository</name>
<url>http://maven.twttr.com</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<!-- TODO: This can be deleted after Spark 1.4 is posted -->
<repository>
<id>spark-1.4-staging</id>
Expand Down Expand Up @@ -1101,6 +1116,24 @@
<version>${parquet.version}</version>
<scope>${parquet.deps.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>${parquet.version}</version>
<scope>${parquet.test.deps.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-thrift</artifactId>
<version>${parquet.version}</version>
<scope>${parquet.test.deps.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>libthrift</artifactId>
<version>${thrift.version}</version>
<scope>${thrift.test.deps.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
Expand Down
17 changes: 2 additions & 15 deletions project/MimaExcludes.scala
Original file line number Diff line number Diff line change
Expand Up @@ -60,21 +60,8 @@ object MimaExcludes {
"org.apache.spark.ml.regression.LeastSquaresCostFun.this"),
// SQL execution is considered private.
excludePackage("org.apache.spark.sql.execution"),
// NanoTime and CatalystTimestampConverter is only used inside catalyst,
// not needed anymore
ProblemFilters.exclude[MissingClassProblem](
"org.apache.spark.sql.parquet.timestamp.NanoTime"),
ProblemFilters.exclude[MissingClassProblem](
"org.apache.spark.sql.parquet.timestamp.NanoTime$"),
ProblemFilters.exclude[MissingClassProblem](
"org.apache.spark.sql.parquet.CatalystTimestampConverter"),
ProblemFilters.exclude[MissingClassProblem](
"org.apache.spark.sql.parquet.CatalystTimestampConverter$"),
// SPARK-6777 Implements backwards compatibility rules in CatalystSchemaConverter
ProblemFilters.exclude[MissingClassProblem](
"org.apache.spark.sql.parquet.ParquetTypeInfo"),
ProblemFilters.exclude[MissingClassProblem](
"org.apache.spark.sql.parquet.ParquetTypeInfo$")
// Parquet support is considered private.
excludePackage("org.apache.spark.sql.parquet")
) ++ Seq(
// SPARK-8479 Add numNonzeros and numActives to Matrix.
ProblemFilters.exclude[MissingMethodProblem](
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,12 @@

package org.apache.spark.sql.types

import scala.util.Try
import scala.util.parsing.combinator.RegexParsers

import org.json4s._
import org.json4s.JsonAST.JValue
import org.json4s.JsonDSL._
import org.json4s._
import org.json4s.jackson.JsonMethods._

import org.apache.spark.annotation.DeveloperApi
Expand Down Expand Up @@ -82,6 +83,9 @@ abstract class DataType extends AbstractDataType {


object DataType {
private[sql] def fromString(raw: String): DataType = {
Try(DataType.fromJson(raw)).getOrElse(DataType.fromCaseClassString(raw))
}

def fromJson(json: String): DataType = parseDataType(parse(json))

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,11 @@ object StructType extends AbstractDataType {

private[sql] override def simpleString: String = "struct"

private[sql] def fromString(raw: String): StructType = DataType.fromString(raw) match {
case t: StructType => t
case _ => throw new RuntimeException(s"Failed parsing StructType: $raw")
}

def apply(fields: Seq[StructField]): StructType = StructType(fields.toArray)

def apply(fields: java.util.List[StructField]): StructType = {
Expand Down
36 changes: 36 additions & 0 deletions sql/core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,45 @@
<version>9.3-1102-jdbc41</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-thrift</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>libthrift</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<executions>
<execution>
<id>add-scala-test-sources</id>
<phase>generate-test-sources</phase>
<goals>
<goal>add-test-source</goal>
</goals>
<configuration>
<sources>
<source>src/test/scala</source>
<source>src/test/gen-java</source>
</sources>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Loading