Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

Test against Spark 1.5.0-rc2 in Travis #76

Closed
wants to merge 14 commits into from

Conversation

JoshRosen
Copy link
Contributor

This PR adds Spark 1.5.0-rc2 to our Travis build matrix. I removed the use of TestSQLContext since that class has been removed in Spark 1.5.

@JoshRosen JoshRosen changed the title [WIP] Test against Spark 1.5.0-rc1 in Travis Test against Spark 1.5.0-rc1 in Travis Aug 24, 2015
@JoshRosen
Copy link
Contributor Author

It looks like this legitimately fails tests with 1.5.0-rc1: https://travis-ci.org/databricks/spark-avro/jobs/77030705

18:42:40.130 ERROR org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Aborting task.
java.lang.NullPointerException
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toScala(CatalystTypeConverters.scala:332)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toScala(CatalystTypeConverters.scala:318)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toScala$1.apply(CatalystTypeConverters.scala:178)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toScala$1.apply(CatalystTypeConverters.scala:177)
    at org.apache.spark.sql.types.ArrayData.foreach(ArrayData.scala:127)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScala(CatalystTypeConverters.scala:177)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScalaImpl(CatalystTypeConverters.scala:185)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScalaImpl(CatalystTypeConverters.scala:148)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toScala(CatalystTypeConverters.scala:110)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toScala(CatalystTypeConverters.scala:278)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toScala(CatalystTypeConverters.scala:245)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToScalaConverter$2.apply(CatalystTypeConverters.scala:406)
    at org.apache.spark.sql.sources.OutputWriter.writeInternal(interfaces.scala:380)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:240)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Looks like there's a problem in the decimal converter.

@JoshRosen
Copy link
Contributor Author

There's also a match error in CatalystTypeConverters:

[info]   Cause: scala.MatchError: 3.14 (of class java.lang.Double)
[info]   at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:321)
[info]   at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:318)
[info]   at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
[info]   at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
[info]   at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:245)
[info]   at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
[info]   at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:393)
[info]   at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439)
[info]   at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439)
[info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
[info]   ...

@JoshRosen
Copy link
Contributor Author

The first issue is a legitimate Spark bug: https://issues.apache.org/jira/browse/SPARK-10190

@JoshRosen
Copy link
Contributor Author

NPE will be fixed by apache/spark#8401

private lazy val avroSchema = if (paths.isEmpty) {
throw NoFilesException
} else {
// As of Spark 1.5.0, it's possible to receive an array which contains a single non-existent
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liancheng, could you take a look at this change? It looks like the globPathIfNecessary change in Spark 1.5.0 means that non-existent paths may get passed down to data sources if those paths didn't contain any glob characters. In the cases where this happens, though, I think that we will only receive an array with a single path, so the third-party data source code should only need to check path existence when paths.size == 1. Would be good to confirm this intuition, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, according existing code paths calling globPathIfNecessary in Spark 1.5, this assumption is right. We should probably do the existence check within globPathIfNecessary when the path pattern doesn't contain any glob characters though.

@codecov-io
Copy link

Current coverage is 93.30%

Merging #76 into master will increase coverage by +0.37% as of 01e5e1d

@@            master    #76   diff @@
=====================================
  Files            6      6       
  Stmts          269    269       
  Branches        45     45       
  Methods          0      0       
=====================================
+ Hit            250    251     +1
  Partial          0      0       
+ Missed          19     18     -1

Review entire Coverage Diff as of 01e5e1d

Powered by Codecov. Updated on successful CI builds.

@JoshRosen JoshRosen modified the milestone: 2.0 Aug 25, 2015
@JoshRosen JoshRosen changed the title Test against Spark 1.5.0-rc1 in Travis Test against Spark 1.5.0-rc2 in Travis Aug 26, 2015
@JoshRosen
Copy link
Contributor Author

Going to update this to test with RC2, then will merge if it passes.

@JoshRosen
Copy link
Contributor Author

I just realized that this is sort of testing the wrong thing: we should be compiling with a fixed version of Spark and running tests with different versions in order to better simulate how our released library will actually be used. I'm going to update the Travis build to do this.

@JoshRosen
Copy link
Contributor Author

I have a partial fix for separate compile and test versions of the same dependency but I'm not convinced that it works correctly / isn't brittle, so let's hold off on merging for now.

@JoshRosen
Copy link
Contributor Author

Yep, it didn't work: SBT recompiled the non-test sources with the newer dependency version:

[info] Compiling 6 Scala sources to /home/travis/build/databricks/spark-avro/target/scala-2.10/classes...
[info] [info] Cleaning datadir [/home/travis/build/databricks/spark-avro/target/scala-2.10/scoverage-data]
[info] [info] Beginning coverage instrumentation
[info] [info] Instrumentation completed [505 statements]
[info] [info] Wrote instrumentation file [/home/travis/build/databricks/spark-avro/target/scala-2.10/scoverage-data/scoverage.coverage.xml]
[info] [info] Will write measurement data to [/home/travis/build/databricks/spark-avro/target/scala-2.10/scoverage-data]
[info] Compiling 5 Scala sources to /home/travis/build/databricks/spark-avro/target/scala-2.10/test-classes...

@JoshRosen
Copy link
Contributor Author

@marmbrus, it looks like this library is going to face the same multi-Hadoop-version-compatibility problems that spark-redshift has since it calls the same incompatible methods on TaskAttemptContext, so I think we need to block the release on performing a similar fix here. I'll bring this PR up-to-date now.

@JoshRosen
Copy link
Contributor Author

Will address Hadoop compatibility in a separate PR.

@JoshRosen JoshRosen closed this in 44a806e Aug 28, 2015
@JoshRosen JoshRosen deleted the test-against-1.5.0 branch August 28, 2015 03:04
@findchris
Copy link

@JoshRosen - You wrote "There's also a match error in CatalystTypeConverters" above. Should that have been resolved by your commits to this PR?

I'm seeing something similar:

scala.MatchError: 1.000000000000 (of class java.lang.String)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:326)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:323)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
    at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445)
    at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445)

@JoshRosen
Copy link
Contributor Author

Hi @findchris,

For legacy reasons, the existing releases of spark-avro store decimals as strings instead of native Avro decimals, since decimals weren't supported by Avro prior to version 1.7.7. In a future release, I'd like to fix this (see #80).

How did you manage to trigger the error above? I could see how that might happen if you saved a decimal column to Avro and then read it back while manually specifying the schema as a decimal. If that wasn't what you were doing, could you share a small reproduction that triggers the issue?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants