-
Notifications
You must be signed in to change notification settings - Fork 310
Conversation
It looks like this legitimately fails tests with 1.5.0-rc1: https://travis-ci.org/databricks/spark-avro/jobs/77030705 18:42:40.130 ERROR org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Aborting task.
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toScala(CatalystTypeConverters.scala:332)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toScala(CatalystTypeConverters.scala:318)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toScala$1.apply(CatalystTypeConverters.scala:178)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toScala$1.apply(CatalystTypeConverters.scala:177)
at org.apache.spark.sql.types.ArrayData.foreach(ArrayData.scala:127)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScala(CatalystTypeConverters.scala:177)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScalaImpl(CatalystTypeConverters.scala:185)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScalaImpl(CatalystTypeConverters.scala:148)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toScala(CatalystTypeConverters.scala:110)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toScala(CatalystTypeConverters.scala:278)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toScala(CatalystTypeConverters.scala:245)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToScalaConverter$2.apply(CatalystTypeConverters.scala:406)
at org.apache.spark.sql.sources.OutputWriter.writeInternal(interfaces.scala:380)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:240)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Looks like there's a problem in the decimal converter. |
There's also a match error in CatalystTypeConverters: [info] Cause: scala.MatchError: 3.14 (of class java.lang.Double)
[info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:321)
[info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:318)
[info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
[info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
[info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:245)
[info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
[info] at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:393)
[info] at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439)
[info] at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:439)
[info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
[info] ... |
The first issue is a legitimate Spark bug: https://issues.apache.org/jira/browse/SPARK-10190 |
NPE will be fixed by apache/spark#8401 |
private lazy val avroSchema = if (paths.isEmpty) { | ||
throw NoFilesException | ||
} else { | ||
// As of Spark 1.5.0, it's possible to receive an array which contains a single non-existent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liancheng, could you take a look at this change? It looks like the globPathIfNecessary
change in Spark 1.5.0 means that non-existent paths may get passed down to data sources if those paths didn't contain any glob characters. In the cases where this happens, though, I think that we will only receive an array with a single path, so the third-party data source code should only need to check path existence when paths.size == 1
. Would be good to confirm this intuition, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, according existing code paths calling globPathIfNecessary
in Spark 1.5, this assumption is right. We should probably do the existence check within globPathIfNecessary
when the path pattern doesn't contain any glob characters though.
Current coverage is
|
Going to update this to test with RC2, then will merge if it passes. |
I just realized that this is sort of testing the wrong thing: we should be compiling with a fixed version of Spark and running tests with different versions in order to better simulate how our released library will actually be used. I'm going to update the Travis build to do this. |
I have a partial fix for separate compile and test versions of the same dependency but I'm not convinced that it works correctly / isn't brittle, so let's hold off on merging for now. |
Yep, it didn't work: SBT recompiled the non-test sources with the newer dependency version:
|
@marmbrus, it looks like this library is going to face the same multi-Hadoop-version-compatibility problems that |
Will address Hadoop compatibility in a separate PR. |
@JoshRosen - You wrote "There's also a match error in CatalystTypeConverters" above. Should that have been resolved by your commits to this PR? I'm seeing something similar: scala.MatchError: 1.000000000000 (of class java.lang.String)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:326)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:323)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445)
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445) |
Hi @findchris, For legacy reasons, the existing releases of How did you manage to trigger the error above? I could see how that might happen if you saved a decimal column to Avro and then read it back while manually specifying the schema as a decimal. If that wasn't what you were doing, could you share a small reproduction that triggers the issue? |
This PR adds Spark 1.5.0-rc2 to our Travis build matrix. I removed the use of TestSQLContext since that class has been removed in Spark 1.5.