Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Apr 20, 2019

What changes were proposed in this pull request?

The Parquet file format is the default data source to use in input/output.
The parquet-provided profile will be confusing for end users:

  1. Build Spark with parquet-provided:
./dev/make-distribution.sh --name parquet-provided --tgz -Phadoop-2.7 -Phive -Pparquet-provided
  1. Save the ML model:
scala> model.save("/tmp/spark/w2v")
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat could not be instantiated
  at java.util.ServiceLoader.fail(ServiceLoader.java:232)
  at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
  at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
  at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
  at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
  at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
  at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
  at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:632)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:607)
  at org.apache.spark.ml.feature.Word2VecModel$Word2VecModelWriter.saveImpl(Word2Vec.scala:352)
  at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)
  at org.apache.spark.ml.util.MLWritable.save(ReadWrite.scala:287)
  at org.apache.spark.ml.util.MLWritable.save$(ReadWrite.scala:287)
  at org.apache.spark.ml.feature.Word2VecModel.save(Word2Vec.scala:210)
  ... 47 elided
Caused by: java.lang.NoClassDefFoundError: org/apache/parquet/hadoop/ParquetOutputFormat$JobSummaryLevel
  at java.lang.Class.getDeclaredConstructors0(Native Method)
  at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
  at java.lang.Class.getConstructor0(Class.java:3075)
  at java.lang.Class.newInstance(Class.java:412)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
  ... 71 more
Caused by: java.lang.ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputFormat$JobSummaryLevel
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 76 more

The end users will be confused about the relationship between Parquet and ML models.

How was this patch tested?

manual tests

@SparkQA
Copy link

SparkQA commented Apr 20, 2019

Test build #104770 has finished for PR 24422 at commit 10d6d88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we break other applications by removing this? For example, I saw Spark built with parquet-provided is used like:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

@felixcheung
Copy link
Member

I don't know that's confusing? ML model is persisted in parquet format.

@wangyum wangyum closed this Apr 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants