[SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and… #9595

zjffdu · 2015-11-10T10:14:55Z

… Add LibSVMOutputWriter

The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter

Partition is still not supported
Multiple input paths is not supported

SparkQA · 2015-11-10T11:16:02Z

Test build #45516 has finished for PR 9595 at commit 801dc5d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister\n

mengxr · 2015-11-10T18:41:47Z

cc @Lewuathe

Lewuathe · 2015-11-11T01:13:09Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

-    StructField("label", DoubleType, nullable = false) ::
-      StructField("features", new VectorUDT(), nullable = false) :: Nil
-  )
+  extends HadoopFsRelation with Logging with Serializable {


Is it really necessary to mixin Logging trait here? HadoopFsRelation already does it.

Right, will correct that

SparkQA · 2015-11-11T05:36:22Z

Test build #45593 has finished for PR 9595 at commit a26c19c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister\n

zjffdu · 2015-12-01T01:42:40Z

@Lewuathe Would you mind help review this ? Thanks

Lewuathe · 2015-12-01T06:53:06Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

 import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
 import org.apache.spark.mllib.util.MLUtils
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.{DataFrameReader, DataFrame, Row, SQLContext}


Is it necessary to change to under score? We can keep this.

import org.apache.spark.sql.{DataFrameReader, DataFrame, Row, SQLContext}

SparkQA · 2015-12-01T08:23:26Z

Test build #46952 has finished for PR 9595 at commit 611a9ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister\n

Lewuathe · 2015-12-02T02:31:31Z

@zjffdu LGTM. Could you create another JIRA for supporting multiple input path on LibSVMRelation as a follow-up? Thanks.

zjffdu · 2015-12-02T02:40:40Z

Sure, create SPARK-12086 for that

mengxr · 2016-01-19T22:36:37Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

-    : BaseRelation = {
-    val path = parameters.getOrElse("path",
-      throw new IllegalArgumentException("'path' must be specified"))
+  override def createRelation(sqlContext: SQLContext,


chop down arguments and use 4-space indentation

mengxr · 2016-01-19T22:39:13Z

@zjffdu Sorry for the delay! I made a pass and left some comments inline. You also need to rebase master to resolve conflicts. Please let me know whether you have time to update this PR.

Btw, another follow-up work would be exposing options to format the output values. Now we use default format, which outputs 16 digits per double value. It might be too long for common use cases. Could you create a JIRA for this? Thanks!

zjffdu · 2016-01-20T00:19:31Z

Thanks @mengxr for review, will update the patch and create a followup jira.

zjffdu · 2016-01-20T05:42:46Z

mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala

+    df.write.save(writepath)
+
+    val df2 = sqlContext.read.format("libsvm").load(writepath)
+    val row1 = df.first()


This is bug of test code, I am verifying df rather than df2 so that the test passed

SparkQA · 2016-01-20T18:57:47Z

Test build #49797 has finished for PR 9595 at commit 4a406ab.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-01-20T19:25:09Z

It's weird that I pass the scala style check in my local box.

mengxr · 2016-01-20T19:30:44Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

 import com.google.common.base.Objects
+import org.apache.hadoop.fs.{Path, FileStatus}


I didn't know we are checking the ordering of imports now. F should come before P. You can use Scala Import Organizer with IntelliJ to organize imports quickly.

zjffdu · 2016-01-20T20:41:34Z

@mengxr any way to retrigger the build ? BTW what do you mean "exposing options to format the output values", is there a new feature of Spark for encoding double using compact way ? I may miss something here.

mengxr · 2016-01-20T22:06:19Z

test this please

mengxr · 2016-01-20T22:13:11Z

LIBSVM is a text format and hence we need to consider the cost of storing numerical values. In the current implementation, the output could be some text like 1 1:0.12345678901234, where people might not need more than 6 digits on the feature values. 1 1:0.123456 should be sufficient. But we don't have an option to control the formatting of values in the implementation. It could be just a parameter of this LIBSVM data source, or it could be a global flag because the same issue also applies to other text formats like CSV and JSON. Let's create a JIRA and move our discussion there.

mengxr · 2016-01-20T22:13:21Z

test this please

SparkQA · 2016-01-20T22:21:48Z

Test build #49813 has finished for PR 9595 at commit ed4e822.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-20T22:32:21Z

Test build #49814 has finished for PR 9595 at commit ed4e822.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-01-20T22:42:04Z

Thanks for clarifying. In that case, we may lose precision when reading. Maybe make it for libsvm specific is better, anyway we can discuss it in the jira.

mengxr · 2016-01-20T22:48:44Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.io.{NullWritable, Text}
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
+import org.apache.hadoop.mapreduce.{RecordWriter, TaskAttemptContext}


{ should come before l. Please try Scala Import Organizer:) This is what I got:

import java.io.IOException import com.google.common.base.Objects import org.apache.hadoop.fs.{FileStatus, Path} import org.apache.hadoop.io.{NullWritable, Text} import org.apache.hadoop.mapreduce.{RecordWriter, TaskAttemptContext} import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat import org.apache.spark.annotation.Since import org.apache.spark.mllib.linalg.{Vector, VectorUDT} import org.apache.spark.mllib.util.MLUtils import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, DataFrameReader, Row, SQLContext} import org.apache.spark.sql.sources._ import org.apache.spark.sql.types._

SparkQA · 2016-01-20T23:09:02Z

Test build #49822 has finished for PR 9595 at commit 4d265d8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-01-21T00:51:46Z

test this please

SparkQA · 2016-01-21T01:07:35Z

Test build #49831 has finished for PR 9595 at commit 41e8c2f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

… Add LibSVMOutputWriter

SparkQA · 2016-01-21T04:40:08Z

Test build #49851 has finished for PR 9595 at commit 0d6d06d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-21T05:03:48Z

Test build #49853 has finished for PR 9595 at commit 8a2c96f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-21T07:36:31Z

Test build #49857 has finished for PR 9595 at commit 5bdf224.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-25T20:35:14Z

test this please

SparkQA · 2016-01-25T22:22:54Z

Test build #50007 has finished for PR 9595 at commit 5bdf224.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-01-25T23:02:55Z

Thanks @mengxr. Not sure why the test fails. Will take a look at it when I have time.

mengxr · 2016-01-25T23:12:18Z

The failed test is irrelevant to this PR, which is tracked here: https://issues.apache.org/jira/browse/SPARK-10086. I will ask Jenkins to make another try.

mengxr · 2016-01-25T23:12:23Z

test this please

SparkQA · 2016-01-26T01:02:47Z

Test build #50032 has finished for PR 9595 at commit 5bdf224.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-26T01:27:43Z

Let's wait for #10909 first.

mengxr · 2016-01-26T06:54:51Z

test this please

SparkQA · 2016-01-26T08:47:13Z

Test build #50078 has finished for PR 9595 at commit 5bdf224.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-27T01:31:45Z

LGTM. Merged into master. Thanks!

Lewuathe reviewed Nov 11, 2015
View reviewed changes

Lewuathe reviewed Dec 1, 2015
View reviewed changes

mengxr reviewed Jan 19, 2016
View reviewed changes

zjffdu reviewed Jan 20, 2016
View reviewed changes

zjffdu force-pushed the SPARK-11622 branch from 611a9ef to 4a406ab Compare January 20, 2016 18:50

mengxr reviewed Jan 20, 2016
View reviewed changes

zjffdu added 7 commits January 20, 2016 20:26

[SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and…

9786a4e

… Add LibSVMOutputWriter

minor change (remove Logging trait)

7cf79ef

address review comments

409f4d5

minor code style fix

75fcb50

fix import ordering

65cfb11

fix code style

6957dfe

fix code style issue

0d6d06d

zjffdu force-pushed the SPARK-11622 branch from 41e8c2f to 0d6d06d Compare January 21, 2016 04:34

code style issue

8a2c96f

fix binary incompatibilities

5bdf224

asfgit closed this in 1dac964 Jan 27, 2016

		import com.google.common.base.Objects
		import org.apache.hadoop.fs.{Path, FileStatus}

[SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and… #9595

[SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and… #9595

Conversation

zjffdu commented Nov 10, 2015

SparkQA commented Nov 10, 2015

mengxr commented Nov 10, 2015

Lewuathe Nov 11, 2015

Choose a reason for hiding this comment

zjffdu Nov 11, 2015

Choose a reason for hiding this comment

SparkQA commented Nov 11, 2015

zjffdu commented Dec 1, 2015

Lewuathe Dec 1, 2015

Choose a reason for hiding this comment

SparkQA commented Dec 1, 2015

Lewuathe commented Dec 2, 2015

zjffdu commented Dec 2, 2015

mengxr Jan 19, 2016

Choose a reason for hiding this comment

mengxr commented Jan 19, 2016

zjffdu commented Jan 20, 2016

zjffdu Jan 20, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 20, 2016

zjffdu commented Jan 20, 2016

mengxr Jan 20, 2016

Choose a reason for hiding this comment

zjffdu commented Jan 20, 2016

mengxr commented Jan 20, 2016

mengxr commented Jan 20, 2016

mengxr commented Jan 20, 2016

SparkQA commented Jan 20, 2016

SparkQA commented Jan 20, 2016

zjffdu commented Jan 20, 2016

mengxr Jan 20, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 20, 2016

zjffdu commented Jan 21, 2016

SparkQA commented Jan 21, 2016

SparkQA commented Jan 21, 2016

SparkQA commented Jan 21, 2016

SparkQA commented Jan 21, 2016

mengxr commented Jan 25, 2016

SparkQA commented Jan 25, 2016

zjffdu commented Jan 25, 2016

mengxr commented Jan 25, 2016

mengxr commented Jan 25, 2016

SparkQA commented Jan 26, 2016

mengxr commented Jan 26, 2016

mengxr commented Jan 26, 2016

SparkQA commented Jan 26, 2016

mengxr commented Jan 27, 2016