Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and… #9595

Closed
wants to merge 9 commits into from

Conversation

zjffdu
Copy link
Contributor

@zjffdu zjffdu commented Nov 10, 2015

… Add LibSVMOutputWriter

The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter

  • Partition is still not supported
  • Multiple input paths is not supported

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45516 has finished for PR 9595 at commit 801dc5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister\n

@mengxr
Copy link
Contributor

mengxr commented Nov 10, 2015

cc @Lewuathe

StructField("label", DoubleType, nullable = false) ::
StructField("features", new VectorUDT(), nullable = false) :: Nil
)
extends HadoopFsRelation with Logging with Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to mixin Logging trait here? HadoopFsRelation already does it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will correct that

@SparkQA
Copy link

SparkQA commented Nov 11, 2015

Test build #45593 has finished for PR 9595 at commit a26c19c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister\n

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 1, 2015

@Lewuathe Would you mind help review this ? Thanks

import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrameReader, DataFrame, Row, SQLContext}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to change to under score? We can keep this.

import org.apache.spark.sql.{DataFrameReader, DataFrame, Row, SQLContext}

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46952 has finished for PR 9595 at commit 611a9ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister\n

@Lewuathe
Copy link
Contributor

Lewuathe commented Dec 2, 2015

@zjffdu LGTM. Could you create another JIRA for supporting multiple input path on LibSVMRelation as a follow-up? Thanks.

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 2, 2015

Sure, create SPARK-12086 for that

: BaseRelation = {
val path = parameters.getOrElse("path",
throw new IllegalArgumentException("'path' must be specified"))
override def createRelation(sqlContext: SQLContext,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chop down arguments and use 4-space indentation

@mengxr
Copy link
Contributor

mengxr commented Jan 19, 2016

@zjffdu Sorry for the delay! I made a pass and left some comments inline. You also need to rebase master to resolve conflicts. Please let me know whether you have time to update this PR.

Btw, another follow-up work would be exposing options to format the output values. Now we use default format, which outputs 16 digits per double value. It might be too long for common use cases. Could you create a JIRA for this? Thanks!

@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 20, 2016

Thanks @mengxr for review, will update the patch and create a followup jira.

df.write.save(writepath)

val df2 = sqlContext.read.format("libsvm").load(writepath)
val row1 = df.first()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bug of test code, I am verifying df rather than df2 so that the test passed

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49797 has finished for PR 9595 at commit 4a406ab.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 20, 2016

It's weird that I pass the scala style check in my local box.

import com.google.common.base.Objects
import org.apache.hadoop.fs.{Path, FileStatus}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know we are checking the ordering of imports now. F should come before P. You can use Scala Import Organizer with IntelliJ to organize imports quickly.

@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 20, 2016

@mengxr any way to retrigger the build ? BTW what do you mean "exposing options to format the output values", is there a new feature of Spark for encoding double using compact way ? I may miss something here.

@mengxr
Copy link
Contributor

mengxr commented Jan 20, 2016

test this please

@mengxr
Copy link
Contributor

mengxr commented Jan 20, 2016

LIBSVM is a text format and hence we need to consider the cost of storing numerical values. In the current implementation, the output could be some text like 1 1:0.12345678901234, where people might not need more than 6 digits on the feature values. 1 1:0.123456 should be sufficient. But we don't have an option to control the formatting of values in the implementation. It could be just a parameter of this LIBSVM data source, or it could be a global flag because the same issue also applies to other text formats like CSV and JSON. Let's create a JIRA and move our discussion there.

@mengxr
Copy link
Contributor

mengxr commented Jan 20, 2016

test this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49813 has finished for PR 9595 at commit ed4e822.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49814 has finished for PR 9595 at commit ed4e822.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 20, 2016

Thanks for clarifying. In that case, we may lose precision when reading. Maybe make it for libsvm specific is better, anyway we can discuss it in the jira.

import org.apache.hadoop.fs.{FileStatus, Path}
import org.apache.hadoop.io.{NullWritable, Text}
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
import org.apache.hadoop.mapreduce.{RecordWriter, TaskAttemptContext}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ should come before l. Please try Scala Import Organizer:) This is what I got:

import java.io.IOException

import com.google.common.base.Objects
import org.apache.hadoop.fs.{FileStatus, Path}
import org.apache.hadoop.io.{NullWritable, Text}
import org.apache.hadoop.mapreduce.{RecordWriter, TaskAttemptContext}
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat

import org.apache.spark.annotation.Since
import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, DataFrameReader, Row, SQLContext}
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49822 has finished for PR 9595 at commit 4d265d8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 21, 2016

test this please

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49831 has finished for PR 9595 at commit 41e8c2f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49851 has finished for PR 9595 at commit 0d6d06d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49853 has finished for PR 9595 at commit 8a2c96f.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49857 has finished for PR 9595 at commit 5bdf224.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jan 25, 2016

test this please

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50007 has finished for PR 9595 at commit 5bdf224.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 25, 2016

Thanks @mengxr. Not sure why the test fails. Will take a look at it when I have time.

@mengxr
Copy link
Contributor

mengxr commented Jan 25, 2016

The failed test is irrelevant to this PR, which is tracked here: https://issues.apache.org/jira/browse/SPARK-10086. I will ask Jenkins to make another try.

@mengxr
Copy link
Contributor

mengxr commented Jan 25, 2016

test this please

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #50032 has finished for PR 9595 at commit 5bdf224.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jan 26, 2016

Let's wait for #10909 first.

@mengxr
Copy link
Contributor

mengxr commented Jan 26, 2016

test this please

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #50078 has finished for PR 9595 at commit 5bdf224.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jan 27, 2016

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 1dac964 Jan 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants