Skip to content

[SPARK-30347][ML] LibSVMDataSource attach AttributeGroup#27003

Closed
zhengruifeng wants to merge 1 commit intoapache:masterfrom
zhengruifeng:libsvm_attr_group
Closed

[SPARK-30347][ML] LibSVMDataSource attach AttributeGroup#27003
zhengruifeng wants to merge 1 commit intoapache:masterfrom
zhengruifeng:libsvm_attr_group

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Dec 24, 2019

What changes were proposed in this pull request?

LibSVMDataSource attach AttributeGroup

Why are the changes needed?

LibSVMDataSource will attach a special metadata to indicate numFeatures:

scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")

scala> data.schema("features").metadata
res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}

However, all ML impls will try to obtain vector size via AttributeGroup, which can not use this metadata:

scala> import org.apache.spark.ml.attribute._
import org.apache.spark.ml.attribute._

scala> AttributeGroup.fromStructField(data.schema("features")).size
res1: Int = -1

Does this PR introduce any user-facing change?

No

How was this patch tested?

added tests

init
@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Dec 24, 2019

After this PR, LibSVMDataSource will attach AttributeGroup for ML impls, while keeping existing metadata.

scala> import org.apache.spark.ml.attribute._
import org.apache.spark.ml.attribute._

scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
19/12/24 18:47:35 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
data: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> data.schema("features").metadata
res0: org.apache.spark.sql.types.Metadata = {"ml_attr":{"num_attrs":4},"numFeatures":4}

scala> AttributeGroup.fromStructField(data.schema("features")).size
res1: Int = 4

@SparkQA
Copy link

SparkQA commented Dec 24, 2019

Test build #115733 has finished for PR 27003 at commit ce67623.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK pending tests if it's just attaching known metadata

@zhengruifeng zhengruifeng deleted the libsvm_attr_group branch December 26, 2019 02:04
@zhengruifeng
Copy link
Contributor Author

Merged to master, thanks @srowen for reviewing!

fqaiser94 pushed a commit to fqaiser94/spark that referenced this pull request Mar 30, 2020
### What changes were proposed in this pull request?
LibSVMDataSource attach AttributeGroup

### Why are the changes needed?
LibSVMDataSource will attach a special metadata to indicate numFeatures:
```scala
scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")

scala> data.schema("features").metadata
res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}
```
However, all ML impls will try to obtain vector size via AttributeGroup, which can not use this metadata:
```scala
scala> import org.apache.spark.ml.attribute._
import org.apache.spark.ml.attribute._

scala> AttributeGroup.fromStructField(data.schema("features")).size
res1: Int = -1
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
added tests

Closes apache#27003 from zhengruifeng/libsvm_attr_group.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants