Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7198] [MLLIB] VectorAssembler should output ML attributes #6452

Closed
wants to merge 2 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented May 28, 2015

VectorAssembler should carry over ML attributes. For unknown attributes, we assume numeric values. This PR handles the following cases:

  1. DoubleType with ML attribute: carry over
  2. DoubleType without ML attribute: numeric value
  3. Scalar type: numeric value
  4. VectorType with all ML attributes: carry over and update names
  5. VectorType with number of ML attributes: assume all numeric
  6. VectorType without ML attributes: check the first row and get the number of attributes

@jkbradley

@SparkQA
Copy link

SparkQA commented May 28, 2015

Test build #33642 has finished for PR 6452 at commit facdb1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Schema transformation.
val schema = dataset.schema
lazy val first = dataset.first()
val attrs = $(inputCols).flatMap { c =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not remove "index" from attributes. Should it remove or fix that field? (It's sort of odd that attributes have an index field.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indices gets reset in AttributeGroup:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/attribute/AttributeGroup.scala#L33

Let's re-visit the attribute API in 1.5:)

@jkbradley
Copy link
Member

LGTM other than those 2 items.

Side comment: Where should we document that a column of Vectors needs to have Vectors of the same length? This is a general rule, but some users may not realize it.

@mengxr
Copy link
Contributor Author

mengxr commented May 28, 2015

I don't have a good place in mind. Maybe in the home page of spark.ml user guide.

@SparkQA
Copy link

SparkQA commented May 28, 2015

Test build #33675 has finished for PR 6452 at commit a9d2469.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM pending tests

@jkbradley
Copy link
Member

Oh I guess tests are done

@jkbradley
Copy link
Member

I'll go ahead and merge this into master and branch-1.4

asfgit pushed a commit that referenced this pull request May 28, 2015
`VectorAssembler` should carry over ML attributes. For unknown attributes, we assume numeric values. This PR handles the following cases:

1. DoubleType with ML attribute: carry over
2. DoubleType without ML attribute: numeric value
3. Scalar type: numeric value
4. VectorType with all ML attributes: carry over and update names
5. VectorType with number of ML attributes: assume all numeric
6. VectorType without ML attributes: check the first row and get the number of attributes

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6452 from mengxr/SPARK-7198 and squashes the following commits:

a9d2469 [Xiangrui Meng] add space
facdb1f [Xiangrui Meng] VectorAssembler should output ML attributes

(cherry picked from commit 7859ab6)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in 7859ab6 May 28, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
`VectorAssembler` should carry over ML attributes. For unknown attributes, we assume numeric values. This PR handles the following cases:

1. DoubleType with ML attribute: carry over
2. DoubleType without ML attribute: numeric value
3. Scalar type: numeric value
4. VectorType with all ML attributes: carry over and update names
5. VectorType with number of ML attributes: assume all numeric
6. VectorType without ML attributes: check the first row and get the number of attributes

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#6452 from mengxr/SPARK-7198 and squashes the following commits:

a9d2469 [Xiangrui Meng] add space
facdb1f [Xiangrui Meng] VectorAssembler should output ML attributes
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
`VectorAssembler` should carry over ML attributes. For unknown attributes, we assume numeric values. This PR handles the following cases:

1. DoubleType with ML attribute: carry over
2. DoubleType without ML attribute: numeric value
3. Scalar type: numeric value
4. VectorType with all ML attributes: carry over and update names
5. VectorType with number of ML attributes: assume all numeric
6. VectorType without ML attributes: check the first row and get the number of attributes

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#6452 from mengxr/SPARK-7198 and squashes the following commits:

a9d2469 [Xiangrui Meng] add space
facdb1f [Xiangrui Meng] VectorAssembler should output ML attributes
} else {
Some(attr.withName(c))
}
case _: NumericType | BooleanType =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick question regarding this, why do we not consider a 3.2 case where we have Scalar type with ML attributes? is it because there's no such thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants