Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13610][ML] Create a Transformer to disassemble vectors in Data… #16486

Closed
wants to merge 2 commits into from
Closed

Conversation

leonfl
Copy link

@leonfl leonfl commented Jan 6, 2017

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-13610

What changes were proposed in this pull request?

Add a VectorDisassembler used for disassemble the vector field to single fields.

How was this patch tested?

Unit tests have added into ml for this feature.

@srowen
Copy link
Member

srowen commented Jan 6, 2017

I don't think this is worth adding. It's pretty easy to pull out a single fiedl from a vector already.

@leonfl
Copy link
Author

leonfl commented Jan 6, 2017

It's a method like VectorAssembler, which make user easy to handle single fields and vector field.
Pull out a single field is easy, but for all single fields in a vector, it still need some code by users.

Our business use disassemble transform a lot, it need always handle by write some code, this Transformer will make user easy to understand and use, right?

@leonfl
Copy link
Author

leonfl commented Jan 6, 2017

@mengxr, could you help to check this patch? Thanks

@leonfl
Copy link
Author

leonfl commented Jan 9, 2017

@jkbradley, Could you also help to check this patch cause you are familiar with this defect, Thanks.

@mrjrdnthms
Copy link

I could use this. I have udf to pick out single values I want but my implementation is slow: here is my python udf:
probTrue_udf = udf(lambda value: value[1].item(), FloatType())
I was hoping there would be a lower level api that did the disassemble transformation quickly.

@leonfl
Copy link
Author

leonfl commented Apr 24, 2017

@mrjrdnthms , this is implemented by UDF, which will run a little bit slower, but easy to use.
If you want it run faster, you can implement it using mappatition and row iterator instead of udf.
That implementation will reduce the running time a lot.

@mrjrdnthms
Copy link

@leonfl The python udf is too slow for my task. By "mappatition and row iterator" do you mean doing the transformation on the RDD directly instead of the dataframe? Sorry for the basic question. I am new to spark. And thanks for help.

@leonfl
Copy link
Author

leonfl commented May 2, 2017

@mrjrdnthms ,Yes, your understand is correct, in scala it like this:

    val rows: RDD[Row] = df.rdd.map(
      rowIn => {
        // handle the rowIn and return a Row
      }
    )
    val newDF = df.sqlContext.createDataFrame(rows, /*create the newDF schema*/)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@leonfl leonfl closed this Dec 15, 2017
@AlbertPlaPlanas
Copy link

Was this ever implemented?

@HarborZeng
Copy link

such a great transformer, don't understand why they chose to ingore this patch.

@diegoxfx
Copy link

I don't think this is worth adding. It's pretty easy to pull out a single fiedl from a vector already.

It is not possible to retrieve a single element from VectorAssembler, it's only possible to retrieve a subset of the array, but it is still an array the element

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants