-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4736][mllib] [random forest] functions returning the category with weights #3583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In this version, we add two functions: 1) predictByVotingWithWeight(features: Vector) and 2) predictWithWeight(features: Vector). And we also modify the function: predictByVoting(features: Vector). There are at least two reasons why we make such improvement: 1 ) In our practice, we want to find the top N samples from one category. However in 1.3.0 version, the function of predict can only give the predicted category but without weights. 2) What's more, in our practice, the numbers of positive and negative samples are very unbalance. There are much less positive samples than negative samples. According to the results of votes, there are very few samples predicted as positive sample. If the weights are also given, users can make a proper threshold to modify the results so that the performance can be improved.
|
@dikejiang Do you mind creating a JIRA and adding the JIRA number to the PR title? Thanks! |
|
@mengxr done! |
|
@dikejiang Thanks for the PR! I'm wondering if you'd be interested in a more general API. In the new experimental ML package, I have a PR [https://www.github.com//pull/3637] which introduces a few prediction methods, one of which is: What do you think of using this instead of only predicting the top label's weight? |
|
@jkbradley Of course I am intersted in the more general API if it could provid confidence prdictions, because in our practice we usually need such confidence value to make top N rank. In addtion, I am quite agree with you that confidence predictions could be improved by incorporating each tree's confidence. |
|
@dikejiang Great, thanks! |
|
@mengxr OK to go? |
|
@dikejiang Apologies--I think I was not clear. I was recommending that you change this PR to implement predictRaw(), rather than predictWithWeight(). Does that sound reasonable? Since predictRaw gives more info than predictWithWeight, it seems best to only include predictRaw. Thanks! |
change function name predictWithWeight->predictRaw predictByVotingWithWeight->predictRawByVoting
|
@dikejiang Do you still plan to update this PR to return a Vector of probabilities? I'm planning a major reorganization of trees & ensembles APIs here: [https://issues.apache.org/jira/browse/SPARK-6113] |
|
Can one of the admins verify this patch? |
|
@dikejiang This work is now being done here: [https://issues.apache.org/jira/browse/SPARK-3727] If you still want to work on this task, please coordinate on the JIRA I linked. Thanks! |
In this version, we add two functions: 1) predictByVotingWithWeight(features: Vector) and 2) predictWithWeight(features: Vector). And we also modify the function: predictByVoting(features: Vector).
There are at least two reasons why we make such improvement:
1 ) In our practice, we want to find the top N samples from one category. However in 1.3.0 version, the function of predict can only give the predicted category but without weights.