-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-1901] [core] Create sample operator for Dataset. #949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously, i plan to leave the sample scala API to an separate PR as i not very familiar with scala, but the failed test shows that Flink has a test to make sure scala and java has the same API, i would try to add scala API and integration test later. |
scala API and it's integration test has been merged into latest commit. @tillrohrmann do you have time to review this PR? |
Thanks for your contribution @ChengXiangLi. I'll try to review your PR tomorrow morning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, if I'm not mistaken, that hasNext
has to be idempotent. Thus it should return true
if current != null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put the last return statement in an else
branch?
Thanks for your contribution @ChengXiangLi. The code is really well tested and well structured. Great work :-) I had only some minor comments. There is however one thing I'm not so sure about. With the current implementation, all parallel tasks of the sampling operator will get the same random generator/seed value. Thus, every node will generate the same sequence of random numbers. I think this can have a negative influence on the sampling. What we could do is to use |
Thanks for the review, Till. For your last suggestion, I'm aware of this issue before, but i didn't know the exist of RichMapPartitionFunction, thanks for reminding it, i would update the code soon. |
1afae81
to
23b93d7
Compare
Hi, @tillrohrmann , current implementation of sample with fixed size would generate fixed size sample for each partition randomly instead of the whole dataset, user may expect the later one actually most of the time. I'm research on how to sample fixed size elements randomly from distributed data stream, i think we can pause this PR review until i merge the previous fix. |
The current state with the You're right that user usually want to fix the size for the whole sample. An easy solution could be to assign each item an index, see Just ping me when you've found a solution for the problem. Looking forward reviewing it :-) |
I have worked on this problem before. The idea is to divide the data into blocks and find the probability of selection of an element from a block. blockNumbers = (block_id, data) -> (block_id, count) (1...k) -> (list of block ids we'll be sampling from) |
23b93d7
to
6e39f84
Compare
Thanks for the input, @tillrohrmann and @sachingoel0101 . I would like to implement the fixed size sampling with only one pass through source dataset, since while user try to sample a dataset, the dataset should be quite large in most cases, pass through the dataset multi times would add much more effort. In my solution, the basic idea of fixed size sample in distributed stream is that: generate a random number for each input elements as its weight, select top K elements with max weight, as the weights are generated randomly, so the selected top K elements are selected randomly. You can see more detail information in the code and javadoc. |
Hello @ChengXiangLi, perhaps looking at this paper may help with deciding which sampling algorithm to use for the exact sample size algorithm. It provides an implementation specifically designed for a MapReduce environment. |
Thanks, @thvasilo , that paper introduced an random sample algorithm which is an extend algorithm of the one i described before, it has two threshold the filter the element before sort, if element weight is bigger than "up threshold", it would be included in final top K elements with very high possibility, if element weight is smaller than "down threshold", it would not be included in final top K elements with very high possibility. With accepted possibility, we can filter the element with weigh larger than "up threshold" or smaller than "down threshold", only sort the elements with weight between the thresholds. This is a very good algorithm, i would add it on my notebook for further improvement, but i don't want to implement it right way. This PR is large enough to me, so i would like to leave all the algorithms optimization in future, and just keep the basic implementations of sample algorithms here, make sure they are simple, easy to understand, work correctly, and they can be used as the performance base line for the further improvement. |
Agreed, we can take a look at the optimized algorithm later. |
Hi, @tillrohrmann , it's ready for review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having some comments for some of the verify[..]
functions explaining what we verify at each would help with code understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @thvasilo , I've add more comments in the latest commit.
a95b33e
to
97e4cd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc is from the fraction function.
Sorry for the late review @ChengXiangLi. I finished it now and the PR looks really good. There is only one thing I would like to change before merging it. This is to move the All of Flink internal operations work on serialized data which is stored in Flink's managed memory. The managed memory allows Flink to detect when to spill elements from memory to disk to avoid out of memory exceptions and, thus, to make the system robust. The managed memory is a pre-allocated area of memory which is administered by the I think that your current implementation will work for most use cases but in order to make it part of the core API, we also have to deal with the case where our sample cannot be materialized on the remaining heap of a running Flink program. In order to achieve this, I think it would be necessary to implement a native For the topK implementation, we would need something like a I know that you've already spent a lot of effort into writing the sampling operator and that this result might be a little bit demotivating. However, if we want to make it right and robust, then I think this is the way to go. Additionally we would add a proper topK operator to Flink's API which is missing big time :-) If you want to, then you could also take the lead here. The further discussion should then happen in a separate issue. I'm more than willing to assist you in implementing this operator. What do you think? |
Thanks for the detail explanation, @tillrohrmann . As the owner of this issue, make it work correctly and efficiently is my top desire, so i would never say no to a reasonable and proper suggestion(although i did expect it happens earlier :-)).
|
@ChengXiangLi, you're right, I should have noticed earlier and raised a flag. That's my bad, sorry. But your work is not in vain. I think it's some excellent piece of work and the For the sake of completeness, let's do it once the |
OK, @tillrohrmann , great to hear that. Besides, I've created FLINK-2549 for topK operator. |
Oh cool, then you were faster than me :-) |
96fd822
to
c50bc27
Compare
Move sample/sampleWithSize operator to DataSetUtil and update unit test. |
Looks great. Thanks a lot for your contribution @ChengXiangLi. Will merge it now. |
Hey @ChengXiangLi , I just observed a failure on a test case: https://travis-ci.org/sachingoel0101/flink/jobs/76649177 |
It's great to have this in, I'll try to update the cross-validation and SGD to use this. |
Hi, @sachingoel0101 , while sample with fraction, it's not easy to verify whether the DataSet is sampled with input fraction. In the test, i take 5 times sample, use the average size to computer the result fraction, and then compare the result fraction with input fraction, verify their difference is not more than 10% percent. The following case may happens as well, Sampler sample the DataSet with input fraction, but the sampled result size is too small or too large that beyond our verification condition, it happens, just with very little possibility, say less than 0.001 in this test. it should be ok if this failure happens very occasionally, please let me know if you found it's not. |
@ChengXiangLi I know that it is hard to verify random sampler implementation. But we need to fix this test failing because of difficulty of other pull requests verification. Some tests of other pull requests are failed by K-S test and sampling test with fraction. There is a JIRA issue covered this. I'm testing with increased count of samples and source size. If I get a notable result, I'll post the result. |
Thanks, @chiwanpark , waiting for your result. |
Hey @ChengXiangLi, I have another concern, regarding the seed for sampling. It doesn't seem to serve its purpose. I tried sampling with fraction three times with the same seed, however, every time I get different results. |
@thvasilo, the PR did not yet include the proper sampling behaviour within iterations. See FLINK-2396. |
@sachingoel0101, you're right. The problem is that Flink does not give you a guarantee in which order the elements will arrive. But this problem won't be fixed by setting the seed for all sampling operators to the same value. There always might be an operator, e.g. rebalance, which will completely randomize your element order. |
Yes. I was only wondering if we should at least ensure this when it is done right at the source though. |
IMHO, this makes understanding the semantics of the sampling operator only more complicated because it behaves differently depending on the job graph structure. I would rather document this limitation more prominently. |
[FLINK-1901] [core] enable sample with fixed size on the whole dataset. [FLINK-1901] [core] add more comments for RandomSamplerTest. [FLINK-1901] [core] refactor PoissonSampler output Iterator. [FLINK-1901] [core] move sample/sampleWithSize operator to DataSetUtils. Adds notes for commons-math3 to LICENSE and NOTICE file This closes apache#949.
#949) * [FLINKCC-1345] Convert alter model reset to AlterModelOptionsOperation
This PR includes: