METRON-1350: Add reservoir sampling functions to Stellar #867

cestella · 2017-12-13T21:07:12Z

Contributor Comments

Sampling capabilities would fit very well with the profiler and enable algorithms that do not necessarily support our existing probabilistic sketches. We should add a reservoir sampler and utilities to merge and resample.

You can play with SAMPLE_INIT, SAMPLE_ADD, SAMPLE_MERGE and SAMPLE_GET via the REPL (via mvn exec:java -Dexec.mainClass="org.apache.metron.stellar.common.shell.StellarShell" -pl metron-analytics/metron-statistics):

[Stellar]>>> ?SAMPLE_ADD
SAMPLE_ADD
Description: Add to a sample

Arguments:
	sampler - Sampler to use.  If null, then a default Uniform sampler is created
	o - The value to add.  If o is an Iterable, then each item is added.

Returns:
[Stellar]>>> s_10 := SAMPLE_INIT(10)
[Stellar]>>> sample := REDUCE( [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ], (s, x) -> SAMPLE_ADD(s, x), SAMPLE_INIT(5))
[Stellar]>>> SAMPLE_GET(sample)
[6, 8, 11, 4, 5]
[Stellar]>>> SAMPLE_ADD(s_10, [5, 2, 5, 7, 10 ])
org.apache.metron.statistics.sampling.UniformSampler@3d8d06c0
[Stellar]>>> SAMPLE_GET(SAMPLE_ADD(s_10, [5, 2, 5, 7, 10 ]))
[5, 2, 5, 7, 10, 5, 2, 5, 7, 10]

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
```
mvn -q clean integration-test install && build_utils/verify_licenses.sh 
```
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:
```
cd site-book
mvn site
```

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

simonellistonball · 2017-12-13T21:18:04Z

Should the size limit on the sample really be a cut off? In a likely usage scenario a users would sample over a window in a profile. Limiting the size is likely to skew to time at the beginning of the window rather than being genuinely uniform. Would a random replacement strategy make more sense when over the limit? This could be a lot heavier in terms of performance, but may be more mathematically sound.

cestella · 2017-12-13T21:28:56Z

Sorry, I am not sure I understand, this is random replacement when after the size limit. Am I mistaking your question?

cestella · 2017-12-13T21:30:47Z

...cs/metron-statistics/src/main/java/org/apache/metron/statistics/sampling/UniformSampler.java

+    if (reservoir.size() < size) {
+      reservoir.add(o);
+    } else {
+      int rIndex = rng.nextInt(seen + 1);


Just so I'm clear, up to the reservoir size, we add to the reservoir. When we're past the reservoir, we do a random replacement as per https://en.wikipedia.org/wiki/Reservoir_sampling

you are 100% right, that's what I get for skim reading.

Shouldn't we reference Reservoir Sampling in the documentation? Then the use of Universal and other terms would be more in context.

This makes me think that we need "namespace" scoped documentation

I modified the docs to have a link to the Reservoir sampling wikipedia article and made things a bit more clear regarding Uniform.

ottobackwards

Very nice @cestella , couple of comments

ottobackwards · 2017-12-13T21:20:16Z

metron-analytics/metron-statistics/README.md

+
+#### `SAMPLE_ADD`
+* Description: Add a value or collection of values to a sampler.
+* Input:


This makes it seem like Uniform sampler is a 'known' thing. But it is not, either by explanation or reference to where it is explained ( as we have done referring to algorithms before ).
Is there another type of sampler?

Somewhere ( I'm not sure where ) we should say:
"A sampler is a xxxxx that is | does | acts as xxxxx for the sample functions. The default has these properties, but you can override that in init"

Why even mention the Universal?

Sorry, uniform here is intended to mean that there's each element has equal probability of being in the sample (e.g. the probability is pulled from a uniform probability distribution). I can probably do a better job documenting.

Couldn't this be simplified to

if(ret == null ) { if(obj != null) { throw new IllegalStateException(argName + "argument(" + obj + " is expected to be an " + expectedClazz.getName() + ", but was " + obj ); } } return Optional.ofNullable(ret);

Ok, It seemed like the Uniform implementation was leaking

There are definitely other types of reservoir samplers which we will probably want. Most specifically a sampler that is biased toward recency (so non-uniform in that case).

Recency would surely be more relevant for merged resampling in a profile context?

They're both needed. Some use-cases would be fine without bias and some would be better with bias. As a follow-on, I was planning on adding a biased sampler, but this is a big enough PR without it.

Then we'll have a get sample types method, like we do with other things like this right?

Actually, more than likely it'd be a separate init since each type are going to have different types of parameters depending on the algorithm. So a biased sampler would be SAMPLE_INIT_BIASED(size, ...). I should say, the rest of the functions only presume the Sampler interface, so they should work with biased samplers or uniform sampler or any other kind of sampler.

ottobackwards · 2017-12-14T14:46:01Z

I'm +1 by inspection, this looks great!

cestella · 2017-12-14T14:47:54Z

I'll keep that +1 in my pocket...I want to spin it up in full-dev and try it in conjunction with a PR that I'm not quite done with yet. Let's hold off merging until I can validate that I haven't botched the API in some horrible way for sampling ;)

ottobackwards · 2017-12-14T14:48:54Z

@cestella comments?
https://issues.apache.org/jira/browse/METRON-1361

cestella · 2017-12-14T15:10:28Z

I commented on the ticket, but will repeat a portion of it here at the risk of this being quite off-topic. @justinleet has some great ideas/prototypes about automatic documentation generation from stellar. I'd support a new annotation for documenting namespaces.

ottobackwards · 2017-12-14T15:32:15Z

any chance I can see @justinleet 's ideas/prototypes?

justinleet · 2017-12-14T16:46:08Z

@ottobackwards Left a comment on the ticket you made: https://issues.apache.org/jira/browse/METRON-1361?focusedCommentId=16291159&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16291159

…e#867

METRON-1350: Add reservoir sampling functions to Stellar

7e1a19e

cestella commented Dec 13, 2017

View reviewed changes

ottobackwards reviewed Dec 13, 2017

View reviewed changes

Simplifying.

2149683

Added the ability to specify a sampler to merge into.

5a2a14b

cestella mentioned this pull request Dec 15, 2017

METRON-1364: Add an implementation of Robust PCA outlier detection #870

Closed

10 tasks

asfgit closed this in 3f0b1b7 Dec 20, 2017

iraghumitra pushed a commit to iraghumitra/incubator-metron that referenced this pull request Feb 17, 2018

METRON-1350: Add reservoir sampling functions to Stellar closes apach…

1de68bd

…e#867

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

METRON-1350: Add reservoir sampling functions to Stellar #867

METRON-1350: Add reservoir sampling functions to Stellar #867

cestella commented Dec 13, 2017 •

edited

simonellistonball commented Dec 13, 2017

cestella commented Dec 13, 2017

cestella Dec 13, 2017

simonellistonball Dec 13, 2017

ottobackwards Dec 13, 2017

ottobackwards Dec 13, 2017

cestella Dec 14, 2017

ottobackwards left a comment

ottobackwards Dec 13, 2017

cestella Dec 13, 2017

ottobackwards Dec 13, 2017

ottobackwards Dec 13, 2017

cestella Dec 13, 2017

simonellistonball Dec 13, 2017

cestella Dec 13, 2017 •

edited

ottobackwards Dec 13, 2017

cestella Dec 13, 2017 •

edited

ottobackwards commented Dec 14, 2017

cestella commented Dec 14, 2017

ottobackwards commented Dec 14, 2017

cestella commented Dec 14, 2017

ottobackwards commented Dec 14, 2017

justinleet commented Dec 14, 2017

METRON-1350: Add reservoir sampling functions to Stellar #867

METRON-1350: Add reservoir sampling functions to Stellar #867

Conversation

cestella commented Dec 13, 2017 • edited

Contributor Comments

Pull Request Checklist

For all changes:

For code changes:

For documentation related changes:

Note:

simonellistonball commented Dec 13, 2017

cestella commented Dec 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ottobackwards left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cestella Dec 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cestella Dec 13, 2017 • edited

Choose a reason for hiding this comment

ottobackwards commented Dec 14, 2017

cestella commented Dec 14, 2017

ottobackwards commented Dec 14, 2017

cestella commented Dec 14, 2017

ottobackwards commented Dec 14, 2017

justinleet commented Dec 14, 2017

cestella commented Dec 13, 2017 •

edited

cestella Dec 13, 2017 •

edited

cestella Dec 13, 2017 •

edited