Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-1350: Add reservoir sampling functions to Stellar #867

Closed
wants to merge 3 commits into from

Conversation

cestella
Copy link
Member

@cestella cestella commented Dec 13, 2017

Contributor Comments

Sampling capabilities would fit very well with the profiler and enable algorithms that do not necessarily support our existing probabilistic sketches. We should add a reservoir sampler and utilities to merge and resample.

You can play with SAMPLE_INIT, SAMPLE_ADD, SAMPLE_MERGE and SAMPLE_GET via the REPL (via mvn exec:java -Dexec.mainClass="org.apache.metron.stellar.common.shell.StellarShell" -pl metron-analytics/metron-statistics):

[Stellar]>>> ?SAMPLE_ADD
SAMPLE_ADD
Description: Add to a sample

Arguments:
	sampler - Sampler to use.  If null, then a default Uniform sampler is created
	o - The value to add.  If o is an Iterable, then each item is added.

Returns:
[Stellar]>>> s_10 := SAMPLE_INIT(10)
[Stellar]>>> sample := REDUCE( [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ], (s, x) -> SAMPLE_ADD(s, x), SAMPLE_INIT(5))
[Stellar]>>> SAMPLE_GET(sample)
[6, 8, 11, 4, 5]
[Stellar]>>> SAMPLE_ADD(s_10, [5, 2, 5, 7, 10 ])
org.apache.metron.statistics.sampling.UniformSampler@3d8d06c0
[Stellar]>>> SAMPLE_GET(SAMPLE_ADD(s_10, [5, 2, 5, 7, 10 ]))
[5, 2, 5, 7, 10, 5, 2, 5, 7, 10]

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?

  • Have you included steps or a guide to how the change may be verified and tested manually?

  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:

    mvn -q clean integration-test install && build_utils/verify_licenses.sh 
    
  • Have you written or updated unit tests and or integration tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:

    cd site-book
    mvn site
    

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

@simonellistonball
Copy link
Contributor

Should the size limit on the sample really be a cut off? In a likely usage scenario a users would sample over a window in a profile. Limiting the size is likely to skew to time at the beginning of the window rather than being genuinely uniform. Would a random replacement strategy make more sense when over the limit? This could be a lot heavier in terms of performance, but may be more mathematically sound.

@cestella
Copy link
Member Author

Sorry, I am not sure I understand, this is random replacement when after the size limit. Am I mistaking your question?

if (reservoir.size() < size) {
reservoir.add(o);
} else {
int rIndex = rng.nextInt(seen + 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm clear, up to the reservoir size, we add to the reservoir. When we're past the reservoir, we do a random replacement as per https://en.wikipedia.org/wiki/Reservoir_sampling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are 100% right, that's what I get for skim reading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we reference Reservoir Sampling in the documentation? Then the use of Universal and other terms would be more in context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think that we need "namespace" scoped documentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the docs to have a link to the Reservoir sampling wikipedia article and made things a bit more clear regarding Uniform.

Copy link
Contributor

@ottobackwards ottobackwards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @cestella , couple of comments


#### `SAMPLE_ADD`
* Description: Add a value or collection of values to a sampler.
* Input:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it seem like Uniform sampler is a 'known' thing. But it is not, either by explanation or reference to where it is explained ( as we have done referring to algorithms before ).
Is there another type of sampler?

Somewhere ( I'm not sure where ) we should say:
"A sampler is a xxxxx that is | does | acts as xxxxx for the sample functions. The default has these properties, but you can override that in init"

Why even mention the Universal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, uniform here is intended to mean that there's each element has equal probability of being in the sample (e.g. the probability is pulled from a uniform probability distribution). I can probably do a better job documenting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be simplified to

 if(ret == null ) {
      if(obj != null) {
         throw new IllegalStateException(argName + "argument(" + obj
                                        + " is expected to be an " + expectedClazz.getName()
                                        + ", but was " + obj
                                        );
       }
     }
return Optional.ofNullable(ret);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, It seemed like the Uniform implementation was leaking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are definitely other types of reservoir samplers which we will probably want. Most specifically a sampler that is biased toward recency (so non-uniform in that case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recency would surely be more relevant for merged resampling in a profile context?

Copy link
Member Author

@cestella cestella Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're both needed. Some use-cases would be fine without bias and some would be better with bias. As a follow-on, I was planning on adding a biased sampler, but this is a big enough PR without it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we'll have a get sample types method, like we do with other things like this right?

Copy link
Member Author

@cestella cestella Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, more than likely it'd be a separate init since each type are going to have different types of parameters depending on the algorithm. So a biased sampler would be SAMPLE_INIT_BIASED(size, ...). I should say, the rest of the functions only presume the Sampler interface, so they should work with biased samplers or uniform sampler or any other kind of sampler.

@ottobackwards
Copy link
Contributor

I'm +1 by inspection, this looks great!

@cestella
Copy link
Member Author

I'll keep that +1 in my pocket...I want to spin it up in full-dev and try it in conjunction with a PR that I'm not quite done with yet. Let's hold off merging until I can validate that I haven't botched the API in some horrible way for sampling ;)

@ottobackwards
Copy link
Contributor

@cestella
Copy link
Member Author

I commented on the ticket, but will repeat a portion of it here at the risk of this being quite off-topic. @justinleet has some great ideas/prototypes about automatic documentation generation from stellar. I'd support a new annotation for documenting namespaces.

@ottobackwards
Copy link
Contributor

any chance I can see @justinleet 's ideas/prototypes?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants