Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample operator #368

Merged
merged 6 commits into from
Jan 23, 2020
Merged

Sample operator #368

merged 6 commits into from
Jan 23, 2020

Conversation

okennedy
Copy link
Member

This branch adds a new Operator called DrawSample, which makes sampling a first-class primitive (as a step towards addressing #362). Currently support exists for two forms of sampling:

  • Uniform: Draw a flat % of the input data.
  • Stratified: Indicate a column and a value -> % map. Pass through the specified % of records, depending on the value in the indicated column.

Both of these sampling are probabilistic (you get approximately the specified %, rather than exactly that amount).

The choice to make DrawSample a distinct operator (rather than folding it into, say Select) was motivated by the following:

  • Mimir and Spark handle functions differently. Both % based sampling strategies rely on seeded calls to rand() for consistency... at present, this has the potential to create inconsistent sample sets depending on where rand() is executed (see Replace UDFs/UDAs with Spark's Catalog #361)
  • Implementing stratified sampling in Mimir is tricky, since the sampler needs to track the value -> % map. This is implemented in Spark by inlining the map into a closure through spark's udf method. It would be possible in Mimir if we had support for dictionary types... but we don't. Thus, the only way to implement this (without resorting to repeated string parsing in a tight loop) is directly in RAToSpark... which means we need something on the order of an Operator to pull it off.

I would like to revisit this decision, ideally once we add a more robust typesystem into Mimir (e.g., #367).

A few more notes:

  • The sampling operators are currently not exposed to the outside world. Initially, I think the right way to do this would be through the Vizier API. I want to get the Mimir model rewrite in place before exposing it through the Mimir SQL API. @mrb24 : I need some help figuring out how to muck with that code, since it looks generated and I don't know where the source lives.
  • Adding deterministic sampling (i.e., give me EXACTLY # rows) is going to be tricky... the easiest way would be some sort of shuffle + limit...
  • The current stratified sampling operator is precise, but a little user-hostile. An ideal version of the operator would first do a group-by count to find the # in each group, and then select sampling rates to ensure equal representation from each group. This would be best implemented using the new model framework.

- Added a simple DrawSamples operator and fixed all the resulting warnings (this *should* mean that DrawSamples is now properly handled by the codebase...)
- SparkToRA no longer has a fall-through case (these... really shouldn't exist on a sealed trait matcher)

DrawSamples in its current form can follow one of several 'modes' defined in [[mimir.algebra.sample]]
 - [[SampleRowsUniformly]]: Include each row independently with a fixed probability
 - [[SampleStratifiedOn]]: Include each row independently with a probability based on a the row's value in a specified column.

The mode defines rules for compiling the specified expression.  This generally happens through some sort of call to rand() < p.
- Added serialization test case for sampling
  - Fixed a small JSON parsing bug in the deserialization step
- Added test cases for sampling execution
  - I somehow managed to miss a case handler for Provenance.compile.  That's fixed now
  - We seem to be generally producing samples of the right size!
- Added some extra sampling-related Kryo superclasses to SparkUtilsSpec and SparkUtils
- Added sampling to OperatorConstructors
- Fixed unrelated test failures in DatasetShapeSpec and SParkMimirCSVDataSourceSpec caused by changing the caveat message for type inference
- Adding formal Scala Play encode/decode support for SamplingModes
- Tweaking the Create Sample update UX.
- Adding API route /view/sample with CreateSampleRequest / CreateSampleResponse to create samples of a dataset
@mrb24 mrb24 merged commit 480eb60 into master Jan 23, 2020
@mrb24
Copy link
Contributor

mrb24 commented Jan 23, 2020

@okennedy Create view statements were not parsing after merging this pull. I made a change to MimirSQL.scala to fix it. Was there a reason for having what I removed in this commit?
commit: 6e91f3b

@okennedy okennedy deleted the SampleOperator branch January 24, 2020 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants