Sample operator #368

okennedy · 2020-01-20T19:43:54Z

This branch adds a new Operator called DrawSample, which makes sampling a first-class primitive (as a step towards addressing #362). Currently support exists for two forms of sampling:

Uniform: Draw a flat % of the input data.
Stratified: Indicate a column and a value -> % map. Pass through the specified % of records, depending on the value in the indicated column.

Both of these sampling are probabilistic (you get approximately the specified %, rather than exactly that amount).

The choice to make DrawSample a distinct operator (rather than folding it into, say Select) was motivated by the following:

Mimir and Spark handle functions differently. Both % based sampling strategies rely on seeded calls to rand() for consistency... at present, this has the potential to create inconsistent sample sets depending on where rand() is executed (see Replace UDFs/UDAs with Spark's Catalog #361)
Implementing stratified sampling in Mimir is tricky, since the sampler needs to track the value -> % map. This is implemented in Spark by inlining the map into a closure through spark's udf method. It would be possible in Mimir if we had support for dictionary types... but we don't. Thus, the only way to implement this (without resorting to repeated string parsing in a tight loop) is directly in RAToSpark... which means we need something on the order of an Operator to pull it off.

I would like to revisit this decision, ideally once we add a more robust typesystem into Mimir (e.g., #367).

A few more notes:

The sampling operators are currently not exposed to the outside world. Initially, I think the right way to do this would be through the Vizier API. I want to get the Mimir model rewrite in place before exposing it through the Mimir SQL API. @mrb24 : I need some help figuring out how to muck with that code, since it looks generated and I don't know where the source lives.
Adding deterministic sampling (i.e., give me EXACTLY # rows) is going to be tricky... the easiest way would be some sort of shuffle + limit...
The current stratified sampling operator is precise, but a little user-hostile. An ideal version of the operator would first do a group-by count to find the # in each group, and then select sampling rates to ensure equal representation from each group. This would be best implemented using the new model framework.

- Added a simple DrawSamples operator and fixed all the resulting warnings (this *should* mean that DrawSamples is now properly handled by the codebase...) - SparkToRA no longer has a fall-through case (these... really shouldn't exist on a sealed trait matcher) DrawSamples in its current form can follow one of several 'modes' defined in [[mimir.algebra.sample]] - [[SampleRowsUniformly]]: Include each row independently with a fixed probability - [[SampleStratifiedOn]]: Include each row independently with a probability based on a the row's value in a specified column. The mode defines rules for compiling the specified expression. This generally happens through some sort of call to rand() < p.

- Added serialization test case for sampling - Fixed a small JSON parsing bug in the deserialization step - Added test cases for sampling execution - I somehow managed to miss a case handler for Provenance.compile. That's fixed now - We seem to be generally producing samples of the right size! - Added some extra sampling-related Kryo superclasses to SparkUtilsSpec and SparkUtils - Added sampling to OperatorConstructors - Fixed unrelated test failures in DatasetShapeSpec and SParkMimirCSVDataSourceSpec caused by changing the caveat message for type inference

- Adding formal Scala Play encode/decode support for SamplingModes - Tweaking the Create Sample update UX. - Adding API route /view/sample with CreateSampleRequest / CreateSampleResponse to create samples of a dataset

mrb24 · 2020-01-23T13:18:23Z

@okennedy Create view statements were not parsing after merging this pull. I made a change to MimirSQL.scala to fix it. Was there a reason for having what I removed in this commit?
commit: 6e91f3b

okennedy added 4 commits January 14, 2020 11:08

cleanup

285cf4d

Work in progress

9f8a2f6

okennedy requested a review from mrb24 January 20, 2020 19:43

okennedy mentioned this pull request Jan 20, 2020

Stratified Sampling Operator #362

Closed

okennedy added 2 commits January 21, 2020 15:09

Adding a UX for sampling

158ddde

Connecting sampling directly to the Vizier API

480eb60

- Adding formal Scala Play encode/decode support for SamplingModes - Tweaking the Create Sample update UX. - Adding API route /view/sample with CreateSampleRequest / CreateSampleResponse to create samples of a dataset

okennedy mentioned this pull request Jan 22, 2020

Sampling ui VizierDB/web-api-async#61

Merged

mrb24 merged commit 480eb60 into master Jan 23, 2020

okennedy deleted the SampleOperator branch January 24, 2020 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample operator #368

Sample operator #368

okennedy commented Jan 20, 2020

mrb24 commented Jan 23, 2020

Sample operator #368

Sample operator #368

Conversation

okennedy commented Jan 20, 2020

mrb24 commented Jan 23, 2020