DRILL-6951: Row set based mock data source #1809

paul-rogers · 2019-06-19T02:33:53Z

The mock data source is used in several tests to generate a large volume of sample data, such as when testing spilling. The mock data source also lets us try new plugin featues in a very simple context. During the development of the row set framework, the mock data source was converted to use the new framework to verify functionality. This commit upgrades the mock data source with that work.

The work changes non of the functionality. It does, however, improve memory usage. Batches are limited, by default, to 10 MB in size. The row set framework minimizes internal fragmentation in the largest vector. (Previously, internal fragmentation averaged 25% but could be as high as 50%.)

As it turns out, the hash aggregate tests depended on the internal fragmentation: without it, the hash agg no longer spilled for the same row count. Adjusted the generated row counts to recreate a data volume that caused spilling.

One test in particular always failed due to assertions in the hash agg code. These seem true bugs and are described in DRILL-7301. After multiple failed attempts to get the test to work, it ws disabled until DRILL-7301 is fixed.

Added a new unit test to sanity check the mock data source. (No test existed for this functionality except as verified via other unit tests.)

arina-ielchiieva · 2019-06-24T13:13:21Z

@paul-rogers looks good, please squash the commits.

paul-rogers · 2019-06-25T02:46:47Z

Squashed commits and rebased on latest master.

paul-rogers · 2019-07-01T00:28:36Z

Rebased onto master.

arina-ielchiieva · 2019-07-13T12:35:13Z

@paul-rogers please resolve the conflicts.

paul-rogers · 2019-07-14T19:59:19Z

Rebased on master. Tests fail with two unrelated failures (these failures occur about half the time I've run full tests over the last few months):

[ERROR] Failures: 
[ERROR]   TestDynamicUDFSupport.testDropFunction:555 Binary should exist in local udf directory
[ERROR] Errors: 
[ERROR]   TestDynamicUDFSupport.testReRegisterTheSameJarWithDifferentContent:600->BaseTestQuery.testRunAndReturn:340 » Rpc

The mock data source is used in several tests to generate a large volume of sample data, such as when testing spilling. The mock data source also lets us try new plugin featues in a very simple context. During the development of the row set framework, the mock data source was converted to use the new framework to verify functionality. This commit upgrades the mock data source with that work. The work changes non of the functionality. It does, however, improve memory usage. Batchs are limited, by default, to 10 MB in size. The row set framework minimizes internal fragmentation in the largest vector. (Previously, internal fragmentation averaged 25% but could be as high as 50%.) As it turns out, the hash aggregate tests depended on the internal fragmentation: without it, the hash agg no longer spilled for the same row count. Adjusted the generated row counts to recreate a data volume that caused spilling. One test in particular always failed due to assertions in the hash agg code. These seem true bugs and are described in DRILL-7301. After multiple failed attempts to get the test to work, it ws disabled until DRILL-7301 is fixed. Added a new unit test to sanity check the mock data source. (No test already existed for this functionality except as verified via other unit tests.)

arina-ielchiieva · 2019-07-15T10:27:24Z

+1

paul-rogers changed the title ~~Drill 6951~~ DRILL-6951: Merge row set based mock data source Jun 19, 2019

paul-rogers force-pushed the DRILL-6951 branch from 1d3cfc1 to 2a54a0c Compare June 19, 2019 02:51

paul-rogers changed the title ~~DRILL-6951: Merge row set based mock data source~~ DRILL-6951: Row set based mock data source Jun 20, 2019

paul-rogers force-pushed the DRILL-6951 branch from 7f1bd0c to 9ca0822 Compare June 25, 2019 02:46

paul-rogers force-pushed the DRILL-6951 branch from 9ca0822 to ca499d5 Compare July 1, 2019 00:28

paul-rogers force-pushed the DRILL-6951 branch from ca499d5 to aa403b7 Compare July 14, 2019 19:59

paul-rogers force-pushed the DRILL-6951 branch from aa403b7 to 23e8a8a Compare July 14, 2019 23:29

arina-ielchiieva merged commit 3599dfd into apache:master Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-6951: Row set based mock data source #1809

DRILL-6951: Row set based mock data source #1809

paul-rogers commented Jun 19, 2019 •

edited

arina-ielchiieva commented Jun 24, 2019

paul-rogers commented Jun 25, 2019

paul-rogers commented Jul 1, 2019

arina-ielchiieva commented Jul 13, 2019

paul-rogers commented Jul 14, 2019

arina-ielchiieva commented Jul 15, 2019

DRILL-6951: Row set based mock data source #1809

DRILL-6951: Row set based mock data source #1809

Conversation

paul-rogers commented Jun 19, 2019 • edited

arina-ielchiieva commented Jun 24, 2019

paul-rogers commented Jun 25, 2019

paul-rogers commented Jul 1, 2019

arina-ielchiieva commented Jul 13, 2019

paul-rogers commented Jul 14, 2019

arina-ielchiieva commented Jul 15, 2019

paul-rogers commented Jun 19, 2019 •

edited