Persistent shaper configs applied upon ingest to pools #2695

philrz · 2021-05-10T18:42:49Z

To date, applying shapers has been something that's always been associated with a client-side operation. A couple examples:

zq may be used (often with -I) to shape data before importing it into a lake via zed load, dragging it into the Brim app, etc.
Brimcap performs all its shaping to turn data into rich ZNG client-side before it's posted to a Zed Lake

However, a useful app workflow might be to define a persistent shaper config such that unshaped data could be incrementally added to a pool and shaped server-side without the user having to include or mention the shaper code every time. For example, in a multi-person org, a Zed-savvy user may be responsible for perfecting the "golden" shaping configs and defining policy that ensures they're applied on incoming data for certain pools. Then other users could just import their NDJSON/CSV/etc. directly to those pools without having to know anything about shapers.

We expect this issue to start with design tasks such as determining how the shaping configs are attached/persisted in the Lake, then thinking about how it's invoked by the Brim app and zed.

Note: This may overlap with the "intake" concept we've discussed in the past and tracked via brimdata/zui#1481.

The text was updated successfully, but these errors were encountered:

mccanne · 2021-12-14T13:28:31Z

Moving to icebox for now as this should probably part of an ingest system design instead of an attribute of a pool in the backend.

philrz · 2023-08-28T18:55:10Z

Today I thought of this issue again in the context of a community user's inquiry in a recent Slack thread. They were trying to use the Python client to replicate command lines they'd traditionally done at the shell. Their specific question:

is there a way to specify the type of some fields like I can with the zq | zed load -
for eg:
zq -i json '_ts:=time(created)' infile.json | zed load -

I couldn't think of a way for them to replicate that whole pipeline within Python unless they were invoking the zq binary. That is, the Zed Python client can load data from file-like objects into the lake, or read data back out of the lake using queries, but the first half of that pipeline is entirely "non-lake". That got me to thinking again about how it would be handy if that kind of shaping were somehow persisted server-side so it could be applied on ingest, since that way a dumber client like the Python one (or an even dumber one like curl) could post the unshaped data and have it be shaped before being storaed. FWIW, when I described this to the user, their response was "oh that would be awesome!", but per @mccanne's most recent comment above, I'm sure there's other design considerations that might favor an approach other than the one I originally thought of in this issue.

philrz mentioned this issue May 10, 2021

Store mapping definitions in the archive #1265

Closed

philrz added this to the Data MVP1 milestone May 12, 2021

philrz removed this from the ETL Lake milestone Oct 25, 2022

philrz changed the title ~~Persistent shaper configs attached to pools~~ Persistent shaper configs applied upon ingest to pools Dec 2, 2022

philrz added the community label Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent shaper configs applied upon ingest to pools #2695

Persistent shaper configs applied upon ingest to pools #2695

philrz commented May 10, 2021 •

edited

mccanne commented Dec 14, 2021

philrz commented Aug 28, 2023

Persistent shaper configs applied upon ingest to pools #2695

Persistent shaper configs applied upon ingest to pools #2695

Comments

philrz commented May 10, 2021 • edited

mccanne commented Dec 14, 2021

philrz commented Aug 28, 2023

philrz commented May 10, 2021 •

edited