Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent shaper configs applied upon ingest to pools #2695

Open
philrz opened this issue May 10, 2021 · 2 comments
Open

Persistent shaper configs applied upon ingest to pools #2695

philrz opened this issue May 10, 2021 · 2 comments

Comments

@philrz
Copy link
Contributor

philrz commented May 10, 2021

To date, applying shapers has been something that's always been associated with a client-side operation. A couple examples:

  1. zq may be used (often with -I) to shape data before importing it into a lake via zed load, dragging it into the Brim app, etc.
  2. Brimcap performs all its shaping to turn data into rich ZNG client-side before it's posted to a Zed Lake

However, a useful app workflow might be to define a persistent shaper config such that unshaped data could be incrementally added to a pool and shaped server-side without the user having to include or mention the shaper code every time. For example, in a multi-person org, a Zed-savvy user may be responsible for perfecting the "golden" shaping configs and defining policy that ensures they're applied on incoming data for certain pools. Then other users could just import their NDJSON/CSV/etc. directly to those pools without having to know anything about shapers.

We expect this issue to start with design tasks such as determining how the shaping configs are attached/persisted in the Lake, then thinking about how it's invoked by the Brim app and zed.

Note: This may overlap with the "intake" concept we've discussed in the past and tracked via brimdata/zui#1481.

@philrz philrz added this to the Data MVP1 milestone May 12, 2021
@mccanne
Copy link
Collaborator

mccanne commented Dec 14, 2021

Moving to icebox for now as this should probably part of an ingest system design instead of an attribute of a pool in the backend.

@philrz philrz removed this from the ETL Lake milestone Oct 25, 2022
@philrz philrz changed the title Persistent shaper configs attached to pools Persistent shaper configs applied upon ingest to pools Dec 2, 2022
@philrz
Copy link
Contributor Author

philrz commented Aug 28, 2023

Today I thought of this issue again in the context of a community user's inquiry in a recent Slack thread. They were trying to use the Python client to replicate command lines they'd traditionally done at the shell. Their specific question:

is there a way to specify the type of some fields like I can with the zq | zed load -
for eg:
zq -i json '_ts:=time(created)' infile.json | zed load -

I couldn't think of a way for them to replicate that whole pipeline within Python unless they were invoking the zq binary. That is, the Zed Python client can load data from file-like objects into the lake, or read data back out of the lake using queries, but the first half of that pipeline is entirely "non-lake". That got me to thinking again about how it would be handy if that kind of shaping were somehow persisted server-side so it could be applied on ingest, since that way a dumber client like the Python one (or an even dumber one like curl) could post the unshaped data and have it be shaped before being storaed. FWIW, when I described this to the user, their response was "oh that would be awesome!", but per @mccanne's most recent comment above, I'm sure there's other design considerations that might favor an approach other than the one I originally thought of in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants