Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerant data input #4546

Open
philrz opened this issue Apr 25, 2023 · 3 comments
Open

Fault tolerant data input #4546

philrz opened this issue Apr 25, 2023 · 3 comments

Comments

@philrz
Copy link
Contributor

philrz commented Apr 25, 2023

At the time this issue is being opened, Zed is at commit 58e7993.

We've had a few community issues that speak to a desire tor fault tolerant data input, e.g., when a parse error is encountered when reading one of Zed's supported formats, skip over the "bad data" and continue reading more "good" data, when possible.

Indeed, we can see how this could be quite handy in some use cases, such as a user with a large amount of data that has such a parse error deep inside it somewhere. Their goal might be to find a "needle in a haystack" such that if a quick search on the non-corrupt parts reveals what they're looking for, they're done, so in that case having to pause and make the data 100% clean just to read it in and start searching is a hindrance. A quick survey of other tools does show that many CSV/JSON readers do indeed often have options to skip over parsing errors, so Zed could offer a similar option for readers where it makes sense.

In a group discussion a novel approach was proposed where Zed could turn the "bad data" into error values that include the bad data itself and a timestamp. As an alternative to just dropping the data as many other tools might do, this would give the user an easier way to see what got skipped & why and perhaps still do crude searches against it or even clean it up and commit the data into new values.

As suggested in a comment in #4514, once we have this functionality it would probably be helpful to surface them to clients like Zui in a way that draws the user's attention to the presence & count of the error when they happen. Follow-on issues to deal with that may be opened once this base functionality exists.

@philrz
Copy link
Contributor Author

philrz commented Apr 25, 2023

Here's a crude example of implementing the proposal using existing building blocks.

$ zq -version
Version: v1.7.0-41-g58e7993d

$ cat messages.ndjson 
{"message": "One"}
{"message": "Two"}
Message
{"message": "Three"}

$ zq -i json messages.ndjson 
messages.ndjson: invalid character 'M' looking for beginning of value

$ zq -z -i line 'yield (parse_zson(this) == null) ? error({failed_to_parse: this, at: now()}): parse_zson(this)' messages.ndjson 
{message:"One"}
{message:"Two"}
error({failed_to_parse:"Message",at:2023-04-25T19:05:49.367519Z})
{message:"Three"}

@philrz
Copy link
Contributor Author

philrz commented Dec 22, 2023

There was another recent request for this functionality in brimdata/zui#2933. The user made a couple suggestions that might be worth considering in our design here:

  1. Including the line number from an input file in the error value could be very helpful if the user wants to go back and fix the syntax error in the data source. We already have line numbers in some error messages in the existing reader, e.g., with the test data from Csv data import problem #4514:
$ zq -version
Version: v1.12.0-3-gec5165f0

$ zq imdb.csv 
imdb.csv: record on line 6: wrong number of fields
  1. It's somewhat orthogonal, but when the data is being imported to a Zed lake, as an alternative to having the errors become values in the pool, they could potentially be redirected to a separate pool. I've heard other community users make similar suggestions in other contexts, e.g., where to send the debug output described in Inspect the output at a midpoint in a Zed script #4487. So kind of like how the UNIX shells allow redirecting of stderr and stdout to different destinations or merging them into one output stream, perhaps Zed could echo this approach since it would be familiar to most users.

@philrz
Copy link
Contributor Author

philrz commented Feb 21, 2024

Note the changes over the years with "partial loads" as captured in brimdata/zui#2660.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant