Fault tolerant data input #4546

philrz · 2023-04-25T18:45:55Z

At the time this issue is being opened, Zed is at commit 58e7993.

We've had a few community issues that speak to a desire tor fault tolerant data input, e.g., when a parse error is encountered when reading one of Zed's supported formats, skip over the "bad data" and continue reading more "good" data, when possible.

JSON - Add a fault tolerance option for multi-line json parsing #4106
CSV - Csv data import problem #4514
Format unspecified - When I import some data, there may be bad strings in some entries in the data. Can you let me skip these data and continue importing? zui#2756

Indeed, we can see how this could be quite handy in some use cases, such as a user with a large amount of data that has such a parse error deep inside it somewhere. Their goal might be to find a "needle in a haystack" such that if a quick search on the non-corrupt parts reveals what they're looking for, they're done, so in that case having to pause and make the data 100% clean just to read it in and start searching is a hindrance. A quick survey of other tools does show that many CSV/JSON readers do indeed often have options to skip over parsing errors, so Zed could offer a similar option for readers where it makes sense.

In a group discussion a novel approach was proposed where Zed could turn the "bad data" into error values that include the bad data itself and a timestamp. As an alternative to just dropping the data as many other tools might do, this would give the user an easier way to see what got skipped & why and perhaps still do crude searches against it or even clean it up and commit the data into new values.

As suggested in a comment in #4514, once we have this functionality it would probably be helpful to surface them to clients like Zui in a way that draws the user's attention to the presence & count of the error when they happen. Follow-on issues to deal with that may be opened once this base functionality exists.

The text was updated successfully, but these errors were encountered:

philrz · 2023-04-25T19:08:33Z

Here's a crude example of implementing the proposal using existing building blocks.

$ zq -version
Version: v1.7.0-41-g58e7993d

$ cat messages.ndjson 
{"message": "One"}
{"message": "Two"}
Message
{"message": "Three"}

$ zq -i json messages.ndjson 
messages.ndjson: invalid character 'M' looking for beginning of value

$ zq -z -i line 'yield (parse_zson(this) == null) ? error({failed_to_parse: this, at: now()}): parse_zson(this)' messages.ndjson 
{message:"One"}
{message:"Two"}
error({failed_to_parse:"Message",at:2023-04-25T19:05:49.367519Z})
{message:"Three"}

philrz · 2023-12-22T17:51:14Z

There was another recent request for this functionality in brimdata/zui#2933. The user made a couple suggestions that might be worth considering in our design here:

Including the line number from an input file in the error value could be very helpful if the user wants to go back and fix the syntax error in the data source. We already have line numbers in some error messages in the existing reader, e.g., with the test data from Csv data import problem #4514:

$ zq -version
Version: v1.12.0-3-gec5165f0

$ zq imdb.csv 
imdb.csv: record on line 6: wrong number of fields

It's somewhat orthogonal, but when the data is being imported to a Zed lake, as an alternative to having the errors become values in the pool, they could potentially be redirected to a separate pool. I've heard other community users make similar suggestions in other contexts, e.g., where to send the debug output described in Inspect the output at a midpoint in a Zed script #4487. So kind of like how the UNIX shells allow redirecting of stderr and stdout to different destinations or merging them into one output stream, perhaps Zed could echo this approach since it would be familiar to most users.

philrz · 2024-02-21T01:09:48Z

Note the changes over the years with "partial loads" as captured in brimdata/zui#2660.

philrz added the community label Apr 25, 2023

philrz mentioned this issue Dec 22, 2023

Suggestion to solve csv import problem (defective) brimdata/zui#2933

Closed

philrz mentioned this issue Feb 21, 2024

Change "Load successful" pop-up to reflect partial loads brimdata/zui#2660

Closed

This was referenced Feb 21, 2024

JSON load times: Brim v0.31.0 vs. Zui v1.6.0 brimdata/zui#3000

Open

Add the fuse option to the data entered linearly brimdata/zui#3011

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault tolerant data input #4546

Fault tolerant data input #4546

philrz commented Apr 25, 2023 •

edited

Loading

philrz commented Apr 25, 2023

philrz commented Dec 22, 2023

philrz commented Feb 21, 2024

Fault tolerant data input #4546

Fault tolerant data input #4546

Comments

philrz commented Apr 25, 2023 • edited Loading

philrz commented Apr 25, 2023

philrz commented Dec 22, 2023

philrz commented Feb 21, 2024

philrz commented Apr 25, 2023 •

edited

Loading