Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debug output beyond "put: a referenced field is missing" #2331

Closed
philrz opened this issue Mar 13, 2021 · 2 comments · Fixed by #3368 or #4686
Closed

Add debug output beyond "put: a referenced field is missing" #2331

philrz opened this issue Mar 13, 2021 · 2 comments · Fixed by #3368 or #4686
Assignees

Comments

@philrz
Copy link
Contributor

philrz commented Mar 13, 2021

Repro is with zq commit 2897dd3.

One of the quirks of how Zeek outputs NDJSON is that it will drop "unset" fields/records entirely rather than output them with explicit null values, such as we see in the following pair of smb_files events where one contains times and one does not.

$ cat smb_files.ndjson | jq .
{
  "_path": "smb_files",
  "_write_ts": "2018-03-24T17:23:35.680864Z",
  "ts": "2018-03-24T17:23:35.680290Z",
  "uid": "CJKUcZ2cGcTrU6FpN7",
  "id.orig_h": "10.128.0.233",
  "id.orig_p": 44132,
  "id.resp_h": "10.47.21.82",
  "id.resp_p": 445,
  "action": "SMB::FILE_OPEN",
  "path": "\\\\10.47.21.82\\ADMIN$",
  "name": "Temp\\wdszAPrV.tmp",
  "size": 196,
  "times.modified": "2018-03-24T17:19:18.111192Z",
  "times.accessed": "2018-03-24T17:19:11.777714Z",
  "times.created": "2018-03-24T17:19:11.777714Z",
  "times.changed": "2018-03-24T17:19:18.111192Z"
}
{
  "_path": "smb_files",
  "_write_ts": "2018-03-24T17:25:04.361621Z",
  "ts": "2018-03-24T17:25:04.360742Z",
  "uid": "CjTxiyIZxrlkPWtNi",
  "id.orig_h": "10.164.94.120",
  "id.orig_p": 39947,
  "id.resp_h": "10.47.3.142",
  "id.resp_p": 445,
  "action": "SMB::FILE_OPEN",
  "name": "\\browser",
  "size": 0
}

I'm trying to shape this data. One of our quirks at the moment is that the only way to convert ISO timestamps into time type is to use the iso() function. However, in this case an error message is produced for the one record where times was absent.

$ zq -z -i ndjson 'put times.modified=iso(times.modified)' smb_files.ndjson 
put: a referenced field is missing
{_path:"smb_files",_write_ts:"2018-03-24T17:23:35.680864Z",action:"SMB::FILE_OPEN",id:{orig_h:"10.128.0.233",orig_p:44132.,resp_h:"10.47.21.82",resp_p:445.},name:"Temp\\wdszAPrV.tmp",path:"\\\\10.47.21.82\\ADMIN$",size:196.,times:{accessed:"2018-03-24T17:19:11.777714Z",changed:"2018-03-24T17:19:18.111192Z",created:"2018-03-24T17:19:11.777714Z",modified:2018-03-24T17:19:18.111192Z},ts:"2018-03-24T17:23:35.680290Z",uid:"CJKUcZ2cGcTrU6FpN7"}
{_path:"smb_files",_write_ts:"2018-03-24T17:25:04.361621Z",action:"SMB::FILE_OPEN",id:{orig_h:"10.164.94.120",orig_p:39947.,resp_h:"10.47.3.142",resp_p:445.},name:"\\browser",size:0.,ts:"2018-03-24T17:25:04.360742Z",uid:"CjTxiyIZxrlkPWtNi"}

This is easy to spot when working with just two records, but originally I was working with a giant shaping config that targeted lots of fields and I was also reading millions of lines of input. Therefore I just saw a flood of these a referenced field is missing errors and I wasn't sure if they were benign (like this one) or perhaps caused by a mistake on my part with perhaps having made a typo in a field name.

In conclusion, it would have helped a lot if the debug output had been able to list the name(s) of the specific field(s) that were deemed "missing" and perhaps (optionally?) a way to see the entirety of the record it was missing from.

In discussing this with @mccanne, he imagined the ideal way to address this would be via a comprehensive "debug mode" that offered the detail via a stream of ZNG errors. However, he also imagined there might be something ad hoc we could do as a first step.

@philrz philrz added this to the Data MVP1 milestone Mar 15, 2021
@philrz philrz modified the milestones: Data MVP0, Data MVP1 May 10, 2021
@philrz
Copy link
Contributor Author

philrz commented Jun 22, 2021

We revisited this topic today in the era of Zed Lakes and some of the experience of the beta testers from the community.

Something worth emphasizing is that despite the put: a referenced field is missing error message, the stream output continued and the record that lacked the times.modified field still appeared in the output. This might very well match the user's intent (i.e., make the operation work where it can work, and pass through the data where it can't).

However, if the user had made a typo (let's say they'd typed time.modified by accident) they might still be scratching their head wondering why the field they thought they were changing is still a string in their output. For this case it was pointed out that the metadata approach proposed in #2769 should help address this case, as long as the data is stored in a Zed Lake: The metadata for the segments touched by the query could be checked and a single warning could be delivered to the user informing them that the field named in their query exists nowhere in the data.

For the simple zq case where we lack this metadata, we might consider just dropping the current put: a referenced field is missing warning since the tool is effectively doing what the user asked, similar to how a grep obviously can't tell a user if they made a typo in their search pattern.

@philrz
Copy link
Contributor Author

philrz commented Jun 29, 2023

Multiple changes have accumulated that have addressed this issue. First, the put: a referenced field is missing error message was phased out in linked PR #3368 in favor of an approach using Zed's first-class errors. Then linked PR #4686 added some user-facing docs that describe how to make use of these errors to provide the kind of debug sought in this issue. The docs show multiple approaches based on how severe the user thinks such a problem might be for their use case, i.e., whether an entire record becomes an error value or just individual problematic fields.

Circling back to the original example in this issue and verifying how it looks are current Zed commit 8bf604c, here's how the shaping looks in current syntax.

$ zq -version
Version: v1.8.1-53-g8bf604cf

$ zq -Z 'nest_dotted(this) | put times.modified:=time(times.modified)' smb_files.ndjson
{
    _path: "smb_files",
    _write_ts: "2018-03-24T17:23:35.680864Z",
    ts: "2018-03-24T17:23:35.680290Z",
    uid: "CJKUcZ2cGcTrU6FpN7",
    id: {
        orig_h: "10.128.0.233",
        orig_p: 44132,
        resp_h: "10.47.21.82",
        resp_p: 445
    },
    action: "SMB::FILE_OPEN",
    path: "\\\\10.47.21.82\\ADMIN$",
    name: "Temp\\wdszAPrV.tmp",
    size: 196,
    times: {
        modified: 2018-03-24T17:19:18.111192Z,
        accessed: "2018-03-24T17:19:11.777714Z",
        created: "2018-03-24T17:19:11.777714Z",
        changed: "2018-03-24T17:19:18.111192Z"
    }
}
{
    _path: "smb_files",
    _write_ts: "2018-03-24T17:25:04.361621Z",
    ts: "2018-03-24T17:25:04.360742Z",
    uid: "CjTxiyIZxrlkPWtNi",
    id: {
        orig_h: "10.164.94.120",
        orig_p: 39947,
        resp_h: "10.47.3.142",
        resp_p: 445
    },
    action: "SMB::FILE_OPEN",
    name: "\\browser",
    size: 0,
    times: {
        modified: error("missing")
    }
}

@philrz philrz closed this as completed Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants