Refactor analyze #110

mattnibs · 2021-06-29T15:25:17Z

The old way analyze was written was prone to hard to reproduce / hard to
debug concurrency bugs. Rewrite analyze to simplify things. This commit
divorces the analyze process command phase and the log tailer phase
which were previously unified. This arrangement makes it easier to
isolate the two phases of analyze which makes it easier to diagnose issues.

This refactor fixes the deadlock bug encountered when one analyzer process
would error out in the midst of analysis.

Also:

Have the load commit use the add / commit endpoints instead of the
deprecated log post endpoints.

Closes #109
Closes #71

philrz · 2021-06-30T16:00:33Z

I just tested out this branch as of commit 5713d85 and it seems effective at addressing the deadlock symptom I'd reported in #71. I did see something else unexpected which @mattnibs said he'll respond to.

The repro setup is similar to what's described in #71 in terms of having my custom Zeek and Suricata installed with the wrapper scripts, one of which depends on jq that's not installed. I did have to update my config YAML to conform to new Zed syntax in the shaper.

$ cat zeek-suricata-new.yml
analyzers:
  - cmd: zeek-wrapper.sh
  - cmd: suricata-wrapper.sh
    globs: ["deduped*.json"]
    shaper: |
      type alert = {
        timestamp: time,
        event_type: bstring,
        src_ip: ip,
        src_port: port=(uint16),
        dest_ip: ip,
        dest_port: port=(uint16),
        vlan: [uint16],
        proto: bstring,
        app_proto: bstring,
        alert: {
          severity: uint16,
          signature: bstring,
          category: bstring,
          action: bstring,
          signature_id: uint64,
          gid: uint64,
          rev: uint64,
          metadata: {
            signature_severity: [bstring],
            former_category: [bstring],
            attack_target: [bstring],
            deployment: [bstring],
            affected_product: [bstring],
            created_at: [bstring],
            performance_impact: [bstring],
            updated_at: [bstring],
            malware_family: [bstring],
            tag: [bstring]
          }
        },
        flow_id: uint64,
        pcap_cnt: uint64,
        tx_id: uint64,
        icmp_code: uint64,
        icmp_type: uint64,
        tunnel: {
          src_ip: ip,
          src_port: port=(uint16),
          dest_ip: ip,
          dest_port: port=(uint16),
          proto: bstring,
          depth: uint64
        },
        community_id: bstring
      }
      filter event_type=="alert" | put . := shape(alert) | rename ts:=timestamp

$ brimcap -version
Version: v0.0.5-19-g5713d85

$ brimcap analyze -config zeek-suricata-new.yml wrccdc.pcap > /dev/null
{"type":"error","error":"zeek-wrapper.sh: WorkDir not set; suricata-wrapper.sh: WorkDir not set"}

That was the one unexpected speedbump I'd hit: I'd created this config YAML when an explicit workdir setting was not strictly required. The changes in the PR make it look like it is, so I created the directories manually and added the settings.

$ cat repro.yml 
analyzers:
  - cmd: zeek-wrapper.sh
    workdir: wd-zeek
  - cmd: suricata-wrapper.sh
    workdir: wd-suricata
    globs: ["deduped*.json"]
    shaper: |
      type alert = {
        timestamp: time,
        event_type: bstring,
        src_ip: ip,
        src_port: port=(uint16),
        dest_ip: ip,
        dest_port: port=(uint16),
        vlan: [uint16],
        proto: bstring,
        app_proto: bstring,
        alert: {
          severity: uint16,
          signature: bstring,
          category: bstring,
          action: bstring,
          signature_id: uint64,
          gid: uint64,
          rev: uint64,
          metadata: {
            signature_severity: [bstring],
            former_category: [bstring],
            attack_target: [bstring],
            deployment: [bstring],
            affected_product: [bstring],
            created_at: [bstring],
            performance_impact: [bstring],
            updated_at: [bstring],
            malware_family: [bstring],
            tag: [bstring]
          }
        },
        flow_id: uint64,
        pcap_cnt: uint64,
        tx_id: uint64,
        icmp_code: uint64,
        icmp_type: uint64,
        tunnel: {
          src_ip: ip,
          src_port: port=(uint16),
          dest_ip: ip,
          dest_port: port=(uint16),
          proto: bstring,
          depth: uint64
        },
        community_id: bstring
      }
      filter event_type=="alert" | put . := shape(alert) | rename ts:=timestamp

$ brimcap analyze -config repro.yml wrccdc.pcap > /dev/null
{"type":"status","ts":{"sec":1625068592,"ns":361491935},"pcap_read_size":65536,"pcap_total_size":500001372,"records_written":0}
...
{"type":"error","error":"suricata-wrapper.sh exited with code 127\nstdout:\n30/6/2021 -- 08:56:31 - \u003cNotice\u003e - This is Suricata version 6.0.2 RELEASE running in USER mode\n30/6/2021 -- 08:56:56 - \u003cNotice\u003e - all 5 packet processing threads, 4 management threads initialized, engine started.\n30/6/2021 -- 08:57:32 - \u003cNotice\u003e - Signal Received.  Stopping engine.\n30/6/2021 -- 08:57:43 - \u003cNotice\u003e - Pcap-file module read 1 files, 1650919 packets, 473586644 bytes\nstderr:\n/usr/local/bin/suricata-wrapper.sh: line 3: jq: command not found\n"}

On repeated runs, I got this appropriate failure message 5 out of 5 times. By comparison when I was using the same repro steps in current tip of main commit 6406f89, I deadlocked it 4 times and got a failure message one time.

As for the speedbump, looking at the changes in the linked PR, it looks like the required workdir setting may be intentional. If it is, I'd probably need to go update some wiki articles, so we should talk about the motivation for the change.

analyzer/config.go

analyzer/processes.go

analyzer/reader.go

cli/analyzecli/display.go

cmd/brimcap/ztests/analyze-combiner-error.yaml

analyzer/analyzer.go

The old way analyze was written was prone to hard to reproduce / hard to debug concurrency bugs. Rewrite analyze to simplify things. This commit divorces the analyze process command phase and the log tailer phase which were previously unified. This arrangement makes it easier to isolate the two phases of analyze which makes it easier to diagnose issues. This refactor fixes the deadlock bug encountered when one analyzer process would error out in the midst of analysis. Also: - Have the load commit use the add / commit endpoints instead of the deprecated log post endpoints. Closes #109 Closes #71

Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>

philrz · 2021-06-30T17:38:56Z

I just re-ran my test with the newer commit 085a264 on this branch and have confirmed that I can once again run my test without requiring an explicit workdir setting. Thanks @mattnibs!

mattnibs force-pushed the analyze-refactor branch from b80ca47 to 00016d1 Compare June 29, 2021 15:29

mattnibs requested review from nwt, mccanne and philrz June 29, 2021 15:30

mattnibs force-pushed the analyze-refactor branch 5 times, most recently from e0c8e4b to 5713d85 Compare June 30, 2021 15:15

nwt approved these changes Jun 30, 2021

View reviewed changes

mattnibs force-pushed the analyze-refactor branch from 5713d85 to cce2fa3 Compare June 30, 2021 17:32

mattnibs and others added 4 commits June 30, 2021 13:32

Update analyzer/processes.go

92b4264

Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>

Update cmd/brimcap/ztests/analyze-combiner-error.yaml

8f1a172

Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>

Update analyzer/analyzer.go

03818b2

Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>

Update cli/analyzecli/display.go

085a264

Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>

philrz approved these changes Jun 30, 2021

View reviewed changes

fixes

a121e79

mattnibs merged commit 357bd57 into main Jun 30, 2021

mattnibs deleted the analyze-refactor branch June 30, 2021 17:51

This was referenced Jun 30, 2021

"brimcap analyze" hangs on a pcap that produces no Suricata logs #112

Closed

Hang (deadlock?) without error message during analyzer failure #71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor analyze #110

Refactor analyze #110

mattnibs commented Jun 29, 2021

philrz commented Jun 30, 2021

philrz commented Jun 30, 2021

Refactor analyze #110

Refactor analyze #110

Conversation

mattnibs commented Jun 29, 2021

philrz commented Jun 30, 2021

philrz commented Jun 30, 2021