Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor analyze #110

Merged
merged 6 commits into from
Jun 30, 2021
Merged

Refactor analyze #110

merged 6 commits into from
Jun 30, 2021

Conversation

mattnibs
Copy link
Collaborator

The old way analyze was written was prone to hard to reproduce / hard to
debug concurrency bugs. Rewrite analyze to simplify things. This commit
divorces the analyze process command phase and the log tailer phase
which were previously unified. This arrangement makes it easier to
isolate the two phases of analyze which makes it easier to diagnose issues.

This refactor fixes the deadlock bug encountered when one analyzer process
would error out in the midst of analysis.

Also:

  • Have the load commit use the add / commit endpoints instead of the
    deprecated log post endpoints.

Closes #109
Closes #71

@mattnibs mattnibs requested review from nwt, mccanne and philrz June 29, 2021 15:30
@mattnibs mattnibs force-pushed the analyze-refactor branch 5 times, most recently from e0c8e4b to 5713d85 Compare June 30, 2021 15:15
@philrz
Copy link
Contributor

philrz commented Jun 30, 2021

I just tested out this branch as of commit 5713d85 and it seems effective at addressing the deadlock symptom I'd reported in #71. I did see something else unexpected which @mattnibs said he'll respond to.

The repro setup is similar to what's described in #71 in terms of having my custom Zeek and Suricata installed with the wrapper scripts, one of which depends on jq that's not installed. I did have to update my config YAML to conform to new Zed syntax in the shaper.

$ cat zeek-suricata-new.yml
analyzers:
  - cmd: zeek-wrapper.sh
  - cmd: suricata-wrapper.sh
    globs: ["deduped*.json"]
    shaper: |
      type alert = {
        timestamp: time,
        event_type: bstring,
        src_ip: ip,
        src_port: port=(uint16),
        dest_ip: ip,
        dest_port: port=(uint16),
        vlan: [uint16],
        proto: bstring,
        app_proto: bstring,
        alert: {
          severity: uint16,
          signature: bstring,
          category: bstring,
          action: bstring,
          signature_id: uint64,
          gid: uint64,
          rev: uint64,
          metadata: {
            signature_severity: [bstring],
            former_category: [bstring],
            attack_target: [bstring],
            deployment: [bstring],
            affected_product: [bstring],
            created_at: [bstring],
            performance_impact: [bstring],
            updated_at: [bstring],
            malware_family: [bstring],
            tag: [bstring]
          }
        },
        flow_id: uint64,
        pcap_cnt: uint64,
        tx_id: uint64,
        icmp_code: uint64,
        icmp_type: uint64,
        tunnel: {
          src_ip: ip,
          src_port: port=(uint16),
          dest_ip: ip,
          dest_port: port=(uint16),
          proto: bstring,
          depth: uint64
        },
        community_id: bstring
      }
      filter event_type=="alert" | put . := shape(alert) | rename ts:=timestamp

$ brimcap -version
Version: v0.0.5-19-g5713d85

$ brimcap analyze -config zeek-suricata-new.yml wrccdc.pcap > /dev/null
{"type":"error","error":"zeek-wrapper.sh: WorkDir not set; suricata-wrapper.sh: WorkDir not set"}

That was the one unexpected speedbump I'd hit: I'd created this config YAML when an explicit workdir setting was not strictly required. The changes in the PR make it look like it is, so I created the directories manually and added the settings.

$ cat repro.yml 
analyzers:
  - cmd: zeek-wrapper.sh
    workdir: wd-zeek
  - cmd: suricata-wrapper.sh
    workdir: wd-suricata
    globs: ["deduped*.json"]
    shaper: |
      type alert = {
        timestamp: time,
        event_type: bstring,
        src_ip: ip,
        src_port: port=(uint16),
        dest_ip: ip,
        dest_port: port=(uint16),
        vlan: [uint16],
        proto: bstring,
        app_proto: bstring,
        alert: {
          severity: uint16,
          signature: bstring,
          category: bstring,
          action: bstring,
          signature_id: uint64,
          gid: uint64,
          rev: uint64,
          metadata: {
            signature_severity: [bstring],
            former_category: [bstring],
            attack_target: [bstring],
            deployment: [bstring],
            affected_product: [bstring],
            created_at: [bstring],
            performance_impact: [bstring],
            updated_at: [bstring],
            malware_family: [bstring],
            tag: [bstring]
          }
        },
        flow_id: uint64,
        pcap_cnt: uint64,
        tx_id: uint64,
        icmp_code: uint64,
        icmp_type: uint64,
        tunnel: {
          src_ip: ip,
          src_port: port=(uint16),
          dest_ip: ip,
          dest_port: port=(uint16),
          proto: bstring,
          depth: uint64
        },
        community_id: bstring
      }
      filter event_type=="alert" | put . := shape(alert) | rename ts:=timestamp

$ brimcap analyze -config repro.yml wrccdc.pcap > /dev/null
{"type":"status","ts":{"sec":1625068592,"ns":361491935},"pcap_read_size":65536,"pcap_total_size":500001372,"records_written":0}
...
{"type":"error","error":"suricata-wrapper.sh exited with code 127\nstdout:\n30/6/2021 -- 08:56:31 - \u003cNotice\u003e - This is Suricata version 6.0.2 RELEASE running in USER mode\n30/6/2021 -- 08:56:56 - \u003cNotice\u003e - all 5 packet processing threads, 4 management threads initialized, engine started.\n30/6/2021 -- 08:57:32 - \u003cNotice\u003e - Signal Received.  Stopping engine.\n30/6/2021 -- 08:57:43 - \u003cNotice\u003e - Pcap-file module read 1 files, 1650919 packets, 473586644 bytes\nstderr:\n/usr/local/bin/suricata-wrapper.sh: line 3: jq: command not found\n"} 

On repeated runs, I got this appropriate failure message 5 out of 5 times. By comparison when I was using the same repro steps in current tip of main commit 6406f89, I deadlocked it 4 times and got a failure message one time.

As for the speedbump, looking at the changes in the linked PR, it looks like the required workdir setting may be intentional. If it is, I'd probably need to go update some wiki articles, so we should talk about the motivation for the change.

analyzer/config.go Show resolved Hide resolved
analyzer/processes.go Outdated Show resolved Hide resolved
analyzer/processes.go Outdated Show resolved Hide resolved
analyzer/processes.go Outdated Show resolved Hide resolved
analyzer/processes.go Outdated Show resolved Hide resolved
analyzer/reader.go Outdated Show resolved Hide resolved
cli/analyzecli/display.go Outdated Show resolved Hide resolved
cmd/brimcap/ztests/analyze-combiner-error.yaml Outdated Show resolved Hide resolved
analyzer/analyzer.go Outdated Show resolved Hide resolved
analyzer/analyzer.go Outdated Show resolved Hide resolved
The old way analyze was written was prone to hard to reproduce / hard to
debug concurrency bugs. Rewrite analyze to simplify things. This commit
divorces the analyze process command phase and the log tailer phase
which were previously unified. This arrangement makes it easier to
isolate the two phases of analyze which makes it easier to diagnose issues.

This refactor fixes the deadlock bug encountered when one analyzer process
would error out in the midst of analysis.

Also:
- Have the load commit use the add / commit endpoints instead of the
  deprecated log post endpoints.

Closes #109
Closes #71
mattnibs and others added 4 commits June 30, 2021 13:32
Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>
Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>
Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>
Co-authored-by: Noah Treuhaft <noah.treuhaft@gmail.com>
@philrz
Copy link
Contributor

philrz commented Jun 30, 2021

I just re-ran my test with the newer commit 085a264 on this branch and have confirmed that I can once again run my test without requiring an explicit workdir setting. Thanks @mattnibs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants