Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parser for flawfinder #31

Merged
merged 1 commit into from
Apr 18, 2017

Conversation

davidCarlos
Copy link
Contributor

I'm still working on this pull request (missing tests). soon i will remove the 'WIP' tag.

weakness_messages.append(message)
# TODO: Flawfinder can returns more than one CWE,
# when this rappends, get_cwe cannot return a valid value
# Fix this.
Copy link
Member

@davidmalcolm davidmalcolm Mar 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support more than one CWE per issue in the schema? Currently <cwe> is optional within <issue>; should it be zeroOrMore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to ask this. I don't know what the best pratice in this case. When flawfinder returns more than one CWE, what should i do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post some examples of what the output looks like?

My current opinion is that we should change the schema and model.py to support multiple CWEs per issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./docs/examples/rtsp.c:170: [4] (buffer) sscanf:
The scanf() family's %s operation, without a limit specification, permits
buffer overflows (CWE-120, CWE-20). Specify a limit to %s, or use a
different input function.
./docs/examples/synctime.c:155: [4] (buffer) sscanf:
The scanf() family's %s operation, without a limit specification, permits
buffer overflows (CWE-120, CWE-20). Specify a limit to %s, or use a
different input function.

In my opinion, Add multiple CWEs would generate a more realistic output, once one warning could be related to more than one CWE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can send another PR doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll update the schema accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will first make this change in models, then i finish the flawfinder parser.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidmalcolm I have a doubt. Projects that already use firehose, and pass to Issue model a single cwe using a integer value. Should we maintain the compatibility? (Allows de user to pass a integer or a list of integers). Or we keep only a list of integers?

Copy link
Contributor Author

@davidCarlos davidCarlos Mar 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change impacts the others parsers too.

@davidmalcolm
Copy link
Member

As noted in #33 we don't have a good design yet for how to handle multiple CWEs in one issue.

For now, maybe have the parser capture the first CWE (if any), and note the limitation as a bug, marking it as a dependency of #33.

Thanks.

@davidCarlos davidCarlos force-pushed the flawfinder branch 2 times, most recently from 76666dd to d10b185 Compare April 5, 2017 02:27
@davidCarlos davidCarlos changed the title [WIP]: Add parser to flawfinder Add parser to flawfinder Apr 5, 2017
@davidCarlos
Copy link
Contributor Author

Hey @davidmalcolm , i have updated this PR. I added some tests, and have included flawfinder in the documentation. Now the parser capture the first CWE, as we had discussed. I created the issue #35, to map the multiple cwes problem.

Copy link
Member

@davidmalcolm davidmalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; this is looking good. I have some concerns about version-handling, and would prefer it if the parser captured the severity of issues (and maybe categories also).

Sorry if this seems nit-picky.

"""

generator = Generator(name='flawfinder',
version=flawfinder())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a Generator's version can be None. It looks like the actual version is embedded in the first line of the output:

Flawfinder version 1.31, (C) 2001-2014 David A. Wheeler.

so presumably parse_file should be looking for lines like that and parsing them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now i get the version from the Flawfinder output.

def get_cwe(line):
"""Extract CWE from flawfinder alarms
It's possible that a alarm not have a correlated CWE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have a comment here referencing issue #35, and that it's just the first CWE if there's more than one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

try:
return check_output(["flawfinder", "--version"]).strip()
except Exception:
return '1.31'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, I'd much prefer it if parse_file parsed the version header in the output.

If the output file doesn't contain a version string, then I think it's better for the parser to leave the Generator's version as None, rather than trying to scrape it from flawfinder --version (or guessing). Rationale: I use firehose in mock-with-analysis, which does the analysis in a chroot. There might be a different version of flawfinder in the chroot compared to what's in the host $PATH.

So I think it's better to get rid of this function and do the version-handling in parse_file.

line = loaded_file.readline()

counter = 0
for weakness in weakness_paths:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than manually incrementing counter, could this code use Python's enumerate builtin? (and maybe use idx rather then counter?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


issue = Issue(cwe, None, location,
Message(text=weakness_messages[counter]), notes=None,
trace=None, severity=None, customfields=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re severity: Flawfinder's docs say:

The risk level is shown inside square brackets and varies from 0, very little risk, to 5, great risk.

It would be good to capture that number as Issue.severity (as a string, without the square brackets):
http://firehose.readthedocs.io/en/latest/data-model.html#firehose.model.severity

else:
cwe = int(weakness_cwes[counter])

issue = Issue(cwe, None, location,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The None here is the Issue.testid:
http://firehose.readthedocs.io/en/latest/data-model.html#firehose.model.testid

Perhaps the (shell) system: stuff should be captured here. If I'm reading the Flawfinder docs correctly:

Each entry has a filename, a colon, a line number, a
risk level in brackets (where 5 is the most risky), a category, the name of the function, and a description of

....then Flawfinder's "category" here is "shell" and the name of the function is "system". I'm not yet sure how that should map to a firehose Issue.testid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidmalcolm how should we handle this? Apprently flawfinder does not have a testid value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use flawfinder's "category" here for the FIrehose "testid"? Looking at the sample output, that would give values like "shell", "format", "buffer", "race", which seems sane.

@davidmalcolm davidmalcolm changed the title Add parser to flawfinder Add parser for flawfinder Apr 5, 2017
@davidCarlos
Copy link
Contributor Author

Hey @davidmalcolm , i updated this PR with the last revision.

Copy link
Member

@davidmalcolm davidmalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

I've got a few more nitpicks (sorry), covering code I had trouble following.


:arg1: file passed by cli
:returns: file loaded in memory
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually load a file into memory; it merely returns a file object representing a file opened for reading. The code would be clearer if you got rid of this function and replace users with just the open builtin.

""" Parser loaded flawfinder output

:loaded_file: flawfinder report, loaded in memory
:returns: Firehose Analysis object, representing the final XML.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"loaded_file" is actually a file-like object, not an in-memory string. Please rename to "infile", or similar ("input" is a builtint IIRC)

data["message"] = message;
line = message_line
flawfinder_data.append(data)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding it hard to follow this two-stage parsing. If I'm reading things right, the code first reads the lines, building up a list of dicts (flawfinder_data). There's then a second loop over that list, creating Issue objects and adding them to analysis.results.

Wouldn't it be much simpler to have one loop, and append Issue objects directly to analysis.results, without the data dict i.e. work directly with an Issue object instead.

Hope that makes sense; I may have misread this code.

int(line[cwe_first_int_char])
cwe_first_int_char += 1
except Exception:
cwe = line[(cwe_init_str_pos + 4):(cwe_first_int_char)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be much simpler with a regex e.g.

 re.findall(line, "CWE-([0-9]+)")

or somesuch? (which I think would give a list of tuples of strings).

Could then rename the function to find_cwes and have it return a list of ints, and do the dropping of multiples in the caller.

Can also write a unit test for the CWE parser that way.



def flawfinder(first_line):
"""Retrieve flawfinder version from report.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the current name to be unclear; please rename to parse_first_line (as that's what it does).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to get_flawfinder_version should be better, what do you think @davidmalcolm ?


class TestParseXml(unittest.TestCase):
def parse_example(self, filename):
report = load_file(os.path.join(os.path.dirname(__file__),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it's a file-like object, maybe turn into:

path = os.path.join(...)
with open(path) as f:
    return parse_file (f)

@davidCarlos davidCarlos force-pushed the flawfinder branch 2 times, most recently from 863d9a5 to 8bf2e8a Compare April 14, 2017 23:42
@davidCarlos
Copy link
Contributor Author

@davidmalcolm thanks for the great revision, i had updated the PR.

@davidCarlos davidCarlos force-pushed the flawfinder branch 3 times, most recently from 4c64931 to 65ac128 Compare April 15, 2017 00:12
Copy link
Member

@davidmalcolm davidmalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated version.

This is looking much better, but I have a few more nitpicks (sorry).

arg_file = sys.argv[1]
report = open(arg_file, 'r')
except IOError:
print("File note found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "note" should be "not".
But an IOError is not necessarily "file not found"; it could be a permissions error, or an attempt to open a directory rather than a file, etc. Maybe just lose this clause?

(prog.search(message_line) and prog2.search(message_line)):
issue_message += " " + message_line
try:
cwe = re.findall("CWE-([0-9]+)", message_line)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like using try/except for parsing; I prefer to use non-exception program flow (rationale: sometimes we end up catching a different exception than the one we thought we were, so it can conceal bugs).

Anything that matches the ([0-9]+) is known to be convertible to int, so how about:

# list of ints
cwes = []

before the "while message_line" loop, and:

cwes.extend([int(m.group(0))
             for m in re.findall("CWE-([0-9]+)", message_line)])

here, and then after the "message_line" loop:

if cwes:
   first_cwe = cwes[0]
else
   first_cwe = None

to avoid the two try/except clauses.

Please reinstate the comment about issue #35

try:
testid = re.findall(r"\(([a-z]+)\)", line)[0]
except IndexError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just want the first match, just use re.search, which returns None for the "not-found" case ( https://docs.python.org/3.4/library/re.html#re.search ).

@davidCarlos
Copy link
Contributor Author

@davidmalcolm I had updated the PR

Copy link
Member

@davidmalcolm davidmalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your improvements here.

Ironically, with the code cleanups you've done, you've simplified things enough that I can now see a couple of other issues that were hiding.

testid = re.search("\(([a-z]+)\)", line).group(1)
issue_path = line.split(":")[0]
issue_line = line.split(":")[1]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the earlier cleanups; the code is clearer now.

With that, the handling of prog, prog2 and the testid looks a bit unusual; presumably these always appear on the same line, in a consistent way. Given that, I think it would be improved by using a single regular expression; something like:

# We want to match a line like:
#    ./docs/examples/asiohiper.cpp:78:  [4] (shell) system:
filename_group = "(.+)"
linenum_group = "([0-9]+)
severity_group = "\[([0-9])\]"
category_group = "\((.+)\)"
funcname_group = "(\S+)"
ws = '\s+'
PAT= (filename_group + ':' + linenum_group + ':' + ws + severity_group + ws
      + category_group + ws + funcname_group)

It should then be possible to do:

m = re.match(PAT, message_line)
if m:
  # do stuff with m.groups()


issue = Issue(first_cwe, testid, location,
Message(text=issue_message), notes=None,
trace=None, severity=None, customfields=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please set the severity field to the number in square parens (keep it as a string, rather than calling int on it).

cwes = []
while message_line != "\n" and not \
(prog.search(message_line) and prog2.search(message_line)):
issue_message += " " + message_line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could reuse the PAT from above here.

	- Add only one cwe in Firehose report
	- Retrieve flawfinder version from report.
	- Use enumerate instead of counter.
	- Add tests
	- Add comment about multiple cwes.
	- Fix regex.
@davidCarlos
Copy link
Contributor Author

@davidmalcolm I had updated the PR

Copy link
Member

@davidmalcolm davidmalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates; this looks good.

@davidmalcolm davidmalcolm merged commit b912b9f into fedora-static-analysis:master Apr 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants