-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Python: Improve sensitive data modeling #6013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
No longer using points-to 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single question about a type tracker, otherwise this looks good 👍
SensitiveFunctionCall() { | ||
this.getFunction() = sensitiveFunction(classification) | ||
or | ||
nameIndicatesSensitiveData(this.getFunction().asCfgNode().(NameNode).getId(), classification) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is to cover functions for which we do not have a definition?
Should this case not be added to be base of the type tracker instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is to cover functions for which we do not have a definition?
Yep 👍
Should this case not be added to be base of the type tracker instead?
That would make it more powerful than this syntactic approach. Then you would be able to handle ref = getPassword; ref()
where we don't have a definition for getPassword
, but you would not be able to handle ref = foo.getPassword; ref()
😞
So this got me started thinking of how to handle this properly, and what things we currently wouldn't handle. See c341643 and ea0c1d7
* Note: We _could_ make any access to a variable with a sensitive name a source of | ||
* sensitive data, but to make path explanations in data-flow/taint-tracking good, | ||
* we don't want that, since it works against allowing users to understand the flow | ||
* in the program (which is the whole point). | ||
* | ||
* Note: To make data-flow/taint-tracking work, the expression that is _assigned_ to | ||
* the variable is marked as the source (as compared to marking the variable as the | ||
* source). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
The comment about imports was placed wrong. I also realized we didn't even have a single test-case for `this.(DataFlow::AttrRead).getAttributeNameExpr() = sensitiveLookupStringConst(classification)` so I added that (notice that this is only `getattr(foo, x)` and not `getattr(foo, "password")`)
This will enable better tests in just one second
This solution was the best I could come up with, but it _is_ a bit brittle since you need to remember to add this additional taint step to any configuration that relies on sensitive data sources... I don't see an easy way around this though :|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern, which may prompt a performance check. I like that you try to solve the problem of error reporting up front.
predicate extraStepForCalls(DataFlow::Node nodeFrom, DataFlow::CallCfgNode nodeTo) { | ||
nodeTo.getFunction() = nodeFrom | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not restrict nodeFrom
here? (say, in the manner you describe above). It seems to be adding a lot of edges otherwise...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! Good point. This predicate will end up being quite big. I guess we could use the type-tracking approach to restrict nodeFrom
here 👍
Now there is a path from the _imports_ of the functions that would return sensitive data, so we produce more alerts. I'm not entirely happy about this "double reporting", but I'm not sure how to get around it without either: 1. disabling the extra taint-step for calls. Not ideal since we would loose good sources. 2. disabling the extra sources based on function name. Not ideal since we would loose good sources. 3. disabling the extra sources based on function name, for those calls that would be handled with the extra taint-step for calls. Not ideal since that would require running the data-flow query initially to prune these out :| So for now, I think the best approach is to accept some risk on this, and ship to learn :)
New results from last commit will look like: I've added some reasoning to dee9378 on why I think this is still an OK approach, although I'm not super happy about it. If you have other ideas, please do tell 😊 |
On django/django, this reduced the number of results in `extraStepForCalls` from 201,283 to 541
I think it would be good to do a small performance tests... I'll start one soon 👍 EDIT: follow along at https://github.com/dsp-testing/RasmusWL-dca/issues/20 |
Evaluation looks good 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for limiting the for
step and for the performance check. I do not really have a good solution for the double reporting, I think we just need to accept for now that we have multiple reasons for flagging certain results.
and also port it away from using points-to 🎉
With this extended modeling, we are able to flag up this little example https://github.com/anxolerd/dvpwa/blob/b11d0415f86cc2285158d2f07c81cd9777d8fffb/sqli/dao/user.py#L40-L41 (with
py/weak-sensitive-data-hashing
)