Skip to content

Python: Improve sensitive data modeling #6013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 15, 2021

Conversation

RasmusWL
Copy link
Member

@RasmusWL RasmusWL commented Jun 4, 2021

and also port it away from using points-to 🎉

With this extended modeling, we are able to flag up this little example https://github.com/anxolerd/dvpwa/blob/b11d0415f86cc2285158d2f07c81cd9777d8fffb/sqli/dao/user.py#L40-L41 (with py/weak-sensitive-data-hashing)

@RasmusWL RasmusWL requested a review from a team as a code owner June 4, 2021 13:37
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single question about a type tracker, otherwise this looks good 👍

SensitiveFunctionCall() {
this.getFunction() = sensitiveFunction(classification)
or
nameIndicatesSensitiveData(this.getFunction().asCfgNode().(NameNode).getId(), classification)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is to cover functions for which we do not have a definition?
Should this case not be added to be base of the type tracker instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is to cover functions for which we do not have a definition?

Yep 👍

Should this case not be added to be base of the type tracker instead?

That would make it more powerful than this syntactic approach. Then you would be able to handle ref = getPassword; ref() where we don't have a definition for getPassword, but you would not be able to handle ref = foo.getPassword; ref() 😞

So this got me started thinking of how to handle this properly, and what things we currently wouldn't handle. See c341643 and ea0c1d7

Comment on lines +118 to +125
* Note: We _could_ make any access to a variable with a sensitive name a source of
* sensitive data, but to make path explanations in data-flow/taint-tracking good,
* we don't want that, since it works against allowing users to understand the flow
* in the program (which is the whole point).
*
* Note: To make data-flow/taint-tracking work, the expression that is _assigned_ to
* the variable is marked as the source (as compared to marking the variable as the
* source).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

RasmusWL added 6 commits June 10, 2021 14:09
The comment about imports was placed wrong. I also realized we didn't
even have a single test-case for
`this.(DataFlow::AttrRead).getAttributeNameExpr() = sensitiveLookupStringConst(classification)`
so I added that (notice that this is only `getattr(foo, x)` and not
`getattr(foo, "password")`)
This will enable better tests in just one second
This solution was the best I could come up with, but it _is_ a bit
brittle since you need to remember to add this additional taint step
to any configuration that relies on sensitive data sources... I don't
see an easy way around this though :|
@RasmusWL RasmusWL requested a review from yoff June 11, 2021 08:54
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern, which may prompt a performance check. I like that you try to solve the problem of error reporting up front.

Comment on lines 149 to 151
predicate extraStepForCalls(DataFlow::Node nodeFrom, DataFlow::CallCfgNode nodeTo) {
nodeTo.getFunction() = nodeFrom
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not restrict nodeFrom here? (say, in the manner you describe above). It seems to be adding a lot of edges otherwise...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! Good point. This predicate will end up being quite big. I guess we could use the type-tracking approach to restrict nodeFrom here 👍

Now there is a path from the _imports_ of the functions that would
return sensitive data, so we produce more alerts.

I'm not entirely happy about this "double reporting", but I'm not sure
how to get around it without either:

1. disabling the extra taint-step for calls. Not ideal since we would
   loose good sources.
2. disabling the extra sources based on function name. Not ideal since
   we would loose good sources.
3. disabling the extra sources based on function name, for those calls
   that would be handled with the extra taint-step for calls. Not ideal
   since that would require running the data-flow query initially to
   prune these out :|

So for now, I think the best approach is to accept some risk on this,
and ship to learn :)
@RasmusWL
Copy link
Member Author

New results from last commit will look like:

image

I've added some reasoning to dee9378 on why I think this is still an OK approach, although I'm not super happy about it. If you have other ideas, please do tell 😊

On django/django, this reduced the number of results in
`extraStepForCalls` from 201,283 to 541
@RasmusWL
Copy link
Member Author

RasmusWL commented Jun 14, 2021

I think it would be good to do a small performance tests... I'll start one soon 👍 EDIT: follow along at https://github.com/dsp-testing/RasmusWL-dca/issues/20

@RasmusWL
Copy link
Member Author

Evaluation looks good 👍

@RasmusWL RasmusWL requested a review from yoff June 15, 2021 11:18
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for limiting the for step and for the performance check. I do not really have a good solution for the double reporting, I think we just need to accept for now that we have multiple reasons for flagging certain results.

@yoff yoff merged commit b19d64f into github:main Jun 15, 2021
@RasmusWL RasmusWL deleted the sensitive-improvements branch June 16, 2021 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants