Python: Improve sensitive data modeling #6013

RasmusWL · 2021-06-04T13:37:28Z

and also port it away from using points-to 🎉

With this extended modeling, we are able to flag up this little example https://github.com/anxolerd/dvpwa/blob/b11d0415f86cc2285158d2f07c81cd9777d8fffb/sqli/dao/user.py#L40-L41 (with py/weak-sensitive-data-hashing)

No longer using points-to 🎉

yoff

A single question about a type tracker, otherwise this looks good 👍

yoff · 2021-06-10T09:30:18Z

python/ql/src/semmle/python/dataflow/new/SensitiveDataSources.qll

+    SensitiveFunctionCall() {
+      this.getFunction() = sensitiveFunction(classification)
+      or
+      nameIndicatesSensitiveData(this.getFunction().asCfgNode().(NameNode).getId(), classification)


I suppose this is to cover functions for which we do not have a definition?
Should this case not be added to be base of the type tracker instead?

I suppose this is to cover functions for which we do not have a definition?

Yep 👍

Should this case not be added to be base of the type tracker instead?

That would make it more powerful than this syntactic approach. Then you would be able to handle ref = getPassword; ref() where we don't have a definition for getPassword, but you would not be able to handle ref = foo.getPassword; ref() 😞

So this got me started thinking of how to handle this properly, and what things we currently wouldn't handle. See c341643 and ea0c1d7

yoff · 2021-06-10T10:03:14Z

python/ql/src/semmle/python/dataflow/new/SensitiveDataSources.qll

+   * Note: We _could_ make any access to a variable with a sensitive name a source of
+   * sensitive data, but to make path explanations in data-flow/taint-tracking good,
+   * we don't want that, since it works against allowing users to understand the flow
+   * in the program (which is the whole point).
+   *
+   * Note: To make data-flow/taint-tracking work, the expression that is _assigned_ to
+   * the variable is marked as the source (as compared to marking the variable as the
+   * source).


Good point.

The comment about imports was placed wrong. I also realized we didn't even have a single test-case for `this.(DataFlow::AttrRead).getAttributeNameExpr() = sensitiveLookupStringConst(classification)` so I added that (notice that this is only `getattr(foo, x)` and not `getattr(foo, "password")`)

This will enable better tests in just one second

This solution was the best I could come up with, but it _is_ a bit brittle since you need to remember to add this additional taint step to any configuration that relies on sensitive data sources... I don't see an easy way around this though :|

yoff

One concern, which may prompt a performance check. I like that you try to solve the problem of error reporting up front.

yoff · 2021-06-11T11:09:26Z

python/ql/src/semmle/python/dataflow/new/SensitiveDataSources.qll

+  predicate extraStepForCalls(DataFlow::Node nodeFrom, DataFlow::CallCfgNode nodeTo) {
+    nodeTo.getFunction() = nodeFrom
+  }


Can we not restrict nodeFrom here? (say, in the manner you describe above). It seems to be adding a lot of edges otherwise...

Aha! Good point. This predicate will end up being quite big. I guess we could use the type-tracking approach to restrict nodeFrom here 👍

Now there is a path from the _imports_ of the functions that would return sensitive data, so we produce more alerts. I'm not entirely happy about this "double reporting", but I'm not sure how to get around it without either: 1. disabling the extra taint-step for calls. Not ideal since we would loose good sources. 2. disabling the extra sources based on function name. Not ideal since we would loose good sources. 3. disabling the extra sources based on function name, for those calls that would be handled with the extra taint-step for calls. Not ideal since that would require running the data-flow query initially to prune these out :| So for now, I think the best approach is to accept some risk on this, and ship to learn :)

RasmusWL · 2021-06-11T12:04:57Z

New results from last commit will look like:

I've added some reasoning to dee9378 on why I think this is still an OK approach, although I'm not super happy about it. If you have other ideas, please do tell 😊

On django/django, this reduced the number of results in `extraStepForCalls` from 201,283 to 541

RasmusWL · 2021-06-14T13:11:13Z

I think it would be good to do a small performance tests... I'll start one soon 👍 EDIT: follow along at https://github.com/dsp-testing/RasmusWL-dca/issues/20

RasmusWL · 2021-06-15T11:17:52Z

Evaluation looks good 👍

yoff

LGTM, thanks for limiting the for step and for the performance check. I do not really have a good solution for the double reporting, I think we just need to accept for now that we have multiple reasons for flagging certain results.

RasmusWL added 8 commits June 3, 2021 12:10

Python: Use "new" SensitiveDataHeuristics

79bef11

Python: Add sensitive data test-cases

3b68c87

Python: Port sensitive data modeling

00a71a1

No longer using points-to 🎉

Python: minor cleanup in SensitiveDataSources

d6532e2

Python: Model sensitive data from subscripts

925e67d

Python: Model sensitive data based on parameter names

f5fd0f8

Python: Model sensitive data based on variable names

350f79e

Python: Add change-note

7f119dd

RasmusWL requested a review from a team as a code owner June 4, 2021 13:37

github-actions bot added documentation Python labels Jun 4, 2021

Python: Autoformat

3819a36

yoff requested changes Jun 10, 2021

View reviewed changes

RasmusWL added 6 commits June 10, 2021 14:09

Python: Add more tests for sensitive function handling

c341643

Python: Use real config in TestSensitiveDataSources

f167143

This will enable better tests in just one second

Merge branch 'main' into sensitive-improvements

04db335

Merge branch 'main' into sensitive-improvements

3d5f379

RasmusWL requested a review from yoff June 11, 2021 08:54

yoff reviewed Jun 11, 2021

View reviewed changes

Python: limit size of extraStepForCalls predicate

d19bc12

On django/django, this reduced the number of results in `extraStepForCalls` from 201,283 to 541

RasmusWL requested a review from yoff June 15, 2021 11:18

yoff approved these changes Jun 15, 2021

View reviewed changes

yoff merged commit b19d64f into github:main Jun 15, 2021

RasmusWL deleted the sensitive-improvements branch June 16, 2021 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Improve sensitive data modeling #6013

Python: Improve sensitive data modeling #6013

Uh oh!

RasmusWL commented Jun 4, 2021 •

edited

Loading

Uh oh!

yoff left a comment

Uh oh!

yoff Jun 10, 2021

Uh oh!

RasmusWL Jun 10, 2021

Uh oh!

yoff Jun 10, 2021

Uh oh!

yoff left a comment

Uh oh!

yoff Jun 11, 2021

Uh oh!

RasmusWL Jun 11, 2021

Uh oh!

RasmusWL commented Jun 11, 2021

Uh oh!

RasmusWL commented Jun 14, 2021 •

edited

Loading

Uh oh!

RasmusWL commented Jun 15, 2021

Uh oh!

yoff left a comment

Uh oh!

Uh oh!

Python: Improve sensitive data modeling #6013

Python: Improve sensitive data modeling #6013

Uh oh!

Conversation

RasmusWL commented Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

yoff Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

RasmusWL Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

yoff Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

yoff Jun 11, 2021

Choose a reason for hiding this comment

Uh oh!

RasmusWL Jun 11, 2021

Choose a reason for hiding this comment

Uh oh!

RasmusWL commented Jun 11, 2021

Uh oh!

RasmusWL commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RasmusWL commented Jun 15, 2021

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RasmusWL commented Jun 4, 2021 •

edited

Loading

RasmusWL commented Jun 14, 2021 •

edited

Loading