Skip to content

Python: Port py/weak-crypto-key to use type-tracking #5075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Mar 18, 2021
Merged

Conversation

RasmusWL
Copy link
Member

@RasmusWL RasmusWL commented Feb 2, 2021

Draft since I will need to run this query on some real projects to verify that results are still looking good.

The major difference between this query and the old (besides type-tracking vs. points-to) is how good we are at tracking integer literals. I've used the same simple approach with local source nodes as we/I have used in the other type-tracking based modeling, but I'm guessing that since being able to correctly track integer literals is so important for this query, we might need to do something a bit more sophisticated. I'm waiting to look at query results before attacking this problem, but my idea would be to try and use some global data-flow to track things properly.

EDIT: I had to use this global data-flow approach, since results were too poor without it. The most controversial bit is that I introduced a new data-flow copy in 40c592a, to avoid re-evaluation.

EDIT2: I removed that again, since I got better performance (266s -> 176s) from using type-tracking for this instead. Type-tracking is the approach we've been using in other places of the code (at least I have). So since I'm still a bit uncertain what's the right approach, I think it's less disruptive to use the type-tracking approach for now.

Although it's only a draft now, I would still like some input on the way I ended up implementing minimumSecureKeySize in 66df9af, which is why I have requested a review already. So feel free to skip rest of this PR for now 😉 -- come to think of it, we would probably be better off discussing this in person, but since I already wrote up my problem, would be great if you could both read through my thought process @tausnb and @yoff 😊 we did this, thanks 👍

@RasmusWL
Copy link
Member Author

To deal with the changes from moving query files around, I found it easier to rebase on top of main, instead of merging.

Tests working can be verified by running

```
ls ql/python/ql/test/experimental/library-tests/frameworks/crypto*/*.py | xargs -L1 sh -c 'python $0 || exit 255'
```
I did spend some time to figure out how to best write `minimumSecureKeySize`
predicate. I wanted to write once and for all the recommended sizes for each
cryptosystem.

I considered making the predicate such as

```codeql
int minimumSecureKeySize() {
    this.getName() = "RSA" and result = 2048
    or
    this.getName() = "DSA" and result = 2048
    or
    this.getName() = "ECC" and result = 244
}
```

but then it would be impossible to add a new model without also being able to
modify the body of this predicate -- which seems like a bad way to start off a
brand new way of modeling things.

So I considered if we could add it to the non-range class, such as

```codeql
class RSAKeyGeneration extends KeyGeneration {
  RSAKeyGeneration() { this.getName() = "RSA" }

  override int minimumSecureKeySize() { result = 2048 }
}
```

This has the major problem that when you're writing the models for a new
API (and therefore extending KeyGeneration::Range), there is no way for you to
see that you need to take this extra step :| (also problem about how we should
define `minimumSecureKeySize` on `KeyGeneration` class then, since if we make it
abstract, we effectively disable the ability to refine `KeyGeneration` since any
subclass must provide an implementation.)

So, therefore I ended up with this solution ;)
* Removed backend arugment that is not required
* Added DSA constants (they are just accidentially the same as RSA right now)
* Removed FakeWeakEllipticCurve and used a real weak elliptic curve instead
instead of points-to.

Looking at query results also made me realize I didn't supply a very good
"origin" for ECC in cryptography package, so I improved that 👍 -- maybe that
sohuld have been split into multiple commits... too late :(
Since WeakCrypto always makes me think that it's about all weak crypto (like
using MD5, or completely broken ciphers such as ARC4 ro DES) and not just about
weak key generation.
after asking around, this seems to be the right approach
This was the result of an internal dicussion we had about this some time ago.
We used to handle this, but no more :(

Adding this example was inspired by looking at results differences
From looking at old results on LGTM.com, this was quite common (and those alerts
doesn't really provide value).
@RasmusWL
Copy link
Member Author

To deal with the changes from moving query files around, I found it easier to rebase on top of main, instead of merging.

and since those were reverted, I rebased once again 😅

@RasmusWL RasmusWL marked this pull request as ready for review February 22, 2021 14:04
Copy link
Contributor

@felicitymay felicitymay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - thanks for including a change log entry.

I'm a little surprised to see a query filename change. I've no idea how commonly users include queries by path in their query suites, but wonder if this needs to be flagged beyond just a mention in the change log (potentially to CodeQL CLI, code scanning, LGTM.com and LGTM Enterprise users). What error message will users see if they are affected?

@AlonaHlobina and @sj, I'll leave the discussion about whether this needs more communication to you, since you have a much better idea of how many people this migh affect than I do.

Also, do we have any documentation on the new type-tracking approach you mention here?

Co-authored-by: Felicity Chapman <felicitymay@github.com>
@RasmusWL
Copy link
Member Author

I'm a little surprised to see a query filename change. I've no idea how commonly users include queries by path in their query suites, but wonder if this needs to be flagged beyond just a mention in the change log (potentially to CodeQL CLI, code scanning, LGTM.com and LGTM Enterprise users). What error message will users see if they are affected?

I cleared this up with @sj internally, who gave it a green light ✔️

Also, do we have any documentation on the new type-tracking approach you mention here?

We're going to work on that soon, but currently there is not any documentation for this (at least not for Python).

@felicitymay
Copy link
Contributor

Thank you for following up on those questions 😄

Like we've done for pretty much everything else. An experiment to see what this
means for query performance.
Internal evaluation showed that this didn't perform better than normal (forward)
type-tracking, but it feels more like the right approach.
@RasmusWL
Copy link
Member Author

I removed the global data-flow stuff for tracking integer literals again, since I got better performance (266s -> 176s) from using type-tracking. Type-tracking is the approach we've been using in other places of the code (at least I have). So since I'm still a bit uncertain what's the right approach, I think it's less disruptive to use the type-tracking approach for now.

As highlighted in the commit message, type back-tracking didn't yield better performance, but does feel more like the right approach.

@RasmusWL
Copy link
Member Author

Converted back to draft, to make sure we don't merge it in for the next RC branch, since performance with type-tracking might be a bit wobbly.

@RasmusWL RasmusWL marked this pull request as draft February 26, 2021 16:08
@RasmusWL RasmusWL closed this Feb 27, 2021
@RasmusWL RasmusWL deleted the crypto branch February 27, 2021 09:54
@RasmusWL RasmusWL restored the crypto branch February 27, 2021 10:27
@RasmusWL RasmusWL reopened this Feb 27, 2021
@RasmusWL
Copy link
Member Author

RasmusWL commented Mar 2, 2021

Since RC branch has been created, I'm happy to get this merged now 👍

@RasmusWL RasmusWL marked this pull request as ready for review March 2, 2021 07:51
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great! Easy to read. I wonder about the exclusion of test code; should that really be user configurable, as in the analysis run excludes test code? Or, if we bake it into queries like this, should we have a concept for it?

Co-authored-by: yoff <lerchedahl@gmail.com>
@RasmusWL
Copy link
Member Author

I wonder about the exclusion of test code; should that really be user configurable, as in the analysis run excludes test code? Or, if we bake it into queries like this, should we have a concept for it?

I'm not 100% sure what you're asking here. But I'm guessing you're asking whether this query tries to exclude results in test code, and whether that's really a good idea -- since results in test code won't be shown by default (at least not on LGTM.com).

If that is indeed what you're asking, I'm only ignoring cases where the key-size comes from a test, since that usually isn't very interesting. I saw quite a few real projects that used this setup, and wanted to get rid of these. I also added our own test-case in https://github.com/github/codeql/pull/5075/files#diff-b3091cad1d47a897a2e25187124c5ce95545ad23d6054b9afbc5de4c628b6d7f

@RasmusWL RasmusWL requested a review from yoff March 18, 2021 10:05
@yoff
Copy link
Contributor

yoff commented Mar 18, 2021

I see, it is a specific part of the logic that is excluded, so only the suggestion with a concept makes sense. However TestScope almost is that already. So I am happy to leave it as is for now, and if we find the need to exclude bits in test code more often in the future, we can make it more convenient to do so.

Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants