Skip to content

Py: refactor regex tracking to type-trackers #12552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 11, 2023

Conversation

erik-krogh
Copy link
Contributor

@erik-krogh erik-krogh commented Mar 16, 2023

Two main things are happening in this PR.
1: Use type-tracking instead of global dataflow to track string constants to where they are used as regular expressions .
A similar change in Ruby improved performance A LOT: #11879
But the library is simpler here for two reasons: no regex literals, and less implicit regex executions.
2: Refactor lots of code, split code into more files, rename some concepts, and thereby make the structure very similar to Ruby.

I figured out things a bit as I went, and that shows in the commits.
There are ugly hacks and TODO comments along the way, but each commit should work on their own.


I did not like the RegexString interface.
A natural separation of responsibilities is that the library models are responsible for saying which values are interpreted as regular expressions. But with RegexString they instead had to specify which string-constants are regular expressions.
And with that interface it's very hard to get tracking of string-constants to where they are used as regular expressions (unless implemented manually in each library model).

Also, the difference between RegexString and Regex was very confusing with the previous API.

I've replaced it with RegExpInterpretation (taken from Ruby), where the library models are instead responsible for specifying which dataflow-nodes are used as regular expressions, and we get tracking of constants to regular expressions for free.
This also allowed for a much cleaner implementation in RegExpTracking.qll (see the regSink predicate).


Evaluation looks OK. 2% gained performance on code-scanning in this evaluation.
Two lost results, and they could be re-gained from adding array-related steps to type-tracking.

@erik-krogh erik-krogh force-pushed the py-type-trackers branch 2 times, most recently from 025b49d to 33c8069 Compare March 17, 2023 16:43
@erik-krogh erik-krogh force-pushed the py-type-trackers branch 4 times, most recently from e8ab4fe to 9423df1 Compare March 21, 2023 09:45
@erik-krogh erik-krogh marked this pull request as ready for review March 21, 2023 10:55
@erik-krogh erik-krogh requested a review from a team as a code owner March 21, 2023 10:55
@erik-krogh erik-krogh added the no-change-note-required This PR does not need a change note label Mar 23, 2023
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but think this could use the eyes of our local RegExp expert @yoff as well 👀

I see that we already have RegExpInterpretation in Ruby, but I don't quite feel it matches how we usually approach these things with concepts in Python. Have you considered adding it as a top-level concept next to RegexExecution? This matches pretty well what we currently do for SQLConstruction/XPathConstruction (we could even use the name RegexConstruction).

I've also been wondering if we could streamline the way flags are handled a bit. In particular, I not happy about the split between where RegexExecution is modeled, and the flags is modeled -- I found that the re_member_flags_arg doesn't have information for fullmatch which is included in the RegexExecution modeling, but it's not apparent without looking at the code side by side.

I think it would be natural to add a getFlagsArg member predicate to the RegexExecution concept (and potential RegexConstruction). This can be used when tracking the flags argument back to the modes (FindRegexMode::getAMode).

Personally I think it would make sense to expose the final mode (MULTILINE/DOTALL) they represent on the concept itself (Like we do for the tracking of mime type in the concept for HTTP responses), but I'm a little indifferent on that part 🤷

Comment on lines 19 to 20
result.asExpr() instanceof Bytes or
result.asExpr() instanceof Unicode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result.asExpr() instanceof Bytes or
result.asExpr() instanceof Unicode
result.asExpr() instanceof StrConst

Comment on lines 43 to 53
/**
* Gets a reference to a string that reaches any regular expression execution.
* This is the forwards exploratory phase of the analysis.
*/
private DataFlow::TypeTrackingNode forwards(DataFlow::TypeTracker t) {
t.start() and
result = backwards(DataFlow::TypeBackTracker::end()) and
result.flowsTo(strStart())
or
exists(DataFlow::TypeTracker t2 | result = forwards(t2).track(t2, t)) and
result = backwards(_)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need the forward part of the first phase.

Specifically I don't understand that part about the .flowsTo(strStart()), will the following QL ever have results? (my assumption is no, since I assume that any StrConst that can flow to a regex will also be a LocalSourceNode)

result = backwards(DataFlow::TypeBackTracker::end()) and
result.flowsTo(strStart()) and
not result = strStart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the following QL ever have results?

You're right, that QL will not have results.
All it means is that the flowsTo call was redundant, the code still worked.
I've refactored it a bit.

Comment on lines +1 to +11
/**
* Provides predicates that track strings to where they are used as regular expressions.
* This is implemented using TypeTracking in two phases:
*
* 1: An exploratory backwards analysis that imprecisely tracks all nodes that may be used as regular expressions.
* The exploratory phase ends with a forwards analysis from string constants that were reached by the backwards analysis.
* This is similar to the exploratory phase of the JavaScript global DataFlow library.
*
* 2: A precise type tracking analysis that tracks constant strings to where they are used as regular expressions.
* This phase keeps track of which strings and regular expressions end up in which places.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this setup give better performance than the simple solution below?

private DataFlow::TypeTrackingNode regExpBacktracker(
  DataFlow::TypeBackTracker t, DataFlow::Node regExpUse
) {
  t.start() and
  regExpUse = regSink() and
  result = regExpUse.getALocalSource()
  or
  exists(DataFlow::TypeBackTracker t2 | result = regExpBacktracker(t2, regExpUse).backtrack(t2, t))
}

DataFlow::Node regExpSourceSimple(DataFlow::Node re) {
  re = regSink() and
  result = regExpBacktracker(DataFlow::TypeBackTracker::end(), re) and
  result = strStart()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. At least it did when I wrote the same thing for Ruby.

Your simple solution will have more tuples because the regExpBacktracker predicate could track many paths that never reach any string-constants.
And I the number of tuples is low in the exploratory phase because it doesn't keep track of which source/sink flow came from.

It's the same basic idea in the first phase of the global dataflow library (at least in JavaScript, I assume the shared dataflow library does some of the same).

@@ -0,0 +1,1074 @@
/**
* Library for parsing for Python regular expressions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT

Suggested change
* Library for parsing for Python regular expressions.
* Library for parsing Python regular expressions.

DataFlow::Node use;

RegExp() {
(this instanceof Bytes or this instanceof Unicode) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(this instanceof Bytes or this instanceof Unicode) and
this instanceof StrConst and

Comment on lines 35 to 36
this =
API::moduleImport("re").getMember(any(string name | name != "escape")).getACall().getArg(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to suggest we remove this class altogether, because RegExpInterpretation is always used together with any(Concepts::RegexExecution exec).getRegex(), but I can see that if a compiled regex is never executed , it will not become a RegexExecution 🤔 (this can also happen if our tracking of it fails to determine it reaches an execution)

So how about the following?

Suggested change
this =
API::moduleImport("re").getMember(any(string name | name != "escape")).getACall().getArg(0)
this =
API::moduleImport("re").getMember("compile").getACall().getParameter(0, "pattern").asSink()

Also, could we move this into Stdlib.qll next to the other regex modeling?

@RasmusWL RasmusWL requested a review from yoff March 23, 2023 13:25
@RasmusWL
Copy link
Member

Merged in main to bring in fixes for broken ql tests.

@erik-krogh
Copy link
Contributor Author

I think I've addressed your review comments.

Except for the parsing of flags, which I feel belongs in some followup work instead.

@erik-krogh erik-krogh requested a review from RasmusWL March 24, 2023 11:20
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for doing all this cleanup 👍 I think it's ok to leave the flags handling for follow up work. Do you agree that this could be a nice thing to do?

RegExpInterpretation 🚲🧃🥶

I want to do a little bit of bike-shedding around the name RegExpInterpretation, since we have RegexExecution in Concepts.qll already. I think it would be nice to have consistent naming.

After doing some research on how widely used the terms RegExp vs Regex I realized I don't really care much what it is called1. The files currently use a mix of both, so we're already in a bit of a mess. But the ReDoS libs and shared implementation around that use RegExp, so I guess we can go with that long term and phase out Regex.

If we have agreement around this, I think we should just merge this PR with current RegExp naming, and make a PR to change the name of RegexExecution.

research on regexp vs regex

I did a little reading on Regex vs. RegExp:
Seems like some languages (at least JS, Ruby, and Go) are using RegExp, while most other languages (at least C++, Python, C#, Java, Haskell, Swift) use Regex.

I also did a little google trends search, where it seems like regex is the clear winner.

image

suffix

Besides the RegExp/Regex discussion, what were your thoughts on the RegexConstruction/RegExpConstruction naming? (which matches what we've done for other such cases).

Footnotes

  1. although regexp is hard on my muscle memory 😂

@erik-krogh
Copy link
Contributor Author

I think it's ok to leave the flags handling for follow up work. Do you agree that this could be a nice thing to do?

Yes. I'm not sure what will come of it, but seems worth looking into.

Seems like some languages (at least JS, Ruby, and Go) are using RegExp.

Yes, JS has an build-in RegExp class. That's probably where our original naming came from.

If we have agreement around this, I think we should just merge this PR with current RegExp naming, and make a PR to change the name of RegexExecution.

Sounds like we have agreement.

Besides the RegExp/Regex discussion, what were your thoughts on the RegExpConstruction naming?

I went for the suffix that Ruby had, and I think it makes sense.
That a regular expression gets constructed is in some cases an implementation detail, because it's not always returned to the user.
Both names sound fine to me, but I much prefer using a name that's already used.

Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long review time, I somehow found this harder to review than normal. One thing that might have helped is if moving RegexTreeView.qll and changing it had happened in two separate commits, then I would not have had to generate the diff manually (the diff turned out to be trivial, so I can see why it was tempting to do it this way).

Generally, this is great! There is one comment that I hope I have improved, because otherwise I did not understand what is going on after all :-)

Comment on lines 29 to 36
exists(DataFlow::CallCfgNode callNode |
call = callNode and
result =
mode_from_node([
callNode
.getArg(re_member_flags_arg(callNode.(DataFlow::MethodCallNode).getMethodName())),
callNode.getArgByName("flags")
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call is already a CallCfgNode, so introducing callNode is redundant:

Suggested change
exists(DataFlow::CallCfgNode callNode |
call = callNode and
result =
mode_from_node([
callNode
.getArg(re_member_flags_arg(callNode.(DataFlow::MethodCallNode).getMethodName())),
callNode.getArgByName("flags")
])
result =
mode_from_node([
call
.getArg(re_member_flags_arg(call.(DataFlow::MethodCallNode).getMethodName())),
call.getArgByName("flags")
])

Comment on lines 29 to 31
private DataFlow::Node sink;

RegExpPatternSource() { astNode = this.asExpr() }
RegExpPatternSource() { this = regExpSource(sink) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tempted to turn the predicate RegExpSource::regSink into the charpred of a class, just so I could annotate the result type of RexExpSource::regExpSource with it. Then we could see it here as:

  private RegExpSink sink;

  RegExpPatternSource() { this = regExpSource(sink) }

(where I imagine the class was called RegExpSink). I think this might clarify the constraints on sink a little bit.

Comment on lines +528 to +534
private string getUnicode() {
result = Numbers::parseHexInt(this.getText().suffix(2)).toUnicode()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This qldoc seems to have gotten lost during the move

Suggested change
private string getUnicode() {
result = Numbers::parseHexInt(this.getText().suffix(2)).toUnicode()
}
/**
* Gets the unicode char for this escape.
* E.g. for `\u0061` this returns "a".
*/
private string getUnicode() {
result = Numbers::parseHexInt(this.getText().suffix(2)).toUnicode()
}

Comment on lines 425 to 427
* A node that is not a regular expression literal, but is used in places that
* may interpret it as one. Instances of this class are typically strings that
* flow to method calls like `re.compile`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it confusing that this talks about strings that flow to places. Is the point of this refactor not, that this concept only includes those places (where strings are being interpreted as regular expressions)?
The definition of regSink suggest that we do tracking based on the RegExpInterpretation concept among other things.

Also, I think we do not have regular expression literals in Python.

Suggested change
* A node that is not a regular expression literal, but is used in places that
* may interpret it as one. Instances of this class are typically strings that
* flow to method calls like `re.compile`.
* A node where a string is interpreted as a regular expression,
* for instance an argument to `re.compile`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your comment is better, and yes the concept only includes the places where a string is interpreted as a regular expression.

Also, I think we do not have regular expression literals in Python.

Too much copy-paste from my end.

Comment on lines 437 to 439
* A node that is not a regular expression literal, but is used in places that
* may interpret it as one. Instances of this class are typically strings that
* flow to method calls like `re.compile`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the above, we should remember to change this also.

Suggested change
* A node that is not a regular expression literal, but is used in places that
* may interpret it as one. Instances of this class are typically strings that
* flow to method calls like `re.compile`.
* A node where a string is interpreted as a regular expression,
* for instance an argument to `re.compile`.

@erik-krogh
Copy link
Contributor Author

I did a rebase on main to solve some conflicts.

@yoff
Copy link
Contributor

yoff commented May 1, 2023

I wonder if the last commit could be avoided, had we not changed the result type of getAParse? I would be happy to accept either way...

@erik-krogh
Copy link
Contributor Author

I wonder if the last commit could be avoided, had we not changed the result type of getAParse? I would be happy to accept either way...

It could have been avoided.
But I think it's fine as is.

@yoff yoff self-requested a review May 8, 2023 18:01
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - and thanks for the extra clean-up! 👍

@erik-krogh erik-krogh requested a review from RasmusWL May 9, 2023 07:52
@RasmusWL RasmusWL merged commit 62f0c64 into github:main May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-change-note-required This PR does not need a change note Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants