Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby: Use taint tracking instead of type tracking to define regExpSource #8332

Merged
merged 2 commits into from
Mar 18, 2022

Conversation

hvitved
Copy link
Contributor

@hvitved hvitved commented Mar 4, 2022

Follow-up to #8293.

@github-actions github-actions bot added the Ruby label Mar 4, 2022
@hvitved hvitved marked this pull request as ready for review March 8, 2022 08:05
@hvitved hvitved requested a review from a team as a code owner March 8, 2022 08:05
@hvitved hvitved added the no-change-note-required This PR does not need a change note label Mar 8, 2022
Copy link
Contributor

@aibaars aibaars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me. How does this approach compare to using a type tracker in terms of precision and efficiency?

@hvitved
Copy link
Contributor Author

hvitved commented Mar 9, 2022

How does this approach compare to using a type tracker in terms of precision and efficiency?

Precision: The data-flow library has much higher precision, e.g. we are tracking contents (fields) and call contexts precisely.
Performance: While it may a bit slower, it doesn't matter much in practice (as witnessed by DCA). Unlike type-tracking, the data-flow library applies sophisticated pruning of nodes before tracking flow from a given source to a given node, which helps a lot on performance.

@hvitved hvitved closed this Mar 9, 2022
@hvitved hvitved reopened this Mar 9, 2022
@RasmusWL
Copy link
Member

RasmusWL commented Mar 9, 2022

Since we discussed these two approaches in architecture meeting, I would be very interested to see whether you have any concrete examples where the new taint-tracking based modeling is able to give a result for regExpSource where the old type-tracking based approach wasn't. I couldn't see any changes to tests, or any new results in alerts/alert-meta in the performance test.

@hmac
Copy link
Contributor

hmac commented Mar 9, 2022

As I understand it, we have to be careful to not re-use the same dataflow library multiple times (hence DataFlow2,3 etc.) If we want to use taint tracking for another library concept, will we have to duplicate this library again in case someone writes a query that uses both regExpSource and the other library concept? In other words, would this copy be better named something like DataFlowImplForRegExpSource, if we can't re-use it for anything else?

@hvitved
Copy link
Contributor Author

hvitved commented Mar 10, 2022

Since we discussed these two approaches in architecture meeting, I would be very interested to see whether you have any concrete examples where the new taint-tracking based modeling is able to give a result for regExpSource where the old type-tracking based approach wasn't. I couldn't see any changes to tests, or any new results in alerts/alert-meta in the performance test.

There is, in my opinion, really no reason to use type tracking if data-flow can be used. Even though no tests change, it doesn't mean that using taint-tracking instead of type tracking will have no effect in the future. In addition to precise tracking of fields and call contexts, data-flow also takes flow summaries into account, and applies the data-flow library's more precise call graph for callbacks/lambdas. I would be interested to see examples where using data-flow/taint-tracking instead of type tracking yields performance problems.

@hvitved
Copy link
Contributor Author

hvitved commented Mar 10, 2022

As I understand it, we have to be careful to not re-use the same dataflow library multiple times (hence DataFlow2,3 etc.) If we want to use taint tracking for another library concept, will we have to duplicate this library again in case someone writes a query that uses both regExpSource and the other library concept? In other words, would this copy be better named something like DataFlowImplForRegExpSource, if we can't re-use it for anything else?

That is correct. There are cases where reusing the same copy of the data-flow library in multiple places in other libraries is OK (specifically when those two libraries are known to always be in scope at the same time), but in general it is safer to always just use a separate copy. However, in the not-too-distant future we will be able to get rid of all the copies using a parameterized module instead.

Copy link
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I approve of switching to taint-tracking here. In JS I believe it was a case of "all we have is a hammer" so we used type-tracking because we don't have multiple copies of the data-flow configuration.

@hvitved hvitved force-pushed the ruby/regexp-taint-flow branch 3 times, most recently from da66e1f to 8b9efd4 Compare March 16, 2022 12:35
nickrolfe
nickrolfe previously approved these changes Mar 18, 2022
Copy link
Contributor

@nickrolfe nickrolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this PR look good, and I think it should be merged.

But I think there is more work to do in this area. What I want to be able to do is take a DataFlow::Node for some arbitrary expression (that our analysis says has a regexp value), and get the resulting root RegExpTerm. It's not clear to me how to do that. That's what motivated me to make #7985, but I'm not sure if that change is still needed after this PR and #8293.

@aibaars
Copy link
Contributor

aibaars commented Mar 18, 2022

The changes in this PR look good, and I think it should be merged.

But I think there is more work to do in this area. What I want to be able to do is take a DataFlow::Node for some arbitrary expression (that our analysis says has a regexp value), and get the resulting root RegExpTerm. It's not clear to me how to do that. That's what motivated me to make #7985, but I'm not sure if that change is still needed after this PR and #8293.

Would that be solved by the RegExpPatternSource defined in #7917 ?

@hvitved
Copy link
Contributor Author

hvitved commented Mar 18, 2022

Rebased again to sync in latest DataFlowImpl.qll/TaintTrackingImpl.qll changes.

@aibaars aibaars merged commit beef8e2 into github:main Mar 18, 2022
@hvitved hvitved deleted the ruby/regexp-taint-flow branch March 22, 2022 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-change-note-required This PR does not need a change note Ruby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants