Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace re with re2 #32303

Merged
merged 2 commits into from
Jul 6, 2023
Merged

Conversation

pierrejeambrun
Copy link
Member

@pierrejeambrun pierrejeambrun commented Jul 1, 2023

Replace re with re2 in core. Leave re for dev/scripts/test and providers.

(Providers don't have re2 dependencies).

CAMELCASE_TO_SNAKE_CASE_REGEX needs to be rewritten without using lookaround. I was not successful at that, so allowing re here.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:CLI area:dev-tools provider:cncf-kubernetes Kubernetes provider related issues area:logging area:serialization area:webserver Webserver related Issues labels Jul 1, 2023
@pierrejeambrun pierrejeambrun added this to the Airlfow 2.6.3 milestone Jul 1, 2023
@pierrejeambrun pierrejeambrun added changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) type:bug-fix Changelog: Bug Fixes and removed changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) labels Jul 1, 2023
@pierrejeambrun pierrejeambrun reopened this Jul 1, 2023
@bolkedebruin
Copy link
Contributor

Maybe it is good to know why? It brings in a non core dependency that needs to be maintained?

Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a commit message that explains why and a removal of "import re2 as re".

@@ -17,7 +17,7 @@
"""Providers sub-commands."""
from __future__ import annotations

import re
import re2 as re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is a functional difference between re and re2 we are bound to end up in issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution is vague then in this case and a (innocuous looking) functional commit message to show that it addresses the core of the problem is required. What does re2 solve that re doesn't and we cannot do otherwise and why?

Additionally re2 is not just a drop-in replacement as you have shown yourself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I think the actual issue is that we are trusting user input here and re2 seems on the surface right now to just to be band-aid and not addressing the core of the problem. Also re2 fallsback to re if it doesn't know how to handle the regex. But maybe you can elaborate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback is explicit in pyre2 (we are not using it), but I didn't find any mention of it in python google-re2 bindings. I tried a few unsupported regexp and it threw errors straight away. I believe there is no fallback, this was discussed on the issue. Did you find example or doc regarding fallback for google-re2 ?

re2 solves the ReDos problem with linear time regexp engine.

I don't have newer arguments than what has already been discussed in the github issue. (backtracking problem, pyre2 vs google-re2, fallback etc.).

As it is not yet released with a patch for this I purposely didn't give too many details here.

If we prefer a different approach, just let me know I can close this in favor of other suggestions.

If needed can we follow up on the airflow-s issue, or security mailing list ?

Note: Commit message has been updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I think the actual issue is that we are trusting user input here and re2 seems on the surface right now to just to be band-aid and not addressing the core of the problem

Not really. I think it's not a band-aid, using rgular expression is part of our API specification, so we cannot really remove it unless we have a very good reason (and it's actually useful). So solving a potential way how you could (mostly accidentally) trigger the situation where it it will take a lot of time is the right approach - we do not want to remove the functionality there. Moreover - since we will already have the google-re2 dependency (which BTW is proven and battle tested because it is used internally in go language), we can use the opportunity to use it elsewhere whre we use regular expressions and protect other pleaces.

Also re2 fallsback to re if it doesn't know how to handle the regex. But maybe you can elaborate?

The fallback is a mechanism for another library, It was a mistake to mention it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I stand corrected on mentioning the fallback mechanism. I had to find out myself as there was so little detail in the commit message.

The commit message could read:

"Use linear time regular expressions

The standard regexp library can consume > O(n) in certain circumstances. The re2 library does not have this issue.
"

Which clarifies, but doesnt give away the issue.

I do not fully agree with your assessment @potiuk that we want to keep that. We are trusting user input here and regexp engines are notorious to have issues. Imho the root cause is trusting user input and that is what probably should be addressed. I also find that a very good reason for change :-). The new, current, commit message says as much now ("untrusted"). This is a workaround still.

I won't stand in the way of the commit, but I stand by my opinion that it is band-aid.

@pierrejeambrun
Copy link
Member Author

pierrejeambrun commented Jul 2, 2023

... and a removal of "import re2 as re".

Do you want me to replace this with import re2 ? (I'm not sure I get It)

@bolkedebruin
Copy link
Contributor

... and a removal of "import re2 as re".

Do you want me to replace this with import re2 ? (I'm not sure I get It)

I prefer explicit over implicit. 're2' is not exactly the same as re as regexps are different, so yes please use import re2.

The standard regexp library can consume > O(n) in certain circumstances.
The re2 library does not have this issue.
@pierrejeambrun
Copy link
Member Author

@bolkedebruin Commit message has been updated with your suggestion.

Explicit re2 import are now made when re2 is used.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@potiuk potiuk requested a review from bolkedebruin July 4, 2023 16:02
@potiuk
Copy link
Member

potiuk commented Jul 4, 2023

Some tests failing @pierrejeambrun

@pierrejeambrun
Copy link
Member Author

Some tests failing @pierrejeambrun

Yep, one mock needed to be adjusted. Should be green now.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@potiuk
Copy link
Member

potiuk commented Jul 5, 2023

cc: @bolkedebruin -> I captured your concerns about "band-aid" in #32360 - described the reasonign and consequences, and marked it as "involves core breaking changes" issue. We have started recently to mark issues with such label to see much more easily what kind of changes we might consider when deciding if/when we release Airflow 3 with breaking changes.

Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing @pierrejeambrun and thanks for noting the concern @potiuk

@bolkedebruin bolkedebruin merged commit ee38382 into apache:main Jul 6, 2023
61 checks passed
ephraimbuddy pushed a commit that referenced this pull request Jul 6, 2023
The standard regexp library can consume > O(n) in certain circumstances.
The re2 library does not have this issue.

(cherry picked from commit ee38382)
ephraimbuddy pushed a commit that referenced this pull request Jul 7, 2023
The standard regexp library can consume > O(n) in certain circumstances.
The re2 library does not have this issue.

(cherry picked from commit ee38382)
@pierrejeambrun pierrejeambrun deleted the replace-re-with-re2 branch July 7, 2023 07:05
@tshirtman
Copy link
Contributor

tshirtman commented Nov 15, 2023

So a side effect of that is that .airflowignore regex mode now can’t use backward or forward lookahead, this seems a bit hand heavy considering if someone untrusted can modify this file, we certainly have bigger problems on our hands.

I’ll see if i can implement our requirements (dags are prefixed with dag_ so we ignore everything that doesn’t start with that) using the glob matcher, but it seems like a common enough use-case that breaking it like this is surprising.

At the very least i think the documentation should mention that the regex implementation is re2, so users have an easier time checking their patterns are correct.

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

So a side effect of that is that .airflowignore regex mode now can’t use backward or forward lookahead, this seems a bit hand heavy considering if someone untrusted can modify this file, we certainly have bigger problems on our hands.

I’ll see if i can implement our requirements (dags are prefixed with dag_ so we ignore everything that doesn’t start with that) using the glob matcher, but it seems like a common enough use-case that breaking it like this is surprising.

At the very least i think the documentation should mention that the regex implementation is re2, so users have an easier time checking their patterns are correct.

Feel free to make some docs update on that. PRs to improve our documentation are most welcome. Yes it is possible that security issues will introduce breaking behaviour (because security has higer priority than compatibility) and we generally try to make sure that it is reflected in docs and release notes https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html so if you have the right wording and idea how to communicate it better - absolutely, by all means submit a PR. Airlfow is created by 2700 contributors - such clarifications and updates are always welcome from new contributors - showing that the contributors want to give back to the community and it's generally cool if our contributors care about other users.

Ping me in a PR you will open for that, I am always happy to review and merge those kinds of PRS.

@tshirtman
Copy link
Contributor

tshirtman commented Nov 15, 2023

Sure :) let me know if it’s fine or needs touching up.

edit: out of habit i made a markdown link, not rst, fixed ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API area:CLI area:dev-tools area:logging area:serialization area:webserver Webserver related Issues provider:cncf-kubernetes Kubernetes provider related issues type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants