Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The allowed_deserialization_classes regex is broken #34093

Closed
2 tasks done
barrysteyn opened this issue Sep 5, 2023 · 5 comments · Fixed by #36147
Closed
2 tasks done

The allowed_deserialization_classes regex is broken #34093

barrysteyn opened this issue Sep 5, 2023 · 5 comments · Fixed by #36147

Comments

@barrysteyn
Copy link

barrysteyn commented Sep 5, 2023

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

In #28829, the allowed_deserialization_classes regex was made more intuitive. It's aim (I guess) was to ensure that a period . be treated as a period and not as the wildcard regex character. The justification (given in #28829) was the following:

Typically someone would like to allow any classes below 'mymodule' to match. For example, 'mymodule.dataclasses' by setting allowed_deserialization_classes to 'mymodule.*'. However this matches everything starting with mymodule, so also mymodulemalicious.'

To correct the above, the following substitution was provided to each space separated string provided by allowed_deserialization_classes:

re.sub(r"(\w)\.", r"\1\..", p)

Now, assume I have a class that I want to be serialized called this.is.an.example. Here is the results of the following inputs to allowed_deserialization_classes

  • this.is.an.example --> This would not match (note it is the identical class name). The regular expression would change it to this\\..is\\..an\\..example. Which means that there is an extra wildcard . that would need be matched.
  • this.is.an* --> This would not match. The regular expression would change it to this\\..is\\..an* which suffer from the problem in the first example.
  • this.* --> This would match (note that this example is given in the unit tests for Make allowed_deserialization_classes more intuitive #28829). The regular expression would would change it to this\\..*. It will match because it will look for the word this followed by . and then the wildcard character with the wildstar character (which will match zero or all of anything). To note that this is THE only way to make things match without altering the input string.

In order to make a match for this.is.an.example, I need to put as input in allowed_deserialization_classes this[.]is[.]an[.]example:

  • The above input would be left alone by the regex expression substitution.
  • It also does what the author wants (which is to match the period .)
  • To note that one can also just escape the . by doing this\.is\.an\.example which would have the identical result.

In fact, any correct regex would match (even this[.]is[.]an*) as long as re.sub(r"(\w)\.", r"\1\..", p) does NOT alter things.

What I think should happen

It is dubious to alter a regex in the first place. Airflow is aimed at technical people, and as such, proper documentation would suffice. There I suggest the following:

  1. Remove the re.sub(r"(\w)\.", r"\1\..", p)
  2. Correct the documentation and warn people to escape the wildcard character.
  3. If the intention is to make it easier by altering the regex, then we need to have a serious chat about what the regex should be. Since we will be trying to guess what the user means. For example, hello.* obviously means match anything after hello, and hello.world.* means match the first period as a period, but the second one as a wildcard... You see how tough this is. Because regex can have multiple reserved symbols. A better way to do it would be look forward. Maybe something like this (although regex experts, please weigh in): re.sub(r"\.(\w)", r"\\.\1", p) (this will only escape a period if a word started directly afterwards).

How to reproduce

I reproduced by copying the logic and putting it in a test task. I did the following:

@dag(
    dag_id="test-dag-used-to-test",
    start_date=datetime(2021, 1, 1),  # start sometime in the past
    catchup=False,  # important, since we started in the past
)
def test_dag():
    @task
    def test_match():
        input = 'this.is.an.example'  # Note that is the exact class name
        p = re.sub(r"(\w)\.", r"\1\..", input)
        result = any(p.match(input))  # Returns false

    test_match()


test_dag()

Operating System

MacOS

Versions of Apache Airflow Providers

The change was introduced in #28829 which was from AirFlow 2.6.0

Deployment

Amazon (AWS) MWAA

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@barrysteyn barrysteyn added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 5, 2023
@potiuk
Copy link
Member

potiuk commented Sep 5, 2023

Agree. It makes very little sense how it is defined. Especially the comment Bare “.” will be replaced so you can set airflow* is pretty misleading in the context of your examples.

I think - if we wanted to do it more intuitively (but keep the regexp compatibility) we should do it differently

  1. Check if the pattern string ends with .*. If so use glob matching rather than regexp matching
  2. Check if the string fully matches existing class (we can attempt to import the pattern and see if it works).
  3. If none of the above, fallback to regexp matching
  4. Update the documentation describing it (and treat it as a bugfix not breaking change).

I believe this will strike the right balance between the intuitiveness of full-class specification, ability to glob-match when you want to exclude whole packages while keeping the power of regular expressions (both 1) and 2) are highly unlikely to be specified with the intention of being regular expressions).

@kaxil @bolkedebruin - does it cover the intentions of the #28829 ? I believe that was the intention. The \,,. replacement indeed does not properly match expressions that contain more than one top-level package and make the matching far less intuitive if you have hierarchical packages.

@potiuk potiuk removed the needs-triage label for new issues that we didn't triage yet label Sep 5, 2023
@barrysteyn
Copy link
Author

barrysteyn commented Sep 5, 2023

Thanks @potiuk. The ultimate aim of allowed_deserialization_classes is to provide protection against unexpected class deserialization. Using a regex is goes against that - as @bolkedebruin pointed out, you can have unexpected results and end up allowing things you didn't want. However, I am guessing one of the things that @bolkedebruin wanted to do was to switch things off completely by allowing the regex .* which would match everything.

That said, I think we should do the following:

  1. If allowed_deserialization_classes is None, then operate as per normal (which would be not to allow custom class deserialization)
  2. If allowed_deserialization_classes is '' (empty string), turn the feature off (I am still not sure how useful this feature is, but I bet there are lots of people who would like to turn it off).
  3. If allowed_deserialization_classes has strings in it, then use fnmatch to do (as you suggest) a glob like pattern matching search.
  4. Do not do any regex matching (and since Make allowed_deserialization_classes more intuitive #28829, this feature has been broken and so I don't think it is used much - except maybe to turn things off)

A note on (4) above: I do not advocate for trying string match, then glob match and then regex. This is way too complex for what is aimed at securing the system. It opens up a can of worms. Rather, let's just use fnmatch if we need to match. And... we need to vastly improve the documentation:

  1. Remove cryptic comments like Bare “.” will be replaced so you can set airflow* - I am a first language English speaker and I have no idea what that implies.
  2. State that allowed_deserialization_classes is just a space separated string that will be used with split() to make lists.
  3. Give the rules as I mention above (None will enforce things, '' will stop enforcement, all other strings will be split and matched with fnmatch)

What do you think?

@potiuk
Copy link
Member

potiuk commented Sep 5, 2023

I am also fine for that. Regexp fallback was only there as a compatibility measure and I am perfectly fine to do "glob-only" - that would by my way of implemeting it if I were to implement it from scratch.

However indeed this is an obscure feature (I'd say) and using regexp was (somewhat - not completely) broken since that change and this is the first time we hear something about it, so I would not be surprised if the feature is hardly used. So it's really on the border of "is it really breaking someone's workflow".

We could really treat it as a bugfix and add release notes /significant newsfragment that announcing that this feature was using regexp was a bug that is going to be fixed in the new release. I think also the radius blast of it is very limited - the worst thing that happens, the serialisation errors will fail everything if the match stops working and we can even likely update the error message to explain that this might be due to the bugfixing of the parameter and link to documentation explaining how to fix it.

@kaxil @bolkedebruin - WDYT?

@barrysteyn
Copy link
Author

@potiuk Thanks for the reply. I have just one final question (mainly because I am new to the AirFlow community): How does it work from here? Would you like me to submit a PR (which would take me some time to set up things - but would be a pleasure to help out the community) or will someone just pick this up?

@potiuk
Copy link
Member

potiuk commented Sep 5, 2023

Absolutely - feel free to open a new PR (even now before getting feedback from the others). We can always iterate over the PR and get to the final solution there.

And we have quite good CONTRIBUTING doss to get you started (just find it in the top level of the repo) - including some useful guidelines on setting up your IDEs and we also have breeze development environment based on docker-compose that is super easy to setup and keep updated and it providers 1-1 local reproducibility of everything that happens in CI - so it's easy to reproduce and see (and fix)) problems our CI / static checks /docs building requires

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants