Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a check for trailing slash in webserver base_url #31833

Merged
merged 4 commits into from
Jun 19, 2023

Conversation

hussein-awala
Copy link
Member

closes: #31726


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Signed-off-by: Hussein Awala <hussein@awala.fr>
@boring-cyborg boring-cyborg bot added the area:webserver Webserver related Issues label Jun 10, 2023
@hussein-awala hussein-awala added the type:bug-fix Changelog: Bug Fixes label Jun 10, 2023
@hussein-awala hussein-awala added this to the Airlfow 2.6.3 milestone Jun 10, 2023
@@ -758,7 +758,7 @@ def generate_command(
def log_url(self) -> str:
"""Log URL for TaskInstance."""
iso = quote(self.execution_date.isoformat())
base_url = conf.get_mandatory_value("webserver", "BASE_URL")
base_url = conf.get_mandatory_value("webserver", "BASE_URL").rstrip("/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could lead to some errors in the future.

It's hard for someone to guess that retrieving the base_url from the config is a special case and needs to strip trailing '/'. (I can easily imagine someone doing conf.get_mandatory_value("webserver", "BASE_URL") to access the value)

Maybe we can put that in the config accessor ? (So we don't need to bother, the equivalent of @validates/@pre_process for marshmallow field, that does some preprocesssing on the value before storing it)

Otherwise looking good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check how can we do that for a specific config

@uranusjr
Copy link
Member

Two thoughts

  1. Should we just use urljoin instead of raw string formatting to handle the slashes instead?
  2. If we want to eliminate the possibility, I’d prefer an startup-time check to ensure the config is set without a trailing slash. If a trailing slash is found, Airflow can refuse to launch and prompt the user to correct the value.

@hussein-awala
Copy link
Member Author

Should we just use urljoin instead of raw string formatting to handle the slashes instead?

it's possible for the changes in airflow/models/taskinstance.py but not for the one in airflow/www/extensions/init_wsgi_middlewares.py since we don't join it to something else, but we provide it as an argument to a 3rd party lib.

If we want to eliminate the possibility, I’d prefer an startup-time check to ensure the config is set without a trailing slash. If a trailing slash is found, Airflow can refuse to launch and prompt the user to correct the value.

I will try this one, and I don't think it's a braking change since it doesn't work with a trailing slash anyway

Signed-off-by: Hussein Awala <hussein@awala.fr>
Signed-off-by: Hussein Awala <hussein@awala.fr>
@hussein-awala
Copy link
Member Author

@pierrejeambrun @uranusjr could you recheck it please?

Copy link
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bugfix but could break some user configs by preventing webservers from starting.

Should we go with a warning first before actually refusing such configuration ?

(No strong opinion, not sure how we should handle this actually)

@hussein-awala
Copy link
Member Author

@pierrejeambrun As I mentioned in my previous comment:

I don't think it's a breaking change since it doesn't work with a trailing slash anyway.

Therefore, if users utilize a base_url with a trailing slash, they will encounter an issue, and comprehending the cause might not be straightforward. IMO, raising an exception is a secure approach since it does not break any existing functionality, but it just explain the problem reason. WDYT?

Copy link
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed that in your comment.

I did a few tests, looking good

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
@hussein-awala hussein-awala changed the title Remove right trailing / from webserver base_url Add a check for trailing slash in webserver base_url Jun 19, 2023
@hussein-awala hussein-awala merged commit fe4a6c8 into apache:main Jun 19, 2023
42 checks passed
ephraimbuddy pushed a commit that referenced this pull request Jul 6, 2023
* Remove right trailing / from webserver base_url

Signed-off-by: Hussein Awala <hussein@awala.fr>

* use url join instead of removing trailing slash

Signed-off-by: Hussein Awala <hussein@awala.fr>

* raise an exception when base_url contains a trailing slash

Signed-off-by: Hussein Awala <hussein@awala.fr>

* Update airflow/www/extensions/init_wsgi_middlewares.py

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

---------

Signed-off-by: Hussein Awala <hussein@awala.fr>
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
(cherry picked from commit fe4a6c8)
@@ -37,8 +38,11 @@ def _root_app(env: WSGIEnvironment, resp: StartResponse) -> Iterable[bytes]:

def init_wsgi_middleware(flask_app: Flask) -> None:
"""Handle X-Forwarded-* headers and base_url support."""
webserver_base_url = conf.get_mandatory_value("webserver", "BASE_URL", fallback="")
if webserver_base_url.endswith("/"):
raise AirflowConfigException("webserver.base_url conf cannot have a trailing slash.")
Copy link
Member

@pankajkoti pankajkoti Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering here that why can the base URL not have a trailing /?

Copy link
Member

@uranusjr uranusjr Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to how web servers rewrite URLs when you deploy a webapp under a prefix. Say you deploy under https://mydomain/airflow. When you visit https://mydomain/airflow/home in the browser (as an example; any client applies), the proxy server (e.g. NGINX) would strip the prefix, and forward the rest of the URL to the app, so in Python code (e.g. Airflow) we can just handle /home without needing to consider what prefix the entirety of Airflow is deployed under.

Now you may say, hey, why can’t either the proxy server, the gateway protocol (WSGI in Python’s case), or the web framework (Flask for Airflow’s case) be smarter and just strip or add the slash in between as appropriate? And you would be right! Some of them actually can do this (NGINX has merge_slashes, for example), but not everyone does since the slash is technically significant and does change the URL’s meaning (even if that meaning can be nonsensical in the prefix), and that extra check is not free. So it’s better to avoid fighting the tools and just stick to the technically correct configuration.

Copy link
Member

@pankajkoti pankajkoti Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I am coming from here on how the python urllib.parse.urljoin handles joining base URL and relative URLs.

Wouldn't https://my.astronomer.run/path qualify as a valid base URL? But looks like the urljoin just ignores the path when it does not end with a trailing /. If it was https://my.astronomer.run/path/, urljoin does not strip the path and makes the join as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can handle more cases if users need it. This PR was a quick fix to prevent circular redirections.

I’m sure we can make this work with a trailing slash but requires a bit more work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry, forgot to reply.)

Python’s urljoin implements the logic in <a href="..."> tags; I don’t recall the correct term for this, but it is the URL resolution logic when you click on links in a browser. URLs have two kinds, folder and file (not the right terms, but easier to understand). The URL would not have a trailing slash if you are in a file view; the trailing slash indicates a folder view. When you’re in a file, a relative path like foo indicates another file in the same folder. When you’re in a folder view, however, foo means the foo entry in the folder. This means the resolution logic would change depending on whether the URL has a trailing slash or not.

Note that Python’s local path joining logic (both pathlib and os.path) is different and does not consider the trailing slash; it instead matches the common logic used to join format in shell scripts.

But the prefix is an entirely different thing, and is intended to only be joined with simple string concatenation and skips all the folder/file logic because it is not semantically possible to support combining the prefix with absolute paths and relative paths. The only proper way to prepend a prefix is prefix + path (or other string concatenation methods such as f"{prefix}{path}", of course).

Copy link
Member

@pankajkoti pankajkoti Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood, thank you @uranusjr for the detailed explanation and propsoal.

pankajkoti added a commit to astronomer/airflow that referenced this pull request Aug 3, 2023
It is observed that urljoin is not yielding expected results for
the task instance's log_url which needs to be a concatenation of the
webserver base_url and specified relative url. The current usage
of urljoin does not seem to be the right way to achieve this based
on what urljoin is meant for and how it works. So, we use simple
string concatenation to yield the desired result.
More context in the comment apache#31833 (comment)

closes: apache#32996
eladkal pushed a commit that referenced this pull request Aug 3, 2023
It is observed that urljoin is not yielding expected results for
the task instance's log_url which needs to be a concatenation of the
webserver base_url and specified relative url. The current usage
of urljoin does not seem to be the right way to achieve this based
on what urljoin is meant for and how it works. So, we use simple
string concatenation to yield the desired result.
More context in the comment #31833 (comment)

closes: #32996
ephraimbuddy pushed a commit that referenced this pull request Aug 3, 2023
It is observed that urljoin is not yielding expected results for
the task instance's log_url which needs to be a concatenation of the
webserver base_url and specified relative url. The current usage
of urljoin does not seem to be the right way to achieve this based
on what urljoin is meant for and how it works. So, we use simple
string concatenation to yield the desired result.
More context in the comment #31833 (comment)

closes: #32996
(cherry picked from commit baa1bc0)
ahidalgob pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 15, 2024
It is observed that urljoin is not yielding expected results for
the task instance's log_url which needs to be a concatenation of the
webserver base_url and specified relative url. The current usage
of urljoin does not seem to be the right way to achieve this based
on what urljoin is meant for and how it works. So, we use simple
string concatenation to yield the desired result.
More context in the comment apache/airflow#31833 (comment)

closes: #32996
(cherry picked from commit baa1bc0438baa05d358b236eec3c343438d8d53c)
GitOrigin-RevId: 868d1389461822cf5d54faaff3ce19913fc7f08e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:webserver Webserver related Issues type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

redirect to same url after set base_url
4 participants