Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only allow webserver to request from the worker log server #16754

Merged
merged 2 commits into from Jul 1, 2021

Conversation

@ashb
Copy link
Member

@ashb ashb commented Jul 1, 2021

Logs shouldn't contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)
@ashb ashb added this to the Airflow 2.1.2 milestone Jul 1, 2021
@ashb ashb requested review from potiuk and mik-laj Jul 1, 2021
potiuk
potiuk approved these changes Jul 1, 2021
Copy link
Member

@potiuk potiuk left a comment

Nice!

mik-laj
mik-laj approved these changes Jul 1, 2021
@github-actions
Copy link

@github-actions github-actions bot commented Jul 1, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@ashb
Copy link
Member Author

@ashb ashb commented Jul 1, 2021

All passed bar upload coverage -- merging.

@ashb ashb merged commit 2285ee9 into apache:main Jul 1, 2021
37 of 38 checks passed
@ashb ashb deleted the protect-logs-server branch Jul 1, 2021
kaxil added a commit to astronomer/airflow that referenced this issue Jul 6, 2021
)

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
jhtimmins added a commit that referenced this issue Jul 6, 2021
Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
kaxil added a commit to astronomer/airflow that referenced this issue Jul 6, 2021
)

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
(cherry picked from commit d4a09f2)
ashb added a commit to astronomer/airflow that referenced this issue Jul 6, 2021
)

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
(cherry picked from commit d4a09f2)
ashb added a commit to astronomer/airflow that referenced this issue Jul 6, 2021
)

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
(cherry picked from commit d4a09f2)
ashb added a commit to astronomer/airflow that referenced this issue Jul 6, 2021
)

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
(cherry picked from commit d4a09f2)
ashb added a commit to astronomer/airflow that referenced this issue Jul 6, 2021
)

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
(cherry picked from commit d4a09f2)
ashb added a commit to astronomer/airflow that referenced this issue Jul 6, 2021
)

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
(cherry picked from commit d4a09f2)
jhtimmins added a commit to astronomer/airflow that referenced this issue Jul 6, 2021
)

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
ashb added a commit that referenced this issue Jul 6, 2021
Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
kaxil added a commit to astronomer/airflow that referenced this issue Jul 7, 2021
)

Logs _shouldn't_ contain any sensitive info, but they often do by
mistake. As an extra level of protection we shouldn't allow anything
other than the webserver to access the logs.

(We can't change the bind IP form 0.0.0.0 as for it to be useful it
needs to be accessed from different hosts -- i.e. the webserver will
almost always be on a different node)

(cherry picked from commit 2285ee9)
@ashb ashb removed this from the Airflow 2.1.2 milestone Jul 7, 2021
@ashb ashb added this to the Airflow 2.1.3 milestone Jul 7, 2021
@wanderijames
Copy link

@wanderijames wanderijames commented Jul 15, 2021

@ashb this could have broken worker logs access from the web ui. A pod running webui cannot access logs from the worker in another pod. This can be reproduced if deployed in Kubernetes. Getting the following:

*** Fetching from: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
*** Failed to fetch log file from worker. 403 Client Error: FORBIDDEN for url: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
For more information check: https://httpstatuses.com/403

@kaxil
Copy link
Member

@kaxil kaxil commented Jul 21, 2021

#17142 should take care of the Helm Chart -- rc1 for which will be created today and hopefully the version will be released in next 3-4 days

kaxil added a commit to astronomer/airflow that referenced this issue Jul 21, 2021
After apache#16754 -- it is important that both Webserver and Worker have the same config value for `[webserver] secret_key` or else you will see the following error:

```
*** Fetching from: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
*** Failed to fetch log file from worker. 403 Client Error: FORBIDDEN for url: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
For more information check: https://httpstatuses.com/403
```

This happens because Airflow generates a random value for them if value isn't provided, which causes a random string generated on webserver and worker. Hence they don't match, resulting in the error.

This PR creates a K8s Secret object and creates a key for that setting and pass it as Env Var similar to what we do with Fernet Key.
@tunix
Copy link

@tunix tunix commented Jul 21, 2021

Just wondering @tunix - where the idea that different components should have different configuration came from? Is this something implied? Or something that you found in the documentation? Should we make it very explicit in our documentation that you are not supposed to selectively set some configuration parameters on some components only?

I'm new to Airflow and I haven't seen an explicit part within the docs that says "all configuration must be shared between airflow components". In fact, all configuration keys are prefixed with some naming scheme which led me to think that some are common, some are for webserver etc... I personally don't like to have common configuration because changing a single setting requires a change in all components (hence restart) but I get that this is intentional for Airflow. I think this can be better communicated throughout the docs.

kaxil added a commit to astronomer/airflow that referenced this issue Jul 21, 2021
After apache#16754 -- it is important that both Webserver and Worker have the same config value for `[webserver] secret_key` or else you will see the following error:

```
*** Fetching from: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
*** Failed to fetch log file from worker. 403 Client Error: FORBIDDEN for url: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
For more information check: https://httpstatuses.com/403
```

This happens because Airflow generates a random value for them if value isn't provided, which causes a random string generated on webserver and worker. Hence they don't match, resulting in the error.

This PR creates a K8s Secret object and creates a key for that setting and pass it as Env Var similar to what we do with Fernet Key.
kaxil added a commit to astronomer/airflow that referenced this issue Jul 21, 2021
Some users seem to be using different configs for different components. This doc makes it clear on what we recommend.

Context: apache#16754 (comment)

>I'm new to Airflow and I haven't seen an explicit part within the docs that says "all configuration must be shared between airflow components". In fact, all configuration keys are prefixed with some naming scheme which led me to think that some are common, some are for webserver etc... I personally don't like to have common configuration because changing a single setting requires a change in all components (hence restart) but I get that this is intentional for Airflow. I think this can be better communicated throughout the docs.
kaxil added a commit to astronomer/airflow that referenced this issue Jul 21, 2021
Some users seem to be using different configs for different components. This doc makes it clear on what we recommend.

Context: apache#16754 (comment)

>I'm new to Airflow and I haven't seen an explicit part within the docs that says "all configuration must be shared between airflow components". In fact, all configuration keys are prefixed with some naming scheme which led me to think that some are common, some are for webserver etc... I personally don't like to have common configuration because changing a single setting requires a change in all components (hence restart) but I get that this is intentional for Airflow. I think this can be better communicated throughout the docs.
@kaxil
Copy link
Member

@kaxil kaxil commented Jul 21, 2021

@tunix Does #17146 help?

@potiuk
Copy link
Member

@potiuk potiuk commented Jul 21, 2021

#17142 should take care of the Helm Chart -- rc1 for which will be created today and hopefully the version will be released in next 3-4 days

wooohooo :)

kaxil added a commit that referenced this issue Jul 21, 2021
Some users seem to be using different configs for different components. This doc makes it clear on what we recommend.

Context: #16754 (comment)

>I'm new to Airflow and I haven't seen an explicit part within the docs that says "all configuration must be shared between airflow components". In fact, all configuration keys are prefixed with some naming scheme which led me to think that some are common, some are for webserver etc... I personally don't like to have common configuration because changing a single setting requires a change in all components (hence restart) but I get that this is intentional for Airflow. I think this can be better communicated throughout the docs.
kaxil added a commit that referenced this issue Jul 21, 2021
After #16754 -- it is important that both Webserver and Worker have the same config value for `[webserver] secret_key` or else you will see the following error:

```
*** Fetching from: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
*** Failed to fetch log file from worker. 403 Client Error: FORBIDDEN for url: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
For more information check: https://httpstatuses.com/403
```

This happens because Airflow generates a random value for them if value isn't provided, which causes a random string generated on webserver and worker. Hence they don't match, resulting in the error.

This PR creates a K8s Secret object and creates a key for that setting and pass it as Env Var similar to what we do with Fernet Key.
@green2k
Copy link

@green2k green2k commented Jul 23, 2021

We use the logging endpoint for worker healthchecks (an external service for health-checking of nodes in a cluster). Does it mean these simple healthchecks won't work anymore?

@jedcunningham
Copy link
Member

@jedcunningham jedcunningham commented Jul 23, 2021

@green2k, I'd just change your success criteria to be a 403 response. That said, having a /health endpoint probably makes sense.

@green2k
Copy link

@green2k green2k commented Jul 23, 2021

Yes, that's the thing. Our health-checking system expects HTTP/200 response. We managed to have a custom healthcheck endpoint by executing pathlib.Path('logs/health').touch(exist_ok=True). So we're currently (v2.1.1) able to check health of our workers by calling http(s)://worker/log/health

andrewgodwin added a commit to astronomer/airflow that referenced this issue Jul 28, 2021
…7146)

Some users seem to be using different configs for different components. This doc makes it clear on what we recommend.

Context: apache#16754 (comment)

>I'm new to Airflow and I haven't seen an explicit part within the docs that says "all configuration must be shared between airflow components". In fact, all configuration keys are prefixed with some naming scheme which led me to think that some are common, some are for webserver etc... I personally don't like to have common configuration because changing a single setting requires a change in all components (hence restart) but I get that this is intentional for Airflow. I think this can be better communicated throughout the docs.
andrewgodwin added a commit to astronomer/airflow that referenced this issue Jul 28, 2021
…e#17142)

After apache#16754 -- it is important that both Webserver and Worker have the same config value for `[webserver] secret_key` or else you will see the following error:

```
*** Fetching from: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
*** Failed to fetch log file from worker. 403 Client Error: FORBIDDEN for url: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
For more information check: https://httpstatuses.com/403
```

This happens because Airflow generates a random value for them if value isn't provided, which causes a random string generated on webserver and worker. Hence they don't match, resulting in the error.

This PR creates a K8s Secret object and creates a key for that setting and pass it as Env Var similar to what we do with Fernet Key.
jarfgit added a commit to astronomer/airflow that referenced this issue Aug 5, 2021
…7146)

Some users seem to be using different configs for different components. This doc makes it clear on what we recommend.

Context: apache#16754 (comment)

>I'm new to Airflow and I haven't seen an explicit part within the docs that says "all configuration must be shared between airflow components". In fact, all configuration keys are prefixed with some naming scheme which led me to think that some are common, some are for webserver etc... I personally don't like to have common configuration because changing a single setting requires a change in all components (hence restart) but I get that this is intentional for Airflow. I think this can be better communicated throughout the docs.
jarfgit added a commit to astronomer/airflow that referenced this issue Aug 5, 2021
…e#17142)

After apache#16754 -- it is important that both Webserver and Worker have the same config value for `[webserver] secret_key` or else you will see the following error:

```
*** Fetching from: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
*** Failed to fetch log file from worker. 403 Client Error: FORBIDDEN for url: https://worker.worker-svc.default.svc.cluster.local:8793/log/<dag>/<task>/2021-07-15T11:51:59.190528+00:00/1.log
For more information check: https://httpstatuses.com/403
```

This happens because Airflow generates a random value for them if value isn't provided, which causes a random string generated on webserver and worker. Hence they don't match, resulting in the error.

This PR creates a K8s Secret object and creates a key for that setting and pass it as Env Var similar to what we do with Fernet Key.
@shankarcb
Copy link

@shankarcb shankarcb commented Nov 8, 2021

This push has introduced another bug of 403 client error when DAG logs are fetched from Airflow WebUI. This bug appears to impact when Airflow is setup in cluster mode. We have setup Airflow in cluster mode with Celery - One master and multiple worker nodes. [Not Celery kubernetes]
Error message was that secret_key is not matching which was not the case since identical copies of airflow.cfg was used in master and worker nodes.

After increasing the default fallback value to 120 seconds in below code snippet, this has fixed the 403 client error issue.
We added "log_request_clock_grace" under [webserver] section on airflow.cfg to override this value which did not work. We had to manually change the code in airflow installation to fix the issue. However, this is not recommended/permanent solution. Please fix this in future releases.

def create_app():
flask_app = Flask(name, static_folder=None)
max_request_age = conf.getint('webserver', 'log_request_clock_grace', fallback=30)

Code: https://github.com/apache/airflow/blob/main/airflow/utils/serve_logs.py

@potiuk
Copy link
Member

@potiuk potiuk commented Nov 8, 2021

This is not the real propblem but your deployment issue. Your webserver and workers simply do not have time properly synchronized. Please make sure they do (for example with ntp).

@shankarcb
Copy link

@shankarcb shankarcb commented Nov 8, 2021

This is not the real propblem but your deployment issue. Your webserver and workers simply do not have time properly synchronized. Please make sure they do (for example with ntp).

Initially I did think the same and made sure they are synced which did not help. They are having only 2 second delay between the nodes.

@potiuk
Copy link
Member

@potiuk potiuk commented Nov 8, 2021

I actually think it's not the change you applied mto the code but likely redeploying the code or restarting the machine fixed the issue. Almost 100% sure that you had time drift.

The fact that you have 2 seconds now, means that something is wrong and probably the difference will grow. When you have NTP synchronised, both machines should have exactly the same time down to millisecond.

You can check it be changing the value back to 30 - as long as the time drift is smaller it should still work.

@shankarcb
Copy link

@shankarcb shankarcb commented Nov 8, 2021

Airflow library is installed on a separate python virtual environment with required Airflow python libraries installed there.. we updated the serve_logs.py in site-packages directory of that virtual environment which is picked up by Airflow process to resolve the issue..Machine reboot was not performed.

All the machines are synced with same RHEL NTP servers.

This impacted multiple deployments of Airflow..The other deployment did not have any time lag/drift..

What is the significance of log_request_clock_grace parameter in the code? Can this be configured in airflow.cfg?
We are fine with this solution as long as we dont need to manually change the airflow utils code after each upgrade of Airflow

@potiuk
Copy link
Member

@potiuk potiuk commented Nov 8, 2021

So likely you had another issue - possibly not restarted airlfow that used different configuration. Yes. This parameter should change grace period and yes it should work if you restart airflow. so you should look in your process/deployment not in airflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet