Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4574] allow providing private_key in SSHHook #6104

Merged

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Sep 14, 2019

We can already provide key_file (i.e. path to file).

This PR makes it so that we can provide actual content of the key, like any other connection.

  • can provide in extras with key "private_key"
  • can provide as parameter in SSHHook init

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-XXX
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@dstandish dstandish force-pushed the feature/AIRFLOW-4574-add-private-key-param branch from 83583e8 to d83e746 Compare September 15, 2019 00:10
@dstandish dstandish changed the title [AIRFLOW-4574] add option to provide private_key in SSHHook [AIRFLOW-4574] allow providing private_key in SSHHook Sep 15, 2019
@mik-laj
Copy link
Member

mik-laj commented Sep 15, 2019

Can you also update documentation?
https://airflow.readthedocs.io/en/latest/howto/connection/ssh.html

@dstandish dstandish force-pushed the feature/AIRFLOW-4574-add-private-key-param branch from d83e746 to 51b05b3 Compare September 15, 2019 16:45
@dstandish
Copy link
Contributor Author

@mik-laj thank you -- did not realize there was this doc.

I have updated.

@codecov-io
Copy link

Codecov Report

Merging #6104 into master will decrease coverage by 0.28%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6104      +/-   ##
==========================================
- Coverage   80.09%   79.81%   -0.29%     
==========================================
  Files         606      607       +1     
  Lines       34890    35031     +141     
==========================================
+ Hits        27945    27959      +14     
- Misses       6945     7072     +127
Impacted Files Coverage Δ
airflow/contrib/hooks/ssh_hook.py 88.78% <100%> (+1.28%) ⬆️
airflow/gcp/hooks/dataflow.py 48.76% <0%> (-26.84%) ⬇️
airflow/gcp/example_dags/example_dataflow.py 0% <0%> (ø)
airflow/utils/dag_processing.py 58.98% <0%> (+0.18%) ⬆️
airflow/jobs/scheduler_job.py 74.58% <0%> (+0.3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7be5560...51b05b3. Read the comment docs.

password = self.password.strip()
connect_kwargs.update(password=password)

# prefer pkey over key_filename when both are given
Copy link
Member

@mik-laj mik-laj Sep 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you prefer one way over another? I think it's worth throwing an exception if two mutually exclusive parameters are given.

Copy link
Contributor Author

@dstandish dstandish Sep 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly I was on the fence about this too. Ultimately of course I defer to you.

Reasoning for picking one

I guess I don't see the harm in trying at least one of them. I figured choosing one was better because it would at least try one of them, therefore it would fail in fewer circumstances. Though I understand throwing error would force user to resolve ambiguity.

Why pkey, if picking one

The choice of which one to pick, assuming we were to choose one, is probably less controversial: choosing the private key is better because the private key is actually a private key, while the path to file is just a path, and the file may or may not be there.

What does paramiko do?

I was curious and looked into paramiko. What does it do when given both? It appears that it picks pkey, but it's not super obvious to me: https://github.com/paramiko/paramiko/blob/master/paramiko/client.py#L655

Proposal

Perhaps better yet, is when given both, then pass both to paramiko, and let it do whatever it does. What you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make them mutually exclusive and error if both are passed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kaxil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i have made them mutually exclusive

also, in the interest of failing sooner than later, moved the parsing of private_key into a PKey object to __init__ from get_conn. This way, when you are testing your connection string you don't have to call get_conn to see if it parses properly.

* can provide in extras with key "private_key"
* can provide as parameter in SSHHook init
* when both key_file and private_key are given, private_key is preferred
@dstandish dstandish force-pushed the feature/AIRFLOW-4574-add-private-key-param branch from 51b05b3 to 14fbc58 Compare September 16, 2019 15:47
@mik-laj mik-laj merged commit fa8e18a into apache:master Sep 17, 2019
@pgagnon
Copy link
Contributor

pgagnon commented Sep 17, 2019

I see this has just been merged, but nevertheless... I can see why this feature could be useful, but I am concerned that it might encourage users to paste key material in their code and subsequently publish it on their VCS. To put it mildly this is an inadvisable practice.

@dstandish @mik-laj @kaxil WDYT?

@dstandish
Copy link
Contributor Author

@pgagnon the same could be said of password, which some hooks have.
The point of this in my view is that ssh is a connection like any other. Any other connection, we can supply creds through DB or env var alone; this one for some reason requires we also create a file on the instance, does not allow us to supply directly via connection string.
I suppose we could remove the option to supply param at hook instantiation, so it is an extras-only kind of deal, but I don't personally see the value. WDYT?

@dstandish dstandish deleted the feature/AIRFLOW-4574-add-private-key-param branch September 17, 2019 23:30
@pgagnon
Copy link
Contributor

pgagnon commented Sep 18, 2019

@dstandish I see your point. My issue with this is that from the user's standpoint it's much less intuitive to figure out how to add a private key to a connection contrasted with a password, which is a standard connection field. Perhaps we could update the docs/docstring to provide an example and warn users against hardcoding keys in their dags?

@dstandish
Copy link
Contributor Author

I did provide an example connection string in the documentation file, following the example with existing param key_file. I did not warn against including creds in code, though. On that, I am definitely sympathetic to your concern, but I guess my thought is it's a type of general advice that is not specific to this operator, or to airflow for that matter.

I think providing connection string examples in docstings is not a terrible idea. In this case it's pretty straightforward to figure out because you see very cleary how it pulls the value from extras, along with the others. For some reason with GCP connections I remember being very confused trying to traceback how to get the keyfile info to the right place.

I ended up writing a connection string generator that I use when adding new types of connections to our setup. Maybe including something like that, as a part of the hook, is a possibility. But could be confusing as well.

@mik-laj
Copy link
Member

mik-laj commented Sep 19, 2019

I think that it is very useful to be able to configure connections using environment variables. Airflow is launched very often in clusters in an automated manner and the ability to easily define a connection is key. Incorrect use is possible, but it is very limited and not very obvious.

I think that in order to dispel any doubts it is worth adding a short warning notice when describing the private_key parameter.

@pgagnon
Copy link
Contributor

pgagnon commented Sep 19, 2019

@mik-laj

I think that it is very useful to be able to configure connections using environment variables.

I agree that it is useful, but as a side effect it creates an avenue for users to easily leak credentials by inadvertance.

Incorrect use is possible, but it is very limited and not very obvious.

This type of credentials compromise is actually so pervasive that GitHub is now scanning repositories for key patterns. Google does the same. Similarly AWS released a git hook to prevent users from accidentally committing key material.

I don't think it can be qualified as limited at all, especially since in our case we need to keep in mind that a lot of Airflow users are data scientist types that just want to get their work done and aren't necessarily expected to have an acute understanding of proper security practices.

@dstandish
Copy link
Contributor Author

Sounds like you don't like the private_key hook init param. I don't care about this I was just adding it to be consistent with with existing hook, in which pretty much every attribute was overridable in init. There was already password, so an invitation to be a knucklehead already exists ;). If you want to remove this init param I have absolutely no objection.

But concerning allowing private key to be provided directly in airflow connection, as opposed to only key file path, I assume you do not mean to object to that. There is absolutely no difference between providing a private key here and keyfile_dict in GCP, or password in any other conn uri. The connection framework is fundamental to airflow and if we object to that, then we object to all hooks. Moreover I think the case is strong that to require distributing keys across all nodes -- in other words adding special extra processing relative to other connection types -- rather increases likelihood of security lapses relative to just being able to lean on the connection framework alone.

Perhaps it makes sense to make a new PR with your proposed change and we can discuss there?

ashb pushed a commit to ashb/airflow that referenced this pull request Oct 14, 2019
adityav pushed a commit to adityav/airflow that referenced this pull request Oct 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants