Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5158] [core] [security] Spark standalone mode can authenticate against a Kerberos-secured Hadoop cluster #4106

Closed
wants to merge 1 commit into from

Commits on Feb 25, 2015

  1. [SPARK-5158] Spark standalone mode can authenticate against a Kerbero…

    …s-secured Hadoop cluster
    
    Previously, Kerberos secured Hadoop clusters could only be accessed by Spark running on top of YARN.
    In other words, Spark standalone clusters had no way to read from secure Hadoop clusters. Other
    solutions were proposed previously, but all of them attempted to perform authentication by obtaining
    a token on a single node and passing that token around to all of the other Spark worker nodes. The
    shipping of the token is risky, and all previous iterations fell short in leaving the token open
    to man-in-the-middle attacks.
    
    This patch introduces an alternative approach. It assumes that the keytab file has already been
    distributed to every node in the cluster. When Spark starts in standalone mode, all of the workers
    individually log in via Kerberos using specified configurations in the driver's SparkConf. In addition, on
    basic Hadoop cluster setups the key tab file is often already manually deployed on all of the cluster's
    nodes; it's not a huge stretch to expect the keytab files to be deployed to the Spark worker nodes as
    well, if they are not already there.
    
    This assumes that Spark will always authenticate with Kerberos using the same principal and keytab,
    and the login is done at the very start of the job. Strictly speaking we should be trying to reduce the
    surface area of the region of code that operates under a logged-in state. Or to put it another way,
    the authentication should only be performed precisely when files are written or read from HDFS, and
    after the read or write is performed the subject should be logged out. However this is difficult to
    write and prone to errors, so this is left for a future refactor.
    mccheah committed Feb 25, 2015
    Configuration menu
    Copy the full SHA
    626318d View commit details
    Browse the repository at this point in the history