Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9645] [yarn] [core] Allow shuffle service to read shuffle files. #7966

Closed
wants to merge 2 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Aug 5, 2015

Spark should not mess with the permissions of directories created
by the cluster manager. Here, by setting the block manager dir
permissions to 700, the shuffle service (running as the YARN user)
wouldn't be able to serve shuffle files created by applications.

Also, the code to protect the local app dir was missing in standalone's
Worker; that has been now added. Since all processes run as the same
user in standalone, chmod 700 should not cause problems.

Spark should not mess with the permissions of directories created
by the cluster manager. Here, by setting the block manager dir
permissions to 700, the shuffle service (running as the YARN user)
wouldn't be able to serve shuffle files created by applications.
@vanzin
Copy link
Contributor Author

vanzin commented Aug 5, 2015

FYI @dragos since you may need to check what Mesos does, since that line was part of your change in SPARK-6287.

@@ -133,7 +133,6 @@ private[spark] class DiskBlockManager(blockManager: BlockManager, conf: SparkCon
Utils.getConfiguredLocalDirs(conf).flatMap { rootDir =>
try {
val localDir = Utils.createDirectory(rootDir, "blockmgr")
Utils.chmod700(localDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this introduce security problems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The cluster manager is responsible for securing the app's directory (in YARN, only the app owner and the YARN group can read it; in standalone, only the user running the standalone service can; not sure about Mesos, which is why the Mesos code might need to be fixed separately).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually standalone seems to be missing the chmod call; I'll just add that to this change.

@andrewor14
Copy link
Contributor

Good catch!

@andrewor14
Copy link
Contributor

@tnachen @dragos could you verify the security aspects in Mesos?

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39894 has finished for PR 7966 at commit 6e07b31.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39887 has finished for PR 7966 at commit 384ba6a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 5, 2015

pyspark tests look really flaky lately.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 5, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #238 has finished for PR 7966 at commit 6e07b31.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Not just pyspark tests... retest this please

@tnachen
Copy link
Contributor

tnachen commented Aug 5, 2015

Mesos also chown the sandbox directory for you, but I'm not sure where this Utils.getConfiguredLocalDirs directories are going to be created.
IIUC Spark just creates in a temporary directory that's outside of Mesos work directory, and if that's true than we don't chmod that.

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39917 has finished for PR 7966 at commit 6e07b31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #39919 has finished for PR 7966 at commit 6e07b31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 6, 2015

Unless someone has a different opinion, I'm gonna merge this because it's a serious regression. Mesos can be fixed separately if needed.

@andrewor14
Copy link
Contributor

@tnachen can you investigate? This is potentially a blocker for 1.5.

@tnachen
Copy link
Contributor

tnachen commented Aug 6, 2015

@andrewor14 I just took a look and I see that it's creating a local directory that's not part of Mesos sandbox. I think we should definitely fix it so it creates in the sandbox in another ticket so it the temporary data will also be managed by Mesos and can be removed with mesos when the task is gone.

Anyhow, it seems like you also need to manage the permission on the folder with Mesos since it's outside of Mesos, which means the chmod still applies for Mesos.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 6, 2015

which means the chmod still applies for Mesos.

I'm a little confused about what you wrote, especially the last part. Which chmod are you talking about?

If Mesos creates directories for the application, it should tell the Spark application how to find them, and Spark should use that information. See how standalone does (the lines I touched in this PR, plus ExecutorRunner.fetchAndRunExecutor), for example.

@tnachen
Copy link
Contributor

tnachen commented Aug 6, 2015

Yes Mesos creates a directory for each task it runs. And if you run any application with Mesos your current directory should be in that directory, or else it's accessed by looking at MESOS_SANDBOX env variable.
So either we fix the code to use that as the default local dir for creating temporary dirs in Mesos mode, or we keep the same permissions changes code that's there for now since Standalone and Mesos mode are creating the directories at the same place. Does that makes sense?

@vanzin
Copy link
Contributor Author

vanzin commented Aug 6, 2015

we keep the same permissions changes code that's there for now

What permission changes? The one I'm removing? That's the part I'm confused about.

I'm removing this chmod because it breaks YARN secure mode. To use MESOS_SANDBOX you should change the code in Utils.scala that figures out the application's scratch directories (getConfiguredLocalDirs).

Standalone does not need this chmod in DiskBlockManager, because the standalone cluster manager secures the app's directory (well, it should have been doing that, but it seems that it stopped doing that at some point and so I restored the functionality in this change).

So it sounds to me like there might be a fix needed for Mesos, but since I have no access to a Mesos cluster for testing, please open a new bug to track it.

@tnachen
Copy link
Contributor

tnachen commented Aug 6, 2015

Ah got it I thought the chmod is needed but just removed for YARN since YARN manages the directory.
Yes I'll create a new ticket for Mesos seperately, then I think from Mesos POV this PR doesn't effect it.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 6, 2015

Ok, sounds like we have an agreement on how to move forward, so I'll merge this.

asfgit pushed a commit that referenced this pull request Aug 6, 2015
Spark should not mess with the permissions of directories created
by the cluster manager. Here, by setting the block manager dir
permissions to 700, the shuffle service (running as the YARN user)
wouldn't be able to serve shuffle files created by applications.

Also, the code to protect the local app dir was missing in standalone's
Worker; that has been now added. Since all processes run as the same
user in standalone, `chmod 700` should not cause problems.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7966 from vanzin/SPARK-9645 and squashes the following commits:

6e07b31 [Marcelo Vanzin] Protect the app dir in standalone mode.
384ba6a [Marcelo Vanzin] [SPARK-9645] [yarn] [core] Allow shuffle service to read shuffle files.

(cherry picked from commit e234ea1)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in e234ea1 Aug 6, 2015
@vanzin vanzin deleted the SPARK-9645 branch August 6, 2015 22:41
@dragos
Copy link
Contributor

dragos commented Aug 7, 2015

You shouldn't assume those directories are under the Mesos sandbox. They are user configurable (in the Hadoop world there would be several of them on different disks to allow parallel reads/writes). Also, shuffle files live there so you can't delete them when the executor exits, or dynamic allocation will break and Spark won't find them anymore. This is the bug that @tnachen reported first on the PR to add dynamic allocation on Mesos.

That chmod was simply following what was done before, so I'm surprised there's any regression in Yarn. As far as I could tell at the time, Yarn mode took a different code path. Maybe some changes merged later unified some logic? Given that standalone mode now also allows dynamic allocation...

@vanzin
Copy link
Contributor Author

vanzin commented Aug 8, 2015

You shouldn't assume those directories are under the Mesos sandbox. They are user configurable (in the Hadoop world there would be several of them on different disks to allow parallel reads/writes).

Maybe that's how it works in mesos, and in such case, you should figure out how to fix it (in Mesos or in Spark). Yes, in YARN you can configure multiple of these directories but (i) Spark handles that just fine and (ii) YARN creates per-application directories, where the shuffle files are written. When containers (= executors) go away, those directories still remain. They're only deleted after the app finished.

Nothing has changed here at all. You had to add this chmod, which means that you intentionally tried to change the behavior. And that broke existing code.

@dragos
Copy link
Contributor

dragos commented Aug 12, 2015

Sorry for that. I was following this line which doesn't seem to be used by Yarn. I should have been more careful in my assumptions.

edit: my reply was mostly answering @tnachen's comments about moving these directories in the Mesos sandbox, which won't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants