Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20435][CORE] More thorough redaction of sensitive information #17725

Closed
wants to merge 4 commits into from

Conversation

markgrover
Copy link
Member

@markgrover markgrover commented Apr 22, 2017

This change does a more thorough redaction of sensitive information from logs and UI
Add unit tests that ensure that no regressions happen that leak sensitive information to the logs.

The motivation for this change was appearance of password like so in SparkListenerEnvironmentUpdate in event logs under some JVM configurations:
"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..."
Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well.

How was this patch tested?

New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it.

@SparkQA
Copy link

SparkQA commented Apr 22, 2017

Test build #76051 has finished for PR 17725 at commit 2f5148a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

because otherwise it gets redacted too
@markgrover
Copy link
Member Author

markgrover commented Apr 22, 2017

Few decisions that I made here:

  • I considered if just sun.java.command property should be checked and redacted but that seemed too specific and likely a bandaid to the current problem, not a long-term solution, so decided against it.

  • Redaction for the SparkListenerEnvironmentUpdate event was solely being done on Spark Properties, while sun.java.command is a part of System Properties. I considered doing redaction for System Properties in addition to Spark Properties (that would have gone somewhere around here) but decided against it because that would have even more hardcoding and I didn't see why these 2 kinds of properties are special enough to be redacted but not the rest of them. So, I decided to redact information from all 4 kinds of properties for that particular event (which still the only event we run redaction on).

  • One way to redact the property value would have been to redact the minimum possible set from the value while keeping the rest of the value intact. For example, if the following were the unredacted case:
    "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password --conf spark.other.property=2"
    One option for the redacted output could have been:
    "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=*********(redacted) --conf spark.other.property=2"
    However, such a redaction is very hard to maintain. For example, we would had to take the current regex (which is (?i)secret|password by default) and add matchers to it like so like "("+SECRET_REDACTION_DEFAULT+"[^ ]*=)[^ ]*". That would allow us to squeeze out and replace just the matched portion. But this all seemed very fragile and even worse when the user supplies a non-default regex. So I decided it was easiest to simply replace the entire value, even though only a small part of it contained secret or password in it. So, now it the redacted properties would look like:
    "sun.java.command":"*********(redacted)"

  • One thing which I didn't explicitly check was the performance implications of this change. The reason I bring this up is because, previously we were matching keys with a regex, now if the key
    doesn't match, we match the value with the regex. So, in the worst case, we are twice as many regex matches as before. Also, before we were simply doing regex matching on Spark Properties, now we do them on all properties - Spark Properties, System Properties, JVM Properties and Classpath Properties. I don't think this should have a big performance impact so I didn't invest time in it, mentioning here in interest of full disclosure.

Thanks in advance for reviewing.

@SparkQA
Copy link

SparkQA commented Apr 22, 2017

Test build #76055 has started for PR 17725 at commit 14b0d72.

@SparkQA
Copy link

SparkQA commented Apr 22, 2017

Test build #76054 has finished for PR 17725 at commit 0603686.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markgrover
Copy link
Member Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76114 has finished for PR 17725 at commit 14b0d72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// ...
// where jvmInformation, sparkProperties, etc. are sequence of tuples.
// We go through the various of properties and redact sensitive information from them.
val redactedProps = event.environmentDetails.map{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map { case (name, props) =>

// This does mean we may be accounting more false positives - for example, if the value of an
// arbitrary property contained the term 'password', we may redact the value from the UI and
// logs. In order to work around it, user would have to make the spark.redaction.regex property
// more specific.
kvs.map { kv =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're looking at values now...

.map { case (key, value) =>

runSparkSubmit(args)
val listStatuses = fileSystem.listStatus(testDirPath)
val logData = EventLoggingListener.openEventLog(listStatuses.last.getPath, fileSystem)
Source.fromInputStream(logData).getLines().foreach {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.foreach { line =>

"--conf", "spark.hadoop.fs.defaultFS=unsupported://example.com",
unusedJar.toString)
runSparkSubmit(args)
val listStatuses = fileSystem.listStatus(testDirPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/listStatuses/something else.

Use list, statuses, statusList, but "listStatuses" doesn't parse for me.

@SparkQA
Copy link

SparkQA commented Apr 25, 2017

Test build #76143 has finished for PR 17725 at commit 80e40ba.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markgrover
Copy link
Member Author

Hmm, this is weird, it ran for 4h10 mins and got killed due to a timeout. Looking...

@markgrover
Copy link
Member Author

Jenkins, retest this please.

@markgrover
Copy link
Member Author

Kicking off another run here while I am running a run locally.

@SparkQA
Copy link

SparkQA commented Apr 26, 2017

Test build #76156 has finished for PR 17725 at commit 80e40ba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markgrover
Copy link
Member Author

This time it finished on regular time (2h 29m) but failed a test. I ran that test locally (org.apache.spark.streaming.ReceiverSuite's 'receiver life cycle') and it passed for me, so for mostly laziness, I am going to issue another run here.

@markgrover
Copy link
Member Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 26, 2017

Test build #76194 has finished for PR 17725 at commit 80e40ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markgrover
Copy link
Member Author

Finally it's passing!

@vanzin
Copy link
Contributor

vanzin commented Apr 27, 2017

Merging to master / 2.2.

asfgit pushed a commit that referenced this pull request Apr 27, 2017
This change does a more thorough redaction of sensitive information from logs and UI
Add unit tests that ensure that no regressions happen that leak sensitive information to the logs.

The motivation for this change was appearance of password like so in `SparkListenerEnvironmentUpdate` in event logs under some JVM configurations:
`"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..."
`
Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well.

## How was this patch tested?

New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it.

Author: Mark Grover <mark@apache.org>

Closes #17725 from markgrover/spark-20435.

(cherry picked from commit 66636ef)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in 66636ef Apr 27, 2017
@markgrover
Copy link
Member Author

markgrover commented Apr 27, 2017 via email

@markgrover markgrover deleted the spark-20435 branch April 27, 2017 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants