[SPARK-20435][CORE] More thorough redaction of sensitive information #17725

markgrover · 2017-04-22T00:40:39Z

This change does a more thorough redaction of sensitive information from logs and UI
Add unit tests that ensure that no regressions happen that leak sensitive information to the logs.

The motivation for this change was appearance of password like so in SparkListenerEnvironmentUpdate in event logs under some JVM configurations:
"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..."
Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well.

How was this patch tested?

New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it.

…from logs/UI, more unit tests

SparkQA · 2017-04-22T02:57:06Z

Test build #76051 has finished for PR 17725 at commit 2f5148a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

because otherwise it gets redacted too

markgrover · 2017-04-22T04:32:31Z

Few decisions that I made here:

I considered if just sun.java.command property should be checked and redacted but that seemed too specific and likely a bandaid to the current problem, not a long-term solution, so decided against it.
Redaction for the SparkListenerEnvironmentUpdate event was solely being done on Spark Properties, while sun.java.command is a part of System Properties. I considered doing redaction for System Properties in addition to Spark Properties (that would have gone somewhere around here) but decided against it because that would have even more hardcoding and I didn't see why these 2 kinds of properties are special enough to be redacted but not the rest of them. So, I decided to redact information from all 4 kinds of properties for that particular event (which still the only event we run redaction on).
One way to redact the property value would have been to redact the minimum possible set from the value while keeping the rest of the value intact. For example, if the following were the unredacted case:
"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password --conf spark.other.property=2"
One option for the redacted output could have been:
"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=*********(redacted) --conf spark.other.property=2"
However, such a redaction is very hard to maintain. For example, we would had to take the current regex (which is (?i)secret|password by default) and add matchers to it like so like "("+SECRET_REDACTION_DEFAULT+"[^ ]*=)[^ ]*". That would allow us to squeeze out and replace just the matched portion. But this all seemed very fragile and even worse when the user supplies a non-default regex. So I decided it was easiest to simply replace the entire value, even though only a small part of it contained secret or password in it. So, now it the redacted properties would look like:
"sun.java.command":"*********(redacted)"
One thing which I didn't explicitly check was the performance implications of this change. The reason I bring this up is because, previously we were matching keys with a regex, now if the key
doesn't match, we match the value with the regex. So, in the worst case, we are twice as many regex matches as before. Also, before we were simply doing regex matching on Spark Properties, now we do them on all properties - Spark Properties, System Properties, JVM Properties and Classpath Properties. I don't think this should have a big performance impact so I didn't invest time in it, mentioning here in interest of full disclosure.

Thanks in advance for reviewing.

with a sensitive value

SparkQA · 2017-04-22T04:42:30Z

Test build #76055 has started for PR 17725 at commit 14b0d72.

SparkQA · 2017-04-22T06:55:14Z

Test build #76054 has finished for PR 17725 at commit 0603686.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

markgrover · 2017-04-24T17:38:23Z

Jenkins, retest this please.

SparkQA · 2017-04-24T20:31:20Z

Test build #76114 has finished for PR 17725 at commit 14b0d72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-04-25T01:30:09Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

+    // ...
+    // where jvmInformation, sparkProperties, etc. are sequence of tuples.
+    // We go through the various  of properties and redact sensitive information from them.
+    val redactedProps = event.environmentDetails.map{


.map { case (name, props) =>

vanzin · 2017-04-25T01:31:11Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+    // This does mean we may be accounting more false positives - for example, if the value of an
+    // arbitrary property contained the term 'password', we may redact the value from the UI and
+    // logs. In order to work around it, user would have to make the spark.redaction.regex property
+    // more specific.
    kvs.map { kv =>


Since you're looking at values now...

.map { case (key, value) =>

vanzin · 2017-04-25T01:31:47Z

core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala

+      runSparkSubmit(args)
+      val listStatuses = fileSystem.listStatus(testDirPath)
+      val logData = EventLoggingListener.openEventLog(listStatuses.last.getPath, fileSystem)
+      Source.fromInputStream(logData).getLines().foreach {


.foreach { line =>

vanzin · 2017-04-25T01:33:28Z

core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala

+        "--conf", "spark.hadoop.fs.defaultFS=unsupported://example.com",
+        unusedJar.toString)
+      runSparkSubmit(args)
+      val listStatuses = fileSystem.listStatus(testDirPath)


s/listStatuses/something else.

Use list, statuses, statusList, but "listStatuses" doesn't parse for me.

SparkQA · 2017-04-25T22:17:37Z

Test build #76143 has finished for PR 17725 at commit 80e40ba.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

markgrover · 2017-04-25T22:31:43Z

Hmm, this is weird, it ran for 4h10 mins and got killed due to a timeout. Looking...

markgrover · 2017-04-25T23:56:35Z

Jenkins, retest this please.

markgrover · 2017-04-25T23:56:47Z

Kicking off another run here while I am running a run locally.

SparkQA · 2017-04-26T02:25:55Z

Test build #76156 has finished for PR 17725 at commit 80e40ba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

markgrover · 2017-04-26T17:11:43Z

This time it finished on regular time (2h 29m) but failed a test. I ran that test locally (org.apache.spark.streaming.ReceiverSuite's 'receiver life cycle') and it passed for me, so for mostly laziness, I am going to issue another run here.

markgrover · 2017-04-26T17:11:49Z

Jenkins, retest this please.

SparkQA · 2017-04-26T20:09:03Z

Test build #76194 has finished for PR 17725 at commit 80e40ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

markgrover · 2017-04-26T20:43:00Z

Finally it's passing!

vanzin · 2017-04-27T00:06:00Z

Merging to master / 2.2.

This change does a more thorough redaction of sensitive information from logs and UI Add unit tests that ensure that no regressions happen that leak sensitive information to the logs. The motivation for this change was appearance of password like so in `SparkListenerEnvironmentUpdate` in event logs under some JVM configurations: `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..." ` Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well. ## How was this patch tested? New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it. Author: Mark Grover <mark@apache.org> Closes #17725 from markgrover/spark-20435. (cherry picked from commit 66636ef) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

markgrover · 2017-04-27T00:09:27Z

Thanks Marcelo!

…

On Apr 26, 2017 5:06 PM, "Marcelo Vanzin" ***@***.***> wrote: Merging to master / 2.2. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#17725 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABoVi9s2uyakNm-81qWP3aBpRkFHtm3Jks5rz9vugaJpZM4NE66l> .

[SPARK-20435][CORE] More thorough redaction of sensitive information …

2f5148a

…from logs/UI, more unit tests

Making the regular property's value not have secret keyword

0603686

because otherwise it gets redacted too

Updating the unit test in UtilsSuite to include a regular property

14b0d72

with a sensitive value

vanzin reviewed Apr 25, 2017

View reviewed changes

Incorporating Vanzin's feedback

80e40ba

asfgit closed this in 66636ef Apr 27, 2017

markgrover deleted the spark-20435 branch April 27, 2017 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20435][CORE] More thorough redaction of sensitive information #17725

[SPARK-20435][CORE] More thorough redaction of sensitive information #17725

markgrover commented Apr 22, 2017 •

edited

Loading

SparkQA commented Apr 22, 2017

markgrover commented Apr 22, 2017 •

edited

Loading

SparkQA commented Apr 22, 2017

SparkQA commented Apr 22, 2017

markgrover commented Apr 24, 2017

SparkQA commented Apr 24, 2017

vanzin Apr 25, 2017

vanzin Apr 25, 2017

vanzin Apr 25, 2017

vanzin Apr 25, 2017

SparkQA commented Apr 25, 2017

markgrover commented Apr 25, 2017

markgrover commented Apr 25, 2017

markgrover commented Apr 25, 2017

SparkQA commented Apr 26, 2017

markgrover commented Apr 26, 2017

markgrover commented Apr 26, 2017

SparkQA commented Apr 26, 2017

markgrover commented Apr 26, 2017

vanzin commented Apr 27, 2017

markgrover commented Apr 27, 2017 via email

[SPARK-20435][CORE] More thorough redaction of sensitive information #17725

[SPARK-20435][CORE] More thorough redaction of sensitive information #17725

Conversation

markgrover commented Apr 22, 2017 • edited Loading

How was this patch tested?

SparkQA commented Apr 22, 2017

markgrover commented Apr 22, 2017 • edited Loading

SparkQA commented Apr 22, 2017

SparkQA commented Apr 22, 2017

markgrover commented Apr 24, 2017

SparkQA commented Apr 24, 2017

vanzin Apr 25, 2017

Choose a reason for hiding this comment

vanzin Apr 25, 2017

Choose a reason for hiding this comment

vanzin Apr 25, 2017

Choose a reason for hiding this comment

vanzin Apr 25, 2017

Choose a reason for hiding this comment

SparkQA commented Apr 25, 2017

markgrover commented Apr 25, 2017

markgrover commented Apr 25, 2017

markgrover commented Apr 25, 2017

SparkQA commented Apr 26, 2017

markgrover commented Apr 26, 2017

markgrover commented Apr 26, 2017

SparkQA commented Apr 26, 2017

markgrover commented Apr 26, 2017

vanzin commented Apr 27, 2017

markgrover commented Apr 27, 2017 via email

markgrover commented Apr 22, 2017 •

edited

Loading

markgrover commented Apr 22, 2017 •

edited

Loading