[SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values#38065
[SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values#38065dtenedor wants to merge 13 commits intoapache:masterfrom dtenedor:mask-ccn
Conversation
fetch latest from master
|
Can one of the admins verify this patch? |
initial implementation initial implementation more testing more testing
|
Hi @vitaliili-db can you please help review this when you have time? 🙏 |
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
|
@vitaliili-db @amaliujia thanks so much for your review! I responded to your comments, please take another look. |
...alyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RedactionFunctionsSuite.scala
Show resolved
Hide resolved
|
The only test failure is unrelated and flaky |
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/redactionExpressions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Shouldn't we check if the corresponding format string char is space as well?
There was a problem hiding this comment.
Good question, I think the intention is to skip and ignore whitespace in both the input string and the format string. For example, for use cases like credit card numbers, phone numbers, and social security numbers, these are not material to the data in the field. I put a comment to this effect here.
There was a problem hiding this comment.
Update on this: per our offline conversation, we decided to require whitespace in the format string to match the input string exactly. I updated the PR accordingly.
There was a problem hiding this comment.
Seems we are going to skip the spaces on both input/format string...
There was a problem hiding this comment.
Same, this is per intention (left a comment).
There was a problem hiding this comment.
Update on this: per our offline conversation, we decided to require whitespace in the format string to match the input string exactly. I updated the PR accordingly.
respond to code review comments
|
@gengliangwang thanks for review! 👍 Responded to your comments. |
|
@gengliangwang @vinodkc FYI, it looks like there is a duplication of planned effort between this PR and https://issues.apache.org/jira/browse/SPARK-40686. We should probably dedup these into one effort. @vinodkc do you want to take this on, and I can close my Jira and help with the review? |
|
going to close this in favor of the other work by @vinodkc instead |
What changes were proposed in this pull request?
Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values.
Each of these functions converts a string 'input' representing a credit card number to an updated version applying a transformation to the characters. This can be useful for creating copies of tables with sensitive information removed, but retaining the same schema.
Both functions return an error if the format string is invalid.
MASK_CCN returns an error if the input string does not match the format string.
TRY_MASK_CCN instead returns NULL in that case.
The format can consist of the following characters, case insensitive:
No other format characters are allowed.
Any leading or trailing whitespace in the input string is stripped out.
Examples:
Why are the changes needed?
These functions are useful for processing string values.
Does this PR introduce any user-facing change?
Yes, it adds new SQL functions.
How was this patch tested?
This PR adds a new unit test suite.