Skip to content

[SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values#38065

Closed
dtenedor wants to merge 13 commits intoapache:masterfrom
dtenedor:mask-ccn
Closed

[SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values#38065
dtenedor wants to merge 13 commits intoapache:masterfrom
dtenedor:mask-ccn

Conversation

@dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Oct 1, 2022

What changes were proposed in this pull request?

Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values.

Each of these functions converts a string 'input' representing a credit card number to an updated version applying a transformation to the characters. This can be useful for creating copies of tables with sensitive information removed, but retaining the same schema.

Both functions return an error if the format string is invalid.

MASK_CCN returns an error if the input string does not match the format string.

TRY_MASK_CCN instead returns NULL in that case.

The format can consist of the following characters, case insensitive:

  • Each 'X' represents a digit which will be converted to 'X' in the result.
  • Each digit '0'-'9' represents a digit which will be left unchanged in the result.
  • Each '-' character should match exactly in the input string.
  • The default format string is: XXXX-XXXX-XXXX-XXXX.

No other format characters are allowed.
Any leading or trailing whitespace in the input string is stripped out.

Examples:

> SELECT MASK_CCN(ccn) FROM VALUES ("1234-5678-9876-5432") AS tab(ccn);
  XXXX-XXXX-XXXX-XXXX
> SELECT MASK_CCN("  1234-5678-9876-5432  ", "XXXX-XXXX-XXXX-1234");
  XXXX-XXXX-XXXX-5432
> SELECT MASK_CCN("[1234-5678-9876-5432]", "[XXXX-XXXX-XXXX-1234]");
  Error: the format string is invalid
> SELECT MASK_CCN("1234567898765432");
  Error: the input string does not match the format
> SELECT MASK_CCN("1234567898765432", "XXXX-XXXX-XXXX-1234");
  Error: the input string does not match the format
> SELECT TRY_MASK_CCN("1234567898765432");
  NULL
> SELECT TRY_MASK_CCN("1234567898765432", "XXXX-XXXX-XXXX-1234");
  NULL

Why are the changes needed?

These functions are useful for processing string values.

Does this PR introduce any user-facing change?

Yes, it adds new SQL functions.

How was this patch tested?

This PR adds a new unit test suite.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

initial implementation

initial implementation

more testing

more testing
@dtenedor dtenedor marked this pull request as ready for review October 3, 2022 20:49
@dtenedor dtenedor changed the title [WIP][SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values [SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values Oct 3, 2022
@dtenedor
Copy link
Contributor Author

dtenedor commented Oct 3, 2022

Hi @vitaliili-db can you please help review this when you have time? 🙏

@dtenedor
Copy link
Contributor Author

dtenedor commented Oct 4, 2022

@vitaliili-db @amaliujia thanks so much for your review! I responded to your comments, please take another look.

Copy link
Contributor

@amaliujia amaliujia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment

@dtenedor
Copy link
Contributor Author

dtenedor commented Oct 5, 2022

The only test failure is unrelated and flaky

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check if the corresponding format string char is space as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think the intention is to skip and ignore whitespace in both the input string and the format string. For example, for use cases like credit card numbers, phone numbers, and social security numbers, these are not material to the data in the field. I put a comment to this effect here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this: per our offline conversation, we decided to require whitespace in the format string to match the input string exactly. I updated the PR accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we are going to skip the spaces on both input/format string...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, this is per intention (left a comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this: per our offline conversation, we decided to require whitespace in the format string to match the input string exactly. I updated the PR accordingly.

@dtenedor
Copy link
Contributor Author

dtenedor commented Oct 5, 2022

@gengliangwang thanks for review! 👍 Responded to your comments.

@dtenedor
Copy link
Contributor Author

@gengliangwang @vinodkc FYI, it looks like there is a duplication of planned effort between this PR and https://issues.apache.org/jira/browse/SPARK-40686. We should probably dedup these into one effort. @vinodkc do you want to take this on, and I can close my Jira and help with the review?

@dtenedor
Copy link
Contributor Author

going to close this in favor of the other work by @vinodkc instead

@dtenedor dtenedor closed this Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants