Skip to content

Comments

[WIP][SPARK-40652][SQL] Add MASK_PHONE and TRY_MASK_PHONE functions to redact phone number string values#38101

Closed
dtenedor wants to merge 16 commits intoapache:masterfrom
dtenedor:mask-phone
Closed

[WIP][SPARK-40652][SQL] Add MASK_PHONE and TRY_MASK_PHONE functions to redact phone number string values#38101
dtenedor wants to merge 16 commits intoapache:masterfrom
dtenedor:mask-phone

Conversation

@dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Oct 4, 2022

What changes were proposed in this pull request?

Add MASK_PHONE and TRY_MASK_PHONE functions to redact phone number string values.

Each of these functions converts a string 'input' representing a phone number to an updated version applying a transformation to the characters. This can be useful for creating copies of tables with sensitive information removed, but retaining the same schema.

Both functions return an error if the format string is invalid.

MASK_PHONE returns an error if the input string does not match the format string.

TRY_MASK_PHONE instead returns NULL in that case.

The format can consist of the following characters, case insensitive:

  • Each 'X' represents a digit which will be converted to 'X' in the result.
  • Each digit '0'-'9' represents a digit which will be left unchanged in the result.
  • Whitespace in the input string is left unchanged.
  • Whitespace in the format string is ignored.
  • Each '-' or '+' or '(' or ')' character should match exactly in the input string.
  • The default format string is: (XXX) XXX-XXXX.

No other format characters are allowed.

Examples:

> SELECT MASK_PHONE(num) FROM VALUES ("(555) 867-5309") AS tab(num);
  (XXX) XXX-XXXX
> SELECT MASK_PHONE("  555 867 5309", "  XXX XXX XXXX");
    XXX XXX XXXX
> SELECT MASK_PHONE("  +1 555 867 5309", "  +1 XXX XXX XXXX");
    +1 XXX XXX XXXX
> SELECT MASK_PHONE("[555 867 5309]", "[XXX XXX XXXX]");
  Error: the format string is invalid
> SELECT TRY_MASK_PHONE("+15558675309");
  NULL
> SELECT TRY_MASK_PHONE("+1 555 867 5309", "+1 (XXX) XXX-XXXX");
  NULL

This PR also adds MASK_DIGITS as an alias for MASK_CCN since it is possible to specify any format string for custom digit masking use cases, e.g. social security numbers.

Why are the changes needed?

These functions are useful for processing string values.

Does this PR introduce any user-facing change?

Yes, it adds new SQL functions.

How was this patch tested?

This PR adds a new unit test suite.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dtenedor
Copy link
Contributor Author

@gengliangwang @HyukjinKwon @vinodkc FYI, it looks like there is a duplication of planned effort between this PR and https://issues.apache.org/jira/browse/SPARK-40686. We should probably dedup these into one effort. @vinodkc do you want to take this on, and I can close my Jira and help with the review?

@dtenedor
Copy link
Contributor Author

Going to close this in favor of the other effort by @vinodkc instead

@dtenedor dtenedor closed this Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants