Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Cleansing #9693

Open
Cassandra-Clark opened this issue Apr 12, 2024 · 1 comment
Open

Data Cleansing #9693

Cassandra-Clark opened this issue Apr 12, 2024 · 1 comment
Assignees
Labels
-libs Libraries: New libraries to be implemented x-new-feature Type: new feature request

Comments

@Cassandra-Clark
Copy link
Contributor

Cassandra-Clark commented Apr 12, 2024

As a user, I often encounter data with common formatting issues that need to be resolved, but which require relatively complex solutions to accommodate this. As an example, removing leading or trailing whitespace requires knowledge of a specific regular expression most commonly to resolve. We can dramatically improve the user experience for these common functions with a Data Cleansing component which encapsulates these as easy to understand descriptions of the operation, allowing multiple operations to be executed in sequence against an array of columns.

a rough outline of the proposed API is as follows:

data_cleansing Vector (Text) Vector (Integer | Text | Regex) -> Table
data_cleansing self (operations=Data_Cleansing) (columns=self.column_names) =

This should support the following operations:

  • Duplicate_Whitespace
  • Leading_Whitespace
  • Trailing_Whitespace
  • All_Whitespace
  • Leading_Numbers
  • Trailing_Numbers
  • Punctuation
  • Numbers
  • Symbols
  • Non_ASCII
  • Tabs
  • Letters
  • Replace_Empty_With_Blank
  • Replace_Empty_With_Zero
  • Replace_Empty_With_Null
  • Replace_Zero_With_Empty
  • Replace_Null_With_Empty
  • Replace_Blank_With_Empty
  • Duplicate_Characters

These operations should execute in a consistent order, regardless of the order that the user selects them in. For example, if choosing {Punctuation, Special_Characters, Tabs}, this should execute in the same sequence as {Tabs, Special_Characters, Punctuation}.

These should operate on applicable columns within the selection, so Replace_Empty_With_Zero will apply to integer, decimal, float, but would not apply to a selected String field. This should not produce a warning.

@Cassandra-Clark
Copy link
Contributor Author

Cassandra-Clark commented Apr 24, 2024

Removed case modification as it doesn't fit the pattern of removing or replacing existing values

@AdRiley AdRiley self-assigned this May 7, 2024
@AdRiley AdRiley added x-new-feature Type: new feature request -libs Libraries: New libraries to be implemented labels May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-libs Libraries: New libraries to be implemented x-new-feature Type: new feature request
Projects
Status: 🟢 Accepted
Development

No branches or pull requests

2 participants