[Datalab new issue type] detect ID columns in features #923

jwmueller · 2023-12-16T04:31:30Z

New Datalab issue type called something like identifier_column that checks whether features contains a column that is entirely sequential integers.

That is check whether there exists a column i in features such that set(features[:,i]) = set(c, c+1, ..., c+n) for some integer c, where n = num-rows of features. Note we don't consider the ordering of the column in case the dataset was pre-shuffled.

If the condition is met, then we say this dataset has the identifier_column issue (similar to the non-iid issue). The overall issue-summary-score can simply be binary = 0 if the dataset has this issue = 1 otherwise.

In this case the info attribute of Datalab should specify which column(s) seem like the identifier.

datalab.issues does not need to reflect this issue-type at all, since it is not something related to the rows of the dataset.

Motivation: ID columns are common in datasets, but generally should be left out of ML modeling or they risk harming the results.

The text was updated successfully, but these errors were encountered:

gmottajr · 2023-12-17T06:56:54Z

Hi Jonas ,
I came across your issue on GitHub and would like to clarify a few points to ensure I fully understand your requirements.

Based on your description, it seems you're looking for a method to process a dataset (structured as a list of lists, with each inner list representing a row) and identify columns that consist of sequentially increasing integers. These columns are often identifiers and might not be useful for certain types of data analyses or machine learning models. Is my understanding correct?

To address this, my pre-embrionary proposition would be considering implementing a method within a class that takes a dataset as input, removes any columns with sequentially increasing integers, and then returns the modified dataset with those columns removed. How does that sound to you? Would this approach meet your needs?

Please let me know if this aligns with your expectations or if there are any additional details or specifications you require.

Thank you!

jwmueller · 2023-12-19T10:04:43Z

There's a lot more work involved in than that. The goal is to create a new IssueManager following the instructions from this guide here:

https://docs.cleanlab.ai/master/cleanlab/datalab/guide/custom_issue_manager.html

This new IssueManager should specifically look at each column of the features input (a 2D matrix with rows = data points, columns = individual features).

jwmueller added good first issue Good for newcomers help-wanted We need your help to add this, but it may be more challenging than a "good first issue" labels Dec 16, 2023

01PrathamS mentioned this issue Jan 7, 2024

Add identifier issue_manager #947

Open

MaxJoas mentioned this issue May 9, 2024

added custom issue manager class for detecting identifier columns #1120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datalab new issue type] detect ID columns in features #923

[Datalab new issue type] detect ID columns in features #923

jwmueller commented Dec 16, 2023 •

edited

Loading

gmottajr commented Dec 17, 2023

jwmueller commented Dec 19, 2023

[Datalab new issue type] detect ID columns in features #923

[Datalab new issue type] detect ID columns in features #923

Comments

jwmueller commented Dec 16, 2023 • edited Loading

gmottajr commented Dec 17, 2023

jwmueller commented Dec 19, 2023

jwmueller commented Dec 16, 2023 •

edited

Loading