Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datalab new issue type] detect ID columns in features #923

Open
jwmueller opened this issue Dec 16, 2023 · 2 comments
Open

[Datalab new issue type] detect ID columns in features #923

jwmueller opened this issue Dec 16, 2023 · 2 comments
Labels
good first issue Good for newcomers help-wanted We need your help to add this, but it may be more challenging than a "good first issue"

Comments

@jwmueller
Copy link
Member

jwmueller commented Dec 16, 2023

New Datalab issue type called something like identifier_column that checks whether features contains a column that is entirely sequential integers.

That is check whether there exists a column i in features such that set(features[:,i]) = set(c, c+1, ..., c+n) for some integer c, where n = num-rows of features. Note we don't consider the ordering of the column in case the dataset was pre-shuffled.

If the condition is met, then we say this dataset has the identifier_column issue (similar to the non-iid issue). The overall issue-summary-score can simply be binary = 0 if the dataset has this issue = 1 otherwise.

In this case the info attribute of Datalab should specify which column(s) seem like the identifier.

datalab.issues does not need to reflect this issue-type at all, since it is not something related to the rows of the dataset.

Motivation: ID columns are common in datasets, but generally should be left out of ML modeling or they risk harming the results.

@jwmueller jwmueller added good first issue Good for newcomers help-wanted We need your help to add this, but it may be more challenging than a "good first issue" labels Dec 16, 2023
@gmottajr
Copy link

Hi Jonas ,
I came across your issue on GitHub and would like to clarify a few points to ensure I fully understand your requirements.

Based on your description, it seems you're looking for a method to process a dataset (structured as a list of lists, with each inner list representing a row) and identify columns that consist of sequentially increasing integers. These columns are often identifiers and might not be useful for certain types of data analyses or machine learning models. Is my understanding correct?

To address this, my pre-embrionary proposition would be considering implementing a method within a class that takes a dataset as input, removes any columns with sequentially increasing integers, and then returns the modified dataset with those columns removed. How does that sound to you? Would this approach meet your needs?

Please let me know if this aligns with your expectations or if there are any additional details or specifications you require.

Thank you!

@jwmueller
Copy link
Member Author

There's a lot more work involved in than that. The goal is to create a new IssueManager following the instructions from this guide here:

https://docs.cleanlab.ai/master/cleanlab/datalab/guide/custom_issue_manager.html

This new IssueManager should specifically look at each column of the features input (a 2D matrix with rows = data points, columns = individual features).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help-wanted We need your help to add this, but it may be more challenging than a "good first issue"
Projects
None yet
Development

No branches or pull requests

2 participants