[Datalab new issue type] detect ID columns in features #923
Labels
good first issue
Good for newcomers
help-wanted
We need your help to add this, but it may be more challenging than a "good first issue"
New Datalab issue type called something like
identifier_column
that checks whetherfeatures
contains a column that is entirely sequential integers.That is check whether there exists a column
i
infeatures
such thatset(features[:,i]) = set(c, c+1, ..., c+n)
for some integerc
, wheren
= num-rows offeatures
. Note we don't consider the ordering of the column in case the dataset was pre-shuffled.If the condition is met, then we say this dataset has the
identifier_column
issue (similar to the non-iid issue). The overall issue-summary-score can simply be binary = 0 if the dataset has this issue = 1 otherwise.In this case the
info
attribute of Datalab should specify which column(s) seem like the identifier.datalab.issues
does not need to reflect this issue-type at all, since it is not something related to the rows of the dataset.Motivation: ID columns are common in datasets, but generally should be left out of ML modeling or they risk harming the results.
The text was updated successfully, but these errors were encountered: