Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception when running LabelAmbiguity check #363

Closed
ItayGabbay opened this issue Dec 30, 2021 · 1 comment · Fixed by #399
Closed

Exception when running LabelAmbiguity check #363

ItayGabbay opened this issue Dec 30, 2021 · 1 comment · Fixed by #399
Assignees
Labels

Comments

@ItayGabbay
Copy link
Contributor

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
/tmp/ipykernel_34/2864323761.py in <module>
      1 from deepchecks.checks import LabelAmbiguity
      2 
----> 3 LabelAmbiguity().run(ds_train)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/integrity/label_ambiguity.py in run(self, dataset, model)
     64 
     65         group_unique_data = dataset.data.groupby(dataset.features, dropna=False)
---> 66         group_unique_labels = group_unique_data.nunique()[label_col]
     67 
     68         num_ambiguous = 0

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in nunique(self, dropna)
   1803         obj = self._obj_with_exclusions
   1804         results = self._apply_to_column_groupbys(
-> 1805             lambda sgb: sgb.nunique(dropna), obj=obj
   1806         )
   1807 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _apply_to_column_groupbys(self, func, obj)
   1709         columns = obj.columns
   1710         results = [
-> 1711             func(col_groupby) for _, col_groupby in self._iterate_column_groupbys(obj)
   1712         ]
   1713 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in <listcomp>(.0)
   1709         columns = obj.columns
   1710         results = [
-> 1711             func(col_groupby) for _, col_groupby in self._iterate_column_groupbys(obj)
   1712         ]
   1713 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in <lambda>(sgb)
   1803         obj = self._obj_with_exclusions
   1804         results = self._apply_to_column_groupbys(
-> 1805             lambda sgb: sgb.nunique(dropna), obj=obj
   1806         )
   1807 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in nunique(self, dropna)
    671 
    672         result = self.obj._constructor(res, index=ri, name=self.obj.name)
--> 673         return self._reindex_output(result, fill_value=0)
    674 
    675     @doc(Series.describe)

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _reindex_output(self, output, fill_value)
   3163         levels_list = [ping.group_index for ping in groupings]
   3164         index, _ = MultiIndex.from_product(
-> 3165             levels_list, names=self.grouper.names
   3166         ).sortlevel()
   3167 

/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/multi.py in from_product(cls, iterables, sortorder, names)
    618 
    619         # codes are all ndarrays, so cartesian_product is lossless
--> 620         codes = cartesian_product(codes)
    621         return cls(levels, codes, sortorder=sortorder, names=names)
    622 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in cartesian_product(X)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in <listcomp>(.0)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

<__array_function__ internals> in repeat(*args, **kwargs)

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
    477 
    478     """
--> 479     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    480 
    481 

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     56 
     57     try:
---> 58         return bound(*args, **kwds)
     59     except TypeError:
     60         # A TypeError occurs if the object does have such a method in its

MemoryError: Unable to allocate 45.3 PiB for an array with shape (25499357367644160,) and data type int16
@ItayGabbay
Copy link
Contributor Author

Pandas has a bug with categorical features with many unique values. Please see:

pandas-dev/pandas#45128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants