Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropped Support for Custom Metadata Types #8625

Closed
DamianBarabonkovQC opened this issue Jan 26, 2022 · 3 comments · Fixed by #8629
Closed

Dropped Support for Custom Metadata Types #8625

DamianBarabonkovQC opened this issue Jan 26, 2022 · 3 comments · Fixed by #8629
Assignees
Labels
needs info Needs further information from the user

Comments

@DamianBarabonkovQC
Copy link

DamianBarabonkovQC commented Jan 26, 2022

What happened:

A recent pull request titled "Fail if meta is not a pandas object": #8563 breaks support for custom metadata types in dask. This pull request was motivated by the issue: #8537.

In our codebase, we use a custom Python class which is iterable as the dask metadata value. I would like to kindly ask if this dropped support is intentional. According to the documentation of DataFrame.map_partitions, the meta parameter may be one of "pd.DataFrame, pd.Series, dict, iterable, tuple".

More specifically, this relates to the Kartothek library with this affected custom metadata class.

What you expected to happen:

The expected behavior is to maintain support for iterable custom metadata classes, as dask did before the aforementioned pull request. This change breaks our code.

Minimal Complete Verifiable Example:

Below is a simplified code snippet reproducing this new behavior.

from typing import Iterable

import pandas as pd
import dask.dataframe as dd

class CustomMetadata(Iterable):
    """Custom class iterator returning pandas types."""

    def __init__(self, max=0):
        self.types = [(None, "f8")]

    def __iter__(self):
        self.n = 0
        return self

    def __next__(self):
        if self.n < len(self.types):
            ret = self.types[self.n]
            self.n += 1
            return ret
        else:
            raise StopIteration

df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [1., 2., 3., 4., 5.]})
ddf = dd.from_pandas(df, npartitions=2)

def myadd(df, a, b=1):
    return df.x + df.y + a + b

# Works just fine
mapped_ddf = ddf.map_partitions(myadd, 1, b=2, meta=[(None, "f8")])
print(mapped_ddf)
# Dask DataFrame Structure:
#                      0
# npartitions=2
# 0              float64
# 3                  ...
# 4                  ...
# Dask Name: myadd, 4 tasks

# Fails
mapped_ddf_custom_meta = ddf.map_partitions(myadd, 1, b=2, meta=CustomMetadata)
print(mapped_ddf_custom_meta)
# Traceback (most recent call last):
#   File "/home/X931300/debugging/playground_dask/dask_metadata.py", line 35, in <module>
#     mapped_ddf_custom_meta = ddf.map_partitions(myadd, 1, b=2, meta=CustomMetadata)
#   File "/home/X931300/debugging/dask/dask/dataframe/core.py", line 772, in map_partitions
#     return map_partitions(func, self, *args, **kwargs)
#   File "/home/X931300/debugging/dask/dask/dataframe/core.py", line 6060, in map_partitions
#     raise ValueError(
# ValueError: Meta is not valid, `map_partitions` expects output to be a pandas object. Try passing a pandas object as meta or a dict or tuple representing the (name, dtype) of the columns.
#
#
# Before the pull request #8563, the output was like above:
# Dask Series Structure:
# npartitions=2
# 0    object
# 3       ...
# 4       ...
# dtype: object
# Dask Name: myadd, 4 tasks

Environment:

  • Dask version: 2022.1.0+27.g95e1cf31 (I pulled the most recent code on main and built/installed it locally)
  • Python version: 3.10.2
  • Operating System: Ubuntu 20.04.3 LTS
  • Install method (conda, pip, source): source
@DamianBarabonkovQC DamianBarabonkovQC changed the title Lost Support for Custom Metadata Types Dropped Support for Custom Metadata Types Jan 26, 2022
@jsignell
Copy link
Member

Thanks for opening this @DamianBarabonkovQC this was definitely not an intentional effect. I will change it to accept an iterable an ping you on the PR.

@jsignell
Copy link
Member

Ok so I don't think this was ever supported how you expected. If I revert #8563 and run your code I get meta as a pandas.Series with dtype = "object". This is because you haven't instantiated the class so it is interpreted as an object. If you do instantiate the class, you get a different error:

ddf.map_partitions(myadd, 1, b=2, meta=CustomMetadata())
# TypeError: Don't know how to create metadata from <__main__.CustomMetadata object at 0x7fd88f40df40>

So I guess I'm a little confused if this ever worked properly and I have two proposed fixes.

  1. Downgrade meta error in #8563 to warning #8628
  2. Really allow any iterable to be passed as a meta #8629

@fjetter
Copy link
Member

fjetter commented Jan 27, 2022

More specifically, this relates to the Kartothek library with this affected custom metadata class.

Regarding the kartothek library, the MetaPartition class is not a dask meta object. If anything, you could have a dask Series with elements of type MetaPartition (in pandas/dask type system this would still be object).
If the library does this wrong, I suggest to open an issue there. If this is about your own code you might already be able to fix your problem by changing this to object instead.

@jsignell jsignell added the needs info Needs further information from the user label Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info Needs further information from the user
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants