Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame.from_dict classmethod #9017

Merged
merged 8 commits into from
May 11, 2022
Merged

Conversation

MrPowers
Copy link
Contributor

@MrPowers MrPowers commented May 3, 2022

I'm not sure this was added in the right files / properly, so feel free to provide feedback! Thank you!

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@quasiben
Copy link
Member

quasiben commented May 3, 2022

add to allowlist

Copy link
Member

@douglasdavis douglasdavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed looks like a nice convenience (and covers more area of the pandas API). I had a few comments.

dask/dataframe/io/io.py Outdated Show resolved Hide resolved
dask/dataframe/io/io.py Outdated Show resolved Hide resolved
dask/dataframe/io/io.py Outdated Show resolved Hide resolved
@pavithraes
Copy link
Member

@MrPowers Thanks for this PR, it looks great! I think we can also add this function to the docs here:

dask/dataframe/io/io.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the documentation Improve or add to documentation label May 4, 2022
@MrPowers
Copy link
Contributor Author

MrPowers commented May 4, 2022

@pavithraes - good catch about adding this method to the docs. I think I added it properly. Can you take a look and confirm? Thanks!

@@ -33,6 +33,7 @@ File Formats:
read_fwf
from_bcolz
from_array
from_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, maybe from_dict might be better suited under a different heading "Python Collections"?

This works as-is too because from_array also doesn't perfectly fit under "file formats". I'll leave it to you. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavithraes - this is a good point. I'm actually not entirely sure how from_array works, so not sure the best categorization. I'm guessing it's not referring to a Dask Array cause there is a separate from_dask_array method. I guess I'd argue to keep this as-is, mainly cause I don't know enough to have an opinion.

@pavithraes
Copy link
Member

@MrPowers Looking at the CI failure, I think there might be some merge conflicts -- would you mind merging main into this branch and checking for any conflicts?

@MrPowers
Copy link
Contributor Author

MrPowers commented May 5, 2022

@pavithraes - I rebased on top of the latest version of main, let's see if that does the trick ;)

@pavithraes
Copy link
Member

@MrPowers Thanks!

@jrbourbeau The failing windows-py3.10 test seems unrelated to this PR, but it's not a known flaky test. Could you please help confirm? Other than that, we might be good to merge :)

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MrPowers!

dask/dataframe/io/io.py Outdated Show resolved Hide resolved
dask/dataframe/io/io.py Outdated Show resolved Hide resolved
{"num1": [1, 2, 3, 4], "num2": [7, 8, 9, 10]},
)
expected = dd.from_pandas(pandas_df, npartitions=2)
assert_eq(actual, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is comparing that dd.from_dict and dd.from_pandas are giving a similar result. This is true, but an implementation detail. The check that we want here is that dd.from_dict and the corresponding pandas method (i.e. pd.DataFrame.from_dict) give the same result.

@@ -241,6 +241,15 @@ def test_from_bcolz_column_order():
assert list(df.loc[0].compute().columns) == ["x", "y", "a"]


def test_from_dict_dataframe():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not currently testing that the dtype, orient, etc. parameters are working as expected. Looking at the code, they are getting forwarded properly, but it would still be good to add test coverage for those cases

How about something like:

@pytest.mark.parametrize("dtype", [int, float])
@pytest.mark.parametrize("orient", ["columns", "index"])
@pytest.mark.parametrize("npartitions", [2, 5])
def test_from_dict(dtype, orient, npartitions):
    data = {"a": range(10), "b": range(10)}
    expected = pd.DataFrame.from_dict(data, dtype=dtype, orient=orient)
    result = dd.from_dict(data, npartitions=npartitions, dtype=dtype, orient=orient)
    if orient == "index":
        # DataFrame only has two rows with this orientation
        assert result.npartitions == 1
    else:
        assert result.npartitions == npartitions
    assert_eq(result, expected)

@@ -146,6 +146,40 @@ def from_array(x, chunksize=50000, columns=None, meta=None):
return new_dd_object(dsk, name, meta, divisions)


def from_dict(data, npartitions, orient="columns", dtype=None, columns=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we're adding this as a top-level dd.from_dict method, but the corresponding pandas method is actually a classmethod -- so pd.DataFrame.from_dict instead of pd.from_dict. My default is usually to match the existing DataFrame API if one exists. Thoughts of making this dd.DataFrame.from_dict?

@jrbourbeau
Copy link
Member

The failing windows-py3.10 test seems unrelated to this PR

Yep, I agree this looks unrelated. Just opened up #9035

MrPowers and others added 5 commits May 5, 2022 13:55
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
@github-actions github-actions bot removed the io label May 7, 2022
@MrPowers
Copy link
Contributor Author

MrPowers commented May 7, 2022

Updated from_dict to be a class method based on a suggestion from @jrbourbeau. dd.DataFrame.from_dict is more consistent with pd.DataFrame.from_dict.

@pavithraes - can you please take another look and check to make sure that I properly updated the docs to reflect that this is a class method 🙏

Do you think this method should have the @derived_from(pd.DataFrame) annotation? Thanks for all the help!!

@pavithraes
Copy link
Member

@pavithraes - can you please take another look and check to make sure that I properly updated the docs to reflect that this is a class method 🙏

@MrPowers Thank you! The docs updates look good to me.

Do you think this method should have the @derived_from(pd.DataFrame) annotation?

I don't think we need it here because you've added a nice and comprehensive docstring. :)

@pavithraes pavithraes requested a review from jrbourbeau May 9, 2022 11:36
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @MrPowers! Didn't get a chance to look at them today. Will plan to review / merge tomorrow though

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MrPowers -- will merge after CI passes

@jrbourbeau jrbourbeau changed the title Add from_dict method to create Dask DataFrame from dictionary Add DataFrame.from_dict classmethod May 11, 2022
@jrbourbeau jrbourbeau merged commit dffc957 into dask:main May 11, 2022
erayaslan pushed a commit to erayaslan/dask that referenced this pull request May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe documentation Improve or add to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make it easier to manually create DataFrames
6 participants