-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataFrame.from_dict
classmethod
#9017
Conversation
Can one of the admins verify this patch? |
add to allowlist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indeed looks like a nice convenience (and covers more area of the pandas API). I had a few comments.
@MrPowers Thanks for this PR, it looks great! I think we can also add this function to the docs here: |
@pavithraes - good catch about adding this method to the docs. I think I added it properly. Can you take a look and confirm? Thanks! |
docs/source/dataframe-create.rst
Outdated
@@ -33,6 +33,7 @@ File Formats: | |||
read_fwf | |||
from_bcolz | |||
from_array | |||
from_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, maybe from_dict
might be better suited under a different heading "Python Collections"?
This works as-is too because from_array
also doesn't perfectly fit under "file formats". I'll leave it to you. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavithraes - this is a good point. I'm actually not entirely sure how from_array
works, so not sure the best categorization. I'm guessing it's not referring to a Dask Array cause there is a separate from_dask_array
method. I guess I'd argue to keep this as-is, mainly cause I don't know enough to have an opinion.
@MrPowers Looking at the CI failure, I think there might be some merge conflicts -- would you mind merging |
@pavithraes - I rebased on top of the latest version of main, let's see if that does the trick ;) |
@jrbourbeau The failing windows-py3.10 test seems unrelated to this PR, but it's not a known flaky test. Could you please help confirm? Other than that, we might be good to merge :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MrPowers!
dask/dataframe/io/tests/test_io.py
Outdated
{"num1": [1, 2, 3, 4], "num2": [7, 8, 9, 10]}, | ||
) | ||
expected = dd.from_pandas(pandas_df, npartitions=2) | ||
assert_eq(actual, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is comparing that dd.from_dict
and dd.from_pandas
are giving a similar result. This is true, but an implementation detail. The check that we want here is that dd.from_dict
and the corresponding pandas method (i.e. pd.DataFrame.from_dict
) give the same result.
dask/dataframe/io/tests/test_io.py
Outdated
@@ -241,6 +241,15 @@ def test_from_bcolz_column_order(): | |||
assert list(df.loc[0].compute().columns) == ["x", "y", "a"] | |||
|
|||
|
|||
def test_from_dict_dataframe(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not currently testing that the dtype
, orient
, etc. parameters are working as expected. Looking at the code, they are getting forwarded properly, but it would still be good to add test coverage for those cases
How about something like:
@pytest.mark.parametrize("dtype", [int, float])
@pytest.mark.parametrize("orient", ["columns", "index"])
@pytest.mark.parametrize("npartitions", [2, 5])
def test_from_dict(dtype, orient, npartitions):
data = {"a": range(10), "b": range(10)}
expected = pd.DataFrame.from_dict(data, dtype=dtype, orient=orient)
result = dd.from_dict(data, npartitions=npartitions, dtype=dtype, orient=orient)
if orient == "index":
# DataFrame only has two rows with this orientation
assert result.npartitions == 1
else:
assert result.npartitions == npartitions
assert_eq(result, expected)
dask/dataframe/io/io.py
Outdated
@@ -146,6 +146,40 @@ def from_array(x, chunksize=50000, columns=None, meta=None): | |||
return new_dd_object(dsk, name, meta, divisions) | |||
|
|||
|
|||
def from_dict(data, npartitions, orient="columns", dtype=None, columns=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we're adding this as a top-level dd.from_dict
method, but the corresponding pandas method is actually a classmethod -- so pd.DataFrame.from_dict
instead of pd.from_dict
. My default is usually to match the existing DataFrame API if one exists. Thoughts of making this dd.DataFrame.from_dict
?
Yep, I agree this looks unrelated. Just opened up #9035 |
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Updated @pavithraes - can you please take another look and check to make sure that I properly updated the docs to reflect that this is a class method 🙏 Do you think this method should have the |
@MrPowers Thank you! The docs updates look good to me.
I don't think we need it here because you've added a nice and comprehensive docstring. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @MrPowers! Didn't get a chance to look at them today. Will plan to review / merge tomorrow though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MrPowers -- will merge after CI passes
DataFrame.from_dict
classmethod
pre-commit run --all-files
I'm not sure this was added in the right files / properly, so feel free to provide feedback! Thank you!