New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added dict format in to_bag accessories
of dataframe
#7932
Conversation
dask/dataframe/io/io.py
Outdated
return list(map(tuple, df.itertuples(index))) | ||
elif format == "dict": | ||
tuple_to_dict = lambda x: dict(x._asdict()) | ||
return list(map(tuple_to_dict, df.itertuples(index))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that there is a better way to do this.
Looking briefly at the Pandas API I find the df.to_dict(orient="records")
option, which might be a better fit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mrocklin, Thanks for the comments,
Actually I tried using df.to_dict(orient='records') , but if user specfiying the index = True | False,
I changed the orient accordingly by using orient = "index" if index else "records"
so _df_to_bag
function returns following output respectively,
case 1 index =False :
[{'Name': 'John', 'Age': 23, 'Fruit': 'orange'},
{'Name': 'Lara', 'Age': 21, 'Fruit': 'apple'},
{'Name': 'James', 'Age': 22, 'Fruit': 'apricot'},
{'Name': 'Pablo', 'Age': 21, 'Fruit': 'cherry'}]
case 2 index = True :
{0: {'Name': 'John', 'Age': 23, 'Fruit': 'orange'},
1: {'Name': 'Lara', 'Age': 21, 'Fruit': 'apple'},
2: {'Name': 'James', 'Age': 22, 'Fruit': 'apricot'},
3: {'Name': 'Pablo', 'Age': 21, 'Fruit': 'cherry'}}
But the output of ddf.to_bag is list of index of above function, please refer screenshot :
New Approach:
Current approach:
So Please let me know, how to achieve the desired output with the newer approach (df.to_dict),
Anything I missed here🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rajagurunath df.to_dict(orient="records")
should be sufficient for the default case when index=False
. For the index=True
case something like this list comprehension should incorporate the index information while still only relying on pandas' public API
In [45]: df
Out[45]:
name age
foo john 7
bar mary 77
baz sally 45
In [46]: [{**{"index": idx}, **values} for values, idx in zip(df.to_dict("records"), df.index)]
Out[46]:
[{'index': 'foo', 'name': 'john', 'age': 7},
{'index': 'bar', 'name': 'mary', 'age': 77},
{'index': 'baz', 'name': 'sally', 'age': 45}]
This works, but is somewhat verbose. There may be a better way to do this with pandas' API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed code & comment explanation, I tried searching for some time not able to find a less verbose solution for the above use case, So implemented the above-suggested version
dask/dataframe/io/tests/test_io.py
Outdated
@@ -509,9 +509,14 @@ def test_to_bag(): | |||
index=pd.Index([1.0, 2.0, 3.0, 4.0], name="ind"), | |||
) | |||
ddf = dd.from_pandas(a, 2) | |||
tuple_to_dict = lambda x: dict(x._asdict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather that we test against the final output form, rather than a copy of our code. So I would expect this to look like ...
expected = [
{"...": ..., ...},
{"...": ..., ...},
{"...": ..., ...},
]
assert ddf.to_bag.compute() == expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, got it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rajagurunath!
dask/dataframe/core.py
Outdated
format : tuple or dict,optional default:tuple | ||
returns bag of tuple or dict, based on this format | ||
parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating the docstring here. Dask uses numpydoc
for docstring formatting (xref https://numpydoc.readthedocs.io/en/latest/format.html). For this particular case, I think it should look something like:
format : tuple or dict,optional default:tuple | |
returns bag of tuple or dict, based on this format | |
parameter. | |
format : {"tuple", "dict"} | |
Whether to return a bag of tuples or dictionaries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, for letting me know about numpydoc
, this is new to me !! and implemented accordingly
dask/dataframe/io/io.py
Outdated
format : tuple or dict,optional default:tuple | ||
returns bag of tuple or dict, based on this format | ||
parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about the docstring
dask/dataframe/io/io.py
Outdated
return list(map(tuple, df.itertuples(index))) | ||
elif format == "dict": | ||
tuple_to_dict = lambda x: dict(x._asdict()) | ||
return list(map(tuple_to_dict, df.itertuples(index))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rajagurunath df.to_dict(orient="records")
should be sufficient for the default case when index=False
. For the index=True
case something like this list comprehension should incorporate the index information while still only relying on pandas' public API
In [45]: df
Out[45]:
name age
foo john 7
bar mary 77
baz sally 45
In [46]: [{**{"index": idx}, **values} for values, idx in zip(df.to_dict("records"), df.index)]
Out[46]:
[{'index': 'foo', 'name': 'john', 'age': 7},
{'index': 'bar', 'name': 'mary', 'age': 77},
{'index': 'baz', 'name': 'sally', 'age': 45}]
This works, but is somewhat verbose. There may be a better way to do this with pandas' API
dask/dataframe/io/io.py
Outdated
if format == "tuple": | ||
return list(df.iteritems()) if index else list(df) | ||
elif format == "dict": | ||
return df.to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.to_dict()
will return a dictionary, instead of a list of dictionaries, when df
is a pandas Series
. We'll want to include some additional formatting to make sure we return a list of dictionaries when format == "dict"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, Thanks for pointing out this, I tried the following code for the conversion, let me know if this approach is okay to be proceed
df.to_frame().to_dict(orient="records")
Can one of the admins verify this patch? |
@rajagurunath I can't build the docs locally either from your PR, but I can pinpoint where the error is coming from. Since James is on PTO until Tuesday, @jsignell would you mind shining some light on what could be the problem with the docs here? |
Hi Team, After merging with the dask-main branch, docs are built without any errors. But getting the following error in macOS -python3.7 integration test fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out Any suggestions to fix this? Reference:
|
@rajagurunath Thanks for the update and thanks for your work. It seems this error is unrelated to your changes, let's wait for one of the maintainers to take a look at it. |
I think that's just a sporadic connection failure. I've restarted CI. |
Sorry I missed the ping about the docs build. This looks good to me! And the docstring is rendering properly: docs build. Merging now! |
dask.dataframe.DataFrame.to_bag
output #7920black dask
/flake8 dask
/isort dask