Added dict format in `to_bag accessories` of dataframe #7932

rajagurunath · 2021-07-25T02:38:16Z

Closes Enable dictionary format on dask.dataframe.DataFrame.to_bag output #7920
Tests added / passed
Passes black dask / flake8 dask / isort dask

mrocklin · 2021-07-25T15:22:43Z

dask/dataframe/io/io.py

+            return list(map(tuple, df.itertuples(index)))
+        elif format == "dict":
+            tuple_to_dict = lambda x: dict(x._asdict())
+            return list(map(tuple_to_dict, df.itertuples(index)))


I suspect that there is a better way to do this.

Looking briefly at the Pandas API I find the df.to_dict(orient="records") option, which might be a better fit here.

Hi @mrocklin, Thanks for the comments,

Actually I tried using df.to_dict(orient='records') , but if user specfiying the index = True | False,
I changed the orient accordingly by using orient = "index" if index else "records"

so _df_to_bag function returns following output respectively,

case 1 index =False :
[{'Name': 'John', 'Age': 23, 'Fruit': 'orange'},
{'Name': 'Lara', 'Age': 21, 'Fruit': 'apple'},
{'Name': 'James', 'Age': 22, 'Fruit': 'apricot'},
{'Name': 'Pablo', 'Age': 21, 'Fruit': 'cherry'}]

case 2 index = True :
{0: {'Name': 'John', 'Age': 23, 'Fruit': 'orange'},
1: {'Name': 'Lara', 'Age': 21, 'Fruit': 'apple'},
2: {'Name': 'James', 'Age': 22, 'Fruit': 'apricot'},
3: {'Name': 'Pablo', 'Age': 21, 'Fruit': 'cherry'}}

But the output of ddf.to_bag is list of index of above function, please refer screenshot :

New Approach:

Current approach:

So Please let me know, how to achieve the desired output with the newer approach (df.to_dict),
Anything I missed here🤔 ?

@rajagurunath df.to_dict(orient="records") should be sufficient for the default case when index=False. For the index=True case something like this list comprehension should incorporate the index information while still only relying on pandas' public API

In [45]: df Out[45]: name age foo john 7 bar mary 77 baz sally 45 In [46]: [{**{"index": idx}, **values} for values, idx in zip(df.to_dict("records"), df.index)] Out[46]: [{'index': 'foo', 'name': 'john', 'age': 7}, {'index': 'bar', 'name': 'mary', 'age': 77}, {'index': 'baz', 'name': 'sally', 'age': 45}]

This works, but is somewhat verbose. There may be a better way to do this with pandas' API

Thanks for the detailed code & comment explanation, I tried searching for some time not able to find a less verbose solution for the above use case, So implemented the above-suggested version

mrocklin · 2021-07-25T15:23:43Z

dask/dataframe/io/tests/test_io.py

@@ -509,9 +509,14 @@ def test_to_bag():
        index=pd.Index([1.0, 2.0, 3.0, 4.0], name="ind"),
    )
    ddf = dd.from_pandas(a, 2)
+    tuple_to_dict = lambda x: dict(x._asdict())


I would rather that we test against the final output form, rather than a copy of our code. So I would expect this to look like ...

expected = [ {"...": ..., ...}, {"...": ..., ...}, {"...": ..., ...}, ] assert ddf.to_bag.compute() == expected

make sense, got it :)

jrbourbeau

Thanks @rajagurunath!

jrbourbeau · 2021-07-26T18:37:55Z

dask/dataframe/core.py

+        format : tuple or dict,optional  default:tuple
+            returns bag of tuple or dict, based on this format
+            parameter.


Thank you for updating the docstring here. Dask uses numpydoc for docstring formatting (xref https://numpydoc.readthedocs.io/en/latest/format.html). For this particular case, I think it should look something like:

Suggested change

format : tuple or dict,optional default:tuple

returns bag of tuple or dict, based on this format

parameter.

format : {"tuple", "dict"}

Whether to return a bag of tuples or dictionaries.

Thanks, for letting me know about numpydoc, this is new to me !! and implemented accordingly

jrbourbeau · 2021-07-26T18:38:11Z

dask/dataframe/io/io.py

+    format : tuple or dict,optional  default:tuple
+            returns bag of tuple or dict, based on this format
+            parameter.


Same comment here about the docstring

jrbourbeau · 2021-07-26T18:45:40Z

dask/dataframe/io/io.py

+            return list(map(tuple, df.itertuples(index)))
+        elif format == "dict":
+            tuple_to_dict = lambda x: dict(x._asdict())
+            return list(map(tuple_to_dict, df.itertuples(index)))


@rajagurunath df.to_dict(orient="records") should be sufficient for the default case when index=False. For the index=True case something like this list comprehension should incorporate the index information while still only relying on pandas' public API

In [45]: df Out[45]: name age foo john 7 bar mary 77 baz sally 45 In [46]: [{**{"index": idx}, **values} for values, idx in zip(df.to_dict("records"), df.index)] Out[46]: [{'index': 'foo', 'name': 'john', 'age': 7}, {'index': 'bar', 'name': 'mary', 'age': 77}, {'index': 'baz', 'name': 'sally', 'age': 45}]

This works, but is somewhat verbose. There may be a better way to do this with pandas' API

jrbourbeau · 2021-07-26T18:50:29Z

dask/dataframe/io/io.py

+        if format == "tuple":
+            return list(df.iteritems()) if index else list(df)
+        elif format == "dict":
+            return df.to_dict()


df.to_dict() will return a dictionary, instead of a list of dictionaries, when df is a pandas Series. We'll want to include some additional formatting to make sure we return a list of dictionaries when format == "dict"

got it, Thanks for pointing out this, I tried the following code for the conversion, let me know if this approach is okay to be proceed

df.to_frame().to_dict(orient="records")

rajagurunath · 2021-08-06T07:23:30Z

Hi team,

Seems like, there was an issue while building Docs with below error:

Can some one help me with fixing this issue .

Thanks in Advance

GPUtester · 2021-08-06T07:23:31Z

Can one of the admins verify this patch?

ncclementi · 2021-08-06T22:47:10Z

@rajagurunath I can't build the docs locally either from your PR, but I can pinpoint where the error is coming from.

Since James is on PTO until Tuesday, @jsignell would you mind shining some light on what could be the problem with the docs here?

rajagurunath · 2021-08-09T17:34:50Z

Hi Team,

After merging with the dask-main branch, docs are built without any errors. But getting the following error in macOS -python3.7 integration test fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out Any suggestions to fix this?

Reference:

Installing pip dependencies: ...working... ::warning::Pip subprocess error:%0A  Running command git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd%0A  fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out%0AWARNING: Discarding git+https://github.com/dask/distributed. Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.%0AERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.%0A%0A
 Pip subprocess error:
   Running command git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd
   **fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out**
 WARNING: Discarding git+https://github.com/dask/distributed. Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.
 ERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.

ncclementi · 2021-08-09T17:46:40Z

Hi Team,

After merging with the dask-main branch, docs are built without any errors. But getting the following error in macOS -python3.7 integration test fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out Any suggestions to fix this?

Reference:

Installing pip dependencies: ...working... ::warning::Pip subprocess error:%0A  Running command git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd%0A  fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out%0AWARNING: Discarding git+https://github.com/dask/distributed. Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.%0AERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.%0A%0A
 Pip subprocess error:
   Running command git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd
   **fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out**
 WARNING: Discarding git+https://github.com/dask/distributed. Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.
 ERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.

@rajagurunath Thanks for the update and thanks for your work. It seems this error is unrelated to your changes, let's wait for one of the maintainers to take a look at it.

jrbourbeau · 2021-08-11T15:57:51Z

I think that's just a sporadic connection failure. I've restarted CI.

jsignell · 2021-08-12T15:34:09Z

Sorry I missed the ping about the docs build. This looks good to me! And the docstring is rendering properly: docs build. Merging now!

Added dict format in to_bag func

ce6812f

github-actions bot added dataframe io labels Jul 25, 2021

rajagurunath mentioned this pull request Jul 25, 2021

Enable dictionary format on dask.dataframe.DataFrame.to_bag output #7920

Closed

mrocklin reviewed Jul 25, 2021

View reviewed changes

Corrected the test cases

4c04a17

jrbourbeau reviewed Jul 26, 2021

View reviewed changes

Refactoring the to_dict implementation

66e15db

Merge branch 'dask:main' into main

c46a42c

jsignell merged commit 3f1bece into dask:main Aug 12, 2021

ncclementi mentioned this pull request Sep 20, 2021

Modify demo to use df to bag as dict in to_mongo part coiled/dask-mongo#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added dict format in `to_bag accessories` of dataframe #7932

Added dict format in `to_bag accessories` of dataframe #7932

rajagurunath commented Jul 25, 2021

mrocklin Jul 25, 2021

rajagurunath Jul 25, 2021

jrbourbeau Jul 26, 2021

rajagurunath Jul 28, 2021

mrocklin Jul 25, 2021

rajagurunath Jul 25, 2021

jrbourbeau left a comment

jrbourbeau Jul 26, 2021

rajagurunath Jul 28, 2021

jrbourbeau Jul 26, 2021

jrbourbeau Jul 26, 2021

jrbourbeau Jul 26, 2021

rajagurunath Jul 28, 2021

rajagurunath commented Aug 6, 2021

GPUtester commented Aug 6, 2021

ncclementi commented Aug 6, 2021

rajagurunath commented Aug 9, 2021

ncclementi commented Aug 9, 2021

jrbourbeau commented Aug 11, 2021

jsignell commented Aug 12, 2021

Added dict format in to_bag accessories of dataframe #7932

Added dict format in to_bag accessories of dataframe #7932

Conversation

rajagurunath commented Jul 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajagurunath commented Aug 6, 2021

GPUtester commented Aug 6, 2021

ncclementi commented Aug 6, 2021

rajagurunath commented Aug 9, 2021

ncclementi commented Aug 9, 2021

jrbourbeau commented Aug 11, 2021

jsignell commented Aug 12, 2021

Added dict format in `to_bag accessories` of dataframe #7932

Added dict format in `to_bag accessories` of dataframe #7932