Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added dict format in to_bag accessories of dataframe #7932

Merged
merged 4 commits into from Aug 12, 2021

Conversation

rajagurunath
Copy link
Contributor

return list(map(tuple, df.itertuples(index)))
elif format == "dict":
tuple_to_dict = lambda x: dict(x._asdict())
return list(map(tuple_to_dict, df.itertuples(index)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that there is a better way to do this.

Looking briefly at the Pandas API I find the df.to_dict(orient="records") option, which might be a better fit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mrocklin, Thanks for the comments,

Actually I tried using df.to_dict(orient='records') , but if user specfiying the index = True | False,
I changed the orient accordingly by using orient = "index" if index else "records"

so _df_to_bag function returns following output respectively,

case 1 index =False :
[{'Name': 'John', 'Age': 23, 'Fruit': 'orange'},
{'Name': 'Lara', 'Age': 21, 'Fruit': 'apple'},
{'Name': 'James', 'Age': 22, 'Fruit': 'apricot'},
{'Name': 'Pablo', 'Age': 21, 'Fruit': 'cherry'}]

case 2 index = True :
{0: {'Name': 'John', 'Age': 23, 'Fruit': 'orange'},
1: {'Name': 'Lara', 'Age': 21, 'Fruit': 'apple'},
2: {'Name': 'James', 'Age': 22, 'Fruit': 'apricot'},
3: {'Name': 'Pablo', 'Age': 21, 'Fruit': 'cherry'}}

But the output of ddf.to_bag is list of index of above function, please refer screenshot :

New Approach:

image

Current approach:

image

So Please let me know, how to achieve the desired output with the newer approach (df.to_dict),
Anything I missed here🤔 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajagurunath df.to_dict(orient="records") should be sufficient for the default case when index=False. For the index=True case something like this list comprehension should incorporate the index information while still only relying on pandas' public API

In [45]: df
Out[45]:
      name  age
foo   john    7
bar   mary   77
baz  sally   45

In [46]: [{**{"index": idx}, **values} for values, idx in zip(df.to_dict("records"), df.index)]
Out[46]:
[{'index': 'foo', 'name': 'john', 'age': 7},
 {'index': 'bar', 'name': 'mary', 'age': 77},
 {'index': 'baz', 'name': 'sally', 'age': 45}]

This works, but is somewhat verbose. There may be a better way to do this with pandas' API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed code & comment explanation, I tried searching for some time not able to find a less verbose solution for the above use case, So implemented the above-suggested version

@@ -509,9 +509,14 @@ def test_to_bag():
index=pd.Index([1.0, 2.0, 3.0, 4.0], name="ind"),
)
ddf = dd.from_pandas(a, 2)
tuple_to_dict = lambda x: dict(x._asdict())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather that we test against the final output form, rather than a copy of our code. So I would expect this to look like ...

expected = [
    {"...": ..., ...},
    {"...": ..., ...},
    {"...": ..., ...},
]

assert ddf.to_bag.compute() == expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, got it :)

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajagurunath!

Comment on lines 4444 to 4446
format : tuple or dict,optional default:tuple
returns bag of tuple or dict, based on this format
parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating the docstring here. Dask uses numpydoc for docstring formatting (xref https://numpydoc.readthedocs.io/en/latest/format.html). For this particular case, I think it should look something like:

Suggested change
format : tuple or dict,optional default:tuple
returns bag of tuple or dict, based on this format
parameter.
format : {"tuple", "dict"}
Whether to return a bag of tuples or dictionaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, for letting me know about numpydoc, this is new to me !! and implemented accordingly

Comment on lines 524 to 526
format : tuple or dict,optional default:tuple
returns bag of tuple or dict, based on this format
parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about the docstring

return list(map(tuple, df.itertuples(index)))
elif format == "dict":
tuple_to_dict = lambda x: dict(x._asdict())
return list(map(tuple_to_dict, df.itertuples(index)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajagurunath df.to_dict(orient="records") should be sufficient for the default case when index=False. For the index=True case something like this list comprehension should incorporate the index information while still only relying on pandas' public API

In [45]: df
Out[45]:
      name  age
foo   john    7
bar   mary   77
baz  sally   45

In [46]: [{**{"index": idx}, **values} for values, idx in zip(df.to_dict("records"), df.index)]
Out[46]:
[{'index': 'foo', 'name': 'john', 'age': 7},
 {'index': 'bar', 'name': 'mary', 'age': 77},
 {'index': 'baz', 'name': 'sally', 'age': 45}]

This works, but is somewhat verbose. There may be a better way to do this with pandas' API

if format == "tuple":
return list(df.iteritems()) if index else list(df)
elif format == "dict":
return df.to_dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.to_dict() will return a dictionary, instead of a list of dictionaries, when df is a pandas Series. We'll want to include some additional formatting to make sure we return a list of dictionaries when format == "dict"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, Thanks for pointing out this, I tried the following code for the conversion, let me know if this approach is okay to be proceed

df.to_frame().to_dict(orient="records")

@rajagurunath
Copy link
Contributor Author

Hi team,

Seems like, there was an issue while building Docs with below error:
image

Can some one help me with fixing this issue .

Thanks in Advance

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@ncclementi
Copy link
Member

@rajagurunath I can't build the docs locally either from your PR, but I can pinpoint where the error is coming from.

Since James is on PTO until Tuesday, @jsignell would you mind shining some light on what could be the problem with the docs here?

@rajagurunath
Copy link
Contributor Author

Hi Team,

After merging with the dask-main branch, docs are built without any errors. But getting the following error in macOS -python3.7 integration test fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out Any suggestions to fix this?

Reference:

Installing pip dependencies: ...working... ::warning::Pip subprocess error:%0A  Running command git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd%0A  fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out%0AWARNING: Discarding git+https://github.com/dask/distributed. Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.%0AERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.%0A%0A
 Pip subprocess error:
   Running command git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd
   **fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out**
 WARNING: Discarding git+https://github.com/dask/distributed. Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.
 ERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.

@ncclementi
Copy link
Member

Hi Team,

After merging with the dask-main branch, docs are built without any errors. But getting the following error in macOS -python3.7 integration test fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out Any suggestions to fix this?

Reference:

Installing pip dependencies: ...working... ::warning::Pip subprocess error:%0A  Running command git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd%0A  fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out%0AWARNING: Discarding git+https://github.com/dask/distributed. Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.%0AERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.%0A%0A
 Pip subprocess error:
   Running command git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd
   **fatal: unable to access 'https://github.com/dask/distributed/': Failed to connect to github.com port 443: Operation timed out**
 WARNING: Discarding git+https://github.com/dask/distributed. Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.
 ERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pip-req-build-00muukbd Check the logs for full command output.

@rajagurunath Thanks for the update and thanks for your work. It seems this error is unrelated to your changes, let's wait for one of the maintainers to take a look at it.

@jrbourbeau
Copy link
Member

I think that's just a sporadic connection failure. I've restarted CI.

@jsignell
Copy link
Member

Sorry I missed the ping about the docs build. This looks good to me! And the docstring is rendering properly: docs build. Merging now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants