Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for Maps in Pandas #34729

Closed
mikelui opened this issue Mar 26, 2023 · 0 comments · Fixed by #34730
Closed

Better support for Maps in Pandas #34729

mikelui opened this issue Mar 26, 2023 · 0 comments · Fixed by #34730

Comments

@mikelui
Copy link
Contributor

mikelui commented Mar 26, 2023

Describe the enhancement requested

Today (Py)Arrow -> Pandas treats:

  1. structs as Python dicts (pydicts)
  2. maps as Python list of tuples (i.e. [(key1, value1), (key2, value2), ...]

Treating maps as a list of tuples has various pros (preserve ordering, allows duplicates, speed of iteration/creation). However, many times users simply want a ... map! (i.e. pydict).

Having to convert every element manually via dict(map_elem) is cumbersome, inefficient, and downright nasty when working with arbitrarily nested maps in Pandas.

Today, Pyarrow already supports (pydicts -> arrow maps) when a schema is provided. So, it's a known use-case.

I propose a simple bool switch in PandasOptions for table.to_pandas(...) to generate pydicts for maps. This creates a symmetrical behavior for the (pydict -> arrow maps) flow, as well.


As alluded to above, the cons are that:

  1. Users can lose (non) ordering.
  2. Duplicates will be removed, resulting in potential data loss. This should be made clear to the user.
  3. Potential ambiguity when examining data that has both maps and structs

I think the upsides of ergonomic flexibility outweigh these cons.


Related, I think there's a bug that precludes (pydicts -> arrow maps) when the type is nested (e.g. list of maps). That should be fixed as well to provide a more featureful map experience.

Component(s)

C++, Python

wjones127 pushed a commit that referenced this issue Apr 21, 2023
…34730)

### Rationale for this change

Explained in issue #34729 

### What changes are included in this PR?

- Add support for list of maps when converting Arrow to Pandas. There doesn't seem to be a strong reason to omit this. Previously it was a hard error as unsupported, due to a bool check.
- Refactor Arrow Map -> Pandas to support two paths: (1) list of tuples, or (2) pydicts
- Add another option in PandasOptions to enable (2), above
- Bugfix in nested pydicts -> Arrow maps. 
- Unit tests

### Are these changes tested?

Unit tests are added in `test_pandas.py`

### Are there any user-facing changes?

- An additional option flag in PandasOptions
- Enable list of maps to Pandas, which was previously disabled
* Closes: #34729

Authored-by: Mike Lui <mikelui@meta.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 13.0.0 milestone Apr 21, 2023
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…ort (apache#34730)

### Rationale for this change

Explained in issue apache#34729 

### What changes are included in this PR?

- Add support for list of maps when converting Arrow to Pandas. There doesn't seem to be a strong reason to omit this. Previously it was a hard error as unsupported, due to a bool check.
- Refactor Arrow Map -> Pandas to support two paths: (1) list of tuples, or (2) pydicts
- Add another option in PandasOptions to enable (2), above
- Bugfix in nested pydicts -> Arrow maps. 
- Unit tests

### Are these changes tested?

Unit tests are added in `test_pandas.py`

### Are there any user-facing changes?

- An additional option flag in PandasOptions
- Enable list of maps to Pandas, which was previously disabled
* Closes: apache#34729

Authored-by: Mike Lui <mikelui@meta.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…ort (apache#34730)

### Rationale for this change

Explained in issue apache#34729 

### What changes are included in this PR?

- Add support for list of maps when converting Arrow to Pandas. There doesn't seem to be a strong reason to omit this. Previously it was a hard error as unsupported, due to a bool check.
- Refactor Arrow Map -> Pandas to support two paths: (1) list of tuples, or (2) pydicts
- Add another option in PandasOptions to enable (2), above
- Bugfix in nested pydicts -> Arrow maps. 
- Unit tests

### Are these changes tested?

Unit tests are added in `test_pandas.py`

### Are there any user-facing changes?

- An additional option flag in PandasOptions
- Enable list of maps to Pandas, which was previously disabled
* Closes: apache#34729

Authored-by: Mike Lui <mikelui@meta.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…ort (apache#34730)

### Rationale for this change

Explained in issue apache#34729 

### What changes are included in this PR?

- Add support for list of maps when converting Arrow to Pandas. There doesn't seem to be a strong reason to omit this. Previously it was a hard error as unsupported, due to a bool check.
- Refactor Arrow Map -> Pandas to support two paths: (1) list of tuples, or (2) pydicts
- Add another option in PandasOptions to enable (2), above
- Bugfix in nested pydicts -> Arrow maps. 
- Unit tests

### Are these changes tested?

Unit tests are added in `test_pandas.py`

### Are there any user-facing changes?

- An additional option flag in PandasOptions
- Enable list of maps to Pandas, which was previously disabled
* Closes: apache#34729

Authored-by: Mike Lui <mikelui@meta.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants