Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9096: [Python] Pandas roundtrip with dtype="object" underlying numeric column index #7822

Closed
wants to merge 18 commits into from

Conversation

arw2019
Copy link
Contributor

@arw2019 arw2019 commented Jul 23, 2020

This PR fixes the roundtrip conversion for Pandas DataFrames whose column index is numeric but has dtype=object, such as

df = pd.DataFrame([1], columns=pd.Index([1], dtype=object))  # underlying int
df = pd.DataFrame([1], columns=pd.Index([1.1], dtype=object)) # underlying float
df = pd.DataFrame([1], columns=pd.Index([datetime(2018, 1, 1)], dtype='object')) # underlying datetime

https://issues.apache.org/jira/browse/ARROW-3651 largely solved the datetime variant of this problem (such that the conversion ran correctly excepting that the dtype after roundtrip did not match). With the current fix a roundtrip of the problematic DataFrames from ARROW-3651 returns the exact original frame.

@arw2019 arw2019 changed the title ARROW-9096: [Python] Pandas roundtrip with object-dtype column labels with integer(floating) values: data type "integer"("floating") not understood ARROW-9096: [Python] Pandas roundtrip with object-dtype numeric column labels Jul 23, 2020
@arw2019 arw2019 changed the title ARROW-9096: [Python] Pandas roundtrip with object-dtype numeric column labels ARROW-9096: [Python] Pandas roundtrip with dtype=" numeric column labels Jul 23, 2020
@arw2019 arw2019 changed the title ARROW-9096: [Python] Pandas roundtrip with dtype=" numeric column labels ARROW-9096: [Python] Pandas roundtrip with dtype="object" underlying numeric column index Jul 23, 2020
@github-actions
Copy link

@@ -1009,12 +1009,15 @@ def _is_generated_index_name(name):
return re.match(pattern, name) is not None


# ARROW-9096: added integer and floating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: generally only have comment like this for TODOs. git blame/git log can track when things were added.

@@ -1080,8 +1083,10 @@ def _reconstruct_columns_from_metadata(columns, column_indexes):
]

# Convert each level to the dtype provided in the metadata
# ARROW-9096: need numpy_type to match cast against original DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop JIRA reference. maybe also the whole comment since this is used directly below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense - it's easy to see what's going on w/o this line so dropped it as suggested

@emkornfield
Copy link
Contributor

CC @jorisvandenbossche

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wesm wesm closed this in 50d6252 Aug 3, 2020
@arw2019
Copy link
Contributor Author

arw2019 commented Aug 6, 2020

thanks @wesm @emkornfield for reviewing!!

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…numeric column index

- [x] closes ARROW-9096
- [x] tests added & passed

This PR fixes the roundtrip conversion for Pandas DataFrames whose column index is numeric but has `dtype=object`, such as
```
df = pd.DataFrame([1], columns=pd.Index([1], dtype=object))  # underlying int
df = pd.DataFrame([1], columns=pd.Index([1.1], dtype=object)) # underlying float
df = pd.DataFrame([1], columns=pd.Index([datetime(2018, 1, 1)], dtype='object')) # underlying datetime
```
https://issues.apache.org/jira/browse/ARROW-3651 largely solved the datetime variant of this problem (such that the conversion ran correctly excepting that the dtype after roundtrip did not match). With the current fix a roundtrip of the problematic DataFrames from ARROW-3651 returns the exact original frame.

Closes apache#7822 from arw2019/ARROW-9096

Authored-by: arw2019 <andrew.r.wieteska@gmail.com>
Signed-off-by: Wes McKinney <wesm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants