New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1993: [Python] Add function for determining implied Arrow schema from pandas.DataFrame #1929

Closed
wants to merge 11 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@keechongtan

keechongtan commented Apr 22, 2018

No description provided.

@xhochy

This comment has been minimized.

Member

xhochy commented Apr 24, 2018

cc @fjetter and @crepererum: This might be interesting for you. This is a first iteration of extracting Arrow schemas from Pandas DataFrames without doing the full conversion. It still does the conversion on object columns but I would add a follow-up PR that exposes some of the type guessing functions from C++ here so that we can call them instead of the full conversion.

@wesm

This comment has been minimized.

Member

wesm commented Jun 12, 2018

@keechongtan could you rebase this? I can try to take a closer look soon

@keechongtan keechongtan force-pushed the keechongtan:ARROW-1993 branch from 6e026ef to e993303 Jun 15, 2018

@keechongtan

This comment has been minimized.

keechongtan commented Jun 15, 2018

@wesm rebased.

@pitrou

This comment has been minimized.

Contributor

pitrou commented Jun 29, 2018

Hi @keechongtan, thanks for posting the PR! I haven't taken a detailed look at your code yet, but there seem to be some problems (perhaps due to rebasing?):

  • the flake8 step fails on Travis-CI
  • the _ndarray_to_array and from_pandas functions seem duplicated
  • Cython compilation fails (both on AppVeyor and Travis-CI)

@kszucs kszucs force-pushed the keechongtan:ARROW-1993 branch from 9836130 to 3920645 Nov 18, 2018

@kszucs

This comment has been minimized.

Member

kszucs commented Nov 19, 2018

@pitrou @wesm I made this PR up to date.

I intend to refactor the pandas_compat module afterwards, because it's pretty hard to follow what happens there. Luckily its coverage is good https://codecov.io/gh/apache/arrow/src/master/python/pyarrow/pandas_compat.py

@wesm

This comment has been minimized.

Member

wesm commented Nov 19, 2018

I will have a look

Kee Chong Tan and others added some commits Apr 22, 2018

@kszucs kszucs force-pushed the keechongtan:ARROW-1993 branch from dfa16dd to d819834 Nov 25, 2018

@kszucs kszucs requested a review from wesm Nov 25, 2018

@codecov-io

This comment has been minimized.

codecov-io commented Nov 25, 2018

Codecov Report

Merging #1929 into master will increase coverage by 1.04%.
The diff coverage is 88.09%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1929      +/-   ##
==========================================
+ Coverage   86.99%   88.03%   +1.04%     
==========================================
  Files         494      425      -69     
  Lines       70410    64845    -5565     
==========================================
- Hits        61253    57089    -4164     
+ Misses       9061     7756    -1305     
+ Partials       96        0      -96
Impacted Files Coverage Δ
python/pyarrow/tests/test_types.py 100% <100%> (ø) ⬆️
python/pyarrow/pandas_compat.py 97.98% <100%> (+0.1%) ⬆️
python/pyarrow/array.pxi 69.01% <57.14%> (-0.11%) ⬇️
python/pyarrow/types.pxi 60.09% <77.77%> (+0.25%) ⬆️
rust/src/record_batch.rs
go/arrow/array/table.go
rust/src/array.rs
go/arrow/math/uint64_amd64.go
go/arrow/internal/testing/tools/bool.go
go/arrow/array/bufferbuilder.go
... and 65 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7281731...3544e42. Read the comment docs.

@wesm

Looks good. Left some comments, mostly around documentation

type_ = pa.lib._ndarray_to_arrow_type(values, type_)
if type_ is None:
type_ = pa.array(c, from_pandas=True).type
types.append(type_)

This comment has been minimized.

@wesm

wesm Nov 25, 2018

Member

Well, this is certainly not cheap to compute in all cases. I think it's fine to focus on getting the API and behavior correct, then focusing later on performance

This comment has been minimized.

@kszucs

kszucs Nov 26, 2018

Member

Yeah, my intention is to finalize this PR and improve it later.

Show resolved Hide resolved python/pyarrow/types.pxi Outdated
>>> pa.Schema.from_pandas(df)
int: int64
str: string
__index_level_0__: int64

This comment has been minimized.

@wesm

wesm Nov 25, 2018

Member

Do you want to add an example of this to the Sphinx documentation?

This comment has been minimized.

@kszucs

kszucs Nov 26, 2018

Member

pandas.rst lacks of examples, so just mentioned it :)

kszucs added some commits Nov 26, 2018

@wesm

This comment has been minimized.

Member

wesm commented Nov 26, 2018

+1

@wesm wesm closed this in 54b0af8 Nov 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment