Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8355: [Python] Remove hard pandas dependency from FeatherDataset and minimize pandas dependency in test_feather.py #8244

Closed
wants to merge 4 commits into from

Conversation

arw2019
Copy link
Contributor

@arw2019 arw2019 commented Sep 23, 2020

xref #6849 (comment)

The goals of this PR are:

  • remove hard pandas dependency from FeatherDataset's __init__. Now we only check if pandas is present & if it is a recent enough version in methods like read_pandas that use it
  • remove hard pandas dependencies in test_feather.py by switching to pyarrow methods whenever possible (e.g. use pa.table to generate data; compare tables directly using pa.equals)
  • reduce boilerplate by using _check_arrow_roundtrip and _check_pandas_roundtrip wherever possible

@github-actions
Copy link

@arw2019 arw2019 force-pushed the ARROW-8355 branch 7 times, most recently from de13ef4 to 2dd24e6 Compare September 28, 2020 14:23
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
Can you check that all @pytest.mark.pandas decorators have been removed where possible?

df = pd.DataFrame(np.random.randn(*num_values),
columns=['col_' + str(i)
for i in range(num_values[1])])
table = pa.Table.from_arrays(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle we could also use pa.table(.. here instead (a bit more ergonomic to use)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will switch to that

@@ -128,19 +128,22 @@ def test_dataset(version):
num_values = (100, 100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the @pytest.mark.pandas can be removed here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to work I had to remove the call to _check_pandas_version in FeatherDataset's __init__. I think that was a bug since FeatherDataset is supposed to work without pandas. I added a call to _check_pandas_version in FeatherDataset.read_pandas since we do want it there

@emkornfield
Copy link
Contributor

@jorisvandenbossche can this be merged now?

@jorisvandenbossche
Copy link
Member

I don't think @arw2019 already pushed the update? (although answered to the comments)

@arw2019 arw2019 changed the title ARROW-8355: [Python] Reduce the number of pandas dependent test cases in test_feather ARROW-8355: [Python] Remove hard pandas dependency from FeatherDataset and minimize pandas dependency in test_feather.py Oct 9, 2020
@arw2019
Copy link
Contributor Author

arw2019 commented Oct 9, 2020

@jorisvandenbossche @emkornfield This is ready to go:

  • addressed comments
  • updated title & description
  • rebased (CI green)

@jorisvandenbossche
Copy link
Member

Thanks @arw2019 !

@arw2019
Copy link
Contributor Author

arw2019 commented Oct 9, 2020

Thanks @jorisvandenbossche for reviewing!

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…t and minimize pandas dependency in test_feather.py

xref apache#6849 (comment)

The goals of this PR are:
- [x] remove hard `pandas` dependency from `FeatherDataset`'s `__init__`. Now we only check if `pandas` is present & if it is a recent enough version in methods like `read_pandas` that use it
- [x] remove hard `pandas` dependencies in `test_feather.py` by switching to `pyarrow` methods whenever possible (e.g. use `pa.table` to generate data; compare tables directly using `pa.equals`)
- [x] reduce boilerplate by using `_check_arrow_roundtrip` and `_check_pandas_roundtrip` wherever possible

Closes apache#8244 from arw2019/ARROW-8355

Authored-by: arw2019 <andrew.r.wieteska@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants