Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use load_options param to pass pandasOptions in load_file #1466

Merged
merged 23 commits into from
Jan 9, 2023
Merged

Conversation

pankajastro
Copy link
Contributor

@pankajastro pankajastro commented Dec 20, 2022

Description

closes: #1519

What is the current behavior?

currently, load_file do not have an option to pass the pandas-related param while reading file

What is the new behavior?

use load_options and pass the given values while reading files using the pandas path

Does this introduce a breaking change?

No

Checklist

  • Created tests which fail without the change (if possible)
  • Extended the README / documentation, if necessary

@codecov
Copy link

codecov bot commented Dec 20, 2022

Codecov Report

Base: 97.59% // Head: 94.01% // Decreases project coverage by -3.57% ⚠️

Coverage data is based on head (8ac0af0) compared to base (1bad806).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1466      +/-   ##
==========================================
- Coverage   97.59%   94.01%   -3.58%     
==========================================
  Files          22       89      +67     
  Lines         789     4347    +3558     
  Branches        0      428     +428     
==========================================
+ Hits          770     4087    +3317     
- Misses         19      178     +159     
- Partials        0       82      +82     
Impacted Files Coverage Δ
python-sdk/src/astro/sql/operators/load_file.py 96.90% <ø> (ø)
python-sdk/src/astro/databases/base.py 96.00% <100.00%> (ø)
python-sdk/src/astro/databases/postgres.py 94.44% <100.00%> (ø)
python-sdk/src/astro/databases/snowflake.py 94.98% <100.00%> (ø)
python-sdk/src/astro/dataframes/load_options.py 100.00% <100.00%> (ø)
python-sdk/src/astro/files/base.py 100.00% <100.00%> (ø)
python-sdk/src/astro/files/types/base.py 78.57% <100.00%> (ø)
python-sdk/src/astro/files/types/csv.py 100.00% <100.00%> (ø)
python-sdk/src/astro/files/types/json.py 100.00% <100.00%> (ø)
python-sdk/src/astro/files/types/ndjson.py 100.00% <100.00%> (ø)
... and 67 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@pankajastro pankajastro force-pushed the pandas_ops branch 10 times, most recently from 883434a to 26e9705 Compare December 20, 2022 11:51
python-sdk/src/astro/databases/base.py Show resolved Hide resolved
python-sdk/src/astro/files/types/csv.py Outdated Show resolved Hide resolved
@utkarsharma2
Copy link
Collaborator

utkarsharma2 commented Dec 21, 2022

@pankajastro What sort of options do we want to pass using pandas_options? is delimiter?

@pankajastro
Copy link
Contributor Author

@pankajastro What sort of options do we want to pass using pandas_options? is delimiter?

yes, currently delimiter, but in future maybe other params as well which pandas read_{csv, json} etc method accept.

@utkarsharma2
Copy link
Collaborator

@pankajastro I was a little concerned because if we keep adding the internal libs-specific parameters we will soon need to expose every database-specific or file type-specific parameter. We are exposing internal details to the end user, kinda violating the interface we have provided. Instead, since load files only have to deal with the file(s) and tables can we keep the parameters specific to tables and files? Like in this scenario, the delimiter is specific to CSV files, maybe we can move this parameter to File Object? WDYT?

@pankajastro
Copy link
Contributor Author

@pankajastro I was a little concerned because if we keep adding the internal libs-specific parameters we will soon need to expose every database-specific or file type-specific parameter. We are exposing internal details to the end user, kinda violating the interface we have provided. Instead, since load files only have to deal with the file(s) and tables can we keep the parameters specific to tables and files? Like in this scenario, the delimiter is specific to CSV files, maybe we can move this parameter to File Object? WDYT?

I understand your concern and probably we will expose only limited param. We don't want to make File object messy either so I closed #1225 there is some discussion on #482 too

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajastro, as we discussed, my only concern - which needs a bit of further look into the code - is if we are calling multiple pandas functions as part of load_file (e.g. to create a table if it doesn't exist), and the user would like to specify custom arguments to multiple of these functions.

I feel it is fine for us to focus on the read_csv, read_json, etc., but it would be great to look at the code and confirm this is not an issue.

@pankajastro pankajastro force-pushed the pandas_ops branch 4 times, most recently from 4a8cb3a to b649a5f Compare January 3, 2023 08:06
Copy link
Collaborator

@utkarsharma2 utkarsharma2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! :)

@pankajastro
Copy link
Contributor Author

Hi @dimberman Can you please re-review this PR

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @pankajastro , for addressing all the comments. I'm happy for this PR to be merged after this last comment we agreed on is addressed.

@pankajastro pankajastro dismissed dimberman’s stale review January 9, 2023 12:03

requested changes does not apply since were existing from before. but feel free to comments if I have missed something will take care in separate PR

@pankajastro pankajastro merged commit 38809fa into main Jan 9, 2023
@pankajastro pankajastro deleted the pandas_ops branch January 9, 2023 12:09
@phanikumv phanikumv changed the title Add add pandas_options in load_file Add pandas_options in load_file Jan 9, 2023
@pankajastro pankajastro changed the title Add pandas_options in load_file Use load_options to pass pandasOptions in load_file Jan 9, 2023
@pankajastro pankajastro changed the title Use load_options to pass pandasOptions in load_file Use load_options param to pass pandasOptions in load_file Jan 9, 2023
utkarsharma2 pushed a commit that referenced this pull request Jan 17, 2023
# Description
closes: #1519

## What is the current behavior?
currently, load_file do not have an option to pass the pandas-related
param while reading file

## What is the new behavior?
use `load_options` and pass the given values while reading files using
the pandas path

## Does this introduce a breaking change?
No

### Checklist
- [ ] Created tests which fail without the change (if possible)
- [ ] Extended the README / documentation, if necessary

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass Kwargs to read_csv, read_parquet etc
6 participants