-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DuckDB plugin #1419
DuckDB plugin #1419
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1419 +/- ##
=======================================
Coverage 69.32% 69.32%
=======================================
Files 305 305
Lines 28671 28671
Branches 2718 2718
=======================================
Hits 19877 19877
Misses 8276 8276
Partials 518 518
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@samhita-alla can we support accepting StructuredDataset as an input? |
@kumare3, duckdb_task = DuckDBQuery(
name="duckdb_sd_df",
query="SELECT * FROM pandas_df WHERE i = 2",
inputs=kwtypes(pandas_df=StructuredDataset),
)
@task
def get_pandas_df() -> StructuredDataset:
return StructuredDataset(
dataframe=pd.DataFrame.from_dict({"i": [1, 2, 3, 4], "j": ["one", "two", "three", "four"]})
)
@workflow
def pandas_wf(pandas_df: StructuredDataset) -> pd.DataFrame:
return duckdb_task(pandas_df=pandas_df)
assert isinstance(pandas_wf(pandas_df=get_pandas_df()), pd.DataFrame) Let me know if you're looking for something different. |
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
0b75b43
to
1aeb074
Compare
1aeb074
to
d95dce2
Compare
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking great! added some comments for docstrings
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
@cosmicBboy, thanks for reviewing the PR! Can you look through it again? |
DuckDB api reference is blank, I think we need to update the https://github.com/flyteorg/flytekit/blob/master/doc-requirements.in file with |
@samhita-alla we'll also need to invest a bit in enable warnings as errors in the sphinx build process |
…nd errors Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
@cosmicBboy, fixed the docs and added a GitHub action to show warnings as errors. |
name: str, | ||
query: Union[str, List[str]], | ||
inputs: Optional[Dict[str, Union[StructuredDataset, list]]] = None, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add an Output Schema type here like snowflake? if output_schema_type is none, we won't generate output dataset.
And the type should change to StructuredDataset, because we already deprecated FlyteSchema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pingsutw, this task isn't helpful if there's no output dataset. The DuckDBQuery
task runs some queries and returns the output of a SELECT
statement. Hence, it must return the query output, and in this case, it's StructuredDataset
. Also, I can definitely add output_schema_type
. But the output has to always be a StructuredDataset
. So is it necessary? I'm already hard-coding the output type in the initialization. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. do we support insert
or some other operations? If not, I think we don't need schema type for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we do. So you can give a bunch of queries to a single DuckDBQuery
task. But the last one needs to be a SELECT query because after say, you insert the data, you need to retrieve the data, right? Else, it's of no use. I'm using the non-persistent offering by DuckDB. So all the data will be available only within the query. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. Thanks for the explanation.
Thanks, Kevin! Will merge this PR after @cosmicBboy approves as well. |
@kumare3, let me know if this PR looks good to you. |
* DuckDB integration Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * add sd test and fix import Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * fix lint error Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * fix lint error Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * list to List Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * lint Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * incorporated suggestions Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * add duckdb to requirements and add gh action to detect doc warnings and errors Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * gh action: python 3.9 Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * docs python 3.8 to 3.9 Signed-off-by: Samhita Alla <aallasamhita@gmail.com> --------- Signed-off-by: Samhita Alla <aallasamhita@gmail.com> Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Co-authored-by: Kevin Su <pingsutw@apache.org>
* Create non-root user after apt-get (#1519) * Create non-root user after apt-get Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> * Create user after pip install Signed-off-by: Kevin Su <pingsutw@apache.org> --------- Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Signed-off-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Kevin Su <pingsutw@apache.org> * Add root pyflyte reference to docs (#1520) Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> * DuckDB plugin (#1419) * DuckDB integration Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * add sd test and fix import Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * fix lint error Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * fix lint error Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * list to List Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * lint Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * incorporated suggestions Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * add duckdb to requirements and add gh action to detect doc warnings and errors Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * gh action: python 3.9 Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * docs python 3.8 to 3.9 Signed-off-by: Samhita Alla <aallasamhita@gmail.com> --------- Signed-off-by: Samhita Alla <aallasamhita@gmail.com> Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Co-authored-by: Kevin Su <pingsutw@apache.org> * add string as a valid input (#1527) * add string as a valid input Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * isort Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * tests Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * Lint Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> --------- Signed-off-by: Samhita Alla <aallasamhita@gmail.com> Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> * Add back attempt to use existing serialization settings when running (#1529) Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * update configuration docs, fix some docstrings (#1530) * update configuration docs, fix some docstrings Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * update copy Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * add config init command Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * Revert "Make flytekit comply with PEP-561 (#1516)" (#1532) This reverts commit b3ad158. Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * Failed to initialize FlyteInvalidInputException (#1534) Signed-off-by: Kevin Su <pingsutw@apache.org> * cherry pick pin fsspec commit Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * Set flytekit<1.3.0 in duckdb tests Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> * Fix flyteidl==1.2.9 in doc-requirements.txt Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> * No duckdb documentation Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> * Linting Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> --------- Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Samhita Alla <aallasamhita@gmail.com> Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <653394+eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Samhita Alla <aallasamhita@gmail.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Samhita Alla aallasamhita@gmail.com
TL;DR
This PR adds a
DuckDBQuery
task plugin that runs queries using DuckDB as the DBMS.Type
Are all requirements met?
Complete description
Capturing the crucial assumptions I made while building the task plugin:
DuckDBQuery
task parameter that a user needs to send argument to includesquery
and can contemplate adding includesinputs
.query
can include a set of queries that'll be run sequentially. The last query needs to be a SELECT query.inputs
can include structured dataset or a list of parameters to be sent to the queries.output
is a pyarrow table. Can be converted to any structured dataset compatible type.:memory
, i.e., the data is always stored in an in-memory, non-persistent database. It can be set to a file, but it's difficult to make the file accessible to differentDuckDBQuery
pods, which otherwise wouldn't make sense because file is persistent, and it needs to be leveraged.Example:
Tracking Issue
Fixes flyteorg/flyte#3246, flyteorg/flyte#2865
Follow-up issue
NA
OR
https://github.com/flyteorg/flyte/issues/