Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose parser on DFParser to enable user controlled parsing #9729

Merged
merged 11 commits into from
Mar 28, 2024

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented Mar 21, 2024

Which issue does this PR close?

Closes #533

Rationale for this change

Being able to control the underlying parser is useful to be able to add capabilities between the SQL and logical plan layer, which can propagate downstream in the plan.

This approach seems to work nicely insofar as modifying how the SQL becomes a user-defined Statement. I.e. here I'm wrapping DF's Statement with my own, and creating it based on the underlying result from parse_copy.

However, I'm unsure on the next steps. E.g. with this strategy of wrapping the statement, I can't then use DF's machinery to generate the LogicalPlan, like using sql_statement_to_plan. It's not too bad in the basic case, as it'd be easy to generate a user defined logical node from this enum (or fall back to sql_statement_to_plan), but in more complex cases there's probably a better way, e.g. maybe to make Statement a Trait with a visit method, or something along those lines? There's probably a better/simpler approach :).

Going to open this in draft, and will come back later today/tomorrow to see what the basic strategy for constructing a LogicalPlan would look like and see if that's a reasonable first chunk.

Open to feedback on any of it, thanks!

What changes are included in this PR?

Makes the parser on DFParser public and adds an example.

Are these changes tested?

Manually ran the example. Seems to perform as expected. Though not sure what the best path is next.

Are there any user-facing changes?

Not breaking, but parser is available on DFParser.

@github-actions github-actions bot added sql core Core datafusion crate labels Mar 21, 2024
@tshauck
Copy link
Contributor Author

tshauck commented Mar 22, 2024

@alamb, would you be able to please give me some feedback here?

In a7a7808, I made parser public and added an example, and that seemed to go well insofar as I could create a new MyStatement which is an enum of a DF Statement and a custom one.

That was good, but then it can't work with df functionality like sql_statement_to_plan, so in 86588e4 I started down added an Extension enum variant to Statement, based on how the LogicalPlan extensions work, but I'm having trouble integrating it into DF writ large... e.g. I'm running into many object safety issues...

image

So I guess my questions are 1) does an Extension statement seem reasonable to you? 2) and if so, any advice on the implementation to mitigate object safety issues.

@alamb
Copy link
Contributor

alamb commented Mar 23, 2024

Thanks for working on this @tshauck -- it is really cool to see

Going to open this in draft, and will come back later today/tomorrow to see what the basic strategy for constructing a LogicalPlan would look like and see if that's a reasonable first chunk.

The simplest strategy I can think of is to implement your own version to sql_statement_to_plan but for MyStatement

So something like

    pub async fn my_statement_to_plan(
        ctx: &SessionContext,
        statement: MyStatement,
    ) -> Result<LogicalPlan> {
   ...
   // call statement_to_plan here and implement the special Copy logic
}

So I guess my questions are 1) does an Extension statement seem reasonable to you? 2) and if so, any advice on the implementation to mitigate object safety issues.

I don't fully understand the need for an Extension statement -- if the idea is to wrap DFStatement with MyStatement I think you can just implement the planning in terms of MyStatement as above.

If the idea is to avoid repetition with whatever statement_to_plan is doing, maybe we can factor out the common functionality into a module.

@github-actions github-actions bot removed the core Core datafusion crate label Mar 25, 2024
@tshauck
Copy link
Contributor Author

tshauck commented Mar 25, 2024

Thanks for the feedback, I reverted the extension changes, and now it only has the example plus making Parser public on DFParser. To your point later, there's quite a bit within sql_statement_to_plan (and functions it calls) that would be nice to not have to replicate, but perhaps going through the work of using this change in my code will make it clearer what those changes should be. Things like resolving name, handling subqueries, etc.

I guess in terms of next steps, I could move this out of draft or add my_statement_to_plan to the example, if you think that's a more complete change?

@tshauck tshauck marked this pull request as ready for review March 25, 2024 17:49
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tshauck -- this looks great 🙏 . It is really nice to get an example showing how to do this.

I left some suggestions on additional contextual comments and how to make the code potentially simpler, but I also think we could merge this PR as is.

Can you also add an entry to the README for this example?
https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples#single-process

Also, I think we can make the example code simpler with less nesting with a function. Something like

impl MyParser<'_> {
...
    /// Returns true if the next token is `COPY` keyword, false otherwise
    fn is_copy(&self) -> bool {
        matches!(
            self.df_parser.parser.peek_token().token,
            Token::Word(w) if w.keyword == Keyword::COPY
        )
    }

    pub fn parse_statement(&mut self) -> Result<MyStatement, ParserError> {
        if self.is_copy() {
            self.df_parser.parser.next_token(); // COPY
            let df_statement = self.df_parser.parse_copy()?;

            if let Statement::CopyTo(s) = df_statement {
                Ok(MyStatement::from(s))
            } else {
                Ok(MyStatement::DFStatement(Box::from(df_statement)))
            }
        } else {
            let df_statement = self.df_parser.parse_statement()?;
            Ok(MyStatement::from(df_statement))
        }
    }
...

#[tokio::main]
async fn main() -> Result<()> {
let mut my_parser =
MyParser::new("COPY source_table TO 'file.fasta' STORED AS FASTA")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😸

datafusion-examples/examples/sql_parsing.rs Outdated Show resolved Hide resolved
datafusion-examples/examples/sql_parsing.rs Outdated Show resolved Hide resolved
datafusion-examples/examples/sql_parsing.rs Outdated Show resolved Hide resolved
datafusion-examples/examples/sql_parsing.rs Outdated Show resolved Hide resolved
datafusion-examples/examples/sql_parsing.rs Outdated Show resolved Hide resolved
datafusion-examples/examples/sql_parsing.rs Outdated Show resolved Hide resolved
tshauck and others added 4 commits March 27, 2024 08:28
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
- [`simple_udfw.rs`](examples/simple_udwf.rs): Define and invoke a User Defined Window Function (UDWF)
- [`advanced_udwf.rs`](examples/advanced_udwf.rs): Define and invoke a more complicated User Defined Window Function (UDWF)
- [`sql_dialect.rs`](examples/sql_dialect.rs): Examples of using the SQL Dialect
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's sql_dialect.rs -- I also alphabetized the list, hope that's ok.

@tshauck
Copy link
Contributor Author

tshauck commented Mar 27, 2024

@alamb Thanks for all the feedback! The matches! approach is really nice... re-requesting review.

@tshauck tshauck requested a review from alamb March 27, 2024 15:55
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

200w

@@ -42,36 +42,37 @@ cargo run --example csv_sql

## Single Process

- [`advanced_udaf.rs`](examples/advanced_udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ -- very nice drive by cleanup 🙏

@alamb alamb merged commit 0534382 into apache:main Mar 28, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 28, 2024

Thanks again @tshauck

@tshauck tshauck deleted the parser-spike branch March 28, 2024 02:46
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Apr 1, 2024
…#9729)

* poc: custom parser

* play with extension statement

* tweak

* Revert "tweak"

This reverts commit e57006e.

* Revert "play with extension statement"

This reverts commit 86588e4.

* style: cargo fmt

* Update datafusion-examples/examples/sql_parsing.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Apply suggestions from code review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* style: cargo cmt

* refactor: less nesting in parse statement

* docs: better example description

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add extension plugin to parse SQL into logical plan / user defined SQL parser
2 participants